In this guide on data analysis with Python, you will see:
- Why use Python for data analysis
- Common libraries for data analysis with Python
- A step-by-step tutorial to do data analysis in Python
- The process to follow when analyzing data
Let’s dive in!
Why Use Python for Data Analysis
Data analysis is usually performed with two main programming languages:
In particular, below are the main reasons to use Python for data analysis:
- Shallow learning curve: Python has a simple and readable syntax, making it accessible to beginners and experts alike.
- Versatility: Python can handle a variety of data types and formats, including CSV, Excel, JSON, SQL databases, Parquet, and others. Also, it is suitable for tasks ranging from simple data cleaning to complex machine learning and deep learning applications.
- Scalability: Python is scalable and can handle both small datasets and large-scale data processing tasks. For example, libraries like Dask and PySpark help you deal with Big Data with no effort.
- Community support: Python has a large and active community of developers and data scientists who contribute to its ecosystem.
- Machine learning and AI integration: Python is the go-to language for machine learning and AI, with libraries like TensorFlow, PyTorch, and Keras supporting advanced analytics and predictive modeling.
- Reproducibility and collaboration: Jupyter Notebooks help you share and reproduce data analysis snippets, which is important for collaboration in data science.
- Unique environment for different purposes: Python offers the possibility to use the same environment for different purposes. For example, you can utilize the same Jupyter Notebook for scraping data from the web and then analyzing it. In the same environment, you can also make predictions with machine learning models.
Common Libraries for Data Analysis With Python
Python is widely used in the analytics field also for its wide ecosystem of libraries. Here are the most common libraries for data analysis in Python:
- NumPy: For numerical computations and handling multi-dimensional arrays.
- Pandas: For data manipulation and analysis, especially with tabular data.
- Matplotlib and Seaborn: For data visualization and creating insightful plots.
- SciPy: For scientific computing and advanced statistical analysis.
- Plotly: For creating animated plots.
See them in action in the guided section that follows!
Data Analysis With Python: A Complete Example
You now know why to use Python for data analysis and common libraries supporting that task. Follow this step-by-step tutorial to learn how to perform data analysis with Python.
In this section, you will analyze Airbnb property information retrieved from a Bright Data free dataset.
Requirements
To follow this guide, you must have Python 3.6 or higher installed on your machine.
Step 1: Set Up the Environment and Install the Dependencies
Suppose you call the main folder of your project data_analysis/
. At the end of this step, the folder will have the following structure:
Where:
analysis.ipynb
is the Jupyter Notebook that contains all the Python data analysis code.venv/
contains the Python virtual environment.
You can create the venv/
virtual environment directory like so:
To activate it on Windows, run:
Equivalently, on macOS/Linux, execute:
In the activated virtual environment, install all the required libraries:
To create the analysis.ipynb
file, you first need to enter the data_analysis/
folder:
Then, initialize a new Jupyter Notebook with this command:
You can now access your Jupyter Notebook App at http://locahost:8888
in your browser.
Create a new file by clicking on the “New > Python 3 (ipykernel)” option:
By default, the new file will be called untitled.ipynb
. You can rename it in the dashboard as follows:
Great! You are now fully set up for data analysis with Python.
Step 2: Download the Data and Open it
The dataset used for this tutorial comes from Bright Data’s dataset marketplace. To download it, sign up for free on the platform and navigate to your user dashboard. Then, follow the “Web Datasets > Dataset” path to get to the dataset marketplace:
Scroll down and search for the “Airbnb Properties Information” card:
To download the dataset, click on the “Download sample > Download as CSV” option:
You can now rename the downloaded file, for example, as airbnb.csv
. To open the CSV file in the Jupyter Notebook, write the following in a new cell:
In this snippet:
- The
read_csv()
method opens the CSV file as a pandas dataset. - The
head()
method shows the first 5 rows of the dataset.
Below is the expected result:
As you can see, this dataset has 45 columns. To see all of them, you have to move the bar to the right. However, in this case, the number of columns is high, and only scrolling the bar to the right will not let you see all the columns as some have been hidden.
To really visualize all the columns, type the following in a separate cell:
Step 3: Manage NaN
s
In computing, NaN
stands for “Not a Number”. When performing data analysis with Python, you can encounter datasets with empty values, strings where you should find numbers, or cells already labeled as NaN
(see, for example, the discount
column in the above image).
As your goal is to analyze data, you have to treat NaN
s properly. You have mainly three ways to do so:
- Delete all the rows containing
NaN
s. - Substitute the
NaN
s of a column with the mean calculated on the other numbers of the same column. - Search for new data to enrich the source dataset.
For the sake of simplicity, let’s follow the first approach.
First, you have to verify if all the values of the discount
column are NaN
s. If it is so, you can delete the whole column. To verify that, write the following in a new cell:
In this snippet, the method isna().all()
analyzes the NaN
s of the discount
column, which has been filtered from the dataset with data["discount"]
.
The result you will obtain is True
, which means that the column discount
****can be dropped as all its values are NaN
s. To achieve that, write:
The original dataset has been overridden with a new one without the discount
column.
Now you can analyze the entire dataset and see if there is any other NaN
in the rows like so:
The result you will receive is:
This means that there are 1248 other NaN
s in the data frame. To drop the rows containing at least one NaN
, type:
Now, the data
data frame has no NaN
s and is ready for Python data analysis without any concerns of skewed outcomes.
To verify that the process went well, you can write:
The expected result is 0.
Step 4: Data Exploration
Before visualizing the Airbnb data, you need to get familiar with it. A good practice is to start by visualizing the statistics of your dataset like so:
This is the expected result:
The method describe()
reports the statistics related to the columns that have numerical values. That is the very first way you have to start understanding your data. For example, the host_rating
column reports the following interesting statistics:
- The dataset has a total of 182 reviews (the
count
value). - The maximum rating is 5, the minimum is 4.29, and the mean is 4.77.
Still, the above statistics may not be satisfying. So, try to visualize a scatter plot of the host_rating
column to see if there is any interesting pattern you may want to investigate later. Here is how you can create a scatter plot with seaborn
:
The above snippet does the following:
- Defines the size of the image (in inches) with the method
figure()
. - Creates a scatterplot by using seaborn through the method
scatterplot()
configured with:data=data
: Means it must use thedata
data frame.x="host_rating"
: Puts the host rating values on the horizontal axisy="listing_name"
: Puts the property listing name on the vertical axis.
This is the expected outcome:
Great plot, but we can do better!
Step 5: Data Transformation and Visualization
The previous scatter plot shows that there is not a particular pattern in the host ratings. However, the majority of the ratings are greater than 4.7 points.
Imagine you are planning a holiday and want to stay in one of the best places. A question you might ask yourself is, “How much does it cost to stay in a house with a rating of at least 4.8?”
To answer that question, you first need to transform your data!
The transformation you can do is to create a new data frame where the rating is greater than 4.8. This will contain the column listing_n``ame
with the names of the apartments and the column total_price
with their prices.
Get that subset and show its statistics with:
The above snippet creates a new data frame called high_ratings
like so:
data["host_rating"] > 4.8
filters for values greater than 4.8 in the columnhost_ratings
from thedata
dataset.[["listing_name", "total_price"]]
selects only thelisting_name
andtotal_price
columns from thehigh_ratings
data frame.
Below is the expected output:
The statistics show that the average total price of the selected apartments is $321, with a minimum of $19 and a maximum of $4230. This requires further analysis!
Visualize a scatter plot of the prices for the houses with high ratings emplyoing the same snippet you used before. All you need to do is change the variables used in the chart like so:
And this is the resulting plot:
This plot shows two interesting facts:
- The prices are all mainly under $500.
- The “Entire Cabin in Sevierville” and the “Entire Cabin in Pigeon” present prices that are way above $1000.
A better way to visualize the price range is by showing a box plot. This is how you can do that:
This time, the resulting chart will be:
If you are asking yourself why the same house can have different costs, you have to remember that you filtered for users’ ratings. This means that different users paid differently and left different ratings.
Additionally, the significant price variation for the “Entire Cabin in Sevierville,” ranging from under $1,000 to over $4,000, may be due to the length of the stay. In detail, the original dataset includes a column called travel_details
, which contains information about the duration of the stay. The wide price range could indicate that some users rented the house for an extended period. A deeper analysis using Python could help uncover more insights about that!
Step 6: Further Investigations Via The Correlation Matrix
Python data analysis is about asking questions and seeking answers within the data you have. One effective way to spark these questions is by visualizing the correlation matrix.
The correlation matrix is a table that shows the correlation coefficients for different variables. The most used correlation coefficient is the Pearson Correlation Coefficient (PCC), which measures the linear correlation between two variables. Its values range from -1 to +1, which means:
- +1: If the value of a variable increases, the other increases linearly.
- -1 : If the value of a variable increases, the other decreases linearly.
- 0: You can not say anything about the linear relation of the two variables (it requires non-linear analysis).
In statistics, the values of linear correlation define the following:
- 0.1-0.5: low correlation.
- 0.6-1: high correlation.
- 0: no correlation.
To display the correlation matrix for the data
data frame, you can type the following:
The above snippet does the following:
- The
np.triu()
method is used to diagonalize a matrix. This is used for a better visualization of the matrix so that it is shown as a triangle and not as a square. - The
sns.heatmap()
method creates a heatmap. This is also used for better visualization. Inside it, the methoddata.corr()
is the one that actually calculates the Pearson coefficients for each column of the data framedata
.
Below is the result you will obtain:
The main idea when interpreting a correlation matrix is to find variables that have high correlation as these will be the starting point for new and deeper analysis. For example:
- The
lat
andlong
variables have a -0.98 correlation. This is expected, as latitude and longitude are strongly correlated when defining a specific location on Earth. - The
host_rating
andlong
variables have a -0.69 correlation. This is an interesting result, which means that the rating of the host is highly correlated to the longitude variable. So it seems that houses located in a certain area of the world have high host ratings. - The
lat
andlong
variables have, respectively, a 0.63 and -0.69 correlation withprice
. That is enough to tell that the price per day is highly influenced by the location.
In your analysis, you should also search for non-correlated variables. For example, the coefficient of the variables is_supperhost
and price
is -0.18, which means that superhosts do not have the highest prices.
Now that the main concepts are clear, it is your turn to explore and analyze your data!
Step 7: Put It All Together
This is what the final Jupyter Notebook for data analysis with Python will look like:
Note the presence of different cells, each with its output.
The Process Behind Data Analysis With Python
The section above guided you through the process of data analysis with Python. Although it may have seemed like a step-by-step approach driven by opportunity, it was actually built on the following best practices:
- Data retrieval: If you are lucky enough to have the data you need in a database, lucky you! If not, you need to retrieve it using popular data sourcing methods like web scraping.
- Data cleaning: Handle
NaN
s, aggregate data, and apply the first filters of the initial dataset. - Data exploration: Data exploration—sometimes also called data discovery—is the most important part of data analysis with Python. It requires producing basic plots to help you understand how your data is structured or if it follows particular patterns.
- Data manipulation: After grabbing the main ideas behind the data you are analyzing, you have to manipulate it. This part requires filtering datasets and often combining more than two datasets into one (as if you were performing table joins in SQL).
- Data visualization: This is the final part, where you visually present your data by making multiple plots on the manipulated datasets.
Conclusion
In this guide on data analysis with Python, you learned why you should use Python for analyzing data and which common libraries you can use for that purpose. You have also gone through a step-by-step tutorial and learned the process to follow if you want to perform data analysis in Python.
You saw that Jupyter Notebook help you create subsets of your data, visualize them, and discover powerful insights. All that while maintaining everything structured in the same environment. Now, where can you find ready-to-use datasets? Bright Data has you covered!
Bright Data operates a large, fast, and reliable proxy network, used by many Fortune 500 companies and over 20,000 customers. That is used to ethically retrieve data from the Web and offer them in a vast dataset marketplace, which includes:
- Business Datasets: Data from key sources like LinkedIn, CrunchBase, Owler, and Indeed.
- Ecommerce Datasets: Data from Amazon, Walmart, Target, Zara, Zalando, Asos, and many more.
- Real Estate Datasets: Data from websites such as Zillow, MLS, and more.
- Social Media Datasets: Data from Facebook, Instagram, YouTube, and Reddit.
- Financial Datasets: Data from Yahoo Finance, Market Watch, Investopedia, and more.
Create a free Bright Data account today and explore our datasets.
No credit card required