In this guide on data analysis with Python, you will see:
- Why use Python for data analysis
- Common libraries for data analysis with Python
- A step-by-step tutorial to do data analysis in Python
- The process to follow when analyzing data
Let’s dive in!
Why Use Python for Data Analysis
Data analysis is usually performed with two main programming languages:
In particular, below are the main reasons to use Python for data analysis:
- Shallow learning curve: Python has a simple and readable syntax, making it accessible to beginners and experts alike.
- Versatility: Python can handle a variety of data types and formats, including CSV, Excel, JSON, SQL databases, Parquet, and others. Also, it is suitable for tasks ranging from simple data cleaning to complex machine learning and deep learning applications.
- Scalability: Python is scalable and can handle both small datasets and large-scale data processing tasks. For example, libraries like Dask and PySpark help you deal with Big Data with no effort.
- Community support: Python has a large and active community of developers and data scientists who contribute to its ecosystem.
- Machine learning and AI integration: Python is the go-to language for machine learning and AI, with libraries like TensorFlow, PyTorch, and Keras supporting advanced analytics and predictive modeling.
- Reproducibility and collaboration: Jupyter Notebooks help you share and reproduce data analysis snippets, which is important for collaboration in data science.
- Unique environment for different purposes: Python offers the possibility to use the same environment for different purposes. For example, you can utilize the same Jupyter Notebook for scraping data from the web and then analyzing it. In the same environment, you can also make predictions with machine learning models.
Common Libraries for Data Analysis With Python
Python is widely used in the analytics field also for its wide ecosystem of libraries. Here are the most common libraries for data analysis in Python:
- NumPy: For numerical computations and handling multi-dimensional arrays.
- Pandas: For data manipulation and analysis, especially with tabular data.
- Matplotlib and Seaborn: For data visualization and creating insightful plots.
- SciPy: For scientific computing and advanced statistical analysis.
- Plotly: For creating animated plots.
See them in action in the guided section that follows!
Data Analysis With Python: A Complete Example
You now know why to use Python for data analysis and common libraries supporting that task. Follow this step-by-step tutorial to learn how to perform data analysis with Python.
In this section, you will analyze Airbnb property information retrieved from a Bright Data free dataset.
Requirements
To follow this guide, you must have Python 3.6 or higher installed on your machine.
Step 1: Set Up the Environment and Install the Dependencies
Suppose you call the main folder of your project data_analysis/
. At the end of this step, the folder will have the following structure:
data_analysis/
├── analysis.ipynb
└── venv/
Where:
analysis.ipynb
is the Jupyter Notebook that contains all the Python data analysis code.venv/
contains the Python virtual environment.
You can create the venv/
virtual environment directory like so:
python -m venv venv
To activate it on Windows, run:
venv\Scripts\activate
Equivalently, on macOS/Linux, execute:
source venv/bin/activate
In the activated virtual environment, install all the required libraries:
pip install pandas jupyter matplotlib seaborn numpy
To create the analysis.ipynb
file, you first need to enter the data_analysis/
folder:
cd data_analysis
Then, initialize a new Jupyter Notebook with this command:
jupyter notebook
You can now access your Jupyter Notebook App at http://locahost:8888
in your browser.
Create a new file by clicking on the “New > Python 3 (ipykernel)” option:
By default, the new file will be called untitled.ipynb
. You can rename it in the dashboard as follows:
Great! You are now fully set up for data analysis with Python.
Step 2: Download the Data and Open it
The dataset used for this tutorial comes from Bright Data’s dataset marketplace. To download it, sign up for free on the platform and navigate to your user dashboard. Then, follow the “Web Datasets > Dataset” path to get to the dataset marketplace:
Scroll down and search for the “Airbnb Properties Information” card:
To download the dataset, click on the “Download sample > Download as CSV” option:
You can now rename the downloaded file, for example, as airbnb.csv
. To open the CSV file in the Jupyter Notebook, write the following in a new cell:
import pandas as pd
# Open CSV
data = pd.read_csv("airbnb.csv")
# Show head
data.head()
In this snippet:
- The
read_csv()
method opens the CSV file as a pandas dataset. - The
head()
method shows the first 5 rows of the dataset.
Below is the expected result:
As you can see, this dataset has 45 columns. To see all of them, you have to move the bar to the right. However, in this case, the number of columns is high, and only scrolling the bar to the right will not let you see all the columns as some have been hidden.
To really visualize all the columns, type the following in a separate cell:
# Show all columns
pd.set_option("display.max_columns", None)
# Display the data frame
print(data)
Step 3: Manage NaN
s
In computing, NaN
stands for “Not a Number”. When performing data analysis with Python, you can encounter datasets with empty values, strings where you should find numbers, or cells already labeled as NaN
(see, for example, the discount
column in the above image).
As your goal is to analyze data, you have to treat NaN
s properly. You have mainly three ways to do so:
- Delete all the rows containing
NaN
s. - Substitute the
NaN
s of a column with the mean calculated on the other numbers of the same column. - Search for new data to enrich the source dataset.
For the sake of simplicity, let’s follow the first approach.
First, you have to verify if all the values of the discount
column are NaN
s. If it is so, you can delete the whole column. To verify that, write the following in a new cell:
import numpy as np
is_discount_all_nan = data["discount"].isna().all()
print(f"Is the 'discount' column all NaNs? {is_discount_all_nan}")
In this snippet, the method isna().all()
analyzes the NaN
s of the discount
column, which has been filtered from the dataset with data["discount"]
.
The result you will obtain is True
, which means that the column discount
****can be dropped as all its values are NaN
s. To achieve that, write:
data = data.drop(columns=["discount"])
The original dataset has been overridden with a new one without the discount
column.
Now you can analyze the entire dataset and see if there is any other NaN
in the rows like so:
total_nans = data.isna().sum().sum()
print(f"Total number of NaN values in the data frame: {total_nans}")
The result you will receive is:
Total number of NaN values in the data frame: 1248
This means that there are 1248 other NaN
s in the data frame. To drop the rows containing at least one NaN
, type:
data = data.dropna()
Now, the data
data frame has no NaN
s and is ready for Python data analysis without any concerns of skewed outcomes.
To verify that the process went well, you can write:
print(data.isna().sum().sum())
The expected result is 0.
Step 4: Data Exploration
Before visualizing the Airbnb data, you need to get familiar with it. A good practice is to start by visualizing the statistics of your dataset like so:
# Show statistics of the entire dataset
statistics = data.describe()
# Print statistics
print(statistics)
This is the expected result:
price ratings lat long guests \
count 182.000000 182.000000 182.000000 182.000000 182.000000
mean 147.523352 4.804505 6.754955 -68.300942 6.554945
std 156.574795 0.209834 27.795750 24.498326 3.012818
min 16.000000 4.000000 -21.837300 -106.817450 2.000000
25% 50.000000 4.710000 -21.717270 -86.628968 4.000000
50% 89.500000 4.865000 30.382710 -83.479890 6.000000
75% 180.750000 4.950000 30.398860 -43.925480 8.000000
max 1003.000000 5.000000 40.481580 -43.801300 16.000000
property_id host_number_of_reviews host_rating hosts_year \
count 1.820000e+02 182.000000 182.000000 182.000000
mean 1.323460e+17 3216.879121 4.776099 7.324176
std 3.307809e+17 4812.876819 0.138849 2.583280
min 3.089381e+06 2.000000 4.290000 1.000000
25% 3.107102e+07 73.000000 4.710000 6.000000
50% 4.375321e+07 3512.000000 4.710000 9.000000
75% 4.538668e+07 3512.000000 4.890000 9.000000
max 1.242049e+18 20189.000000 5.000000 11.000000
host_response_rate total_price
count 182.000000 182.000000
mean 98.538462 859.317363
std 8.012156 1498.684990
min 25.000000 19.000000
25% 100.000000 111.500000
50% 100.000000 350.000000
75% 100.000000 934.750000
max 100.000000 13165.000000
The method describe()
reports the statistics related to the columns that have numerical values. That is the very first way you have to start understanding your data. For example, the host_rating
column reports the following interesting statistics:
- The dataset has a total of 182 reviews (the
count
value). - The maximum rating is 5, the minimum is 4.29, and the mean is 4.77.
Still, the above statistics may not be satisfying. So, try to visualize a scatter plot of the host_rating
column to see if there is any interesting pattern you may want to investigate later. Here is how you can create a scatter plot with seaborn
:
import seaborn as sns
import matplotlib.pyplot as plt
# Define figure size
plt.figure(figsize=(15, 10))
# Plot the data
sns.scatterplot(data=data, x="host_rating", y="listing_name")
# Labeling
plt.title("HOST RATINGS SCATTERPLOT", fontsize=20)
plt.xlabel("Host ratings", fontsize=16)
plt.ylabel("Houses", fontsize=16)
# Show plot
plt.show()
The above snippet does the following:
- Defines the size of the image (in inches) with the method
figure()
. - Creates a scatterplot by using seaborn through the method
scatterplot()
configured with:data=data
: Means it must use thedata
data frame.x="host_rating"
: Puts the host rating values on the horizontal axisy="listing_name"
: Puts the property listing name on the vertical axis.
This is the expected outcome:
Great plot, but we can do better!
Step 5: Data Transformation and Visualization
The previous scatter plot shows that there is not a particular pattern in the host ratings. However, the majority of the ratings are greater than 4.7 points.
Imagine you are planning a holiday and want to stay in one of the best places. A question you might ask yourself is, “How much does it cost to stay in a house with a rating of at least 4.8?”
To answer that question, you first need to transform your data!
The transformation you can do is to create a new data frame where the rating is greater than 4.8. This will contain the column listing_n``ame
with the names of the apartments and the column total_price
with their prices.
Get that subset and show its statistics with:
# Filter the DataFrame
high_ratings = data\[data["host_rating"] > 4.8\][["listing_name", "total_price"]]
# Caltulate and print statistics
high_ratings_statistics = high_ratings.describe()
print(high_ratings_statistics)
The above snippet creates a new data frame called high_ratings
like so:
data["host_rating"] > 4.8
filters for values greater than 4.8 in the columnhost_ratings
from thedata
dataset.[["listing_name", "total_price"]]
selects only thelisting_name
andtotal_price
columns from thehigh_ratings
data frame.
Below is the expected output:
total_price
count 78.000000
mean 321.061026
std 711.340269
min 19.000000
25% 78.250000
50% 116.000000
75% 206.000000
max 4230.000000
The statistics show that the average total price of the selected apartments is $321, with a minimum of $19 and a maximum of $4230. This requires further analysis!
Visualize a scatter plot of the prices for the houses with high ratings emplyoing the same snippet you used before. All you need to do is change the variables used in the chart like so:
# Define figure size
plt.figure(figsize=(12, 8))
# Plot the data
sns.scatterplot(data=high_ratings, x='total_price', y='listing_name')
# Labeling
plt.title('HIGH RATING HOUSES PRICES', fontsize=20)
plt.xlabel('Prices', fontsize=16)
plt.ylabel('Houses', fontsize=16)
# Show grid for better visualization
sns.set_style("ticks", {'axes.grid': True})
# Show plot
plt.show()
And this is the resulting plot:
This plot shows two interesting facts:
- The prices are all mainly under $500.
- The “Entire Cabin in Sevierville” and the “Entire Cabin in Pigeon” present prices that are way above $1000.
A better way to visualize the price range is by showing a box plot. This is how you can do that:
# Define figure size
plt.figure(figsize=(15, 10))
# Plotting the boxplot
sns.boxplot(data=high_ratings, x='total_price', y='listing_name')
# Labeling
plt.title('HIGH RATING HOUSES PRICES - BOXPLOT', fontsize=20)
plt.xlabel('Prices', fontsize=16)
plt.ylabel('Houses', fontsize=16)
# Show plot
plt.show()
This time, the resulting chart will be:
If you are asking yourself why the same house can have different costs, you have to remember that you filtered for users’ ratings. This means that different users paid differently and left different ratings.
Additionally, the significant price variation for the “Entire Cabin in Sevierville,” ranging from under $1,000 to over $4,000, may be due to the length of the stay. In detail, the original dataset includes a column called travel_details
, which contains information about the duration of the stay. The wide price range could indicate that some users rented the house for an extended period. A deeper analysis using Python could help uncover more insights about that!
Step 6: Further Investigations Via The Correlation Matrix
Python data analysis is about asking questions and seeking answers within the data you have. One effective way to spark these questions is by visualizing the correlation matrix.
The correlation matrix is a table that shows the correlation coefficients for different variables. The most used correlation coefficient is the Pearson Correlation Coefficient (PCC), which measures the linear correlation between two variables. Its values range from -1 to +1, which means:
- +1: If the value of a variable increases, the other increases linearly.
- -1 : If the value of a variable increases, the other decreases linearly.
- 0: You can not say anything about the linear relation of the two variables (it requires non-linear analysis).
In statistics, the values of linear correlation define the following:
- 0.1-0.5: low correlation.
- 0.6-1: high correlation.
- 0: no correlation.
To display the correlation matrix for the data
data frame, you can type the following:
# Set the images dimensions
plt.figure(figsize=(12, 10))
# Labeling
plt.title('CORRELATION MATRIX', fontsize=20)
plt.xticks(fontsize=16) # x-axis font size
plt.yticks(fontsize=16) # y-axis font size
# Applying mask
mask = np.triu(np.ones_like(data.corr()))
# Correlation matrix
dataplot = sns.heatmap(data.corr(), annot=True, fmt='.2f', mask=mask, annot_kws={"size": 12})
The above snippet does the following:
- The
np.triu()
method is used to diagonalize a matrix. This is used for a better visualization of the matrix so that it is shown as a triangle and not as a square. - The
sns.heatmap()
method creates a heatmap. This is also used for better visualization. Inside it, the methoddata.corr()
is the one that actually calculates the Pearson coefficients for each column of the data framedata
.
Below is the result you will obtain:
The main idea when interpreting a correlation matrix is to find variables that have high correlation as these will be the starting point for new and deeper analysis. For example:
- The
lat
andlong
variables have a -0.98 correlation. This is expected, as latitude and longitude are strongly correlated when defining a specific location on Earth. - The
host_rating
andlong
variables have a -0.69 correlation. This is an interesting result, which means that the rating of the host is highly correlated to the longitude variable. So it seems that houses located in a certain area of the world have high host ratings. - The
lat
andlong
variables have, respectively, a 0.63 and -0.69 correlation withprice
. That is enough to tell that the price per day is highly influenced by the location.
In your analysis, you should also search for non-correlated variables. For example, the coefficient of the variables is_supperhost
and price
is -0.18, which means that superhosts do not have the highest prices.
Now that the main concepts are clear, it is your turn to explore and analyze your data!
Step 7: Put It All Together
This is what the final Jupyter Notebook for data analysis with Python will look like:
Note the presence of different cells, each with its output.
The Process Behind Data Analysis With Python
The section above guided you through the process of data analysis with Python. Although it may have seemed like a step-by-step approach driven by opportunity, it was actually built on the following best practices:
- Data retrieval: If you are lucky enough to have the data you need in a database, lucky you! If not, you need to retrieve it using popular data sourcing methods like web scraping.
- Data cleaning: Handle
NaN
s, aggregate data, and apply the first filters of the initial dataset. - Data exploration: Data exploration—sometimes also called data discovery—is the most important part of data analysis with Python. It requires producing basic plots to help you understand how your data is structured or if it follows particular patterns.
- Data manipulation: After grabbing the main ideas behind the data you are analyzing, you have to manipulate it. This part requires filtering datasets and often combining more than two datasets into one (as if you were performing table joins in SQL).
- Data visualization: This is the final part, where you visually present your data by making multiple plots on the manipulated datasets.
Conclusion
In this guide on data analysis with Python, you learned why you should use Python for analyzing data and which common libraries you can use for that purpose. You have also gone through a step-by-step tutorial and learned the process to follow if you want to perform data analysis in Python.
You saw that Jupyter Notebook help you create subsets of your data, visualize them, and discover powerful insights. All that while maintaining everything structured in the same environment. Now, where can you find ready-to-use datasets? Bright Data has you covered!
Bright Data operates a large, fast, and reliable proxy network, used by many Fortune 500 companies and over 20,000 customers. That is used to ethically retrieve data from the Web and offer them in a vast dataset marketplace, which includes:
- Business Datasets: Data from key sources like LinkedIn, CrunchBase, Owler, and Indeed.
- Ecommerce Datasets: Data from Amazon, Walmart, Target, Zara, Zalando, Asos, and many more.
- Real Estate Datasets: Data from websites such as Zillow, MLS, and more.
- Social Media Datasets: Data from Facebook, Instagram, YouTube, and Reddit.
- Financial Datasets: Data from Yahoo Finance, Market Watch, Investopedia, and more.
Create a free Bright Data account today and explore our datasets.
No credit card required