Data Analysis With Python: Step-by-Step Guide & Best Practices

In this guide on data analysis with Python, you will see:

Why use Python for data analysis
Common libraries for data analysis with Python
A step-by-step tutorial to do data analysis in Python
The process to follow when analyzing data

Let’s dive in!

Why Use Python for Data Analysis

Data analysis is usually performed with two main programming languages:

R: more suitable for researchers and statisticians.
Python: Best for all the other professionals

In particular, below are the main reasons to use Python for data analysis:

Shallow learning curve: Python has a simple and readable syntax, making it accessible to beginners and experts alike.
Versatility: Python can handle a variety of data types and formats, including CSV, Excel, JSON, SQL databases, Parquet, and others. Also, it is suitable for tasks ranging from simple data cleaning to complex machine learning and deep learning applications.
Scalability: Python is scalable and can handle both small datasets and large-scale data processing tasks. For example, libraries like Dask and PySpark help you deal with Big Data with no effort.
Community support: Python has a large and active community of developers and data scientists who contribute to its ecosystem.
Machine learning and AI integration: Python is the go-to language for machine learning and AI, with libraries like TensorFlow, PyTorch, and Keras supporting advanced analytics and predictive modeling.
Reproducibility and collaboration: Jupyter Notebooks help you share and reproduce data analysis snippets, which is important for collaboration in data science.
Unique environment for different purposes: Python offers the possibility to use the same environment for different purposes. For example, you can utilize the same Jupyter Notebook for scraping data from the web and then analyzing it. In the same environment, you can also make predictions with machine learning models.

Common Libraries for Data Analysis With Python

Python is widely used in the analytics field also for its wide ecosystem of libraries. Here are the most common libraries for data analysis in Python:

NumPy: For numerical computations and handling multi-dimensional arrays.
Pandas: For data manipulation and analysis, especially with tabular data.
Matplotlib and Seaborn: For data visualization and creating insightful plots.
SciPy: For scientific computing and advanced statistical analysis.
Plotly: For creating animated plots.

See them in action in the guided section that follows!

Data Analysis With Python: A Complete Example

You now know why to use Python for data analysis and common libraries supporting that task. Follow this step-by-step tutorial to learn how to perform data analysis with Python.

In this section, you will analyze Airbnb property information retrieved from a Bright Data free dataset.

Requirements

To follow this guide, you must have Python 3.6 or higher installed on your machine.

Step 1: Set Up the Environment and Install the Dependencies

Suppose you call the main folder of your project data_analysis/. At the end of this step, the folder will have the following structure:

data_analysis/
    ├── analysis.ipynb
    └── venv/

Where:

analysis.ipynb is the Jupyter Notebook that contains all the Python data analysis code.
venv/ contains the Python virtual environment.

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it on Windows, run:

venvScriptsactivate

Equivalently, on macOS/Linux, execute:

source venv/bin/activate

In the activated virtual environment, install all the required libraries:

pip install pandas jupyter matplotlib seaborn numpy

To create the analysis.ipynb file, you first need to enter the data_analysis/ folder:

cd data_analysis

Then, initialize a new Jupyter Notebook with this command:

jupyter notebook

You can now access your Jupyter Notebook App at http://locahost:8888 in your browser.

Create a new file by clicking on the “New > Python 3 (ipykernel)” option:

Creating a new file via the ipython kernel

By default, the new file will be called untitled.ipynb. You can rename it in the dashboard as follows:

Great! You are now fully set up for data analysis with Python.

Step 2: Download the Data and Open it

The dataset used for this tutorial comes from Bright Data’s dataset marketplace. To download it, sign up for free on the platform and navigate to your user dashboard. Then, follow the “Web Datasets > Dataset” path to get to the dataset marketplace:

Scroll down and search for the “Airbnb Properties Information” card:

The “Airbnb Properties Information” dataset

To download the dataset, click on the “Download sample > Download as CSV” option:

You can now rename the downloaded file, for example, as airbnb.csv. To open the CSV file in the Jupyter Notebook, write the following in a new cell:

import pandas as pd

# Open CSV
data = pd.read_csv("airbnb.csv")

# Show head
data.head()

In this snippet:

The read_csv() method opens the CSV file as a pandas dataset.
The head() method shows the first 5 rows of the dataset.

Below is the expected result:

As you can see, this dataset has 45 columns. To see all of them, you have to move the bar to the right. However, in this case, the number of columns is high, and only scrolling the bar to the right will not let you see all the columns as some have been hidden.

To really visualize all the columns, type the following in a separate cell:

# Show all columns
pd.set_option("display.max_columns", None)

# Display the data frame
print(data)

Step 3: Manage `NaN`s

In computing, NaN stands for “Not a Number”. When performing data analysis with Python, you can encounter datasets with empty values, strings where you should find numbers, or cells already labeled as NaN (see, for example, the discount column in the above image).

As your goal is to analyze data, you have to treat NaNs properly. You have mainly three ways to do so:

Delete all the rows containing NaNs.
Substitute the NaNs of a column with the mean calculated on the other numbers of the same column.
Search for new data to enrich the source dataset.

For the sake of simplicity, let’s follow the first approach.

First, you have to verify if all the values of the discount column are NaNs. If it is so, you can delete the whole column. To verify that, write the following in a new cell:

import numpy as np

is_discount_all_nan = data["discount"].isna().all()

print(f"Is the 'discount' column all NaNs? {is_discount_all_nan}")

In this snippet, the method isna().all() analyzes the NaNs of the discount column, which has been filtered from the dataset with data["discount"].

The result you will obtain is True, which means that the column discount ****can be dropped as all its values are NaNs. To achieve that, write:

data = data.drop(columns=["discount"])

The original dataset has been overridden with a new one without the discount column.

Now you can analyze the entire dataset and see if there is any other NaN in the rows like so:

total_nans = data.isna().sum().sum()

print(f"Total number of NaN values in the data frame: {total_nans}")

The result you will receive is:

Total number of NaN values in the data frame: 1248

This means that there are 1248 other NaNs in the data frame. To drop the rows containing at least one NaN, type:

data = data.dropna()

Now, the data data frame has no NaNs and is ready for Python data analysis without any concerns of skewed outcomes.

To verify that the process went well, you can write:

print(data.isna().sum().sum())

The expected result is 0.

Step 4: Data Exploration

Before visualizing the Airbnb data, you need to get familiar with it. A good practice is to start by visualizing the statistics of your dataset like so:

# Show statistics of the entire dataset
statistics = data.describe()

# Print statistics
print(statistics)

This is the expected result:

price     ratings         lat        long      guests
count   182.000000  182.000000  182.000000  182.000000  182.000000
mean    147.523352    4.804505    6.754955  -68.300942    6.554945
std     156.574795    0.209834   27.795750   24.498326    3.012818
min      16.000000    4.000000  -21.837300 -106.817450    2.000000
25%      50.000000    4.710000  -21.717270  -86.628968    4.000000
50%      89.500000    4.865000   30.382710  -83.479890    6.000000
75%     180.750000    4.950000   30.398860  -43.925480    8.000000
max    1003.000000    5.000000   40.481580  -43.801300   16.000000

        property_id  host_number_of_reviews  host_rating  hosts_year
count  1.820000e+02              182.000000   182.000000  182.000000
mean   1.323460e+17             3216.879121     4.776099    7.324176
std    3.307809e+17             4812.876819     0.138849    2.583280
min    3.089381e+06                2.000000     4.290000    1.000000
25%    3.107102e+07               73.000000     4.710000    6.000000
50%    4.375321e+07             3512.000000     4.710000    9.000000
75%    4.538668e+07             3512.000000     4.890000    9.000000
max    1.242049e+18            20189.000000     5.000000   11.000000

       host_response_rate   total_price
count          182.000000    182.000000
mean            98.538462    859.317363
std              8.012156   1498.684990
min             25.000000     19.000000
25%            100.000000    111.500000
50%            100.000000    350.000000
75%            100.000000    934.750000
max            100.000000  13165.000000

The method describe() reports the statistics related to the columns that have numerical values. That is the very first way you have to start understanding your data. For example, the host_rating column reports the following interesting statistics:

The dataset has a total of 182 reviews (the count value).
The maximum rating is 5, the minimum is 4.29, and the mean is 4.77.

Still, the above statistics may not be satisfying. So, try to visualize a scatter plot of the host_rating column to see if there is any interesting pattern you may want to investigate later. Here is how you can create a scatter plot with seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

# Define figure size
plt.figure(figsize=(15, 10))

# Plot the data
sns.scatterplot(data=data, x="host_rating", y="listing_name")

# Labeling
plt.title("HOST RATINGS SCATTERPLOT", fontsize=20)
plt.xlabel("Host ratings", fontsize=16)
plt.ylabel("Houses", fontsize=16)

# Show plot
plt.show()

The above snippet does the following:

Defines the size of the image (in inches) with the method figure().
Creates a scatterplot by using seaborn through the method scatterplot() configured with:
- data=data: Means it must use the data data frame.
- x="host_rating": Puts the host rating values on the horizontal axis
- y="listing_name": Puts the property listing name on the vertical axis.

This is the expected outcome:

Great plot, but we can do better!

Step 5: Data Transformation and Visualization

The previous scatter plot shows that there is not a particular pattern in the host ratings. However, the majority of the ratings are greater than 4.7 points.

Imagine you are planning a holiday and want to stay in one of the best places. A question you might ask yourself is, “How much does it cost to stay in a house with a rating of at least 4.8?”

To answer that question, you first need to transform your data!

The transformation you can do is to create a new data frame where the rating is greater than 4.8. This will contain the column listing_n``ame with the names of the apartments and the column total_price with their prices.

Get that subset and show its statistics with:

# Filter the DataFrame
high_ratings = data[data["host_rating"] > 4.8][["listing_name", "total_price"]]

# Caltulate and print statistics
high_ratings_statistics = high_ratings.describe()
print(high_ratings_statistics)

The above snippet creates a new data frame called high_ratings like so:

data["host_rating"] > 4.8 filters for values greater than 4.8 in the column host_ratings from the data dataset.
[["listing_name", "total_price"]] selects only the listing_name and total_price columns from the high_ratings data frame.

Below is the expected output:

total_price
count    78.000000
mean    321.061026
std     711.340269
min      19.000000
25%      78.250000
50%     116.000000
75%     206.000000
max    4230.000000

The statistics show that the average total price of the selected apartments is $321, with a minimum of $19 and a maximum of $4230. This requires further analysis!

Visualize a scatter plot of the prices for the houses with high ratings emplyoing the same snippet you used before. All you need to do is change the variables used in the chart like so:

# Define figure size
plt.figure(figsize=(12, 8))

# Plot the data
sns.scatterplot(data=high_ratings, x='total_price', y='listing_name')

# Labeling
plt.title('HIGH RATING HOUSES PRICES', fontsize=20)
plt.xlabel('Prices', fontsize=16)
plt.ylabel('Houses', fontsize=16)

# Show grid for better visualization
sns.set_style("ticks", {'axes.grid': True})

# Show plot
plt.show()

And this is the resulting plot:

This plot shows two interesting facts:

The prices are all mainly under $500.
The “Entire Cabin in Sevierville” and the “Entire Cabin in Pigeon” present prices that are way above $1000.

A better way to visualize the price range is by showing a box plot. This is how you can do that:

# Define figure size
plt.figure(figsize=(15, 10))

# Plotting the boxplot
sns.boxplot(data=high_ratings, x='total_price', y='listing_name')

# Labeling
plt.title('HIGH RATING HOUSES PRICES - BOXPLOT', fontsize=20)
plt.xlabel('Prices', fontsize=16)
plt.ylabel('Houses', fontsize=16)

# Show plot
plt.show()

This time, the resulting chart will be:

If you are asking yourself why the same house can have different costs, you have to remember that you filtered for users’ ratings. This means that different users paid differently and left different ratings.

Additionally, the significant price variation for the “Entire Cabin in Sevierville,” ranging from under $1,000 to over $4,000, may be due to the length of the stay. In detail, the original dataset includes a column called travel_details, which contains information about the duration of the stay. The wide price range could indicate that some users rented the house for an extended period. A deeper analysis using Python could help uncover more insights about that!

Step 6: Further Investigations Via The Correlation Matrix

Python data analysis is about asking questions and seeking answers within the data you have. One effective way to spark these questions is by visualizing the correlation matrix.

The correlation matrix is a table that shows the correlation coefficients for different variables. The most used correlation coefficient is the Pearson Correlation Coefficient (PCC), which measures the linear correlation between two variables. Its values range from -1 to +1, which means:

+1: If the value of a variable increases, the other increases linearly.
-1 : If the value of a variable increases, the other decreases linearly.
0: You can not say anything about the linear relation of the two variables (it requires non-linear analysis).

In statistics, the values of linear correlation define the following:

0.1-0.5: low correlation.
0.6-1: high correlation.
0: no correlation.

To display the correlation matrix for the data data frame, you can type the following:

# Set the images dimensions
plt.figure(figsize=(12, 10))

# Labeling
plt.title('CORRELATION MATRIX', fontsize=20)
plt.xticks(fontsize=16) # x-axis font size
plt.yticks(fontsize=16) # y-axis font size

# Applying mask
mask = np.triu(np.ones_like(numeric_data.corr()))
dataplot = sns.heatmap(numeric_data.corr(), annot=True, fmt='.2f', mask=mask, annot_kws={"size": 12})

#Add this code before creating the correlation matrix
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Correlation matrix
dataplot = sns.heatmap(data.corr(), annot=True, fmt='.2f', mask=mask, annot_kws={"size": 12})

The above snippet does the following:

The np.triu() method is used to diagonalize a matrix. This is used for a better visualization of the matrix so that it is shown as a triangle and not as a square.
The sns.heatmap() method creates a heatmap. This is also used for better visualization. Inside it, the method data.corr() is the one that actually calculates the Pearson coefficients for each column of the data frame data.

Below is the result you will obtain:

The main idea when interpreting a correlation matrix is to find variables that have high correlation as these will be the starting point for new and deeper analysis. For example:

The lat and long variables have a -0.98 correlation. This is expected, as latitude and longitude are strongly correlated when defining a specific location on Earth.
The host_rating and long variables have a -0.69 correlation. This is an interesting result, which means that the rating of the host is highly correlated to the longitude variable. So it seems that houses located in a certain area of the world have high host ratings.
The lat and long variables have, respectively, a 0.63 and -0.69 correlation with price. That is enough to tell that the price per day is highly influenced by the location.

In your analysis, you should also search for non-correlated variables. For example, the coefficient of the variables is_supperhost and price is -0.18, which means that superhosts do not have the highest prices.

Now that the main concepts are clear, it is your turn to explore and analyze your data!

Step 7: Put It All Together

This is what the final Jupyter Notebook for data analysis with Python will look like:

Note the presence of different cells, each with its output.

The Process Behind Data Analysis With Python

The section above guided you through the process of data analysis with Python. Although it may have seemed like a step-by-step approach driven by opportunity, it was actually built on the following best practices:

Data retrieval: If you are lucky enough to have the data you need in a database, lucky you! If not, you need to retrieve it using popular data sourcing methods like w eb scraping.
Data cleaning: Handle NaNs, aggregate data, and apply the first filters of the initial dataset.
Data exploration: Data exploration—sometimes also called data discovery—is the most important part of data analysis with Python. It requires producing basic plots to help you understand how your data is structured or if it follows particular patterns.
Data manipulation: After grabbing the main ideas behind the data you are analyzing, you have to manipulate it. This part requires filtering datasets and often combining more than two datasets into one (as if you were performing table joins in SQL).
Data visualization: This is the final part, where you visually present your data by making multiple plots on the manipulated datasets.

Conclusion

In this guide on data analysis with Python, you learned why you should use Python for analyzing data and which common libraries you can use for that purpose. You have also gone through a step-by-step tutorial and learned the process to follow if you want to perform data analysis in Python.

You saw that Jupyter Notebook help you create subsets of your data, visualize them, and discover powerful insights. All that while maintaining everything structured in the same environment. Now, where can you find ready-to-use datasets? Bright Data has you covered!

Bright Data operates a large, fast, and reliable proxy network, used by many Fortune 500 companies and over 20,000 customers. That is used to ethically retrieve data from the Web and offer them in a vast dataset marketplace, which includes:

Business Datasets: Data from key sources like LinkedIn, CrunchBase, Owler, and Indeed.
Ecommerce Datasets: Data from Amazon, Walmart, Target, Zara, Zalando, Asos, and many more.
Real Estate Datasets: Data from websites such as Zillow, MLS, and more.
Social Media Datasets: Data from Facebook, Instagram, YouTube, and Reddit.
Financial Datasets: Data from Yahoo Finance, Market Watch, Investopedia, and more.

Create a free Bright Data account today and explore our datasets.

Start free trial

Start free with Google

Federico Trotta

Technical Writer

3 years experience

Federico Trotta is a technical writer, editor, and data scientist. Expert in technical content management, data analysis, machine learning, and Python development.

Expertise

Data Analysis AI Web Scraping

View all articles

A Guide to Data Analysis With Python in 2025

Why Use Python for Data Analysis

Common Libraries for Data Analysis With Python

Data Analysis With Python: A Complete Example

Requirements

Step 1: Set Up the Environment and Install the Dependencies

Step 2: Download the Data and Open it

Step 3: Manage `NaN`s

Step 4: Data Exploration

Step 5: Data Transformation and Visualization

Step 6: Further Investigations Via The Correlation Matrix

Step 7: Put It All Together

The Process Behind Data Analysis With Python

Conclusion

Federico Trotta

Expertise

Dedicated Scraper APIs & No-Code Scrapers

Just want data? Skip scraping.

You might also be interested in

Best Web Scraping Methods for JavaScript-Heavy Sites

Crawl4AI vs Firecrawl: Detailed Comparison 2025

Using LlamaIndex and Bright Data for Web Search

A Guide to Data Analysis With Python in 2025

Why Use Python for Data Analysis

Common Libraries for Data Analysis With Python

Data Analysis With Python: A Complete Example

Requirements

Step 1: Set Up the Environment and Install the Dependencies

Step 2: Download the Data and Open it

Step 3: Manage NaNs

Step 4: Data Exploration

Step 5: Data Transformation and Visualization

Step 6: Further Investigations Via The Correlation Matrix

Step 7: Put It All Together

The Process Behind Data Analysis With Python

Conclusion

Federico Trotta

Expertise

Dedicated Scraper APIs & No-Code Scrapers

Just want data? Skip scraping.

You might also be interested in

Best Web Scraping Methods for JavaScript-Heavy Sites

Crawl4AI vs Firecrawl: Detailed Comparison 2025

Using LlamaIndex and Bright Data for Web Search

Step 3: Manage `NaN`s