Before data can be used it needs to be cleaned. This arduous process includes everything from removing duplicate values to missing outlier data. The more accurate your cleaning stage, the easier it will be to derive value from datasets in the data manipulation, algorithm learning, and modeling stages.
In fact, the bulk of any data science-based project first requires effective data cleaning and manipulation.
What is Data Manipulation/Cleaning?
As a data scientist, upon beginning a project, you will need to start by gathering a variety of data sets, either by extracting them yourself from external websites, or by receiving them from different internal sources, depending on your requirements.
Not all of the data you acquire will be relevant to your cause. In order to separate the relevant data from the irrelevant, you will need to cleanse the collected data sets. In other words, you may need to remove or modify columns, remove duplicate values, deal with missing values and outlier data, and so on. You may also need to normalize and scale your data in order for it to fit within a certain range.
Data cleaning also includes the process of visualizing the data through the use of graphs and statistical functions in order to find the ‘underlying data’, also known as the ‘mean’, ‘median’, ‘range’, ‘distribution’, etc.
Why is Data Manipulation/Cleansing Important to Data Scientists?
Before any data scientist can focus on modeling, they will need to master data cleaning. Depending on how effectively you can clean your data will determine how complicated your modeling will be. The more organized your data sets are in the cleansing stage, the simpler your learning algorithms will need to be in the modeling stage. The structure of your data will also have a direct impact on the precision of your projections.
In short, data cleansing is just as crucial as building the algorithms themselves. Once you have mastered data cleansing you can expect:
- Lower processing times
- More precise projections
- Simplified algorithm functionality
- Increased model learning
Python is becoming the favored coding language in data science for many reasons. For one, it provides a variety of computation libraries that can be used for data science projects, including data manipulation and cleansing. In this article, we will be using the Pandas Python library.
6 Steps to Manipulate and Cleanse Data with Python:
#1: Implementing missing values imputation – This is a standard statistical imputing constant, using KNN imputation.
Outlier/Anomaly Detection is carried out using: Isolation Forest, One Class SVM, Local Outlier Factor, and/or outlier detection algorithms.
#2: Carrying out outlier/anomaly detection- You can accomplish this by using Isolation Forest, One-Class SVM, and/or Local Outlier Factor outlier detection algorithms.
#3: Utilizing cleaning techniques from the X-Variable family- In this instance, you want to apply custom functions, remove duplicates, as well as replace crucial values.
#4: Using Cleaning Techniques of the Y-Variable sort – Here it is important to do label encoding, one-hot encoding, as well as dictionary mapping.
#5:’DataFrames’ need to be merged- This step includes concatenating, merging, and joining.
#6: The last step consists of ‘parsing dates’- Here you need to use auto-format detecting strings to accomplish ‘DateTime’ converting, including changing ‘DateTime’ objects to numbers.
Let’s go into detail for each step:
ONE: Imputing Missing Values
One of the most common issues you may come across in raw extracted data sets is missing values. As long as there are not too many of them, they can easily be imputed at this stage.
- Simple Imputation Methods, like mean, median, mode can be used to fill in missing values (NaN) with the statistical measure of each column. The parameter can be replaced with ‘mean’, ‘median’, ‘most_frequent’ or mode, or ‘constant’ which is a manual value.
- KNN Imputing is a more complex method for imputing missing values. The KNN algorithm is used to find different data points like the ones missing values within the datasets.
It is important to note that in order to use KNN imputation, the data needs to be normalized to remove differences in scale. To use KNN imputation you will need to:
- Normalize the data
- KNN impute to fill in missing values
- Inverse scale/normalize the data again
TWO: Outlier and Anomaly Detection
- Isolation Forest is an algorithm used to return the anomaly score of the datasets. The algorithm selects a feature and isolates observations by randomly choosing a split value. Paths are then created by representing the value’s normality. The shorter the paths reveal anomalies. ‘Trees’ of shorter paths for these samples make up a ‘forest’ likely to reveal the anomalies.
- One Class SVM is another method for finding outliers. This is suited for instances where Isolation Forest cannot be applied due to excessive variance.
- Local Outlier Factor is the third method used to detect anomalies. The Local Outlier Factor measures the deviation of density in each dataset in comparison to the other. Samples that display lower density than their neighbors are likely to be outliers. This algorithm is distance based, meaning you will need to normalize the data before you can use it. This method is a high variance alternative to Isolation Forest.
It is important when using any of these three methods to be sure that the anomalies are not simply data clusters. You can use PCA visualization to double-check.
THREE: X-Variable Cleaning Methods
- Applying customs functions is necessary when cleaning cannot be done via the built in functions. In this case, you may need to write your own functions but you can try to use an external built in function first.
- Removing duplicates is an important part of data cleansing. This can be done with data.drop_duplicates(), which removes rows of identical value. You must be careful to check that the duplicate rows are not errors, especially in smaller datasets.
- Sampling data points is important for large datasets. This allows you to sample random data points and can be done with data.sample(number_of_samples).
- Renaming columns is done with .rename, where the key is the original column name and the value is the renamed value.
- Replacing values can be done with data,replace(), which takes two values from the dataframe that you will replace with other values. This is useful for imputing missing values so that imputing algorithms can work effectively.
FOUR: Y-Variable Cleaning Methods
- Label Encoding is necessary for categorical y-variables. If your data has two classes, they need to be converted into 0 and 1, because machine learning algorithms can only operate with mathematical characters. You can do this by using the .map() function, which converts a dictionary or original names and replaces the values with numbers. If there are too many classes to manually map, you can use sklearn’s automated method. This method is beneficial because the data can easily be reverted to the original format by using encoder.inverse_transform(array).
- One-Hot Encoding may be preferred in specific cases when you have many classes and you do not want to place quantitative measures on the data. With one-hot encoding, every y-value is a vector the length of the number of each class, with a ‘1’ marking an index within the vector and the rest of the values are marked with ‘0’s. Pandas has a built in function called get_dummies, that can automatically take forms and output the one-hot encoded dataframe.
FIVE: Merging DataFrames
- Concatenation is the top-down method of joining DataFrames
- Merging is the left to right process of merging two DataFrames
- Joining is for other types of merging. Merging only combines rows where there is a common keyword in both data frames. Joining includes left outer joining, where all keywords in the left DataFrame are included, while rows in the right DataFrame are only included if their keywords exist in the left one
SIX: Parsing Dates
- Dates can be very difficult sets of data, but are also some of the most important. This is why it is so important for you to understand how to correctly work with this type of data
- Auto-format detecting string to datetime converting is a crucial skill, as datasets rarely come with datetime objects that can be readily accessed. You can use dateutil to automatically determine the location of days, months and years.
- Converting dates to numbers is necessary for models to be able to understand the concept of time. Datetime objects are converted to numbers. In other words, each date represents the number of days passed since the earliest date in your dataset. The function is applied to the date column using .apply()
Data manipulation and cleansing are crucial steps for all data scientists to take before they can begin any Machine Learning project. This article provides a step-by-step guide to reduce confusion and save time. Python’s libraries allow you to manipulate your data now so you can yield more accurate results further down the line.