Data Cleaning : Data Science Skill Portfolio

Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct. There are many ways to pursue data cleansing in various software and data storage architectures; most of them center on the careful review of data sets and the protocols associated with any particular data storage technology.

Data cleansing is also known as data cleaning or data scrubbing.

Better data beats fancier algorithms

In fact, if you have a properly cleaned data set, even simple algorithms can learn impressive insights from the data!

Obviously, different types of data will require different types of cleaning. However, the systematic approach laid out in this lesson can always serve as a good starting point.

Data Cleaning Step

Step 1 : Remove Unwanted observations

The first step to data cleaning is removing unwanted observations from your dataset.

This includes duplicate or irrelevant observations.

Duplicate observations

Duplicate observations most frequently arise during data collection, such as when you:

Combine datasets from multiple places
Scrape data
Receive data from clients/other departments

Irrelevant observations

Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve.

For example, if you were building a model for Single-Family homes only, you wouldn’t want observations for Apartments in there.
This is also a great time to review your charts from Exploratory Analysis. You can look at the distribution charts for categorical features to see if there are any classes that shouldn’t be there.
Checking for irrelevant observations before engineering features can save you many headaches down the road.

Step 2 : Fix Structural Errors

Structural errors are those that arise during measurement, data transfer, or other types of “poor housekeeping.”

For instance, you can check for typos or inconsistent capitalization. This is mostly a concern for categorical features, and you can look at your bar plots to check.

Also check for mislabeled classes, i.e. separate classes that should really be the same.

For example:

If ’N/A’ and ’Not Applicable’ appear as two separate classes, you should combine them.
’IT’ and ’information_technology’ should be a single class.

Step 3 : Filter Unwanted Outliers

Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models.

In general, if you have a legitimate reason to remove an outlier, it will help your model’s performance.

However, outliers are innocent until proven guilty. You should never remove an outlier just because it’s a “big number.” That big number could be very informative for your model.

We can’t stress this enough: you must have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.

Step 4 : Handle Missing Data

Missing data is a deceptively tricky issue in applied machine learning.

First, just to be clear, you cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.

“Common sense” is not sensible here

Unfortunately, from our experience, the 2 most commonly recommended ways of dealing with missing data actually suck.

They are:

Dropping observations that have missing values
Imputing the missing values based on other observations

Dropping missing values is sub-optimal because when you drop observations, you drop information.

The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!

Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a loss in information, no matter how sophisticated your imputation method is.

Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.

Missing categorical data

The best way to handle missing data for categorical features is to simply label them as ’Missing’!

You’re essentially adding a new class for the feature.
This tells the algorithm that the value was missing.
This also gets around the technical requirement for no missing values.

Missing numeric data

For missing numeric data, you should flag and fill the values.

Flag the observation with an indicator variable of missingness.
Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.

By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.

Data Cleaning as Data Science Skill Portfolio

Data scientists can expect to spend up to 80% of the time on a new project cleaning data. This is a huge pain point for teams.

If you can show that you’re experienced at cleaning data, you’ll immediately be more valuable. To create a data cleaning project, find some messy data sets, and start cleaning.

If you’re working with Python, Pandas is a great library to use,

and if you’re working with R, you can use the dplyr package.

Make sure to showcase the following skills:

Importing data
Joining multiple datasets
Detecting missing values
Detecting anomalies
Imputing for missing values
Data quality assurance