Impute missing data values in Python – 3 Easy Ways!
Why do we need to impute missing data values?
Before going ahead with imputation, let us understand what is a missing value.
So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.
Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:
- Reduces the efficiency of the ML model.
- Affects the overall distribution of data values.
- It leads to a biased effect in the estimation of the ML model.
This is when imputation comes into picture.
By imputation, we mean to replace the missing or null values with a particular value in the entire dataset.
Imputation can be done using any of the below techniques–
- Impute by mean
- Impute by median
- Knn Imputation
Let us now understand and implement each of the techniques in the upcoming section.
1. Impute missing data values by MEAN
The missing qualities can be attributed with the mean of that specific element/information variable. That is, the invalid or missing qualities can be supplanted by the mean of the information estimations of that specific information section or dataset.
Let us have a look at the below dataset which we will be using throughout the article.
As clearly seen, the above dataset contains NULL values. Let us now try to impute them with the mean of the feature.
Import the required libraries
Here, at first, let us load the necessary datasets into the working environment.
We have used pandas.read_csv() function to load the dataset into the environment.
Verify missing values in the database
Before we imputing missing data values, it is necessary to check and detect the presence of missing values using
isnull() function as shown below–
After executing the above line of code, we get the following count of missing values as output:
As clearly seen, the data variable ‘custAge’ contains 1804 missing values out of 7414 records.
Use the mean() method on all the null values
Further, we have used
mean() function to impute all the null values with the mean of the column ‘custAge’.
Verify the changes
After performing the imputation with mean, let us check whether all the values have been imputed or not.
As seen below, all the missing values have been imputed and thus, we see no more missing values present.
2. Imputation with median
In this technique, we impute the missing values with the median of the data values or the data set.
Let us understand this with the below example.
Here, we have imputed the missing values with median using
3. KNN Imputation
In this technique, the missing values get imputed based on the KNN algorithm i.e. K-nearest-neighbour algorithm.
In this algorithm, the missing values get replaced by the nearest neighbor estimated values.
Let us understand the implementation using the below example:
Here, is the count of missing values:
In the beneath bit of code, we have changed over the information kinds of the information factors to protest type with all out codes doled out to them.
the missing values with the nearest neighbour possible.
Output of imputation:
By this, we have arrived at the finish of this subject. In this article, we have actualized 3 distinct procedures of ascription.
Don’t hesitate to remark underneath, in the event that you run over any inquiry.