# Impute missing data values in Python – 3 Easy Ways!

## Why do we need to impute missing data values?

Before going ahead with imputation, let us understand what is a missing value.

So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.

Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:

**Reduces the efficiency**of the ML model.**Affects the overall distribution**of data values.- It leads to a
**biased effect**in the estimation of the ML model.

This is when imputation comes into picture.

By imputation, we mean to replace the missing or null values with a particular value in the entire dataset.

Imputation can be done using any of the below techniques–

**Impute by mean****Impute by median****Knn Imputation**

Let us now understand and implement each of the techniques in the upcoming section.

## 1. Impute missing data values by MEAN

The missing qualities can be attributed with the mean of that specific element/information variable. That is, the invalid or missing qualities can be supplanted by the mean of the information estimations of that specific information section or dataset.

**Let us have a look at the below dataset which we will be using throughout the article.**

As clearly seen, the above dataset contains NULL values. Let us now try to impute them with the mean of the feature.

### Import the required libraries

Here, at first, let us load the necessary datasets into the working environment.

`#Load libraries` `import` `os` `import` `pandas as pd` `import` `numpy as np` |

We have used pandas.read_csv() function to load the dataset into the environment.

`marketing_train ` `=` `pd.read_csv(` `"C:/marketing_tr.csv"` `)` |

### Verify missing values in the database

Before we imputing missing data values, it is necessary to check and detect the presence of missing values using `isnull() function`

as shown below–

`marketing_train.isnull().` `sum` `()` |

After executing the above line of code, we get the following count of missing values as output:

`custAge ` `1804` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` |

As clearly seen, the data variable ‘custAge’ contains 1804 missing values out of 7414 records.

### Use the mean() method on all the null values

Further, we have used `mean() function`

to impute all the null values with the mean of the column ‘custAge’.

`missing_col ` `=` `[` `'custAge'` `]` `#Technique 1: Using mean to impute the missing values` `for` `i ` `in` `missing_col:` ` ` `marketing_train.loc[marketing_train.loc[:,i].isnull(),i]` `=` `marketing_train.loc[:,i].mean()` |

### Verify the changes

After performing the imputation with mean, let us check whether all the values have been imputed or not.

`marketing_train.isnull().` `sum` `()` |

As seen below, all the missing values have been imputed and thus, we see no more missing values present.

`custAge ` `0` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` |

## 2. Imputation with median

In this technique, we impute the missing values with the median of the data values or the data set.

Let us understand this with the below example.

**Example:**

`#Load libraries` `import` `os` `import` `pandas as pd` `import` `numpy as np` `marketing_train ` `=` `pd.read_csv(` `"C:/marketing_tr.csv"` `)` `print` `(` `"count of NULL values before imputation\n"` `)` `marketing_train.isnull().` `sum` `()` `missing_col ` `=` `[` `'custAge'` `]` `#Technique 2: Using median to impute the missing values` `for` `i ` `in` `missing_col:` ` ` `marketing_train.loc[marketing_train.loc[:,i].isnull(),i]` `=` `marketing_train.loc[:,i].median()` `print` `(` `"count of NULL values after imputation\n"` `)` `marketing_train.isnull().` `sum` `()` |

Here, we have imputed the missing values with median using `median() function`

.

**Output:**

`count of NULL values before imputation` `custAge ` `1804` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` `count of NULL values after imputation` `custAge ` `0` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` |

## 3. KNN Imputation

In this technique, the missing values get imputed based on the KNN algorithm i.e. **K-nearest-neighbour algorithm**.

In this algorithm, the missing values get replaced by the nearest neighbor estimated values.

Let us understand the implementation using the below example:

**KNN Imputation:**

`#Load libraries` `import` `os` `import` `pandas as pd` `import` `numpy as np` `marketing_train ` `=` `pd.read_csv(` `"C:/marketing_tr.csv"` `)` `print` `(` `"count of NULL values before imputation\n"` `)` `marketing_train.isnull().` `sum` `()` |

Here, is the count of missing values:

`count of NULL values before imputation` `custAge ` `1804` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` |

In the beneath bit of code, we have changed over the information kinds of the information factors to protest type with all out codes doled out to them.

the missing values with the nearest neighbour possible.

`#Apply KNN imputation algorithm` `marketing_train ` `=` `pd.DataFrame(KNN(k ` `=` `3` `).fit_transform(marketing_train), columns ` `=` `marketing_train.columns)` |

**Output of imputation**:

`Imputing row ` `1` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.293` `Imputing row ` `101` `/` `7414` `with ` `1` `missing, elapsed time: ` `13.311` `Imputing row ` `201` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.319` `Imputing row ` `301` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.319` `Imputing row ` `401` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.329` `.` `.` `.` `.` `.` `Imputing row ` `7101` `/` `7414` `with ` `1` `missing, elapsed time: ` `13.610` `Imputing row ` `7201` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.610` `Imputing row ` `7301` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.618` `Imputing row ` `7401` `/` `7414` `with ` `0` `missing, elapsed time: ` `13.618` |

`print` `(` `"count of NULL values after imputation\n"` `)` `marketing_train.isnull().` `sum` `()` |

**Output:**

`count of NULL values before imputation` `custAge ` `0` `profession ` `0` `marital ` `0` `responded ` `0` `dtype: int64` |

## Conclusion

By this, we have arrived at the finish of this subject. In this article, we have actualized 3 distinct procedures of ascription.

Don’t hesitate to remark underneath, in the event that you run over any inquiry.