Dealing with missing values is a common task for data scientists when building machine learning models. There are several methods of dealing with missing values, and if you want to use advanced techniques, the mice
library in R is a great option.
MICE stands for Multivariate Imputation by Chained Equations, and it works by creating multiple imputations (replacement values) for multivariate missing data. The MICE algorithm can be used with different data types such as continuous, binary, unordered categorical, and ordered categorical data.
In this guide, you will learn how to work with the mice
library in R.
In this guide, you will use a fictitious data of loan applicants containing 600 observations and eight variables, as described below:
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual Income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is satisfactory ("Satisfactory") or not ("Not_Satisfactory")
approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose
: Purpose of applying for the loanThe first step is to load the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(mice)
6library(VIM)
7
8dat <- read_csv("C:/Notes_Old/A_Resources/data_qna/Content writing/R guides/caret package/data_mice.csv")
9
10glimpse(dat)
Output:
1Observations: 600
2Variables: 8
3$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
4$ Income <int> 3000, 3000, 3000, 3000, 8990, NA, NA, NA, NA, NA, N...
5$ Loan_amount <dbl> 6000, NA, NA, NA, 8091, NA, NA, NA, NA, NA, NA, NA,...
6$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", NA,...
7$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes"...
8$ Age <int> 27, 29, 27, 33, 29, NA, 29, 27, 33, 29, NA, 29, 27,...
9$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 121...
10$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel"...
The output shows that the dataset has four numerical and four character variables. You will convert these into factor variables with the code below.
1names <- c(1,4,5,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
Output:
1Observations: 600
2Variables: 8
3$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No...
4$ Income <int> 3000, 3000, 3000, 3000, 8990, NA, NA, NA, NA, NA, N...
5$ Loan_amount <dbl> 6000, NA, NA, NA, 8091, NA, NA, NA, NA, NA, NA, NA,...
6$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, NA, NA, S...
7$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, N...
8$ Age <int> 27, 29, 27, 33, 29, NA, 29, 27, 33, 29, NA, 29, 27,...
9$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 121...
10$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, ...
The summary()
function provides a quick overview of the variables and missing values, if any.
1summary(dat)
Output:
1Is_graduate Income Loan_amount Credit_score
2 No :130 Min. : 3000 Min. : 6000 Not _satisfactory:123
3 Yes:470 1st Qu.: 39045 1st Qu.:115665 Satisfactory :458
4 Median : 50995 Median :135990 NA's : 19
5 Mean : 65901 Mean :149313
6 3rd Qu.: 76170 3rd Qu.:170740
7 Max. :277770 Max. :466660
8 NA's :20 NA's :17
9 approval_status Age Investment Purpose
10 No :190 Min. :22.00 Min. : 2100 Education: 94
11 Yes:410 1st Qu.:35.00 1st Qu.: 16678 Home :132
12 Median :50.00 Median : 26439 Others : 64
13 Mean :48.82 Mean : 34442 Personal :174
14 3rd Qu.:61.00 3rd Qu.: 35000 Travel :118
15 Max. :76.00 Max. :190422 NA's : 18
16 NA's :19
The output above shows that some of the variables have missing values, represented by NA's. To understand the pattern of missing values better, you can use the md.pattern()
function.
1md.pattern(dat)
Output:
1 Is_graduate approval_status Investment Loan_amount Purpose
2 559 1 1 1 1 1
3 4 1 1 1 1 1
4 4 1 1 1 1 1
5 3 1 1 1 1 1
6 10 1 1 1 1 0
7 3 1 1 1 1 0
8 2 1 1 1 0 1
9 2 1 1 1 0 1
10 1 1 1 1 0 1
11 1 1 1 1 0 1
12 6 1 1 1 0 1
13 4 1 1 1 0 0
14 1 1 1 1 0 0
15 0 0 0 17 18
16
17 Credit_score Age Income
18 559 1 1 1 0
19 4 1 0 1 1
20 4 0 1 1 1
21 3 0 1 0 2
22 10 1 0 1 2
23 3 1 0 0 3
24 2 1 1 1 1
25 2 1 1 0 2
26 1 1 0 0 3
27 1 0 1 1 2
28 6 0 1 0 3
29 4 0 1 0 4
30 1 0 0 0 5
31 19 19 20 93
The topmost row of the output indicates that there are 559 records with no missing values. There are 10 records that have missing values only in the Income
variable, which overall has twenty missing values.
The missing value pattern can also be analyzed with the code below.
1plot1 <- aggr(dat, col=c('blue','red'), numbers=TRUE, sortVars=TRUE, labels=names(dat), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
Output:
1Variables sorted by number of missings:
2 Variable Count
3 Income 0.03333333
4 Credit_score 0.03166667
5 Age 0.03166667
6 Purpose 0.03000000
7 Loan_amount 0.02833333
8 Is_graduate 0.00000000
9 approval_status 0.00000000
10 Investment 0.00000000
The output above prints the percentage of missing values in each of the variables. Overall, 93% of the data does not have missing values, which can be seen from the right-hand side plot below.
The number of missing values is not large, and you can remove these observations. But the objective is to use the mice
library to treat missing values.
The mice()
function is used to impute missing values. Some of the important arguments used in the code are explained below.
data
: A data frame or a matrix containing the incomplete data. Missing values are coded as NA.m
: Number of multiple imputations. The default value is five. method
: Specifies the imputation method to be used for each column in data. In this case, you are using predictive mean matching (PMM) as an imputation method. maxit
: A scalar giving the number of iterations. The default value is five.The above arguments are passed to the imputation function.
1imputed_data <- mice(dat,m=5,maxit=50,meth='pmm',seed=500)
2summary(imputed_data)
Output:
1Class: mids
2Number of multiple imputations: 5
3Imputation methods:
4 Is_graduate Income Loan_amount Credit_score approval_status
5 "" "pmm" "pmm" "pmm" ""
6 Age Investment Purpose
7 "pmm" "" "pmm"
8PredictorMatrix:
9 Is_graduate Income Loan_amount Credit_score approval_status Age
10Is_graduate 0 1 1 1 1 1
11Income 1 0 1 1 1 1
12Loan_amount 1 1 0 1 1 1
13Credit_score 1 1 1 0 1 1
14approval_status 1 1 1 1 0 1
15Age 1 1 1 1 1 0
16 Investment Purpose
17Is_graduate 1 1
18Income 1 1
19Loan_amount 1 1
20Credit_score 1 1
21approval_status 1 1
22Age 1 1
If you want to look at a specific variable's imputed data—for instance, the variable Purpose
—you can do that with the code below.
1imputed_data$imp$Purpose
Output:
1 1 2 3 4 5
2 9 Travel Education Travel Personal Travel
3 10 Education Home Home Home Home
4 11 Home Personal Home Home Travel
5 12 Home Home Travel Home Home
6 13 Education Education Travel Travel Home
7 588 Travel Others Travel Travel Personal
8 589 Travel Travel Personal Personal Personal
9 590 Travel Travel Travel Travel Personal
10 591 Travel Personal Travel Travel Others
11 592 Travel Personal Travel Travel Education
12 593 Home Education Personal Travel Education
13 594 Home Home Home Home Home
14 595 Personal Education Travel Travel Education
15 596 Travel Travel Travel Travel Home
16 597 Personal Travel Travel Travel Home
17 598 Home Education Travel Travel Education
18 599 Others Personal Personal Travel Personal
19 600 Others Travel Travel Travel Travel
The above output shows that for the 18 missing values in the Purpose
variable, there are five sets of imputations available.
The next step is to complete the missing value imputation on the entire data with the code below. The missing values will be replaced with the values in the first of the five imputed datasets, indicated by the value of one in the second argument.
1completeddata1 <- complete(imputed_data,1)
2summary(completeddata1)
Output:
1Is_graduate Income Loan_amount Credit_score
2 No :130 Min. : 3000 Min. : 6000 Not _satisfactory:129
3 Yes:470 1st Qu.: 38498 1st Qu.:112973 Satisfactory :471
4 Median : 50835 Median :134385
5 Mean : 65819 Mean :146552
6 3rd Qu.: 76040 3rd Qu.:168715
7 Max. :277770 Max. :466660
8 approval_status Age Investment Purpose
9 No :190 Min. :22.00 Min. : 2100 Education: 96
10 Yes:410 1st Qu.:35.00 1st Qu.: 16678 Home :137
11 Median :50.00 Median : 26439 Others : 66
12 Mean :49.18 Mean : 34442 Personal :176
13 3rd Qu.:61.25 3rd Qu.: 35000 Travel :125
14 Max. :76.00 Max. :190422
The summary of the new data shows the absence of any missing values, indicating that the missing value imputation is complete. You can go ahead and use the new data for model building to check model performance on the imputed data.
The lines of code below create a data partition, build the random forest algorithm on the training data set, and evaluate the model on the test data set.
1# Create Data Partition
2set.seed(100)
3trainRowNumbers <- createDataPartition(completeddata1$approval_status, p=0.7, list=FALSE)
4train <- completeddata1[trainRowNumbers,]
5test <- completeddata1[-trainRowNumbers,]
6
7# Build Random Forest Algorithm
8
9control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
10rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
11
12
13# Model Evaluation
14predictTest = predict(rf_model, newdata = test, type = "raw")
15table(test$approval_status, predictTest)
Output:
1 predictTest
2 No Yes
3 No 45 12
4 Yes 11 112
5
6
The accuracy can be calculated from the above confusion matrix with the code below.
1(112+45)/nrow(test)
Output:
1[1] 0.8722222
The output shows that the accuracy on the test data is 87%, which indicates that the model performance is good.
In this guide, you learned about the mice
library, which is one of the advanced packages in R for missing value imputation. You learned how to identify and visualize the patterns of missing values in data, and to impute them with the mice
librray. This will help you in data preprocessing and preparation for machine learning.
To learn more about data science and machine learning with R, please refer to the following guides: