Working with data is an obvious requirement from data science professionals. The building block of working with data is to understand the most common data types, and acquire the knowledge of processing, querying and converting them. In this guide, you will learn the techniques of querying and converting data types in R.
There are several data types in R, and the most integral ones are listed below:
class()
or typeof()
function. 1t = "pluralsight"
2class(t)
3typeof(t)
Output:
1[1] "character"
2
3[1] "character"
1N = 3.5
2class(N)
Output:
1[1] "numeric"
The variable N
is stored as a numeric value, and not an integer. This can be checked using the is.integer()
function.
1is.integer(N)
Output:
1[1] FALSE
as.integer()
function. Also, all integers are numeric, but the reverse is not true. 1i = as.integer(3.1)
2print(i)
Output:
1[1] 3
1x = 100
2y = 56
3x < y
Output:
1[1] FALSE
The most common data types are discussed above, but the most important data type is a data frame.
Data frame is the de-facto data type for most data science projects, as it's organized in tabular format. In simple terms, a data frame is a special type of list where all the elements are of equal length.
Data frames are normally created by read_csv()
and read.table()
functions when importing the data into R. You can also create a new data frame with the data.frame()
function.
1df <- data.frame(rollnum = seq(1:10), h1 = 15:24, h2 = 81:90)
2df
Output:
1 rollnum h1 h2
2 1 1 15 81
3 2 2 16 82
4 3 3 17 83
5 4 4 18 84
6 5 5 19 85
7 6 6 20 86
8 7 7 21 87
9 8 8 22 88
10 9 9 23 89
11 10 10 24 90
The most common method of dealing with a data frame is by importing the flat files--csv or Excel--into the R environment. The code below performs this task and loads the data that will be used in the subsequent sections.
1library(readr)
2dat <- read_csv("data.csv")
3glimpse(dat)
Output:
1Observations: 585
2Variables: 6
3$ UID <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
4$ Income <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
5$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
6$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
7$ Age <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
8$ Purpose <chr> "Business", "Personal", "Travel", "Personal", "Persona...
The output shows there are 585 observations of 6 variables, described below.
UID
: Unique identifier tag of the loan applicant.
Income
: Annual income of the applicant (in US dollars).
Credit_score
: Whether the applicant's credit score was satisfactory or not.
approval_status
: Whether the loan application was approved ("1") or not ("0").
Age
: The applicant’s age in years.
Purpose
: The reason for the loan application. For data science and machine learning, it's important for the variables to be in the right data type. To begin, you will use the str()
function that prints the structure of the data.
1str(dat)
Output:
1Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 585 obs. of 6 variables:
2 $ UID : chr "UIDA467" "UIDA402" "UIDA354" "UIDA209" ...
3 $ Income : num 36850 45470 53240 198400 83410 ...
4 $ Credit_score : Factor w/ 2 levels "Not _satisfactory",..: 2 2 2 2 1 2 2 2 2 2 ...
5 $ approval_status: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
6 $ Age : int -12 -10 -3 23 23 23 23 23 23 24 ...
7 $ Purpose : Factor w/ 6 levels "Business","Education",..: 1 4 5 4 4 4 4 4 5 4
From the output above, you can see that the data has six variables, three numerical and three categorical. You will start by understanding the levels of character variables.
1table(dat$Credit_score)
Output:
1Not _satisfactory Satisfactory
2 124 461
The variable Credit_score
has only two levels, so it can be converted to a factor variable with the as.factor()
function.
1dat$Credit_score = as.factor(dat$Credit_score)
2class(dat$Credit_score)
Output:
1[1] "factor"
Next, inspect the number of levels for the variable Purpose
.
1table(dat$Purpose)
Output:
1Business Education Furniture Personal Travel Wedding
2 43 184 37 161 122 38
There are six levels in the variable Purpose
which is converted to the factor data type with the code below.
1dat$Purpose = as.factor(dat$Purpose)
2class(dat$Purpose)
Output:
1[1] "factor"
The last conversion to make is for the variable approval_status
. Start by examining the class of the variable.
1class(dat$approval_status)
2
3table(dat$approval_status)
Output:
1[1] "integer"
2
3 0 1
4186 399
The class of the variable approval_status
is shown as integer
, but it takes only two values, zero and one. In fact, this is a categorical variable and needs to be converted to factor.
1dat$approval_status = as.factor(dat$approval_status)
2class(dat$approval_status)
Output:
1[1] "factor"
The required conversions have been made, and this can be verified with the code below.
1glimpse(dat)
Output:
1Observations: 585
2Variables: 7
3$ UID <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
4$ Income <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
5$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
6$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
7$ Age <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
8$ Purpose <fct> Business, Personal, Travel, Personal, Personal, Person...
You have inspected and converted the variables in the section above, and will learn how to query some of the numerical variables. The summary()
function provides key statistics about the variables.
1summary(dat)
Output:
1 UID Income Credit_score approval_status
2 Length:585 Min. : 3000 Not _satisfactory:124 0:186
3 Class :character 1st Qu.: 38890 Satisfactory :461 1:399
4 Mode :character Median : 51440
5 Mean : 71655
6 3rd Qu.: 77570
7 Max. :844490
8
9
10 Age Purpose
11 Min. :-12.00 Business : 43
12 1st Qu.: 37.00 Education:184
13 Median : 51.00 Furniture: 37
14 Mean : 49.39 Personal :161
15 3rd Qu.: 61.00 Travel :122
16 Max. : 76.00 Wedding : 38
From the output above, you can see that the variable, Age
, has negative values. This is incorrect data and needs further querying. There are various ways to do it, one of which is to find out how many such values are there.
1neg_age = dat[dat$Age<0,]
2nrow(neg_age)
Output:
1[1] 3
There are only three such records and deleting them won't make any difference. However, the other technique can be to create a new logical variable that will check the condition of age being negative.
The first line uses the ifelse()
command to create a new variable AgeNegative
, that returns a value TRUE
if the expression is correct. Otherwise it returns a FALSE
.
The second line prints the first five values of the variable.
1dat$AgeNegative <-ifelse(dat$Age < 0, "TRUE", "FALSE")
2dat$AgeNegative[1:5]
Output:
1[1] "TRUE" "TRUE" "TRUE" "FALSE" "FALSE"
The output above shows that the first three values are TRUE
, which indicates the three negative age values of the data. In the similar manner, you can inspect other variables in the data.
In this guide, you learned about the most common data types, and acquired the knowledge of querying and converting them. This will help you understand and transform data better to perform complex data science tasks.
To learn more about Data Science with R, please refer to the following guides: