When working with data, it's important to know what you want to do with the substantial amount of information you have. To figure out the facts from the data, some level of manipulation is necessary, as it is rare to get the data in exactly the right form.
For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.
1# The easiest way to install dplyr is to install the whole tidyverse
2install.packages("tidyverse")
3
4# Alternate way to install only dplyr
5install.packages("dplyr")
dplyr provides a consistent set of functions to solve data manipulation problems. Some of these include:
filter()
: to select records based on their valuesarrange()
: to reorderselect()
: to select variables from the datasetmutate()
: to add new variablessummarize()
: to condense multiple values into onegroup_by()
: to break down the dataset into specified groups of rowsEvery data manipulation activity would not simple that will use one or two functions. To make the code easy to read, dplyr uses %>%
(the piping operator) from magrittr, which turns x%>%f(x)
into f(x, y)
.
We'll be using this operator to help explain dplyr.
To explore the functions of dplyr, we need a dataset. We will use the flights dataset from the nycflights13 package, which contains several useful datasets.
1# Installing nycflights13 package
2install.packages("nycflights13")
3library(nycflights13)
4data(flights)
5# looking into sample data
6head(flights)
7# A tibble: 6 x 19
8 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
9 <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
101 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
112 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
123 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
134 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
145 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
156 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
16# ... with 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
To get a subset of rows from the main dataset, use the filter()
function. This function takes dataframe
as the first argument, and subsequent arguments are the expression that filters the dataframe.
1# loading dplyr library
2library(dplyr)
3
4# filtering flights dataframe on year
5filter(flights, year ==2013)
6# adding more expression for year and month
7filter(flights, year = 2013, month = 1)
The filter expressions are applied as an AND
operation.
The arrange()
function works similarly to filter()
except it arranges the rows of the dataframe. The first argument of the function is the dataframe name and the subsequent arguments are the column names.
1# arranging the flights dataframe on year
2arrange(flights, year)
3
4# arranging dataframe on multiple columns
5arrange(flights, year, month, day)
6
7# The default order of arrange() function is ascending if we want, we can arrange in descending order
8arrange(flights, desc(year))
When providing multiple columns, each column breaks the arrangement of the rows of the preceding one.
If you are interested only in a few columns of the dataset, pull them using the select()
function. The first argument ise the name of the dataframe and subsequent arguments are the names of the columns or expressions.
1# selecting columns through their names
2select(flights, year, month, day)
3
4# selecting columns with expressions
5select(flights, starts_with("arr"))
To add a new column that is a function of the existing columns in the dataframe, use the mutate()
function. The first argument of the mutate function is the name of the dataframe and the subsequent arguments are the formula for the new columns.
1# The flights dataset has distance and air_time, so we will add speed in the dataframe
2mutate(flights, speed = distance/air_time)
This will create a new column, speed
, in the flights dataframe.
Summarized data helps make decisions that are difficult to decide properly going through a huge amount of information. The summarize function helps in this scenario.
1# lets get mean of the delay time in arrival from flights dataframe.
2summarize(flights, delay = mean(arr_delay, na.rm = TRUE)
The summarize()
function is used a lot with the group_by()
function as it gives more detailed information when used along with group_by()
.
In grouped operations, the dataset breaks down into specified groups of rows. In dplyr, this is done with the group_by()
function. The first argument of the function is the data frame name and subsequent arguments are those columns that take part in grouping the rows.
We generally use the group_by()
function along with some aggregate functions.
1# Grouping dataset in years
2group_by(flights, year)
3# Using along with summarize() function
4summarize(group_by(flight, year), delay = mean(arr_delay, na.rm = TRUE)
When used in the summarize()
function, the result is the mean of arr_delay
for each year.
So far, we have applyied the data manipulation function alone, but in this section, we will take on a task in which we have to use multiple functions and club them with the help of the piping(%>%)
operator.
Say you need to calculate the mean delay in arrival and departure for every month of the year 2013 from the flights dataset. If you read the statement, it looks complicated. So you will break it in the code section.
1# Using functions along with %>% operator
2
3flight%>%
4 filter(year = 2013)%>% # First filter the required rows
5 select(year, month, arr_delay, dep_delay)%>% # Second selecting the necessary columns
6 group_by(year, month)%>% # Third grouping the rows
7 summarise( arrival = mean(arr_delay, na.rm = TRUE),
8 departure = mean(dep_delay, na.rm = TRUE)) # Fourth now calculating the columns
The code above uses the %>%
operator to give you a better understanding of the code.
Whenever we are working with data, we need to do some manipulations to get the most valuable information. In real-life scenarios, datasets are more complicated and contain a lot of errors, so we have to write code that can manipulates data efficiently and tackles the errors.
Also, the volume of data in real-life cases is much higher. We can handle this if we understand the task and break it into small functions.
You can get more information about dplyr functions here.