Introducing dplyr for data manipulation in R

For most analyses, a large proportion of the code is used to import, reformat and clean the data, and only a small portion is used to run the statistical tests. Despite this, university courses tend to focus on statistical analyses and neglect the data processing steps.

Base R (the packages that are automatically installed) has many very powerful tools for data processing, but an extra package called dplyr can make this processing much easier.

In this post, I’m going to compare how some data.frame manipulations can be done in base R and with dplyr.

First, we need to install the dplyr package, and also the gapminder package which contains the data set I want to process. You only need do this once, and some of these packages may already be installed.

install.packages(c("dplyr", "gapminder"))

Then we need to load the packages.

library("dplyr")
library("gapminder")

The gapminder data set contains life expectancy, population and GDP per capita data for different countries from 1957 to 2007. These are the first few rows

gapminder

## # A tibble: 1,704 × 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438 14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854 12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822 13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

The gapminder data is in a special type of data.frame called a tibble. For most functions, this behaves just like a data.frame except that it has a better print function – see ?tibble for details.

Filtering rows

I want extract from the gapminder data.frame the data for the African countries with a GDP of less than $500 per capita in 1967.

With base R, we can use square brackets to select the rows. Remember these work like this: myDataFrame[rowSelector, columnSelector] where rowSelector is a logical vector (TRUE/FALSE) of rows to return, or a vector of the row numbers to return.

gapminder1967 <- gapminder[gapminder$continent == "Africa" & gapminder$year == 1967 & gapminder$gdpPercap < 500, ]
gapminder1967

## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Lesotho    Africa  1967  48.492  996380  498.6390
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

In this code, the & represents AND, so only rows that are TRUE for all three tests are returned. Remember to use == as a test of equality.

With dplyr, we can use filter().

filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500)

## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Lesotho    Africa  1967  48.492  996380  498.6390
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

I think this is cleaner, more explicit code. The first argument is the name of the data.frame; subsequent arguments are the logical tests. If the logical tests need to be combined with AND, they can be seperated by commas.

Use filter to extract

the data for Norway
the data for 2007 in Asia or Africa

Selecting columns

We can select a single column using the dollar notation

gapminder1967$country

## [1] Burundi Eritrea Lesotho Malawi 
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

If we want more columns, we need to use the myDataFrame[rowSelector, columnSelector] notation.

gapminder1967[, c("country", "lifeExp")]

## # A tibble: 4 × 2
##   country lifeExp
##    <fctr>   <dbl>
## 1 Burundi  43.548
## 2 Eritrea  42.189
## 3 Lesotho  48.492
## 4  Malawi  39.487

With dplyr, we can use select()

select(gapminder1967, country, lifeExp)

## # A tibble: 4 × 2
##   country lifeExp
##    <fctr>   <dbl>
## 1 Burundi  43.548
## 2 Eritrea  42.189
## 3 Lesotho  48.492
## 4  Malawi  39.487

The first argument to select is the data.frame, the subsequent columns are the required columns. The column names are not quoted.

This is often enough, but sometimes other methods for selecting columns with select are useful.

Adjacent columns can be selected by using firstColumnName:lastColumnName notation

select(gapminder1967, country:pop)

## # A tibble: 4 × 5
##   country continent  year lifeExp     pop
##    <fctr>    <fctr> <int>   <dbl>   <int>
## 1 Burundi    Africa  1967  43.548 3330989
## 2 Eritrea    Africa  1967  42.189 1820319
## 3 Lesotho    Africa  1967  48.492  996380
## 4  Malawi    Africa  1967  39.487 4147252

Columns can be removed by by putting a minus sign in front of their name

select(gapminder1967, -continent)

## # A tibble: 4 × 5
##   country  year lifeExp     pop gdpPercap
##    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Burundi  1967  43.548 3330989  412.9775
## 2 Eritrea  1967  42.189 1820319  468.7950
## 3 Lesotho  1967  48.492  996380  498.6390
## 4  Malawi  1967  39.487 4147252  495.5148

Alternatively, a helper function can be used

select(gapminder1967, starts_with("c"))

## # A tibble: 4 × 2
##   country continent
##    <fctr>    <fctr>
## 1 Burundi    Africa
## 2 Eritrea    Africa
## 3 Lesotho    Africa
## 4  Malawi    Africa

This will select all columns that start with a “c”. See ?starts_with for other helper functions.

Using select() extract

columns for lifeExp and gpdPercap
All columns except contintent

Arranging rows

Rows can be sorted in base R using order(). This code orders the data by life expectancy.

gapminder1967[order(gapminder1967$lifeExp), ]

## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1  Malawi    Africa  1967  39.487 4147252  495.5148
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Burundi    Africa  1967  43.548 3330989  412.9775
## 4 Lesotho    Africa  1967  48.492  996380  498.6390

In dplyr, we can use arrange().

arrange(gapminder1967, lifeExp)

## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1  Malawi    Africa  1967  39.487 4147252  495.5148
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950
## 3 Burundi    Africa  1967  43.548 3330989  412.9775
## 4 Lesotho    Africa  1967  48.492  996380  498.6390

To arrange rows in the reverse direction use desc().

arrange(gapminder1967, desc(lifeExp))

## # A tibble: 4 × 6
##   country continent  year lifeExp     pop gdpPercap
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1 Lesotho    Africa  1967  48.492  996380  498.6390
## 2 Burundi    Africa  1967  43.548 3330989  412.9775
## 3 Eritrea    Africa  1967  42.189 1820319  468.7950
## 4  Malawi    Africa  1967  39.487 4147252  495.5148

Using filter and arrange

Which African country had the highest life expectancy in 2007

Chaining dplyr functions

dplyr functions can be used seperately, with multiple commands run on a data.frame via intermediate objects.

gapminder1967 <- filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500)
result <- select(gapminder1967, country, lifeExp)
result <- arrange(result, lifeExp)

dplyr functions can also be chained together with pipes which look like %>%. Pipes pass the output of one function to the next function in the chain as the first argument. The above code can be written with pipes as

result <- gapminder %>%
  filter(continent == "Africa", year == 1967, gdpPercap < 500) %>%
  select(country)

As the number of processing steps increases, this notation becomes more and more useful. When writing a dplyr chain, it is best to write one command at a time and then test whether it works. Remember to put the pipe %>% after each command.

Using pipes %>%

Which Asian country had the lowest life expectancy in 2002?

Mutating columns

We can add a new column to a data.frame with mutate().

This code will add GDP to the data.frame

gapminder1967 %>%
  mutate(GDP = gdpPercap * pop)

## # A tibble: 4 × 7
##   country continent  year lifeExp     pop gdpPercap        GDP
##    <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>      <dbl>
## 1 Burundi    Africa  1967  43.548 3330989  412.9775 1375623555
## 2 Eritrea    Africa  1967  42.189 1820319  468.7950  853356391
## 3 Lesotho    Africa  1967  48.492  996380  498.6390  496833953
## 4  Malawi    Africa  1967  39.487 4147252  495.5148 2055024665

To replace a column, use an existing column name rather than making a new one.

Using mutate()

Add new columns holding the log of population and gpd per capita

Summarising data

All the above processing can be done in base R, but is neater and more explicit in dplyr. Some things are much easier to do in dplyr.

For example, calculating the mean and standard deviation of life expectancy per continent for 2007. With base R this gets quite complicated.

With dplyr, we need to declare which column (or columns) is to be use to group the data with group_by, and then use summarise.

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(
    n = n(), 
    mean_lifeExp = mean(lifeExp), 
    sd_lifeExp = sd(lifeExp)
  )

## # A tibble: 5 × 4
##   continent     n mean_lifeExp sd_lifeExp
##      <fctr> <int>        <dbl>      <dbl>
## 1    Africa    52     54.80604  9.6307807
## 2  Americas    25     73.60812  4.4409476
## 3      Asia    33     70.72848  7.9637245
## 4    Europe    30     77.64860  2.9798127
## 5   Oceania     2     80.71950  0.7290271

n() reports how many cases there are for each level of the grouping variable.

If there are many columns to summarise or mutate with the same functions, summarise_all or mutate_all and related functions can be useful.

What happens if you use mutate() instead of summarise() in this code?

Which countries have a life expectancy higher than the mean for their continent?

Other useful functions

rename for renaming columns

gapminder %>%
  rename(population = pop) # newName = oldName

## # A tibble: 1,704 × 6
##        country continent  year lifeExp population gdpPercap
##         <fctr>    <fctr> <int>   <dbl>      <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801    8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332    9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997   10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020   11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088   13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438   14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854   12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822   13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674   16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763   22227415  635.3414
## # ... with 1,694 more rows

slice selects rows by position

gapminder %>%
  group_by(continent) %>%
  slice(1:2)

## Source: local data frame [10 x 6]
## Groups: continent [5]
## 
##        country continent  year lifeExp      pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>      <dbl>
## 1      Algeria    Africa  1952  43.077  9279525  2449.0082
## 2      Algeria    Africa  1957  45.685 10270856  3013.9760
## 3    Argentina  Americas  1952  62.485 17876956  5911.3151
## 4    Argentina  Americas  1957  64.399 19610538  6856.8562
## 5  Afghanistan      Asia  1952  28.801  8425333   779.4453
## 6  Afghanistan      Asia  1957  30.332  9240934   820.8530
## 7      Albania    Europe  1952  55.230  1282697  1601.0561
## 8      Albania    Europe  1957  59.280  1476505  1942.2842
## 9    Australia   Oceania  1952  69.120  8691212 10039.5956
## 10   Australia   Oceania  1957  70.330  9712569 10949.6496

This will return the first two rows for each continent.

distinct removes duplicate rows

gapminder %>%
  distinct(continent)

## # A tibble: 5 × 1
##   continent
##      <fctr>
## 1      Asia
## 2    Europe
## 3    Africa
## 4  Americas
## 5   Oceania

left_join joins two data.frames by one or more columns in common. For example, to find which countries had a reduction in life expectancy between 2002 and 2007, we can generate subsets of gapminder for each year and then join them by country and continent.

left_join(
  filter(gapminder, year == 2002),
  filter(gapminder, year == 2007),
  by = c("continent" = "continent","country" = "country"), 
  suffix = c(".2002", ".2007")
) %>%
  filter(lifeExp.2002 &gt; lifeExp.2007) %>%
  select(country, continent, starts_with("lifeExp")) %>%
  mutate(diff = lifeExp.2002 - lifeExp.2007)

## # A tibble: 5 × 5
##        country continent lifeExp.2002 lifeExp.2007  diff
##         <fctr>    <fctr>        <dbl>        <dbl> <dbl>
## 1        Gabon    Africa       56.761       56.735 0.026
## 2      Lesotho    Africa       44.593       42.592 2.001
## 3   Mozambique    Africa       44.026       42.082 1.944
## 4 South Africa    Africa       53.365       49.339 4.026
## 5    Swaziland    Africa       43.869       39.613 4.256