For most analyses, a large proportion of the code is used to import, reformat and clean the data, and only a small portion is used to run the statistical tests. Despite this, university courses tend to focus on statistical analyses and neglect the data processing steps.
Base R (the packages that are automatically installed) has many very powerful tools for data processing, but an extra package called dplyr
can make this processing much easier.
In this post, I’m going to compare how some data.frame manipulations can be done in base R and with dplyr
.
First, we need to install the dplyr
package, and also the gapminder
package which contains the data set I want to process. You only need do this once, and some of these packages may already be installed.
install.packages(c("dplyr", "gapminder"))
Then we need to load the packages.
library("dplyr") library("gapminder")
The gapminder data set contains life expectancy, population and GDP per capita data for different countries from 1957 to 2007. These are the first few rows
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
The gapminder data is in a special type of data.frame
called a tibble. For most functions, this behaves just like a data.frame
except that it has a better print function – see ?tibble
for details.
Filtering rows
I want extract from the gapminder data.frame the data for the African countries with a GDP of less than $500 per capita in 1967.
With base R, we can use square brackets to select the rows. Remember these work like this: myDataFrame[rowSelector, columnSelector]
where rowSelector is a logical vector (TRUE/FALSE) of rows to return, or a vector of the row numbers to return.
gapminder1967 <- gapminder[gapminder$continent == "Africa" & gapminder$year == 1967 & gapminder$gdpPercap < 500, ] gapminder1967
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Burundi Africa 1967 43.548 3330989 412.9775
## 2 Eritrea Africa 1967 42.189 1820319 468.7950
## 3 Lesotho Africa 1967 48.492 996380 498.6390
## 4 Malawi Africa 1967 39.487 4147252 495.5148
In this code, the &
represents AND, so only rows that are TRUE for all three tests are returned. Remember to use ==
as a test of equality.
With dplyr
, we can use filter()
.
filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500)
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Burundi Africa 1967 43.548 3330989 412.9775
## 2 Eritrea Africa 1967 42.189 1820319 468.7950
## 3 Lesotho Africa 1967 48.492 996380 498.6390
## 4 Malawi Africa 1967 39.487 4147252 495.5148
I think this is cleaner, more explicit code. The first argument is the name of the data.frame; subsequent arguments are the logical tests. If the logical tests need to be combined with AND, they can be seperated by commas.
Use filter
to extract
- the data for Norway
- the data for 2007 in Asia or Africa
Selecting columns
We can select a single column using the dollar notation
gapminder1967$country
## [1] Burundi Eritrea Lesotho Malawi
## 142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe
If we want more columns, we need to use the myDataFrame[rowSelector, columnSelector]
notation.
gapminder1967[, c("country", "lifeExp")]
## # A tibble: 4 × 2
## country lifeExp
## <fctr> <dbl>
## 1 Burundi 43.548
## 2 Eritrea 42.189
## 3 Lesotho 48.492
## 4 Malawi 39.487
With dplyr
, we can use select()
select(gapminder1967, country, lifeExp)
## # A tibble: 4 × 2
## country lifeExp
## <fctr> <dbl>
## 1 Burundi 43.548
## 2 Eritrea 42.189
## 3 Lesotho 48.492
## 4 Malawi 39.487
The first argument to select
is the data.frame, the subsequent columns are the required columns. The column names are not quoted.
This is often enough, but sometimes other methods for selecting columns with select
are useful.
Adjacent columns can be selected by using firstColumnName:lastColumnName
notation
select(gapminder1967, country:pop)
## # A tibble: 4 × 5
## country continent year lifeExp pop
## <fctr> <fctr> <int> <dbl> <int>
## 1 Burundi Africa 1967 43.548 3330989
## 2 Eritrea Africa 1967 42.189 1820319
## 3 Lesotho Africa 1967 48.492 996380
## 4 Malawi Africa 1967 39.487 4147252
Columns can be removed by by putting a minus sign in front of their name
select(gapminder1967, -continent)
## # A tibble: 4 × 5
## country year lifeExp pop gdpPercap
## <fctr> <int> <dbl> <int> <dbl>
## 1 Burundi 1967 43.548 3330989 412.9775
## 2 Eritrea 1967 42.189 1820319 468.7950
## 3 Lesotho 1967 48.492 996380 498.6390
## 4 Malawi 1967 39.487 4147252 495.5148
Alternatively, a helper function can be used
select(gapminder1967, starts_with("c"))
## # A tibble: 4 × 2
## country continent
## <fctr> <fctr>
## 1 Burundi Africa
## 2 Eritrea Africa
## 3 Lesotho Africa
## 4 Malawi Africa
This will select all columns that start with a “c”. See ?starts_with
for other helper functions.
Using select()
extract
- columns for lifeExp and gpdPercap
- All columns except contintent
Arranging rows
Rows can be sorted in base R using order()
. This code orders the data by life expectancy.
gapminder1967[order(gapminder1967$lifeExp), ]
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Malawi Africa 1967 39.487 4147252 495.5148
## 2 Eritrea Africa 1967 42.189 1820319 468.7950
## 3 Burundi Africa 1967 43.548 3330989 412.9775
## 4 Lesotho Africa 1967 48.492 996380 498.6390
In dplyr
, we can use arrange()
.
arrange(gapminder1967, lifeExp)
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Malawi Africa 1967 39.487 4147252 495.5148
## 2 Eritrea Africa 1967 42.189 1820319 468.7950
## 3 Burundi Africa 1967 43.548 3330989 412.9775
## 4 Lesotho Africa 1967 48.492 996380 498.6390
To arrange rows in the reverse direction use desc()
.
arrange(gapminder1967, desc(lifeExp))
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Lesotho Africa 1967 48.492 996380 498.6390
## 2 Burundi Africa 1967 43.548 3330989 412.9775
## 3 Eritrea Africa 1967 42.189 1820319 468.7950
## 4 Malawi Africa 1967 39.487 4147252 495.5148
Using filter
and arrange
- Which African country had the highest life expectancy in 2007
Chaining dplyr functions
dplyr
functions can be used seperately, with multiple commands run on a data.frame via intermediate objects.
gapminder1967 <- filter(gapminder, continent == "Africa", year == 1967, gdpPercap < 500) result <- select(gapminder1967, country, lifeExp) result <- arrange(result, lifeExp)
dplyr
functions can also be chained together with pipes which look like %>%
. Pipes pass the output of one function to the next function in the chain as the first argument. The above code can be written with pipes as
result <- gapminder %>% filter(continent == "Africa", year == 1967, gdpPercap < 500) %>% select(country)
As the number of processing steps increases, this notation becomes more and more useful. When writing a dplyr chain, it is best to write one command at a time and then test whether it works. Remember to put the pipe %>%
after each command.
Using pipes %>%
- Which Asian country had the lowest life expectancy in 2002?
Mutating columns
We can add a new column to a data.frame with mutate()
.
This code will add GDP to the data.frame
gapminder1967 %>% mutate(GDP = gdpPercap * pop)
## # A tibble: 4 × 7
## country continent year lifeExp pop gdpPercap GDP
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Burundi Africa 1967 43.548 3330989 412.9775 1375623555
## 2 Eritrea Africa 1967 42.189 1820319 468.7950 853356391
## 3 Lesotho Africa 1967 48.492 996380 498.6390 496833953
## 4 Malawi Africa 1967 39.487 4147252 495.5148 2055024665
To replace a column, use an existing column name rather than making a new one.
Using mutate()
- Add new columns holding the log of population and gpd per capita
Summarising data
All the above processing can be done in base R, but is neater and more explicit in dplyr
. Some things are much easier to do in dplyr
.
For example, calculating the mean and standard deviation of life expectancy per continent for 2007. With base R this gets quite complicated.
With dplyr
, we need to declare which column (or columns) is to be use to group the data with group_by
, and then use summarise
.
gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarise( n = n(), mean_lifeExp = mean(lifeExp), sd_lifeExp = sd(lifeExp) )
## # A tibble: 5 × 4
## continent n mean_lifeExp sd_lifeExp
## <fctr> <int> <dbl> <dbl>
## 1 Africa 52 54.80604 9.6307807
## 2 Americas 25 73.60812 4.4409476
## 3 Asia 33 70.72848 7.9637245
## 4 Europe 30 77.64860 2.9798127
## 5 Oceania 2 80.71950 0.7290271
n()
reports how many cases there are for each level of the grouping variable.
If there are many columns to summarise or mutate with the same functions, summarise_all
or mutate_all
and related functions can be useful.
What happens if you use mutate()
instead of summarise()
in this code?
Which countries have a life expectancy higher than the mean for their continent?
Other useful functions
rename
for renaming columns
gapminder %>% rename(population = pop) # newName = oldName
## # A tibble: 1,704 × 6
## country continent year lifeExp population gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
slice
selects rows by position
gapminder %>% group_by(continent) %>% slice(1:2)
## Source: local data frame [10 x 6]
## Groups: continent [5]
##
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Algeria Africa 1952 43.077 9279525 2449.0082
## 2 Algeria Africa 1957 45.685 10270856 3013.9760
## 3 Argentina Americas 1952 62.485 17876956 5911.3151
## 4 Argentina Americas 1957 64.399 19610538 6856.8562
## 5 Afghanistan Asia 1952 28.801 8425333 779.4453
## 6 Afghanistan Asia 1957 30.332 9240934 820.8530
## 7 Albania Europe 1952 55.230 1282697 1601.0561
## 8 Albania Europe 1957 59.280 1476505 1942.2842
## 9 Australia Oceania 1952 69.120 8691212 10039.5956
## 10 Australia Oceania 1957 70.330 9712569 10949.6496
This will return the first two rows for each continent.
distinct
removes duplicate rows
gapminder %>% distinct(continent)
## # A tibble: 5 × 1
## continent
## <fctr>
## 1 Asia
## 2 Europe
## 3 Africa
## 4 Americas
## 5 Oceania
left_join
joins two data.frames by one or more columns in common. For example, to find which countries had a reduction in life expectancy between 2002 and 2007, we can generate subsets of gapminder for each year and then join them by country and continent.
left_join( filter(gapminder, year == 2002), filter(gapminder, year == 2007), by = c("continent" = "continent","country" = "country"), suffix = c(".2002", ".2007") ) %>% filter(lifeExp.2002 > lifeExp.2007) %>% select(country, continent, starts_with("lifeExp")) %>% mutate(diff = lifeExp.2002 - lifeExp.2007)
## # A tibble: 5 × 5
## country continent lifeExp.2002 lifeExp.2007 diff
## <fctr> <fctr> <dbl> <dbl> <dbl>
## 1 Gabon Africa 56.761 56.735 0.026
## 2 Lesotho Africa 44.593 42.592 2.001
## 3 Mozambique Africa 44.026 42.082 1.944
## 4 South Africa Africa 53.365 49.339 4.026
## 5 Swaziland Africa 43.869 39.613 4.256