March, 2015 | Scott Withrow

dplyr and all of the packages from the Wickham-verse (ggplot2, reshape2, tidyr, ggviz, etc.) have rapidly become essential to the way I visualize my data and construct my syntax. I spent most of a three hour class period going over the fine points of how to use R (stay tuned, it will be posted eventually), but my students thought it would be helpful if they had a more brief guide about the various functions. So here it is!

The dataset used is about 17,700 cases sampled from a larger IMDB and Rotten Tomatoes dataset. You can find the data used in this example here: http://degovx.eurybia.feralhosting.com/moviescleaned.RData

#############
#   dplyr   #
# Functions #
#############

require("dplyr")

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

setwd("E:/Rcourse")
load("moviesclean.RData")

options(digits = 2)

# dplyr can modify any aspect of a dataframe as well as present the dataframe
# in the table format
# 
# table format
movies <- tbl_df(movies)
movies

## Source: local data frame [17,751 x 24]
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)

# To modify dataframes you can use a variety of commands
# Within each command (verb) you can use modifiers (adverbs)
# 
# dplyr commands have the form of VERB(DATA, ADVERBS, OPTIONS)

# Below are the main verbs and their adverbs
# VERB: Select which returns a subset of the columns
# ADVERBS:
# starts_with("X")
# ends_with("X")
# contains("X")
# matches("X")
# num_range("X", 1:5, width = 2) selects X01, x02, x03, x04 etc.
# You can also use "-" to select all but.
select(movies, Title, starts_with("Genre"), contains("rating"))

## Source: local data frame [17,751 x 9]
## 
##                         Title   Genre_1   Genre_2 Genre_3 imdbRating
## 1     The Great Train Robbery     Short   Western      NA        7.4
## 2       Juve Against Fantomas     Crime     Drama      NA        6.6
## 3                     Cabiria Adventure     Drama History        6.5
## 4  Tillie's Punctured Romance    Comedy        NA      NA        7.3
## 5                Regeneration Biography     Crime   Drama        6.8
## 6                Les vampires    Action Adventure   Crime        6.7
## 7                      Mickey    Comedy     Drama      NA        7.5
## 8                   J'accuse!    Horror       War      NA        7.0
## 9            True Heart Susie    Comedy     Drama Romance        7.1
## 10    Dr. Jekyll and Mr. Hyde     Drama    Horror  Sci-Fi        7.1
## ..                        ...       ...       ...     ...        ...
## Variables not shown: RTomRating (dbl), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), RTomRatingCatagory (fctr)

# Here we match title (the default behavior of matches("X")), then find things that
# start with Genre, and any variable that contains "rating".


# VERB: Filter which returns a subset of the rows
# ADVERBS: 
# All base r math and statistical commands as well as boolean operators
# For example
# x < y, x > y, x <= y, x >= y, x == y, x != y
# and all boolean operators
# !, &, and |
# R also has a special operator x %in% [vector]
filter(movies, Genre_1 == "Drama" | Genre_1 == "Comedy", 
       !(Language_01 == "English"), Runtime > 60, imdbRating %in% c(1,2,3,4,5,6,7,8,9))

## Source: local data frame [231 x 24]
## 
##       X               Title Year Runtime   Released imdbRating imdbVotes
## 1    29 Battleship Potemkin 1925      66 1925-12-24          8     34093
## 2    40               Faust 1926      85 1926-12-06          8      8753
## 3   100         Miss Europe 1930      93 1930-08-01          7       380
## 4   131      The Blue Light 1932      85 1934-05-08          7       659
## 5   898        Early Summer 1951     124 1972-08-02          8      3539
## 6  1050         I Vitelloni 1953     104 1956-11-07          8      8478
## 7  1076    A Lesson in Love 1954      96 1960-03-14          7      1273
## 8  1154               Ordet 1955     126 1955-01-10          8      8260
## 9  1171     Street of Shame 1956      87 1959-06-04          8      1758
## 10 1185    The Burmese Harp 1956     116 1967-04-28          8      3340
## ..  ...                 ...  ...     ...        ...        ...       ...
## Variables not shown: RTomRating (dbl), Fresh (int), Rotten (int),
##   RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2
##   (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02 (fctr),
##   Language_03 (fctr), Country_01 (fctr), Country_02 (fctr), weighted
##   (dbl), RTomRatingCatagory (fctr), Director_01 (fctr), Director_02 (fctr)

# Here we find rows where Genre_1 is Drama OR Comedy (the | makes it an or statement), AND 
# (the default behavior is that a comma indicates and), Language_01 does NOT equal English 
# (the ! inverts the statement), AND Runtime is greater than 60, AND finally that imdbRating
# matches the numbers 1,2,3,4,5,6,7,8, or 9 (essentially whole numbers only).


# VERB: Summarize which reduces each group to a single row by calculating aggregate measures
# ADVERBS: 
# fist(x) The first element of vector x
# last(x) The last element of vector x
# nth(x, n) The nth element of vector x
# n() The number of rows in the data.frame or group of observations that summarise() describes
# n_distinct(x) The number of unique values in vector x
# And any math or statistic function that can be used as an aggregator of data
# Adverbs have the form desired.name = adverb
summary.movies <- summarize(movies, First.Title = first(Title), Last.Title = last(Title), Middle.Title = nth(Title, 8875),
          Total.Titles = n(), Distinct.Genres = n_distinct(Genre_1), Average.Rating = mean(imdbRating),
          Best.Rating = max(imdbRating))
print.data.frame(summary.movies)

##               First.Title  Last.Title
## 1 The Great Train Robbery Citizenfour
##                                     Middle.Title Total.Titles
## 1 Escape to Life: The Erika and Klaus Mann Story        17751
##   Distinct.Genres Average.Rating Best.Rating
## 1              23            6.5         9.4

# Here we summarized our dataset by finding the first Title in the dataframe and
# the last title. Then we Found the title in the 8875th place (roughly the
# middle), The total number of rows (and also the total number of titles), the
# number of distinct genres, the average rating and the maximum rating.


# VERB: Arrange which reorders the rows according to single or multiple variables
# ADVERB:
# DESC which inverts the order
arrange(movies, desc(imdbRating), desc(RTomRating), Title)

## Source: local data frame [17,751 x 24]
## 
##        X                                         Title Year Runtime
## 1  12550                                  Interstellar 2014     169
## 2   5951                      The Shawshank Redemption 1994     142
## 3   2468                                 The Godfather 1972     175
## 4   2655                        The Godfather: Part II 1974     200
## 5   5927                                  Pulp Fiction 1994     154
## 6   1888                The Good, the Bad and the Ugly 1966     161
## 7  11783                               The Dark Knight 2008     152
## 8   1232                                  12 Angry Men 1957      96
## 9   5657                              Schindler's List 1993     195
## 10  7710 The Lord of the Rings: The Return of the King 2003     201
## ..   ...                                           ...  ...     ...
## Variables not shown: Released (date), imdbRating (dbl), imdbVotes (int),
##   RTomRating (dbl), Fresh (int), Rotten (int), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2 (fctr), Genre_3
##   (fctr), Language_01 (fctr), Language_02 (fctr), Language_03 (fctr),
##   Country_01 (fctr), Country_02 (fctr), weighted (dbl), RTomRatingCatagory
##   (fctr), Director_01 (fctr), Director_02 (fctr)

# Here we rearranged our data (always the complete dataframe given to arrange
# not just the specified rows) in descending order of imdbRating, then when
# there were ties RTomRating was used to break ties. Finally, Ties were broken
# by Title.


# VERB: Mutate which adds columns from existing data 
# ADVERBS: Any mathmatical or
# statistical function (including user created functions) that can be performed 
# on a row. Any variable created can be used in subsequent calculatons.
composite <- mutate(movies, composite = (imdbRating + RTomRating) / 2,
       avg.composite = mean(composite), deviation = composite - avg.composite)
select(composite, composite, avg.composite, deviation)

## Source: local data frame [17,751 x 3]
## 
##    composite avg.composite deviation
## 1        7.5           6.3     1.226
## 2        7.5           6.3     1.276
## 3        7.4           6.3     1.126
## 4        6.8           6.3     0.576
## 5        8.0           6.3     1.726
## 6        7.8           6.3     1.476
## 7        6.3           6.3     0.026
## 8        7.2           6.3     0.926
## 9        6.2           6.3    -0.024
## 10       7.4           6.3     1.176
## ..       ...           ...       ...

# Here we create a composite variable which is the mean of IMDB and Rotten
# Tomatoes ratings. Then we found the mean of those composite scores. Finally,
# we created a deviation from the mean based on the composite - the mean. We
# then displayed only those columns with selct.


# VERB: group_by() which creates metadata groups that summarize will use
# to give breakdowns. Multiple groups can be specified in the group_by procedure
# ADVERBS: NONE

#Notice how the only change is there is now a "Groups:" entry at the top. 
group_by(movies, Genre_1)

## Source: local data frame [17,751 x 24]
## Groups: Genre_1
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)

#The real change is when you run summarize
grouped <- group_by(movies, Genre_1)

summary.movies <- summarize(grouped, First.Title = first(Title), Last.Title = last(Title),
          Total.Titles = n(), Average.Rating = mean(imdbRating), Best.Rating = max(imdbRating))

print.data.frame(summary.movies)

##        Genre_1                        First.Title
## 1       Action                       Les vampires
## 2        Adult                           Caligula
## 3    Adventure                            Cabiria
## 4    Animation                 Gulliver's Travels
## 5    Biography                       Regeneration
## 6       Comedy         Tillie's Punctured Romance
## 7        Crime              Juve Against Fantomas
## 8  Documentary H„xan: Witchcraft Through the Ages
## 9        Drama            Dr. Jekyll and Mr. Hyde
## 10      Family                             Skippy
## 11     Fantasy                            Destiny
## 12   Film-Noir                              Laura
## 13     History                      Western Union
## 14      Horror                          J'accuse!
## 15       Music                  One Night of Love
## 16     Musical                The Broadway Melody
## 17     Mystery             The Kennel Murder Case
## 18     Romance                        Easy Virtue
## 19      Sci-Fi                     The Devil-Doll
## 20       Short            The Great Train Robbery
## 21    Thriller                           Sabotage
## 22         War                      The Way Ahead
## 23     Western                     The Iron Horse
##                        Last.Title Total.Titles Average.Rating Best.Rating
## 1          I Am a Knife with Legs         1889            6.2         9.0
## 2                      Destricted            3            4.7         5.2
## 3                         Pirates          687            6.4         9.4
## 4  Thunder and the House of Magic          472            6.7         8.6
## 5                  The Golden Era          619            7.0         8.9
## 6                   Force Majeure         4488            6.3         8.6
## 7                   The Blue Room         1142            6.7         9.3
## 8                     Citizenfour         2119            7.2         8.9
## 9                      But Always         4757            6.7         8.9
## 10               Teen Beach Movie           56            6.0         8.2
## 11 Painted Skin: The Resurrection           80            6.0         8.0
## 12              I Bury the Living           14            7.3         8.4
## 13                        Phantom            5            6.7         7.1
## 14       The Houses October Built          839            5.7         8.6
## 15                     Tamla Rose           10            6.4         8.4
## 16           Peaches Does Herself           48            6.5         7.8
## 17                    Frequencies          108            6.7         8.6
## 18                Still the Water           74            6.5         8.1
## 19                     The Signal           81            5.9         8.2
## 20                        Hellion           34            6.7         8.4
## 21                     Heatstroke          149            6.1         8.0
## 22                Dark Blue World            9            7.0         7.5
## 23               Django Unchained           68            6.9         9.0

# Here we have the same summarized data as before (with the exceptions of
# Distinct.Genres since there will only be 1 and middle movie since that will
# change by catagory). The only difference is that now the displayed data has a
# row for every unique genre. We can use this to compute aggregate statistics.


# The final part of the dplyr package is imported from the magrittr package. and
# is not a verb at all Instead it is an operator that takes commands and imports
# them into the first statement of the next variable The command is the pipe
# %>%. It works by joining two sides of an equation with the words "and then". 
# For instance, LEFTHANDSIDE %>% (AND THEN) RIGHTHANDSIDE
# 
# You can enable the pipe by starting dplyr or by uzing the origional package magrittr.
# require("magrittr")
#
# We can use everything we learned to explore our dataset for interesting
# results. Let's look at feature length (90 minute) movies released between 1950
# and 2000 We want to see a summary of movies by Genre. The summary should
# include the number of movies in that category and the mean composite score.
# The composite score whould be a sum of imdbRating and RTomRating. Sorted by
# Composite score (highest first).
movies %>%
  group_by(Genre_1) %>%
    filter(Released > as.Date("1950-01-01") & Released < as.Date("2000-01-01"), Runtime >= 90) %>%
      mutate(composite = imdbRating + RTomRating) %>%
       summarize(N = n(), Composite = mean(composite)) %>%
        arrange(desc(Composite))

## Source: local data frame [22 x 3]
## 
##        Genre_1    N Composite
## 1  Documentary  110        15
## 2          War    5        14
## 3    Film-Noir    2        14
## 4      Mystery   44        14
## 5    Biography  246        14
## 6      Western   52        14
## 7        Drama 1607        14
## 8    Animation   40        14
## 9        Crime  429        13
## 10       Music    1        13
## ..         ...  ...       ...

# Here we take the movies dataset. Then we use pipe to put that into the first
# part of the group_by verb This imports the data. Then we assign the group
# Genre_1 and take that new data and apply the filter statement. This reduces
# our data by only selecting the rows we desire (between 1950 and 200 and over
# 90 minutes long). We take that reduced number of rows and we add a composite
# of the imdbRating and RTomRating. After the composite is created we take that
# data and import it into the summarize function where we compute N and the
# composite score. Last, we import that data into arrange where we sort it by
# composite.

// //

Scott Withrow

Assistant Professor|Koc University

Monthly Archives: March 2015

A Brief Guide to dplyr

Teaching R