dplyr and all of the packages from the Wickham-verse (ggplot2, reshape2, tidyr, ggviz, etc.) have rapidly become essential to the way I visualize my data and construct my syntax. I spent most of a three hour class period going over the fine points of how to use R (stay tuned, it will be posted eventually), but my students thought it would be helpful if they had a more brief guide about the various functions. So here it is!
The dataset used is about 17,700 cases sampled from a larger IMDB and Rotten Tomatoes dataset. You can find the data used in this example here: http://degovx.eurybia.feralhosting.com/moviescleaned.RData
#############
# dplyr #
# Functions #
#############
require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("E:/Rcourse")
load("moviesclean.RData")
options(digits = 2)
# dplyr can modify any aspect of a dataframe as well as present the dataframe
# in the table format
#
# table format
movies <- tbl_df(movies)
movies
## Source: local data frame [17,751 x 24]
##
## X Title Year Runtime Released imdbRating
## 1 1 The Great Train Robbery 1903 11 1903-12-01 7.4
## 2 2 Juve Against Fantomas 1913 61 1913-10-02 6.6
## 3 3 Cabiria 1914 148 1914-06-01 6.5
## 4 4 Tillie's Punctured Romance 1914 82 1914-12-21 7.3
## 5 5 Regeneration 1915 72 1915-09-13 6.8
## 6 6 Les vampires 1915 399 1916-11-23 6.7
## 7 7 Mickey 1918 93 1918-08-01 7.5
## 8 8 J'accuse! 1919 166 1919-04-25 7.0
## 9 9 True Heart Susie 1919 87 1919-06-01 7.1
## 10 10 Dr. Jekyll and Mr. Hyde 1920 49 1920-04-01 7.1
## .. .. ... ... ... ... ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
## Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
## (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
## (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
## weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
## Director_02 (fctr)
# To modify dataframes you can use a variety of commands
# Within each command (verb) you can use modifiers (adverbs)
#
# dplyr commands have the form of VERB(DATA, ADVERBS, OPTIONS)
# Below are the main verbs and their adverbs
# VERB: Select which returns a subset of the columns
# ADVERBS:
# starts_with("X")
# ends_with("X")
# contains("X")
# matches("X")
# num_range("X", 1:5, width = 2) selects X01, x02, x03, x04 etc.
# You can also use "-" to select all but.
select(movies, Title, starts_with("Genre"), contains("rating"))
## Source: local data frame [17,751 x 9]
##
## Title Genre_1 Genre_2 Genre_3 imdbRating
## 1 The Great Train Robbery Short Western NA 7.4
## 2 Juve Against Fantomas Crime Drama NA 6.6
## 3 Cabiria Adventure Drama History 6.5
## 4 Tillie's Punctured Romance Comedy NA NA 7.3
## 5 Regeneration Biography Crime Drama 6.8
## 6 Les vampires Action Adventure Crime 6.7
## 7 Mickey Comedy Drama NA 7.5
## 8 J'accuse! Horror War NA 7.0
## 9 True Heart Susie Comedy Drama Romance 7.1
## 10 Dr. Jekyll and Mr. Hyde Drama Horror Sci-Fi 7.1
## .. ... ... ... ... ...
## Variables not shown: RTomRating (dbl), RTomUserRating (dbl),
## imdbRatingCatagory (fctr), RTomRatingCatagory (fctr)
# Here we match title (the default behavior of matches("X")), then find things that
# start with Genre, and any variable that contains "rating".
# VERB: Filter which returns a subset of the rows
# ADVERBS:
# All base r math and statistical commands as well as boolean operators
# For example
# x < y, x > y, x <= y, x >= y, x == y, x != y
# and all boolean operators
# !, &, and |
# R also has a special operator x %in% [vector]
filter(movies, Genre_1 == "Drama" | Genre_1 == "Comedy",
!(Language_01 == "English"), Runtime > 60, imdbRating %in% c(1,2,3,4,5,6,7,8,9))
## Source: local data frame [231 x 24]
##
## X Title Year Runtime Released imdbRating imdbVotes
## 1 29 Battleship Potemkin 1925 66 1925-12-24 8 34093
## 2 40 Faust 1926 85 1926-12-06 8 8753
## 3 100 Miss Europe 1930 93 1930-08-01 7 380
## 4 131 The Blue Light 1932 85 1934-05-08 7 659
## 5 898 Early Summer 1951 124 1972-08-02 8 3539
## 6 1050 I Vitelloni 1953 104 1956-11-07 8 8478
## 7 1076 A Lesson in Love 1954 96 1960-03-14 7 1273
## 8 1154 Ordet 1955 126 1955-01-10 8 8260
## 9 1171 Street of Shame 1956 87 1959-06-04 8 1758
## 10 1185 The Burmese Harp 1956 116 1967-04-28 8 3340
## .. ... ... ... ... ... ... ...
## Variables not shown: RTomRating (dbl), Fresh (int), Rotten (int),
## RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2
## (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02 (fctr),
## Language_03 (fctr), Country_01 (fctr), Country_02 (fctr), weighted
## (dbl), RTomRatingCatagory (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we find rows where Genre_1 is Drama OR Comedy (the | makes it an or statement), AND
# (the default behavior is that a comma indicates and), Language_01 does NOT equal English
# (the ! inverts the statement), AND Runtime is greater than 60, AND finally that imdbRating
# matches the numbers 1,2,3,4,5,6,7,8, or 9 (essentially whole numbers only).
# VERB: Summarize which reduces each group to a single row by calculating aggregate measures
# ADVERBS:
# fist(x) The first element of vector x
# last(x) The last element of vector x
# nth(x, n) The nth element of vector x
# n() The number of rows in the data.frame or group of observations that summarise() describes
# n_distinct(x) The number of unique values in vector x
# And any math or statistic function that can be used as an aggregator of data
# Adverbs have the form desired.name = adverb
summary.movies <- summarize(movies, First.Title = first(Title), Last.Title = last(Title), Middle.Title = nth(Title, 8875),
Total.Titles = n(), Distinct.Genres = n_distinct(Genre_1), Average.Rating = mean(imdbRating),
Best.Rating = max(imdbRating))
print.data.frame(summary.movies)
## First.Title Last.Title
## 1 The Great Train Robbery Citizenfour
## Middle.Title Total.Titles
## 1 Escape to Life: The Erika and Klaus Mann Story 17751
## Distinct.Genres Average.Rating Best.Rating
## 1 23 6.5 9.4
# Here we summarized our dataset by finding the first Title in the dataframe and
# the last title. Then we Found the title in the 8875th place (roughly the
# middle), The total number of rows (and also the total number of titles), the
# number of distinct genres, the average rating and the maximum rating.
# VERB: Arrange which reorders the rows according to single or multiple variables
# ADVERB:
# DESC which inverts the order
arrange(movies, desc(imdbRating), desc(RTomRating), Title)
## Source: local data frame [17,751 x 24]
##
## X Title Year Runtime
## 1 12550 Interstellar 2014 169
## 2 5951 The Shawshank Redemption 1994 142
## 3 2468 The Godfather 1972 175
## 4 2655 The Godfather: Part II 1974 200
## 5 5927 Pulp Fiction 1994 154
## 6 1888 The Good, the Bad and the Ugly 1966 161
## 7 11783 The Dark Knight 2008 152
## 8 1232 12 Angry Men 1957 96
## 9 5657 Schindler's List 1993 195
## 10 7710 The Lord of the Rings: The Return of the King 2003 201
## .. ... ... ... ...
## Variables not shown: Released (date), imdbRating (dbl), imdbVotes (int),
## RTomRating (dbl), Fresh (int), Rotten (int), RTomUserRating (dbl),
## imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2 (fctr), Genre_3
## (fctr), Language_01 (fctr), Language_02 (fctr), Language_03 (fctr),
## Country_01 (fctr), Country_02 (fctr), weighted (dbl), RTomRatingCatagory
## (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we rearranged our data (always the complete dataframe given to arrange
# not just the specified rows) in descending order of imdbRating, then when
# there were ties RTomRating was used to break ties. Finally, Ties were broken
# by Title.
# VERB: Mutate which adds columns from existing data
# ADVERBS: Any mathmatical or
# statistical function (including user created functions) that can be performed
# on a row. Any variable created can be used in subsequent calculatons.
composite <- mutate(movies, composite = (imdbRating + RTomRating) / 2,
avg.composite = mean(composite), deviation = composite - avg.composite)
select(composite, composite, avg.composite, deviation)
## Source: local data frame [17,751 x 3]
##
## composite avg.composite deviation
## 1 7.5 6.3 1.226
## 2 7.5 6.3 1.276
## 3 7.4 6.3 1.126
## 4 6.8 6.3 0.576
## 5 8.0 6.3 1.726
## 6 7.8 6.3 1.476
## 7 6.3 6.3 0.026
## 8 7.2 6.3 0.926
## 9 6.2 6.3 -0.024
## 10 7.4 6.3 1.176
## .. ... ... ...
# Here we create a composite variable which is the mean of IMDB and Rotten
# Tomatoes ratings. Then we found the mean of those composite scores. Finally,
# we created a deviation from the mean based on the composite - the mean. We
# then displayed only those columns with selct.
# VERB: group_by() which creates metadata groups that summarize will use
# to give breakdowns. Multiple groups can be specified in the group_by procedure
# ADVERBS: NONE
#Notice how the only change is there is now a "Groups:" entry at the top.
group_by(movies, Genre_1)
## Source: local data frame [17,751 x 24]
## Groups: Genre_1
##
## X Title Year Runtime Released imdbRating
## 1 1 The Great Train Robbery 1903 11 1903-12-01 7.4
## 2 2 Juve Against Fantomas 1913 61 1913-10-02 6.6
## 3 3 Cabiria 1914 148 1914-06-01 6.5
## 4 4 Tillie's Punctured Romance 1914 82 1914-12-21 7.3
## 5 5 Regeneration 1915 72 1915-09-13 6.8
## 6 6 Les vampires 1915 399 1916-11-23 6.7
## 7 7 Mickey 1918 93 1918-08-01 7.5
## 8 8 J'accuse! 1919 166 1919-04-25 7.0
## 9 9 True Heart Susie 1919 87 1919-06-01 7.1
## 10 10 Dr. Jekyll and Mr. Hyde 1920 49 1920-04-01 7.1
## .. .. ... ... ... ... ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
## Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
## (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
## (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
## weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
## Director_02 (fctr)
#The real change is when you run summarize
grouped <- group_by(movies, Genre_1)
summary.movies <- summarize(grouped, First.Title = first(Title), Last.Title = last(Title),
Total.Titles = n(), Average.Rating = mean(imdbRating), Best.Rating = max(imdbRating))
print.data.frame(summary.movies)
## Genre_1 First.Title
## 1 Action Les vampires
## 2 Adult Caligula
## 3 Adventure Cabiria
## 4 Animation Gulliver's Travels
## 5 Biography Regeneration
## 6 Comedy Tillie's Punctured Romance
## 7 Crime Juve Against Fantomas
## 8 Documentary H„xan: Witchcraft Through the Ages
## 9 Drama Dr. Jekyll and Mr. Hyde
## 10 Family Skippy
## 11 Fantasy Destiny
## 12 Film-Noir Laura
## 13 History Western Union
## 14 Horror J'accuse!
## 15 Music One Night of Love
## 16 Musical The Broadway Melody
## 17 Mystery The Kennel Murder Case
## 18 Romance Easy Virtue
## 19 Sci-Fi The Devil-Doll
## 20 Short The Great Train Robbery
## 21 Thriller Sabotage
## 22 War The Way Ahead
## 23 Western The Iron Horse
## Last.Title Total.Titles Average.Rating Best.Rating
## 1 I Am a Knife with Legs 1889 6.2 9.0
## 2 Destricted 3 4.7 5.2
## 3 Pirates 687 6.4 9.4
## 4 Thunder and the House of Magic 472 6.7 8.6
## 5 The Golden Era 619 7.0 8.9
## 6 Force Majeure 4488 6.3 8.6
## 7 The Blue Room 1142 6.7 9.3
## 8 Citizenfour 2119 7.2 8.9
## 9 But Always 4757 6.7 8.9
## 10 Teen Beach Movie 56 6.0 8.2
## 11 Painted Skin: The Resurrection 80 6.0 8.0
## 12 I Bury the Living 14 7.3 8.4
## 13 Phantom 5 6.7 7.1
## 14 The Houses October Built 839 5.7 8.6
## 15 Tamla Rose 10 6.4 8.4
## 16 Peaches Does Herself 48 6.5 7.8
## 17 Frequencies 108 6.7 8.6
## 18 Still the Water 74 6.5 8.1
## 19 The Signal 81 5.9 8.2
## 20 Hellion 34 6.7 8.4
## 21 Heatstroke 149 6.1 8.0
## 22 Dark Blue World 9 7.0 7.5
## 23 Django Unchained 68 6.9 9.0
# Here we have the same summarized data as before (with the exceptions of
# Distinct.Genres since there will only be 1 and middle movie since that will
# change by catagory). The only difference is that now the displayed data has a
# row for every unique genre. We can use this to compute aggregate statistics.
# The final part of the dplyr package is imported from the magrittr package. and
# is not a verb at all Instead it is an operator that takes commands and imports
# them into the first statement of the next variable The command is the pipe
# %>%. It works by joining two sides of an equation with the words "and then".
# For instance, LEFTHANDSIDE %>% (AND THEN) RIGHTHANDSIDE
#
# You can enable the pipe by starting dplyr or by uzing the origional package magrittr.
# require("magrittr")
#
# We can use everything we learned to explore our dataset for interesting
# results. Let's look at feature length (90 minute) movies released between 1950
# and 2000 We want to see a summary of movies by Genre. The summary should
# include the number of movies in that category and the mean composite score.
# The composite score whould be a sum of imdbRating and RTomRating. Sorted by
# Composite score (highest first).
movies %>%
group_by(Genre_1) %>%
filter(Released > as.Date("1950-01-01") & Released < as.Date("2000-01-01"), Runtime >= 90) %>%
mutate(composite = imdbRating + RTomRating) %>%
summarize(N = n(), Composite = mean(composite)) %>%
arrange(desc(Composite))
## Source: local data frame [22 x 3]
##
## Genre_1 N Composite
## 1 Documentary 110 15
## 2 War 5 14
## 3 Film-Noir 2 14
## 4 Mystery 44 14
## 5 Biography 246 14
## 6 Western 52 14
## 7 Drama 1607 14
## 8 Animation 40 14
## 9 Crime 429 13
## 10 Music 1 13
## .. ... ... ...
# Here we take the movies dataset. Then we use pipe to put that into the first
# part of the group_by verb This imports the data. Then we assign the group
# Genre_1 and take that new data and apply the filter statement. This reduces
# our data by only selecting the rows we desire (between 1950 and 200 and over
# 90 minutes long). We take that reduced number of rows and we add a composite
# of the imdbRating and RTomRating. After the composite is created we take that
# data and import it into the summarize function where we compute N and the
# composite score. Last, we import that data into arrange where we sort it by
# composite.
// //