Monthly Archives: March 2015

A Brief Guide to dplyr

dplyr and all of the packages from the Wickham-verse (ggplot2, reshape2, tidyr, ggviz, etc.) have rapidly become essential to the way I visualize my data and construct my syntax. I spent most of a three hour class period going over the fine points of how to use R (stay tuned, it will be posted eventually), but my students thought it would be helpful if they had a more brief guide about the various functions. So here it is!

The dataset used is about 17,700 cases sampled from a larger IMDB and Rotten Tomatoes dataset. You can find the data used in this example here: http://degovx.eurybia.feralhosting.com/moviescleaned.RData

#############
#   dplyr   #
# Functions #
#############

require("dplyr")
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("E:/Rcourse")
load("moviesclean.RData")

options(digits = 2)

# dplyr can modify any aspect of a dataframe as well as present the dataframe
# in the table format
# 
# table format
movies <- tbl_df(movies)
movies
## Source: local data frame [17,751 x 24]
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)
# To modify dataframes you can use a variety of commands
# Within each command (verb) you can use modifiers (adverbs)
# 
# dplyr commands have the form of VERB(DATA, ADVERBS, OPTIONS)

# Below are the main verbs and their adverbs
# VERB: Select which returns a subset of the columns
# ADVERBS:
# starts_with("X")
# ends_with("X")
# contains("X")
# matches("X")
# num_range("X", 1:5, width = 2) selects X01, x02, x03, x04 etc.
# You can also use "-" to select all but.
select(movies, Title, starts_with("Genre"), contains("rating"))
## Source: local data frame [17,751 x 9]
## 
##                         Title   Genre_1   Genre_2 Genre_3 imdbRating
## 1     The Great Train Robbery     Short   Western      NA        7.4
## 2       Juve Against Fantomas     Crime     Drama      NA        6.6
## 3                     Cabiria Adventure     Drama History        6.5
## 4  Tillie's Punctured Romance    Comedy        NA      NA        7.3
## 5                Regeneration Biography     Crime   Drama        6.8
## 6                Les vampires    Action Adventure   Crime        6.7
## 7                      Mickey    Comedy     Drama      NA        7.5
## 8                   J'accuse!    Horror       War      NA        7.0
## 9            True Heart Susie    Comedy     Drama Romance        7.1
## 10    Dr. Jekyll and Mr. Hyde     Drama    Horror  Sci-Fi        7.1
## ..                        ...       ...       ...     ...        ...
## Variables not shown: RTomRating (dbl), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), RTomRatingCatagory (fctr)
# Here we match title (the default behavior of matches("X")), then find things that
# start with Genre, and any variable that contains "rating".


# VERB: Filter which returns a subset of the rows
# ADVERBS: 
# All base r math and statistical commands as well as boolean operators
# For example
# x < y, x > y, x <= y, x >= y, x == y, x != y
# and all boolean operators
# !, &, and |
# R also has a special operator x %in% [vector]
filter(movies, Genre_1 == "Drama" | Genre_1 == "Comedy", 
       !(Language_01 == "English"), Runtime > 60, imdbRating %in% c(1,2,3,4,5,6,7,8,9))
## Source: local data frame [231 x 24]
## 
##       X               Title Year Runtime   Released imdbRating imdbVotes
## 1    29 Battleship Potemkin 1925      66 1925-12-24          8     34093
## 2    40               Faust 1926      85 1926-12-06          8      8753
## 3   100         Miss Europe 1930      93 1930-08-01          7       380
## 4   131      The Blue Light 1932      85 1934-05-08          7       659
## 5   898        Early Summer 1951     124 1972-08-02          8      3539
## 6  1050         I Vitelloni 1953     104 1956-11-07          8      8478
## 7  1076    A Lesson in Love 1954      96 1960-03-14          7      1273
## 8  1154               Ordet 1955     126 1955-01-10          8      8260
## 9  1171     Street of Shame 1956      87 1959-06-04          8      1758
## 10 1185    The Burmese Harp 1956     116 1967-04-28          8      3340
## ..  ...                 ...  ...     ...        ...        ...       ...
## Variables not shown: RTomRating (dbl), Fresh (int), Rotten (int),
##   RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2
##   (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02 (fctr),
##   Language_03 (fctr), Country_01 (fctr), Country_02 (fctr), weighted
##   (dbl), RTomRatingCatagory (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we find rows where Genre_1 is Drama OR Comedy (the | makes it an or statement), AND 
# (the default behavior is that a comma indicates and), Language_01 does NOT equal English 
# (the ! inverts the statement), AND Runtime is greater than 60, AND finally that imdbRating
# matches the numbers 1,2,3,4,5,6,7,8, or 9 (essentially whole numbers only).


# VERB: Summarize which reduces each group to a single row by calculating aggregate measures
# ADVERBS: 
# fist(x) The first element of vector x
# last(x) The last element of vector x
# nth(x, n) The nth element of vector x
# n() The number of rows in the data.frame or group of observations that summarise() describes
# n_distinct(x) The number of unique values in vector x
# And any math or statistic function that can be used as an aggregator of data
# Adverbs have the form desired.name = adverb
summary.movies <- summarize(movies, First.Title = first(Title), Last.Title = last(Title), Middle.Title = nth(Title, 8875),
          Total.Titles = n(), Distinct.Genres = n_distinct(Genre_1), Average.Rating = mean(imdbRating),
          Best.Rating = max(imdbRating))
print.data.frame(summary.movies)
##               First.Title  Last.Title
## 1 The Great Train Robbery Citizenfour
##                                     Middle.Title Total.Titles
## 1 Escape to Life: The Erika and Klaus Mann Story        17751
##   Distinct.Genres Average.Rating Best.Rating
## 1              23            6.5         9.4
# Here we summarized our dataset by finding the first Title in the dataframe and
# the last title. Then we Found the title in the 8875th place (roughly the
# middle), The total number of rows (and also the total number of titles), the
# number of distinct genres, the average rating and the maximum rating.


# VERB: Arrange which reorders the rows according to single or multiple variables
# ADVERB:
# DESC which inverts the order
arrange(movies, desc(imdbRating), desc(RTomRating), Title)
## Source: local data frame [17,751 x 24]
## 
##        X                                         Title Year Runtime
## 1  12550                                  Interstellar 2014     169
## 2   5951                      The Shawshank Redemption 1994     142
## 3   2468                                 The Godfather 1972     175
## 4   2655                        The Godfather: Part II 1974     200
## 5   5927                                  Pulp Fiction 1994     154
## 6   1888                The Good, the Bad and the Ugly 1966     161
## 7  11783                               The Dark Knight 2008     152
## 8   1232                                  12 Angry Men 1957      96
## 9   5657                              Schindler's List 1993     195
## 10  7710 The Lord of the Rings: The Return of the King 2003     201
## ..   ...                                           ...  ...     ...
## Variables not shown: Released (date), imdbRating (dbl), imdbVotes (int),
##   RTomRating (dbl), Fresh (int), Rotten (int), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2 (fctr), Genre_3
##   (fctr), Language_01 (fctr), Language_02 (fctr), Language_03 (fctr),
##   Country_01 (fctr), Country_02 (fctr), weighted (dbl), RTomRatingCatagory
##   (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we rearranged our data (always the complete dataframe given to arrange
# not just the specified rows) in descending order of imdbRating, then when
# there were ties RTomRating was used to break ties. Finally, Ties were broken
# by Title.


# VERB: Mutate which adds columns from existing data 
# ADVERBS: Any mathmatical or
# statistical function (including user created functions) that can be performed 
# on a row. Any variable created can be used in subsequent calculatons.
composite <- mutate(movies, composite = (imdbRating + RTomRating) / 2,
       avg.composite = mean(composite), deviation = composite - avg.composite)
select(composite, composite, avg.composite, deviation)
## Source: local data frame [17,751 x 3]
## 
##    composite avg.composite deviation
## 1        7.5           6.3     1.226
## 2        7.5           6.3     1.276
## 3        7.4           6.3     1.126
## 4        6.8           6.3     0.576
## 5        8.0           6.3     1.726
## 6        7.8           6.3     1.476
## 7        6.3           6.3     0.026
## 8        7.2           6.3     0.926
## 9        6.2           6.3    -0.024
## 10       7.4           6.3     1.176
## ..       ...           ...       ...
# Here we create a composite variable which is the mean of IMDB and Rotten
# Tomatoes ratings. Then we found the mean of those composite scores. Finally,
# we created a deviation from the mean based on the composite - the mean. We
# then displayed only those columns with selct.


# VERB: group_by() which creates metadata groups that summarize will use
# to give breakdowns. Multiple groups can be specified in the group_by procedure
# ADVERBS: NONE

#Notice how the only change is there is now a "Groups:" entry at the top. 
group_by(movies, Genre_1)
## Source: local data frame [17,751 x 24]
## Groups: Genre_1
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)
#The real change is when you run summarize
grouped <- group_by(movies, Genre_1)

summary.movies <- summarize(grouped, First.Title = first(Title), Last.Title = last(Title),
          Total.Titles = n(), Average.Rating = mean(imdbRating), Best.Rating = max(imdbRating))

print.data.frame(summary.movies)
##        Genre_1                        First.Title
## 1       Action                       Les vampires
## 2        Adult                           Caligula
## 3    Adventure                            Cabiria
## 4    Animation                 Gulliver's Travels
## 5    Biography                       Regeneration
## 6       Comedy         Tillie's Punctured Romance
## 7        Crime              Juve Against Fantomas
## 8  Documentary H„xan: Witchcraft Through the Ages
## 9        Drama            Dr. Jekyll and Mr. Hyde
## 10      Family                             Skippy
## 11     Fantasy                            Destiny
## 12   Film-Noir                              Laura
## 13     History                      Western Union
## 14      Horror                          J'accuse!
## 15       Music                  One Night of Love
## 16     Musical                The Broadway Melody
## 17     Mystery             The Kennel Murder Case
## 18     Romance                        Easy Virtue
## 19      Sci-Fi                     The Devil-Doll
## 20       Short            The Great Train Robbery
## 21    Thriller                           Sabotage
## 22         War                      The Way Ahead
## 23     Western                     The Iron Horse
##                        Last.Title Total.Titles Average.Rating Best.Rating
## 1          I Am a Knife with Legs         1889            6.2         9.0
## 2                      Destricted            3            4.7         5.2
## 3                         Pirates          687            6.4         9.4
## 4  Thunder and the House of Magic          472            6.7         8.6
## 5                  The Golden Era          619            7.0         8.9
## 6                   Force Majeure         4488            6.3         8.6
## 7                   The Blue Room         1142            6.7         9.3
## 8                     Citizenfour         2119            7.2         8.9
## 9                      But Always         4757            6.7         8.9
## 10               Teen Beach Movie           56            6.0         8.2
## 11 Painted Skin: The Resurrection           80            6.0         8.0
## 12              I Bury the Living           14            7.3         8.4
## 13                        Phantom            5            6.7         7.1
## 14       The Houses October Built          839            5.7         8.6
## 15                     Tamla Rose           10            6.4         8.4
## 16           Peaches Does Herself           48            6.5         7.8
## 17                    Frequencies          108            6.7         8.6
## 18                Still the Water           74            6.5         8.1
## 19                     The Signal           81            5.9         8.2
## 20                        Hellion           34            6.7         8.4
## 21                     Heatstroke          149            6.1         8.0
## 22                Dark Blue World            9            7.0         7.5
## 23               Django Unchained           68            6.9         9.0
# Here we have the same summarized data as before (with the exceptions of
# Distinct.Genres since there will only be 1 and middle movie since that will
# change by catagory). The only difference is that now the displayed data has a
# row for every unique genre. We can use this to compute aggregate statistics.


# The final part of the dplyr package is imported from the magrittr package. and
# is not a verb at all Instead it is an operator that takes commands and imports
# them into the first statement of the next variable The command is the pipe
# %>%. It works by joining two sides of an equation with the words "and then". 
# For instance, LEFTHANDSIDE %>% (AND THEN) RIGHTHANDSIDE
# 
# You can enable the pipe by starting dplyr or by uzing the origional package magrittr.
# require("magrittr")
#
# We can use everything we learned to explore our dataset for interesting
# results. Let's look at feature length (90 minute) movies released between 1950
# and 2000 We want to see a summary of movies by Genre. The summary should
# include the number of movies in that category and the mean composite score.
# The composite score whould be a sum of imdbRating and RTomRating. Sorted by
# Composite score (highest first).
movies %>%
  group_by(Genre_1) %>%
    filter(Released > as.Date("1950-01-01") & Released < as.Date("2000-01-01"), Runtime >= 90) %>%
      mutate(composite = imdbRating + RTomRating) %>%
       summarize(N = n(), Composite = mean(composite)) %>%
        arrange(desc(Composite))
## Source: local data frame [22 x 3]
## 
##        Genre_1    N Composite
## 1  Documentary  110        15
## 2          War    5        14
## 3    Film-Noir    2        14
## 4      Mystery   44        14
## 5    Biography  246        14
## 6      Western   52        14
## 7        Drama 1607        14
## 8    Animation   40        14
## 9        Crime  429        13
## 10       Music    1        13
## ..         ...  ...       ...
# Here we take the movies dataset. Then we use pipe to put that into the first
# part of the group_by verb This imports the data. Then we assign the group
# Genre_1 and take that new data and apply the filter statement. This reduces
# our data by only selecting the rows we desire (between 1950 and 200 and over
# 90 minutes long). We take that reduced number of rows and we add a composite
# of the imdbRating and RTomRating. After the composite is created we take that
# data and import it into the summarize function where we compute N and the
# composite score. Last, we import that data into arrange where we sort it by
# composite.

// //

Teaching R

I have been learning and using R in my research for a number of years now which has been a lot of fun! On the down side, most of the education undergraduates receive is in SPSS and graduate students will either continue that trend or move on to STATA or SAS. I consider myself proficient in the use of SAS and SPSS but I would much rather use R. For one, the ability to do all my analyses in one place is very appealing. I can pull in data, clean it, conduct item response theory analyses, exploratory or confirmatory factor analysis, multilevel modeling using IRT theta scores, etc. A process that might take two or three programs with any other software. Two, it allows easy work across my work machine (Windows) and my home machine (Linux). Three, it’s open and free so I don’t have to wait for my university to get around to updating the license so I can do my work.

In an effort to have a few more students able to use R I proposed a course in R programming. I was amazed at the overwhelmingly positive reactions I received from my colleagues. I created the course and was again amazed when it reached capacity within two days of listing. Although I am not done with this semester yet (barely half way, maalesef) I have learned a lot about what I know and what I can do with R (usually how much I don’t know). From the questions students ask to the clever ways they find to make the lab assignments easy I have learned more this semester than in the years of steady use.

Stay tuned I will be posting my notes so others can benefit!