Lesson 3: Intermediate Data Management

Most of the time someone spends working on a data analysis problem will be getting the data into the program. This starts with reading in a file and turns into transforming variables, correcting mistakes, and sometimes redesigning the entire dataset (e.g., long to tall changes). In this module we will begin to talk about how we can manipulate data in R to be the form we want for the analyses we want to conduct.

We are going to begin working with an actual dataset that will hopefully be a little entertaining as well as informative. First, we set a working directory to save files, graphs, etc. into. We can also have R create directories for us Show warnings = FALSE ignores warnings telling us the directory already exists (if it does already exist)

dir.create("E:/Rcourse/L3", showWarnings = FALSE)

Then set as working directory

setwd("E:/Rcourse/L3")

The dataset we will be using is a from IMDB and Rotten Tomatoes. You can find and create your own dataset with these variables or use the one linked here. We can start off by saving the file to the directory we set in the setwd step.

download.file(“http://degovx.eurybia.feralhosting.com/movies.RData”, “orig_movies.RData”)

Note: Making directories close to a root or in a home directory (documents in Windows, home in Linux, and whatever the equivalent is on an Apple machine) will make things a little less messy. I tend to partition a drive for doing analyses that automatically backs up every couple of days.

Once we have saved our file we can load it with a simple load statement (for RData file). We will look at a few different types of loading formats over the rest of the course including csv and fixed widths. See lesson 1 for some examples of loading and how to get data from common statistical programs.

load("E:/Rcourse/L3/orig_movies.RData")

The first step is to see how the data looks and to see how R has imported the various variables. We can do that with head (first 6 rows), tail (last 6 rows), summary (variable names and quantiles, max, and NAs of numeric variables), and str (type of variables and examples of the first few rows from each variable)

head(movies)

##                    Title Year Appropriate Runtime                    Genre
## 1             Carmencita 1894   NOT RATED   1 min       Documentary, Short
## 2 Le clown et ses chiens 1892        <NA>    <NA>         Animation, Short
## 3         Pauvre Pierrot 1892        <NA>   4 min Animation, Comedy, Short
## 4            Un bon bock 1892        <NA>    <NA>         Animation, Short
## 5       Blacksmith Scene 1893     UNRATED   1 min                    Short
## 6      Chinese Opium Den 1894        <NA>   1 min                    Short
##           Released             Director Writer Metacritic imdbRating
## 1             <NA> William K.L. Dickson   <NA>         NA        5.9
## 2 October 28, 1892        Émile Reynaud   <NA>         NA        6.5
## 3 October 28, 1892        Émile Reynaud   <NA>         NA        6.7
## 4 October 28, 1892        Émile Reynaud   <NA>         NA        6.6
## 5     May 09, 1893 William K.L. Dickson   <NA>         NA        6.3
## 6 October 17, 1894 William K.L. Dickson   <NA>         NA        5.9
##   imdbVotes Language Country Awards rtomRating rtomMeter rtomVotes Fresh
## 1       982     <NA>     USA   <NA>         NA        NA        NA    NA
## 2       118     <NA>  France   <NA>         NA        NA        NA    NA
## 3       523     <NA>  France   <NA>         NA        NA        NA    NA
## 4        79     <NA>  France   <NA>         NA        NA        NA    NA
## 5      1134     <NA>     USA 1 win.         NA        NA        NA    NA
## 6        53  English     USA   <NA>         NA        NA        NA    NA
##   Rotten rtomUserMeter rtomUserRating rtomUserVotes BoxOffice
## 1     NA            NA             NA            NA      <NA>
## 2     NA            NA             NA            NA      <NA>
## 3     NA            NA             NA            NA      <NA>
## 4     NA           100            2.8           216      <NA>
## 5     NA            32            3.0           184      <NA>
## 6     NA            NA             NA            NA      <NA>

summary(movies)

##     Title               Year           Appropriate       
##  Length:548010      Length:548010      Length:548010     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    Runtime             Genre             Released        
##  Length:548010      Length:548010      Length:548010     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    Director            Writer            Metacritic       imdbRating    
##  Length:548010      Length:548010      Min.   :  1.0    Min.   : 1.00   
##  Class :character   Class :character   1st Qu.: 45.0    1st Qu.: 5.60   
##  Mode  :character   Mode  :character   Median : 58.0    Median : 6.60   
##                                        Mean   : 56.5    Mean   : 6.38   
##                                        3rd Qu.: 69.0    3rd Qu.: 7.30   
##                                        Max.   :100.0    Max.   :10.00   
##                                        NA's   :537880   NA's   :197639  
##    imdbVotes         Language           Country         
##  Min.   :      5   Length:548010      Length:548010     
##  1st Qu.:     11   Class :character   Class :character  
##  Median :     29   Mode  :character   Mode  :character  
##  Mean   :   1614                                        
##  3rd Qu.:    127                                        
##  Max.   :1465210                                        
##  NA's   :197640                                         
##     Awards            rtomRating       rtomMeter        rtomVotes     
##  Length:548010      Min.   : 0.0     Min.   :  0.0    Min.   :  1.0   
##  Class :character   1st Qu.: 5.0     1st Qu.: 40.0    1st Qu.:  6.0   
##  Mode  :character   Median : 6.2     Median : 67.0    Median : 14.0   
##                     Mean   : 6.0     Mean   : 61.6    Mean   : 34.8   
##                     3rd Qu.: 7.1     3rd Qu.: 86.0    3rd Qu.: 39.0   
##                     Max.   :10.0     Max.   :100.0    Max.   :328.0   
##                     NA's   :529206   NA's   :529017   NA's   :526140  
##      Fresh            Rotten       rtomUserMeter    rtomUserRating  
##  Min.   :  0.0    Min.   :  0.0    Min.   :  0.0    Min.   :0.0     
##  1st Qu.:  3.0    1st Qu.:  1.0    1st Qu.: 34.0    1st Qu.:2.2     
##  Median :  8.0    Median :  4.0    Median : 58.0    Median :3.1     
##  Mean   : 21.6    Mean   : 13.2    Mean   : 55.8    Mean   :2.6     
##  3rd Qu.: 24.0    3rd Qu.: 12.0    3rd Qu.: 79.0    3rd Qu.:3.6     
##  Max.   :311.0    Max.   :195.0    Max.   :100.0    Max.   :5.0     
##  NA's   :526140   NA's   :526140   NA's   :485854   NA's   :455274  
##  rtomUserVotes       BoxOffice        
##  Min.   :       0   Length:548010     
##  1st Qu.:       6   Class :character  
##  Median :      74   Mode  :character  
##  Mean   :   25271                     
##  3rd Qu.:     512                     
##  Max.   :35791395                     
##  NA's   :455131

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    548010 obs. of  23 variables:
##  $ Title         : chr  "Carmencita" "Le clown et ses chiens" "Pauvre Pierrot" "Un bon bock" ...
##  $ Year          : chr  "1894" "1892" "1892" "1892" ...
##  $ Appropriate   : chr  "NOT RATED" NA NA NA ...
##  $ Runtime       : chr  "1 min" NA "4 min" NA ...
##  $ Genre         : chr  "Documentary, Short" "Animation, Short" "Animation, Comedy, Short" "Animation, Short" ...
##  $ Released      : chr  NA "October 28, 1892" "October 28, 1892" "October 28, 1892" ...
##  $ Director      : chr  "William K.L. Dickson" "Émile Reynaud" "Émile Reynaud" "Émile Reynaud" ...
##  $ Writer        : chr  NA NA NA NA ...
##  $ Metacritic    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ imdbRating    : num  5.9 6.5 6.7 6.6 6.3 5.9 5.5 5.9 4.9 6.9 ...
##  $ imdbVotes     : int  982 118 523 79 1134 53 365 967 60 3376 ...
##  $ Language      : chr  NA NA NA NA ...
##  $ Country       : chr  "USA" "France" "France" "France" ...
##  $ Awards        : chr  NA NA NA NA ...
##  $ rtomRating    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ rtomMeter     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rtomVotes     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Fresh         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Rotten        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rtomUserMeter : int  NA NA NA 100 32 NA NA NA NA NA ...
##  $ rtomUserRating: num  NA NA NA 2.8 3 NA NA NA NA 3.8 ...
##  $ rtomUserVotes : int  NA NA NA 216 184 NA NA NA NA 124 ...
##  $ BoxOffice     : chr  NA NA NA NA ...

If you would like to interactively look at your dataset you can use view view(movies)

The first thing you should ask yourself is “How does the data look?” Then you should ask yourself “Are there any obvious issues?” Finally, you will probably have to ask yourself. How do I solve these problems?

For this dataset I see quite a few issues. Runtime, Released, Year, BoxOffice, Awards, and Appropriate are character vectors. Year and Released should be date variables. Genre, and Language have multiple entries in some rows. Genre, Language, Appropriate, Awards, and Country should all be nominal factors. Like most data analysis sessions getting the data into a format that is easy to use will be the first step. Followed by creating any variables that you may need. Using this data we will cover the basic tools of restructuring a dataset. ##Recoding Variables

Whether you are looking for a median split (shudder), you want to reverse code a scale, or your advisor has insisted you catagorize a continuous variable (because you would never do that yourself right?) you can do that relatively easily with base-R functionality. Recoding into a new variable is also great practice for the referencing we learned in previous lessons.

Let’s start out by breaking IMDB ratings down into bad (1-4), average (5-7), and good (8-10).

We can do this moderately quickly but cumbersomely with base-R First we duplicate the data from imdbRating into a new variable.

movies$imdbrate_cat <- movies$imdbRating

Then we take that new variable and we use the referencing to subset the variable into a smaller piece. Basically, we are saying from the movies dataset take the variable ($) imdbrate_cat. Then, within that variable find all instances where that variable meets a certain criterion. In this case that criterion is a logical operator. There are a number of simple logical operators in r.

Some Logical Operators:

< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE

Behind the scenes R makes a vector of TRUE and FALSE statements for every row, then selects all the rows which equal TRUE. With that subset in mind we tell r to assign either “Good”, “Average”, or “Bad

movies$imdbrate_cat[movies$imdbrate_cat >= 8] <- "Good"
movies$imdbrate_cat[movies$imdbrate_cat > 4 & 
                      movies$imdbrate_cat < 8] <- "Average"
movies$imdbrate_cat[movies$imdbrate_cat <= 4] <- "Bad"

We can take a look at our catagorization with the table() function. Table creates a contingency table of counts of combinations of factor levels. See help(table) for more information.

table(movies$imdbrate_cat)

## 
## Average     Bad    Good 
##  289971   21960   38440

We should also summarize our new variable to see how it was created.

summary(movies$imdbrate_cat)

##    Length     Class      Mode 
##    548010 character character

It looks like this variable is still a character variable and needs to be transformed into a factor. See Lesson 1 if you don’t remember how.

movies$imdbrate_cat <- factor(movies$imdbrate_cat, ordered=TRUE, 
                              levels=c("Bad", "Average", "Good"))
summary(movies$imdbrate_cat)

##     Bad Average    Good    NA's 
##   21960  289971   38440  197639

For many simple subsets or recodes this functionality is good. However, when you want to recode a lot of catagories at once using a package can be helpful. The best package for recoding (among a lot of other useful features that we will be using in this class) is the Companion to Applied Regression package.

require("car") #If you don't have it already install.packages("car")

## Loading required package: car

We will be using the recode() function. For more help type ?car or ?recode

movies$imdbrate_cat <- NA

NA is missing data. R treats both numeric and character missing the same. you can find missing with is.na() and exclude missing with na.rm = TRUE or use listwise deletion with na.omit = TRUE

recode(variable, “character string of changes”, options) Here we are telling cat to recode imdb rating. We are giving it a character string that has a couple of commands that recode recognizes (hi and lo). For more information on the commands available for recode try help(recode). We are also telling it that the outcome of the recode should be a factor and that factor is ordinal by passing the levels argument. Levels works exactly like in the factor command in base-R.

movies$imdbrate_cat <- recode(movies$imdbRating, "8:hi='Good';lo:4='Bad';4:8='Average'", 
                              as.factor.result = TRUE, levels=c("Bad", "Average", "Good"))
summary(movies$imdbrate_cat)

##     Bad Average    Good    NA's 
##   21960  289971   38440  197639

Later when we learn the apply statement we will learn how to recode any number of variables quickly using recode or base-R.

For now, let’s unload the car package since we won’t be using it anymore in this lesson.

detach("package:car", unload = TRUE)

Renaming Variables

You can open an interactive window with fix() to rename or modify variables like you would in SPSS or Excel. However, I find the interactive window to be rather slow and clumsy to use for more than 1 or 2 variables at a time. There are ways to rename variables through base-R functionality but I find packages simplify things immensly. For this we will be using the package dplyr. We will be using this package extensively in the next data management section as well as through the rest of the course.

require(dplyr) # install.packages("dplyr")

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

As you can see a few objects are “masked” from “package:Stats”. This means that those commands (filer, lag, intersect, etc.) will now be interpreted by the dplyr package rather than by the base-R package “Stats.” Normally, this won’t be much of an issue for most users since a well built package won’t break any functionality. However, this may not always be the case. If you find something just stops working try unloading any packages that aren’t essential to what you are doing. It can be a good tactic to load a package, do what you need with it, and unload it. Particularly, when you see one package masks commands from another package you will usually find functionality of one or both of the packages will be compromised. Pay attention to this warnings and if you want to use one of the commands that is being masked look into how that new command works to make sure you aren’t accidently making errors that can hinder interpretation of your results later.

rename(dataset, new_variable_name = old_variable_name). Additional variables can be added seperated by a comma.

movies <- rename(movies, imdbRatingCatagory = imdbrate_cat)

Date Variables

We have a couple of date variables in the dataset that will need to have some work done on them but the released variable needs the most so let’s start there. First, let’s take a look at released.

summary(movies$Released)

##    Length     Class      Mode 
##    548010 character character

str(movies$Released)

##  chr [1:548010] NA "October 28, 1892" "October 28, 1892" ...

head(movies$Released)

## [1] NA                 "October 28, 1892" "October 28, 1892"
## [4] "October 28, 1892" "May 09, 1893"     "October 17, 1894"

We have a character variable with the form MMMM dd, yyyy and some missing data. To see how R works with character data try visualing the variable with plot(movies$Released)

Looks like a character variable is unplottable. You could try a few other commands but you would find that character variables are pretty useless for analysis purposes. Later, we will find some uses for character variables, but most of the time you will want to get variables in the form of a number or factor.

R lets us tell it how variables should be classified with as. For example as.date(), as.numeric(), as.vector(), as.data.frame(), as.logical(). We can also test typing with is. for example, is.date() or is.numeric().

Before we convert this character string into a date we should make sure our localization is set correctly. In this case the data are English United States format. If your computer isn’t set to that localization you will have a bad time dealing with these dates.

Luckily, we can set localizations eazily.

Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

Let’s try setting this variable as a date

head(as.Date(movies$Released))

The R as.Date command is well scripted and can recognize most common date formats. Sometimes, like above, you might run into something a little strange or unusual and you might need to set the format yourself. You can do that pretty easily. R uses POSIX for conversion. To find the commands specific to your format try help(strptime).

Looking at the help file we can see that a full month written out is a %B while the day is %d and four digit year is %Y

Note: as.Date uses a standard representation of time which sets a specific second as reference and all dates and times are stored as the time since or before that point.

We can specify our format as well as overwriting the current variable by giving as.Date a character string with the exact format (including commas, slashes, and dashes). We can also specify the time zone if there are hours, minutes, seconds, etc. in the string.

movies$Released <- as.Date(movies$Released, "%B %d, %Y", tz = "UTC")

summary(movies$Released)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1888-10-14" "1982-12-03" "2004-09-23" "1993-04-16" "2010-06-13" 
##         Max.         NA's 
## "2022-01-01"      "66665"

str(movies$Released)

##  Date[1:548010], format: NA "1892-10-28" "1892-10-28" "1892-10-28" "1893-05-09" ...

head(movies$Released)

## [1] NA           "1892-10-28" "1892-10-28" "1892-10-28" "1893-05-09"
## [6] "1894-10-17"

Now our file has an appropriate date variable. We can see from summary that we are dealing with movies as early as 1888 and as late as 2022 (projected movies).

We can create a quick graph to look at number of movies released over time

hist(movies$Released, breaks = 10, main="Histogram of Movie Release Dates", xlab="Release Year")

Using some of the fancy things we learned from last lesson we can put together a better graph. Let’s break it up so this histogram has a break for every year, change from density to frequency, and cut off the projected movies from the graph. We will have to specify that we want the x axis to be in years by using the as.Date command and format. Format will let us specify how we want the date represented. In this case, we want just years.

hist(movies$Released, breaks = (2015 - 1888), freq = TRUE, 
     main = "Histogram of Movie Release Dates", xlab = "Release Year",
     xlim = c(as.Date("1888", format="%Y"), as.Date("2015", format="%Y")))

Regular Expressions in R

Now let’s convert Runtime to numeric. As we can see it’s a character vector still.

summary(movies$Runtime)

##    Length     Class      Mode 
##    548010 character character

str(movies$Runtime)

##  chr [1:548010] "1 min" NA "4 min" NA "1 min" "1 min" ...

head(movies$Runtime)

## [1] "1 min" NA      "4 min" NA      "1 min" "1 min"

We can try using the as.numberic statement like we did with as.Date

head(as.numeric(movies$Runtime))

## Warning in head(as.numeric(movies$Runtime)): NAs introduced by coercion

## [1] NA NA NA NA NA NA

Uhoh looks like R can’t handle that min in the string.

We can get around that by using pattern matching on the data. In this case we can use gsub (grep substitution of all matches using regular expressions) Use ?gsub for more information. Regular expressions are very powerful and spending some time learning how to use them will be a great benefit not only to R programming but any use of computers. You can learn a lot about regex from Wikipedia (https://en.wikipedia.org/wiki/Regular_expression) or from http://www.regular-expressions.info/tutorial.html

In this case we need only the most basic and simple search to accomplish our goals.

movies$Runtime <- as.numeric(gsub(" min", "", movies$Runtime))

## Warning: NAs introduced by coercion

summary(movies$Runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.15   19.00   60.00   61.10   91.00  964.00   76827

str(movies$Runtime)

##  num [1:548010] 1 NA 4 NA 1 1 1 1 45 1 ...

head(movies$Runtime)

## [1]  1 NA  4 NA  1  1

Alright, we now have a usable variable. We can see we have movies from 15 seconds long to a little over 16 hours long. “Wait, what?” you are saying. 16 hour long movies? We can learn a bit about that exceptionally long movie fairly easily. We can find the maximum number with max() and remove NA’s from the analysis with na.rm = TRUE

max(movies$Runtime, na.rm = TRUE)

## [1] 964

Then all we have to do is look up that value. The easiest way to do that is to use which(). Which takes a logical vector (a vector of true and false statements) and returns the row number(s) that match the logic. Which is a useful function since it automatically allows NAs and sets them to FALSE. This is important because it gives an accurate return index. It’s a useful shortcut to a long statement.

movies[which(movies$Runtime == 964), ]

## Source: local data frame [1 x 24]
## 
##         Title  Year Appropriate Runtime                  Genre   Released
##         (chr) (chr)       (chr)   (dbl)                  (chr)     (date)
## 1 24 in Seven  2009          NA     964 Documentary, Adventure 2009-08-01
## Variables not shown: Director (chr), Writer (chr), Metacritic (int),
##   imdbRating (dbl), imdbVotes (int), Language (chr), Country (chr), Awards
##   (chr), rtomRating (dbl), rtomMeter (int), rtomVotes (int), Fresh (int),
##   Rotten (int), rtomUserMeter (int), rtomUserRating (dbl), rtomUserVotes
##   (int), BoxOffice (chr), imdbRatingCatagory (fctr)

“24 in Seven” is the long movie that doesn’t have any ratings. We can use which to find other things too. Like movies that are longer than 10 hours. Here we can use the index to select only titles and ratings.

movies[which(movies$Runtime > 600), c("Title", "imdbRating")]

## Source: local data frame [167 x 2]
## 
##                              Title imdbRating
##                              (chr)      (dbl)
## 1       Semnadtsat mgnoveniy vesny        9.1
## 2                      War & Peace        8.2
## 3                           Chlopi        7.2
## 4                      Noce i dnie        7.4
## 5                      I, Claudius        9.1
## 6   How Yukong Moved the Mountains        7.2
## 7                My Uncle Napoleon        8.6
## 8  Washington: Behind Closed Doors        8.0
## 9                 Die Buddenbrooks        8.1
## 10                       Flambards        8.3
## ..                             ...        ...

Let’s use our graphing skills again

hist(movies$Runtime, breaks = 100, main = "Histogram of Movie Runtimes", 
     xlab = "Runtime in Minutes")

Those pesky outliers are making our graph pretty hard to read. Let’s drop down the graph to be movies that are just 5 hours or less. Before we just modified the graph to exclude those values. Let’s use the subsetting we learned to reduce the actual data being processed by the graph. This will generally be the prefered way since it speeds up the rest of the graphing process and takes less computational cycles.

hist(movies$Runtime[movies$Runtime < 300], breaks = 100, 
     main = "Histogram of Movie Runtimes Excluding Outliers", 
     col = gray(0:100/100), xlab = "Runtime (Minutes)")

View the data again now that we have transformed it.

summary(movies)

##     Title               Year           Appropriate           Runtime      
##  Length:548010      Length:548010      Length:548010      Min.   :  0.15  
##  Class :character   Class :character   Class :character   1st Qu.: 19.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 60.00  
##                                                           Mean   : 61.10  
##                                                           3rd Qu.: 91.00  
##                                                           Max.   :964.00  
##                                                           NA's   :76827   
##     Genre              Released            Director        
##  Length:548010      Min.   :1888-10-14   Length:548010     
##  Class :character   1st Qu.:1982-12-03   Class :character  
##  Mode  :character   Median :2004-09-23   Mode  :character  
##                     Mean   :1993-04-16                     
##                     3rd Qu.:2010-06-13                     
##                     Max.   :2022-01-01                     
##                     NA's   :66665                          
##     Writer            Metacritic       imdbRating       imdbVotes      
##  Length:548010      Min.   :  1.0    Min.   : 1.00    Min.   :      5  
##  Class :character   1st Qu.: 45.0    1st Qu.: 5.60    1st Qu.:     11  
##  Mode  :character   Median : 58.0    Median : 6.60    Median :     29  
##                     Mean   : 56.5    Mean   : 6.38    Mean   :   1614  
##                     3rd Qu.: 69.0    3rd Qu.: 7.30    3rd Qu.:    127  
##                     Max.   :100.0    Max.   :10.00    Max.   :1465210  
##                     NA's   :537880   NA's   :197639   NA's   :197640   
##    Language           Country             Awards            rtomRating    
##  Length:548010      Length:548010      Length:548010      Min.   : 0.0    
##  Class :character   Class :character   Class :character   1st Qu.: 5.0    
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.2    
##                                                           Mean   : 6.0    
##                                                           3rd Qu.: 7.1    
##                                                           Max.   :10.0    
##                                                           NA's   :529206  
##    rtomMeter        rtomVotes          Fresh            Rotten      
##  Min.   :  0.0    Min.   :  1.0    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.: 40.0    1st Qu.:  6.0    1st Qu.:  3.0    1st Qu.:  1.0   
##  Median : 67.0    Median : 14.0    Median :  8.0    Median :  4.0   
##  Mean   : 61.6    Mean   : 34.8    Mean   : 21.6    Mean   : 13.2   
##  3rd Qu.: 86.0    3rd Qu.: 39.0    3rd Qu.: 24.0    3rd Qu.: 12.0   
##  Max.   :100.0    Max.   :328.0    Max.   :311.0    Max.   :195.0   
##  NA's   :529017   NA's   :526140   NA's   :526140   NA's   :526140  
##  rtomUserMeter    rtomUserRating   rtomUserVotes       BoxOffice        
##  Min.   :  0.0    Min.   :0.0      Min.   :       0   Length:548010     
##  1st Qu.: 34.0    1st Qu.:2.2      1st Qu.:       6   Class :character  
##  Median : 58.0    Median :3.1      Median :      74   Mode  :character  
##  Mean   : 55.8    Mean   :2.6      Mean   :   25271                     
##  3rd Qu.: 79.0    3rd Qu.:3.6      3rd Qu.:     512                     
##  Max.   :100.0    Max.   :5.0      Max.   :35791395                     
##  NA's   :485854   NA's   :455274   NA's   :455131                       
##  imdbRatingCatagory
##  Bad    : 21960    
##  Average:289971    
##  Good   : 38440    
##  NA's   :197639    
##                    
##                    
##

It looks like we still need to work on the variables Year, Appropriate, Genre, Director, Writer, Language, Country, Awards, and BoxOffice.

We will want to use Genre so let’s start there. First, we can use table to see a better summary of the data.

head(table(movies$Genre))

## 
##                   Action            Action, Adult Action, Adult, Adventure 
##                     4563                       14                        6 
## Action, Adult, Animation    Action, Adult, Comedy     Action, Adult, Crime 
##                        1                        8                        5

Here, with just head we can see that many catagories are present seperated by a comma. We could use regex to seperate these into new catagories but luckily for us there are already packages that can accomplish splitting, stacking, and shaping. Extra luckily for us these strings of combined information are constructed so the first entry is most important, second most, etc. That means we can take the first few catagories for each of these splits.

To accomplish the splitting we will be using tidyr. This is part of the family of packages like dplyr and plyr that we will be using in this course to manage data. Here we will be using a basic functionality to seperate a column of data into a number of new columns.

require(tidyr) # install.packages("tidyr")

## Loading required package: tidyr

the command we are interested in is separate(). separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE, convert = FALSE, extra = “error”, …)

For our purposes we are going to use the data movies, the column Genre, we want to make three new columns Genre_1, Genre_2, Genre_3, keep the original column in the data, split based on a comma, and drop any extra columns of data after three.

movies_genresplit <- separate(movies, Genre, c("Genre_1", "Genre_2", "Genre_3"), 
         remove = FALSE, sep = ",", extra = "drop")

tidyr is generally quick and easy but there are alternatives. A good alternative is the splitstackshape package which will automaticaly create a number of columns equal to the maximum number of splits. Splitstackshape uses the package data.table which we will not be covering in this course so once it has been used the resulting data needs to be converted back into a data frame before we can use it.

First, we will have to unload some packages since they conflict with splitstackshape

detach("package:dplyr", unload = TRUE)
detach("package:tidyr", unload = TRUE)

splitstackshape example

require("splitstackshape")

## Loading required package: splitstackshape
## Loading required package: data.table

movies <- cSplit(as.data.frame(movies), "Genre", sep=",")

setDF(movies)

summary(movies)

##     Title               Year           Appropriate           Runtime      
##  Length:548010      Length:548010      Length:548010      Min.   :  0.15  
##  Class :character   Class :character   Class :character   1st Qu.: 19.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 60.00  
##                                                           Mean   : 61.10  
##                                                           3rd Qu.: 91.00  
##                                                           Max.   :964.00  
##                                                           NA's   :76827   
##     Released            Director            Writer         
##  Min.   :1888-10-14   Length:548010      Length:548010     
##  1st Qu.:1982-12-03   Class :character   Class :character  
##  Median :2004-09-23   Mode  :character   Mode  :character  
##  Mean   :1993-04-16                                        
##  3rd Qu.:2010-06-13                                        
##  Max.   :2022-01-01                                        
##  NA's   :66665                                             
##    Metacritic       imdbRating       imdbVotes         Language        
##  Min.   :  1.0    Min.   : 1.00    Min.   :      5   Length:548010     
##  1st Qu.: 45.0    1st Qu.: 5.60    1st Qu.:     11   Class :character  
##  Median : 58.0    Median : 6.60    Median :     29   Mode  :character  
##  Mean   : 56.5    Mean   : 6.38    Mean   :   1614                     
##  3rd Qu.: 69.0    3rd Qu.: 7.30    3rd Qu.:    127                     
##  Max.   :100.0    Max.   :10.00    Max.   :1465210                     
##  NA's   :537880   NA's   :197639   NA's   :197640                      
##    Country             Awards            rtomRating       rtomMeter     
##  Length:548010      Length:548010      Min.   : 0.0     Min.   :  0.0   
##  Class :character   Class :character   1st Qu.: 5.0     1st Qu.: 40.0   
##  Mode  :character   Mode  :character   Median : 6.2     Median : 67.0   
##                                        Mean   : 6.0     Mean   : 61.6   
##                                        3rd Qu.: 7.1     3rd Qu.: 86.0   
##                                        Max.   :10.0     Max.   :100.0   
##                                        NA's   :529206   NA's   :529017  
##    rtomVotes          Fresh            Rotten       rtomUserMeter   
##  Min.   :  1.0    Min.   :  0.0    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.:  6.0    1st Qu.:  3.0    1st Qu.:  1.0    1st Qu.: 34.0   
##  Median : 14.0    Median :  8.0    Median :  4.0    Median : 58.0   
##  Mean   : 34.8    Mean   : 21.6    Mean   : 13.2    Mean   : 55.8   
##  3rd Qu.: 39.0    3rd Qu.: 24.0    3rd Qu.: 12.0    3rd Qu.: 79.0   
##  Max.   :328.0    Max.   :311.0    Max.   :195.0    Max.   :100.0   
##  NA's   :526140   NA's   :526140   NA's   :526140   NA's   :485854  
##  rtomUserRating   rtomUserVotes       BoxOffice         imdbRatingCatagory
##  Min.   :0.0      Min.   :       0   Length:548010      Bad    : 21960    
##  1st Qu.:2.2      1st Qu.:       6   Class :character   Average:289971    
##  Median :3.1      Median :      74   Mode  :character   Good   : 38440    
##  Mean   :2.6      Mean   :   25271                      NA's   :197639    
##  3rd Qu.:3.6      3rd Qu.:     512                                        
##  Max.   :5.0      Max.   :35791395                                        
##  NA's   :455274   NA's   :455131                                          
##         Genre_1          Genre_2           Genre_3           Genre_4      
##  Short      :110902   Drama  : 62760   Drama   : 15207   Thriller:   269  
##  Documentary: 86989   Short  : 46557   Romance : 11662   Romance :   246  
##  Drama      : 81578   Comedy : 36044   Thriller:  9668   Family  :   200  
##  Comedy     : 74449   Romance: 14718   Comedy  :  7354   Drama   :   165  
##  Action     : 25638   Crime  :  9770   Family  :  6791   Fantasy :   151  
##  (Other)    :131529   (Other):102559   (Other) : 50743   (Other) :  1045  
##  NA's       : 36925   NA's   :275602   NA's    :446585   NA's    :545934  
##      Genre_5           Genre_6           Genre_7           Genre_8      
##  Thriller:    80   Thriller:    16   Sci-Fi  :     5   Music   :     1  
##  Romance :    52   Romance :    14   Thriller:     4   Musical :     1  
##  Sci-Fi  :    45   Sci-Fi  :     6   Music   :     2   Romance :     1  
##  War     :    40   Sport   :     6   War     :     2   Sci-Fi  :     1  
##  Fantasy :    33   Mystery :     5   Adult   :     1   Thriller:     4  
##  (Other) :   210   (Other) :    25   (Other) :     4   War     :     1  
##  NA's    :547550   NA's    :547938   NA's    :547992   NA's    :548001

Using either method we have sucessfully broken the Genre variable into multiple groups. Depending on which method we use we will have to do some cleanup. Using tidyr we would need to set the variables as factors since they are still strings. Using splitstackshape we can delete the extra columns that aren’t providing much information. For now, let’s use the output from splitstackshape and cover how to delete columns in R.

Subsetting a Dataset

It looks like we got 8 genre categories although the 4th genre only contains a handful of entries and the 8th only has 9. Let’s go ahead and delete the Genre_4 through 8. We can keep or drop variables a number of different ways.

Like we did above in the recoding section we can use dataframe element selection.

Since the extra columns are at the end of the dataset we can keep the beginning.

head(movies[, 1:26]) #We could just set this as movies with <-

##                    Title Year Appropriate Runtime   Released
## 1             Carmencita 1894   NOT RATED       1       <NA>
## 2 Le clown et ses chiens 1892        <NA>      NA 1892-10-28
## 3         Pauvre Pierrot 1892        <NA>       4 1892-10-28
## 4            Un bon bock 1892        <NA>      NA 1892-10-28
## 5       Blacksmith Scene 1893     UNRATED       1 1893-05-09
## 6      Chinese Opium Den 1894        <NA>       1 1894-10-17
##               Director Writer Metacritic imdbRating imdbVotes Language
## 1 William K.L. Dickson   <NA>         NA        5.9       982     <NA>
## 2        Émile Reynaud   <NA>         NA        6.5       118     <NA>
## 3        Émile Reynaud   <NA>         NA        6.7       523     <NA>
## 4        Émile Reynaud   <NA>         NA        6.6        79     <NA>
## 5 William K.L. Dickson   <NA>         NA        6.3      1134     <NA>
## 6 William K.L. Dickson   <NA>         NA        5.9        53  English
##   Country Awards rtomRating rtomMeter rtomVotes Fresh Rotten rtomUserMeter
## 1     USA   <NA>         NA        NA        NA    NA     NA            NA
## 2  France   <NA>         NA        NA        NA    NA     NA            NA
## 3  France   <NA>         NA        NA        NA    NA     NA            NA
## 4  France   <NA>         NA        NA        NA    NA     NA           100
## 5     USA 1 win.         NA        NA        NA    NA     NA            32
## 6     USA   <NA>         NA        NA        NA    NA     NA            NA
##   rtomUserRating rtomUserVotes BoxOffice imdbRatingCatagory     Genre_1
## 1             NA            NA      <NA>            Average Documentary
## 2             NA            NA      <NA>            Average   Animation
## 3             NA            NA      <NA>            Average   Animation
## 4            2.8           216      <NA>            Average   Animation
## 5            3.0           184      <NA>            Average       Short
## 6             NA            NA      <NA>            Average       Short
##   Genre_2 Genre_3
## 1   Short    <NA>
## 2   Short    <NA>
## 3  Comedy   Short
## 4   Short    <NA>
## 5    <NA>    <NA>
## 6    <NA>    <NA>

We could also refer to the entire string by name but that would be tedious. Instead let’s use the subset function. We can combine that with the through statement we learned earlier (:) to grab a range. NOTE: This doesn’t count up the numbers after the _ but selects the Genre_4 and Genre_8 and everything that is between the two of them in the dataset. Even things that don’t have Genre in the title. We will learn more later about how to select things that are counting up.

Subset works by giving a data location, then a logical statement for selecting rows, then select which tells which columns to use. If we give it no logical statement it assumes all rows and if give no column statements it assums all columns.

movies <- subset(movies, select = -(Genre_4:Genre_8))

Typical Genres

table(movies$Genre_1)

## 
##      Action       Adult   Adventure   Animation   Biography      Comedy 
##       25638       16260       10160       25101        4134       74449 
##       Crime Documentary       Drama      Family     Fantasy   Film-Noir 
##       13490       86989       81578        7988        1997          36 
##   Game-Show     History      Horror       Music     Musical     Mystery 
##        2101        1120        9673        7543        3126        2476 
##        News  Reality-TV     Romance      Sci-Fi       Short       Sport 
##        1576        5004        3347        2118      110902        2553 
##   Talk-Show    Thriller         War     Western 
##        2726        4885         763        3352

Subset is a very useful command that we will be using a bit in this class. We can easily use it to replicate a few of the logical statements we made earlier with a bit more ease.

Like selecting very long movies Before we used: movies[which(movies$Runtime > 600), c(“Title”, “imdbRating”)]

head(subset(movies, Runtime > 600, select = c("Title", "imdbRating")))

##                                Title imdbRating
## 51836     Semnadtsat mgnoveniy vesny        9.1
## 51860                    War & Peace        8.2
## 52050                         Chlopi        7.2
## 55075                    Noce i dnie        7.4
## 55513                    I, Claudius        9.1
## 55784 How Yukong Moved the Mountains        7.2

We can embed that subset into a statement to select rows for things like graphing. In this case we subset the data then use the $ to create a vector of just imdbRatings.

hist(subset(movies, Runtime > 600)$imdbRating, 
     main = "Histogram of IMDB Ratings for movies Over 600 Minutes",
     xlab = "IMDB ratings", xlim=c(1, 10), breaks = 10)

Now that we are experts on splitting and subsetting data let’s go ahead and tackle a couple of other variables. Why don’t you try and work through Language and Country yourselves before following along.

head(table(movies$Language))

## 
##                               Abkhazian 
##                                       2 
##                              Aboriginal 
##                                      16 
##                     Aboriginal, English 
##                                       5 
## Aboriginal, English, French, Portuguese 
##                                       1 
##           Aboriginal, Japanese, Hokkien 
##                                       1 
##         Aboriginal, Polynesian, English 
##                                       1

head(table(movies$Country))

## 
##                      Afghanistan              Afghanistan, France 
##                               21                                2 
## Afghanistan, France, Germany, UK                Afghanistan, Iran 
##                                1                                1 
##             Afghanistan, Ireland      Afghanistan, Ireland, Japan 
##                                1                                1

Like before these are long strings seperated by commas. Since we still have splitstackshape loaded I will use that again. One of the positives is that you can pass a character vector to cSplit to split multiple variables simultaneously.

movies <- cSplit(movies, c("Language","Country"), sep=",")
setDF(movies)

summary(movies)

##     Title               Year           Appropriate           Runtime      
##  Length:548010      Length:548010      Length:548010      Min.   :  0.15  
##  Class :character   Class :character   Class :character   1st Qu.: 19.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 60.00  
##                                                           Mean   : 61.10  
##                                                           3rd Qu.: 91.00  
##                                                           Max.   :964.00  
##                                                           NA's   :76827   
##     Released            Director            Writer         
##  Min.   :1888-10-14   Length:548010      Length:548010     
##  1st Qu.:1982-12-03   Class :character   Class :character  
##  Median :2004-09-23   Mode  :character   Mode  :character  
##  Mean   :1993-04-16                                        
##  3rd Qu.:2010-06-13                                        
##  Max.   :2022-01-01                                        
##  NA's   :66665                                             
##    Metacritic       imdbRating       imdbVotes          Awards         
##  Min.   :  1.0    Min.   : 1.00    Min.   :      5   Length:548010     
##  1st Qu.: 45.0    1st Qu.: 5.60    1st Qu.:     11   Class :character  
##  Median : 58.0    Median : 6.60    Median :     29   Mode  :character  
##  Mean   : 56.5    Mean   : 6.38    Mean   :   1614                     
##  3rd Qu.: 69.0    3rd Qu.: 7.30    3rd Qu.:    127                     
##  Max.   :100.0    Max.   :10.00    Max.   :1465210                     
##  NA's   :537880   NA's   :197639   NA's   :197640                      
##    rtomRating       rtomMeter        rtomVotes          Fresh       
##  Min.   : 0.0     Min.   :  0.0    Min.   :  1.0    Min.   :  0.0   
##  1st Qu.: 5.0     1st Qu.: 40.0    1st Qu.:  6.0    1st Qu.:  3.0   
##  Median : 6.2     Median : 67.0    Median : 14.0    Median :  8.0   
##  Mean   : 6.0     Mean   : 61.6    Mean   : 34.8    Mean   : 21.6   
##  3rd Qu.: 7.1     3rd Qu.: 86.0    3rd Qu.: 39.0    3rd Qu.: 24.0   
##  Max.   :10.0     Max.   :100.0    Max.   :328.0    Max.   :311.0   
##  NA's   :529206   NA's   :529017   NA's   :526140   NA's   :526140  
##      Rotten       rtomUserMeter    rtomUserRating   rtomUserVotes     
##  Min.   :  0.0    Min.   :  0.0    Min.   :0.0      Min.   :       0  
##  1st Qu.:  1.0    1st Qu.: 34.0    1st Qu.:2.2      1st Qu.:       6  
##  Median :  4.0    Median : 58.0    Median :3.1      Median :      74  
##  Mean   : 13.2    Mean   : 55.8    Mean   :2.6      Mean   :   25271  
##  3rd Qu.: 12.0    3rd Qu.: 79.0    3rd Qu.:3.6      3rd Qu.:     512  
##  Max.   :195.0    Max.   :100.0    Max.   :5.0      Max.   :35791395  
##  NA's   :526140   NA's   :485854   NA's   :455274   NA's   :455131    
##   BoxOffice         imdbRatingCatagory        Genre_1      
##  Length:548010      Bad    : 21960     Short      :110902  
##  Class :character   Average:289971     Documentary: 86989  
##  Mode  :character   Good   : 38440     Drama      : 81578  
##                     NA's   :197639     Comedy     : 74449  
##                                        Action     : 25638  
##                                        (Other)    :131529  
##                                        NA's       : 36925  
##     Genre_2           Genre_3         Language_01      Language_02    
##  Drama  : 62760   Drama   : 15207   English :274133   English:  7695  
##  Short  : 46557   Romance : 11662   German  : 28684   French :  3217  
##  Comedy : 36044   Thriller:  9668   Spanish : 27610   Spanish:  2834  
##  Romance: 14718   Comedy  :  7354   French  : 26076   German :  2366  
##  Crime  :  9770   Family  :  6791   Japanese: 18533   Tagalog:  1759  
##  (Other):102559   (Other) : 50743   (Other) :102686   (Other): 14433  
##  NA's   :275602   NA's    :446585   NA's    : 70288   NA's   :515706  
##   Language_03      Language_04      Language_05      Language_06    
##  English:  1881   English:   381   English:   112   German :    33  
##  French :   969   French :   307   French :    99   English:    32  
##  German :   828   German :   270   Spanish:    96   Spanish:    26  
##  Spanish:   590   Spanish:   203   German :    85   Italian:    24  
##  Italian:   443   Italian:   191   Italian:    75   Russian:    24  
##  (Other):  4148   (Other):  1525   (Other):   620   (Other):   218  
##  NA's   :539151   NA's   :545133   NA's   :546923   NA's   :547653  
##      Language_07         Language_08             Language_09    
##  Greek     :    17   Hebrew    :    10   Icelandic     :     9  
##  Russian   :    11   Portuguese:     7   Spanish       :     5  
##  Portuguese:    10   Italian   :     6   Italian       :     4  
##  Polish    :     9   Norwegian :     6   Russian       :     4  
##  English   :     7   Spanish   :     6   Serbo-Croatian:     4  
##  (Other)   :   110   (Other)   :    66   (Other)       :    43  
##  NA's      :547846   NA's      :547909   NA's          :547941  
##      Language_10             Language_11         Language_12    
##  Italian   :     9   Norwegian     :     8   Portuguese:     5  
##  Portuguese:     6   Spanish       :     5   Spanish   :     4  
##  Spanish   :     6   Swedish       :     5   French    :     3  
##  Polish    :     4   Portuguese    :     3   Polish    :     3  
##  Danish    :     3   Serbo-Croatian:     3   Italian   :     2  
##  (Other)   :    25   (Other)       :    21   (Other)   :    17  
##  NA's      :547957   NA's          :547965   NA's      :547976  
##          Language_13             Language_14             Language_15    
##  Portuguese    :     6   Serbo-Croatian:     4   Spanish       :     5  
##  Serbo-Croatian:     4   Spanish       :     4   Serbo-Croatian:     3  
##  Swedish       :     3   Turkish       :     4   Swedish       :     3  
##  Raeto-Romance :     2   Slovenian     :     3   German        :     1  
##  Slovak        :     2   Romanian      :     2   Hebrew        :     1  
##  (Other)       :     9   (Other)       :     7   (Other)       :     4  
##  NA's          :547984   NA's          :547986   NA's          :547993  
##     Language_16      Language_17        Language_18      Language_19    
##  Swedish  :     5   Turkish:     5   Bengali  :     1   Maltese:     1  
##  Turkish  :     3   Spanish:     2   Quechua  :     1   Russian:     1  
##  Slovenian:     2   Arabic :     1   Spanish  :     2   Swedish:     1  
##  Spanish  :     2   Finnish:     1   Swedish  :     2   Turkish:     2  
##  Arabic   :     1   Klingon:     1   Turkish  :     1   Uzbek  :     1  
##  (Other)  :     3   (Other):     3   Ukrainian:     1   NA's   :548004  
##  NA's     :547994   NA's   :547997   NA's     :548002                   
##   Language_20     Language_21    Language_22    Language_23   
##  Dutch  :     1   Mode:logical   Mode:logical   Mode:logical  
##  Turkish:     1   NA's:548010    NA's:548010    NA's:548010   
##  NA's   :548008                                               
##                                                               
##                                                               
##                                                               
##                                                               
##  Language_24    Language_25    Language_26    Language_27   
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:548010    NA's:548010    NA's:548010    NA's:548010   
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##    Country_01       Country_02       Country_03       Country_04    
##  USA    :202753   USA    :  6556   France :   949   France :   236  
##  UK     : 42264   France :  4579   USA    :   851   USA    :   213  
##  France : 25468   UK     :  2689   Germany:   732   Germany:   190  
##  Canada : 20137   Canada :  2526   UK     :   653   UK     :   179  
##  Japan  : 20082   Germany:  2389   Italy  :   449   Italy  :   144  
##  (Other):189691   (Other): 20353   (Other):  5030   (Other):  1694  
##  NA's   : 47615   NA's   :508918   NA's   :539346   NA's   :545354  
##    Country_05       Country_06           Country_07    
##  France :    83   Germany:    34   France     :    13  
##  USA    :    83   USA    :    34   Germany    :    13  
##  UK     :    77   France :    29   Netherlands:    12  
##  Germany:    64   Italy  :    26   UK         :    11  
##  Spain  :    36   UK     :    23   USA        :    11  
##  (Other):   664   (Other):   265   (Other)    :   171  
##  NA's   :547003   NA's   :547599   NA's       :547779  
##        Country_08       Country_09         Country_10    
##  Germany    :     9   France :     8   UK       :     4  
##  Austria    :     6   Canada :     5   Australia:     3  
##  France     :     6   Italy  :     5   Belgium  :     3  
##  Switzerland:     6   Belgium:     4   China    :     3  
##  UK         :     6   UK     :     4   Israel   :     3  
##  (Other)    :   123   (Other):    79   (Other)  :    55  
##  NA's       :547854   NA's   :547905   NA's     :547939  
##         Country_11       Country_12       Country_13       Country_14    
##  Hungary     :     4   Germany:     3   France :     5   Ecuador:     2  
##  Italy       :     4   China  :     2   India  :     2   Estonia:     2  
##  South Africa:     4   Egypt  :     2   Israel :     2   Germany:     2  
##  Finland     :     3   Ireland:     2   Mexico :     2   Russia :     2  
##  Canada      :     2   Italy  :     2   UK     :     2   Brazil :     1  
##  (Other)     :    40   (Other):    27   (Other):    16   (Other):    14  
##  NA's        :547953   NA's   :547972   NA's   :547981   NA's   :547987  
##    Country_15              Country_16                  Country_17    
##  Denmark:     3   Austria       :     2   Croatia           :     2  
##  Israel :     3   Czech Republic:     2   India             :     2  
##  Georgia:     2   Congo         :     1   China             :     1  
##  Japan  :     2   Côte d'Ivoire :     1   Czech Republic    :     1  
##  China  :     1   Ethiopia      :     1   Dominican Republic:     1  
##  (Other):     9   (Other)       :    11   (Other)           :     7  
##  NA's   :547990   NA's          :547992   NA's              :547996  
##     Country_18                      Country_19         Country_20    
##  Bulgaria:     2   Belgium               :     3   Austria  :     2  
##  Cambodia:     2   Bangladesh            :     1   Argentina:     1  
##  Brazil  :     1   Bosnia and Herzegovina:     1   Australia:     1  
##  Croatia :     1   Brazil                :     1   Belgium  :     1  
##  Germany :     1   France                :     1   Bolivia  :     1  
##  (Other) :     6   (Other)               :     6   (Other)  :     7  
##  NA's    :547997   NA's                  :547997   NA's     :547997  
##        Country_21         Country_22            Country_23    
##  Albania    :     1   Australia:     1   Argentina   :     1  
##  Bhutan     :     1   Canada   :     1   Belgium     :     1  
##  Chile      :     1   France   :     1   Egypt       :     1  
##  India      :     1   Russia   :     1   South Africa:     1  
##  Philippines:     1   Slovenia :     1   Spain       :     1  
##  (Other)    :     4   (Other)  :     2   (Other)     :     2  
##  NA's       :548001   NA's     :548003   NA's        :548003  
##           Country_24          Country_25        Country_26    
##  Australia     :     1   Antarctica:     1   Cambodia:     1  
##  Czech Republic:     1   China     :     1   NA's    :548009  
##  Sweden        :     1   Turkey    :     1                    
##  Syria         :     1   UK        :     1                    
##  Uruguay       :     1   NA's      :548006                    
##  NA's          :548005                                        
##                                                               
##      Country_27    
##  Australia:     1  
##  NA's     :548009  
##                    
##                    
##                    
##                    
##

Like before we will also need to deal with all these extra columns. Many of which don’t have much information contained within.

movies <- subset(movies, select = -(Language_04:Language_27))
movies <- subset(movies, select = -(Country_04:Country_27))

Now that we have divided up language and country why don’t we see how many Turkish movies Are in this list. We also want to find out something about how good the movies are so let’s select movies that don’t have missing data for IMDb Ratings. Here we can combine what we have learned before to create a powerful search of the data to create a subset.

TurkishMovies <- subset(movies, Country_01=="Turkey" & 
                          Language_01=="Turkish" &
                          !is.na(imdbRating), 
                        select=c("Title", "imdbRating"))
head(TurkishMovies, n = 10)

##                                                        Title imdbRating
## 13528                              Aysel: Batakli damin kizi        7.1
## 45280                                   O Beautiful Istanbul        8.2
## 45503                                             Dry Summer        8.0
## 45856                                       Atesli delikanli        5.2
## 49119                                                   Hope        8.2
## 49374 Little Ayse and the Magic Dwarfs in the Land of Dreams        5.0
## 50333                                             Umutsuzlar        7.3
## 50583                                                   Agit        7.1
## 51942                                                   Baba        7.3
## 52237                                                  Gelin        7.8

So what are the best and the worst Turkish movies? We can use order to arrange our dataset and head to display just the top and tail to display the bottom

TurkishMovies <- TurkishMovies[order(-TurkishMovies$imdbRating, TurkishMovies$Title), ]

head(TurkishMovies)

##                         Title imdbRating
## 153141        The Chaos Class        9.5
## 480119 CM101MMXI Fundamentals        9.4
## 480973 CM101MMXI Fundamentals        9.4
## 377955           C.M.Y.L.M.Z.        9.3
## 533766            Kardes Payi        9.3
## 213335       Bir Tat Bir Doku        9.2

tail(TurkishMovies)

##                                Title imdbRating
## 461030                   Hep ezildim        1.8
## 410372               Kahkaha Marketi        1.8
## 254248 Keloglan vs. the Black Prince        1.8
## 329286                Super Agent K9        1.4
## 298894              Yildizlar savasi        1.2
## 529115                       Tersine        1.0

It seems our dataset contains at least a few duplicated rows. For instance here we can see CM101MMXI Fundamentals seems to be so good it was included twice. We can verify that is in fact a duplicate by looking at it.

movies[c(480119, 480973), ]

##                         Title Year Appropriate Runtime   Released
## 480119 CM101MMXI Fundamentals 2013        <NA>     139 2013-01-03
## 480973 CM101MMXI Fundamentals 2013        <NA>     139 2013-01-03
##            Director     Writer Metacritic imdbRating imdbVotes Awards
## 480119 Murat Dundar Cem Yilmaz         NA        9.4     32992   <NA>
## 480973 Murat Dundar Cem Yilmaz         NA        9.4     32161   <NA>
##        rtomRating rtomMeter rtomVotes Fresh Rotten rtomUserMeter
## 480119         NA        NA        NA    NA     NA            NA
## 480973         NA        NA        NA    NA     NA            NA
##        rtomUserRating rtomUserVotes BoxOffice imdbRatingCatagory
## 480119             NA            NA      <NA>               Good
## 480973             NA            NA      <NA>               Good
##            Genre_1 Genre_2 Genre_3 Language_01 Language_02 Language_03
## 480119 Documentary  Comedy    <NA>     Turkish        <NA>        <NA>
## 480973 Documentary  Comedy    <NA>     Turkish        <NA>        <NA>
##        Country_01 Country_02 Country_03
## 480119     Turkey       <NA>       <NA>
## 480973     Turkey       <NA>       <NA>

Finding duplicate rows can be tedious in a small dataset. In a dataset with over 500k observations it’s impossible to do manually. R has a couple of ways to identify unique and duplicate cases.

The easiest is duplicated(dataframe) which looks for two rows that are exactly the same. This will identify many duplicates but not ones with even superficial differences. For example, CM101MMXI Fundamentals has different numbers of votes in the imdbVotes variable so it wouldn’t be identified as a duplicate. It will also assume that the first row in the dataset is the original unless fromLast = TRUE. Finally, it is not the fastest operation. We can have it run a bit quicker by getting rid of movies that have the same Title but it will remove movies that are actually different that happen to have the same title. How you manage duplicates is up to you, for the purposes of this dataset a conservative approach is best since it’s likely many movies share a title.

We can save this as two data frames. One with the duplicated movies and the second is our movies dataframe with the duplicates removed.

duplicated_movies <- movies[duplicated(movies), ]

movies <- movies[!duplicated(movies), ]

Exporting a File

Alright now let’s export this file so we can use it in later projects. It’s good data management to create new files after modifying your data so you can go back to previous versions or to store each iteration of your data in a new folder. Always preserve your original data without modifications. You have all the syntax here in R so you can easily re-run a syntax file partially to correct errors. It’s never that easy to recover missing, destroyed, or incorrectly modified data.

New Folder Method

At the beginning on the lesson we created a folder and set our working directory to that folder. Then we downloaded our file and called it movies_orig.RData. As long as we don’t overwrite that file we will have our original data. I prefer this method myself because I like folder based organization. Usually, I name my folders with the project name followed by the date I am working.

save(data, list of objects to save, file to save them in)

For example, we could save everything into an RData file with:

save(movies, file="movies.RData")

or as a csv with

write.table(movies, file="movies.csv", sep=",")

If you are going to continue using R I recommend keeping files in RData it’s faster and smaller.

file.info(c("movies.csv", "movies.RData"))

##                   size isdir mode               mtime               ctime
## movies.csv   107340331 FALSE  666 2015-09-10 12:10:46 2015-09-09 15:12:51
## movies.RData  25541925 FALSE  666 2015-09-10 12:10:27 2015-09-09 15:12:43
##                            atime exe
## movies.csv   2015-09-09 15:12:51  no
## movies.RData 2015-09-09 15:12:43  no

Automatic Naming of Data Files

We can use R to automatically generate these saves in an easy to preserve format. We combine the paste which we have used before with the same command and a new command Sys.Date() which looks to your computer to find what today’s date is. This can be modified to include the hour, minute, or second to make completely sure every time you save a file it generates a new one. I find that having the day in front of the file is frequently good enough.

save(movies, file = paste(Sys.Date(), "movies.RData", sep=" "))

Let’s test out our knowledge of cleaning a dataset with lab 3. You can download it here: https://docs.google.com/document/d/17MFukYodljz6nIdgJIvS-G_g6B11nHcvZmsf4yhiP4M/edit?usp=sharing

Answers: https://drive.google.com/file/d/0BzzRhb-koTrLNWJsT2N4ZTAtU0U/view?usp=sharing

If you are having issues with the answers try downloading them and opening them in your browser of choice.

Scott Withrow

Assistant Professor|Koc University

Category Archives: Uncategorized

Part III: Intermediate Data Management