Category Archives: R Programming

Part II: Introduction to Base-R Graphing

Here we begin the journey that is graphing with R. The ability to make beautiful and compelling graphs quickly was what drew me into using R in the first place. Later, I began to use graphing packages like ggplot2 and ggvis and quickly found that making high quality and publishable plots is easy. Perhaps one of the most exciting (and new) features of R is the introduction of packages like shiny which allow for direct translation of R code into javascript code. Later we will be exploring some of these exciting uses of R. Particularly, we will focus on how to make an interactive report where someone can drag a slider bar around to adjust aspects of your graph. A sure sign that you are bound for promotion!

Graphing in R is very powerful. Think of graphing in R as a construction project. We start by laying down a foundation (specifying the data), then we build the framework (specifying the axes, labeling, titling, etc.), then we fill in the rest of the structure with the walls and details (specifying the statistics that are displayed in the graph). Base-R has a large suite of tools for graphing and does a commendable job quickly plotting what researchers need to see. The tools exist to build a plot that you desire but many turn to packages for true graphing freedom. The most propular packages are lattice and ggplot2 with the sucessor to ggplot2, ggvis, gaining in popularity. We will later be covering ggplot2 since it is more refined and less subject to change than ggvis.

We will work with one of the R learning dataframes today. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

help(mtcars) for more info on the dataset.

This is an American dataset so we can convert to metric for measurements that make sense. Like we did in lesson 1 we use within to state which dataframe to use (in this case mtcars). Then we use a curly bracket to frame what we want to manipulate. The curly brackets help keep the syntax organized. At the end we assign the data back to the mtcars datafame with a right facing arrow.

within(mtcars, {
  kpl <- mpg * 0.425
  wt.mt <- wt * 0.454
  disp.c <- disp * 2.54
}) -> mtcars  

The Most Basic Graph

First we lay the foundation

Graph weight by kilometers per liter.
We are using the mtcars dataframe and some variables that are in that dataframe. Like in lesson 1 we need to tell R what dataframe the variables are in. We do that by using the $. mtcars is the data and WITHIN ($) that data is the variable wt.mt.
Then we overlay that foundation with a least squares line.
abline = straight line graphic
lm = linear model
~ is the by command here we are saying graph kpl by wt.ml

plot(mtcars$wt.mt, mtcars$kpl)
abline(lm(mtcars$kpl ~ mtcars$wt.mt))
title("Regression of Kilometers per Liter on Weight in Metric Tons")

1

Now let’s put some strucural components in place

Saving a Graph to the Hard Drive

I am too lazy to make a folder so let’s have R do it for us.

dir.create("E:/Rcourse/L2", showWarnings = FALSE)

Make that new folder the working directory.

setwd("E:/Rcourse/L2")

Let’s take the commands above and create a file instead of displaying.
First we need to tell what engine to use. I prefer png since it’s a good mix of compression and quality. You can specify pdf or tiff for good lossless saves, jpg for small and low quality, or bmp, xfig, and postscript for embedding or modifications. Just be sure that whatever engine you specify you also specify a file extention that matches.
This will start a graphical device (dev) which saves console output to that device until it ends with dev.off(). You could use this to capture table output or anything else you like.

png("graph1.png")
plot(mtcars$wt.mt, mtcars$kpl)
abline(lm(mtcars$kpl ~ mtcars$wt.mt))
title("Regression of Kilometers per Liter on Weight in Metric Tons")
dev.off()
## png 
##   2

Notice nothing is generated in the plot window.
You can specify the size of the graph in the dev with width, height, and units. You can also specify plotted point size with pointsize, background with bg, resolution in ppi with res, and depending on the file type some measure of quality or compression type. See ?png or ?pdf for more information.

png("graph2.png", width = 1000, height = 806, units = "px", res = 150)
plot(mtcars$wt.mt, mtcars$kpl)
abline(lm(mtcars$kpl ~ mtcars$wt.mt))
title("Regression of Kilometers per Liter on Weight in Metric Tons")
dev.off()
## png 
##   2

Here we are specifying a graph that is 1000 by 806 pixels and adjusting res so the graph isn’t tiny at that size

If you have been saving images and suddenly your commands don’t seem to be doing anything anymore it’s probably because a dev is still running. You can simply run dev.off() until R prints null device 1 or gives the error “cannot shut down device 1”

R studio also support saving a graph through the point and click menus. Check the export box and modify settings accordingly.

Making Graphs Pretty and Functional

R controls graph displays with graphical parameters or par(). They function as par(optionname=VALUE, optionname=VALUE)

par(no.readonly=TRUE) #These are all the parameters you can manipulate.
## $xlog
## [1] FALSE
## 
## $ylog
## [1] FALSE
## 
## $adj
## [1] 0.5
## 
## $ann
## [1] TRUE
## 
## $ask
## [1] FALSE
## 
## $bg
## [1] "white"
## 
## $bty
## [1] "o"
## 
## $cex
## [1] 1
## 
## $cex.axis
## [1] 1
## 
## $cex.lab
## [1] 1
## 
## $cex.main
## [1] 1.2
## 
## $cex.sub
## [1] 1
## 
## $col
## [1] "black"
## 
## $col.axis
## [1] "black"
## 
## $col.lab
## [1] "black"
## 
## $col.main
## [1] "black"
## 
## $col.sub
## [1] "black"
## 
## $crt
## [1] 0
## 
## $err
## [1] 0
## 
## $family
## [1] ""
## 
## $fg
## [1] "black"
## 
## $fig
## [1] 0 1 0 1
## 
## $fin
## [1] 6.999999 4.999999
## 
## $font
## [1] 1
## 
## $font.axis
## [1] 1
## 
## $font.lab
## [1] 1
## 
## $font.main
## [1] 2
## 
## $font.sub
## [1] 1
## 
## $lab
## [1] 5 5 7
## 
## $las
## [1] 0
## 
## $lend
## [1] "round"
## 
## $lheight
## [1] 1
## 
## $ljoin
## [1] "round"
## 
## $lmitre
## [1] 10
## 
## $lty
## [1] "solid"
## 
## $lwd
## [1] 1
## 
## $mai
## [1] 1.02 0.82 0.82 0.42
## 
## $mar
## [1] 5.1 4.1 4.1 2.1
## 
## $mex
## [1] 1
## 
## $mfcol
## [1] 1 1
## 
## $mfg
## [1] 1 1 1 1
## 
## $mfrow
## [1] 1 1
## 
## $mgp
## [1] 3 1 0
## 
## $mkh
## [1] 0.001
## 
## $new
## [1] FALSE
## 
## $oma
## [1] 0 0 0 0
## 
## $omd
## [1] 0 1 0 1
## 
## $omi
## [1] 0 0 0 0
## 
## $pch
## [1] 1
## 
## $pin
## [1] 5.759999 3.159999
## 
## $plt
## [1] 0.1171429 0.9400000 0.2040000 0.8360000
## 
## $ps
## [1] 12
## 
## $pty
## [1] "m"
## 
## $smo
## [1] 1
## 
## $srt
## [1] 0
## 
## $tck
## [1] NA
## 
## $tcl
## [1] -0.5
## 
## $usr
## [1] 0 1 0 1
## 
## $xaxp
## [1] 0 1 5
## 
## $xaxs
## [1] "r"
## 
## $xaxt
## [1] "s"
## 
## $xpd
## [1] FALSE
## 
## $yaxp
## [1] 0 1 5
## 
## $yaxs
## [1] "r"
## 
## $yaxt
## [1] "s"
## 
## $ylbias
## [1] 0.2

Lets change the shape of the dot to a triangle and the line to a dashed one. The first step is to save the default parameters. It is not essential that you do so but it helps reset things if you mess up and don’t remember what you did or how to fix the mistake.

defaultpar <- par(no.readonly=TRUE)

par(lty=2, pch=17)
plot(mtcars$wt.mt, mtcars$kpl)
abline(lm(mtcars$kpl ~ mtcars$wt.mt))
title("Regression of Kilometers per Liter on Weight in Metric Tons")

2

par(defaultpar)

In RStudio you can also reset your parameters to the default by clicking Clear All in the plots window.

Common parameters
lty = line type
pch = plotted point type
cex = symbol size
lwd = line width
How can I find more? ?par or help(“par”)

Most plot functions allow you to specify everything inline. This tends to be how I modify plot options. It only lasts for one plot but in my experience I am seldom changing every point in dozens of graphs to warrent using global pars.

plot(mtcars$wt.mt, mtcars$kpl, lty=2, pch=17, 
     abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
     main="Regression of Kilometers per Liter on Weight in Metric Tons")

3

Like with the lm we can specify some graphs to use the by.
The form is Y ~by~ X

boxplot(mtcars$kpl ~ mtcars$gear, 
        main = "Boxplot of Kilometers per Liter by Number of Gears")

4

Coloring a graph.

Everything can be colored. col = plot color, col.axis = axis color, col.lab = labels color, col.main = title color, col.sub = subtitle color, fg = foreground color, and bg = background color. Color can be specified many ways:
col = 1 | Specified by order in R dataframe
col = “white” | Specified by name
col = #FFFFFF | Specified by hexadecimal
col = rgb(1,1,1) | Specified by RGB index
col = hsv(0,0,1) | Specified by HSV index

colors() #all the names and index numbers for the R colors
##   [1] "white"                "aliceblue"            "antiquewhite"        
##   [4] "antiquewhite1"        "antiquewhite2"        "antiquewhite3"       
##   [7] "antiquewhite4"        "aquamarine"           "aquamarine1"         
##  [10] "aquamarine2"          "aquamarine3"          "aquamarine4"         
##  [13] "azure"                "azure1"               "azure2"              
##  [16] "azure3"               "azure4"               "beige"               
##  [19] "bisque"               "bisque1"              "bisque2"             
##  [22] "bisque3"              "bisque4"              "black"               
##  [25] "blanchedalmond"       "blue"                 "blue1"               
##  [28] "blue2"                "blue3"                "blue4"               
##  [31] "blueviolet"           "brown"                "brown1"              
##  [34] "brown2"               "brown3"               "brown4"              
##  [37] "burlywood"            "burlywood1"           "burlywood2"          
##  [40] "burlywood3"           "burlywood4"           "cadetblue"           
##  [43] "cadetblue1"           "cadetblue2"           "cadetblue3"          
##  [46] "cadetblue4"           "chartreuse"           "chartreuse1"         
##  [49] "chartreuse2"          "chartreuse3"          "chartreuse4"         
##  [52] "chocolate"            "chocolate1"           "chocolate2"          
##  [55] "chocolate3"           "chocolate4"           "coral"               
##  [58] "coral1"               "coral2"               "coral3"              
##  [61] "coral4"               "cornflowerblue"       "cornsilk"            
##  [64] "cornsilk1"            "cornsilk2"            "cornsilk3"           
##  [67] "cornsilk4"            "cyan"                 "cyan1"               
##  [70] "cyan2"                "cyan3"                "cyan4"               
##  [73] "darkblue"             "darkcyan"             "darkgoldenrod"       
##  [76] "darkgoldenrod1"       "darkgoldenrod2"       "darkgoldenrod3"      
##  [79] "darkgoldenrod4"       "darkgray"             "darkgreen"           
##  [82] "darkgrey"             "darkkhaki"            "darkmagenta"         
##  [85] "darkolivegreen"       "darkolivegreen1"      "darkolivegreen2"     
##  [88] "darkolivegreen3"      "darkolivegreen4"      "darkorange"          
##  [91] "darkorange1"          "darkorange2"          "darkorange3"         
##  [94] "darkorange4"          "darkorchid"           "darkorchid1"         
##  [97] "darkorchid2"          "darkorchid3"          "darkorchid4"         
## [100] "darkred"              "darksalmon"           "darkseagreen"        
## [103] "darkseagreen1"        "darkseagreen2"        "darkseagreen3"       
## [106] "darkseagreen4"        "darkslateblue"        "darkslategray"       
## [109] "darkslategray1"       "darkslategray2"       "darkslategray3"      
## [112] "darkslategray4"       "darkslategrey"        "darkturquoise"       
## [115] "darkviolet"           "deeppink"             "deeppink1"           
## [118] "deeppink2"            "deeppink3"            "deeppink4"           
## [121] "deepskyblue"          "deepskyblue1"         "deepskyblue2"        
## [124] "deepskyblue3"         "deepskyblue4"         "dimgray"             
## [127] "dimgrey"              "dodgerblue"           "dodgerblue1"         
## [130] "dodgerblue2"          "dodgerblue3"          "dodgerblue4"         
## [133] "firebrick"            "firebrick1"           "firebrick2"          
## [136] "firebrick3"           "firebrick4"           "floralwhite"         
## [139] "forestgreen"          "gainsboro"            "ghostwhite"          
## [142] "gold"                 "gold1"                "gold2"               
## [145] "gold3"                "gold4"                "goldenrod"           
## [148] "goldenrod1"           "goldenrod2"           "goldenrod3"          
## [151] "goldenrod4"           "gray"                 "gray0"               
## [154] "gray1"                "gray2"                "gray3"               
## [157] "gray4"                "gray5"                "gray6"               
## [160] "gray7"                "gray8"                "gray9"               
## [163] "gray10"               "gray11"               "gray12"              
## [166] "gray13"               "gray14"               "gray15"              
## [169] "gray16"               "gray17"               "gray18"              
## [172] "gray19"               "gray20"               "gray21"              
## [175] "gray22"               "gray23"               "gray24"              
## [178] "gray25"               "gray26"               "gray27"              
## [181] "gray28"               "gray29"               "gray30"              
## [184] "gray31"               "gray32"               "gray33"              
## [187] "gray34"               "gray35"               "gray36"              
## [190] "gray37"               "gray38"               "gray39"              
## [193] "gray40"               "gray41"               "gray42"              
## [196] "gray43"               "gray44"               "gray45"              
## [199] "gray46"               "gray47"               "gray48"              
## [202] "gray49"               "gray50"               "gray51"              
## [205] "gray52"               "gray53"               "gray54"              
## [208] "gray55"               "gray56"               "gray57"              
## [211] "gray58"               "gray59"               "gray60"              
## [214] "gray61"               "gray62"               "gray63"              
## [217] "gray64"               "gray65"               "gray66"              
## [220] "gray67"               "gray68"               "gray69"              
## [223] "gray70"               "gray71"               "gray72"              
## [226] "gray73"               "gray74"               "gray75"              
## [229] "gray76"               "gray77"               "gray78"              
## [232] "gray79"               "gray80"               "gray81"              
## [235] "gray82"               "gray83"               "gray84"              
## [238] "gray85"               "gray86"               "gray87"              
## [241] "gray88"               "gray89"               "gray90"              
## [244] "gray91"               "gray92"               "gray93"              
## [247] "gray94"               "gray95"               "gray96"              
## [250] "gray97"               "gray98"               "gray99"              
## [253] "gray100"              "green"                "green1"              
## [256] "green2"               "green3"               "green4"              
## [259] "greenyellow"          "grey"                 "grey0"               
## [262] "grey1"                "grey2"                "grey3"               
## [265] "grey4"                "grey5"                "grey6"               
## [268] "grey7"                "grey8"                "grey9"               
## [271] "grey10"               "grey11"               "grey12"              
## [274] "grey13"               "grey14"               "grey15"              
## [277] "grey16"               "grey17"               "grey18"              
## [280] "grey19"               "grey20"               "grey21"              
## [283] "grey22"               "grey23"               "grey24"              
## [286] "grey25"               "grey26"               "grey27"              
## [289] "grey28"               "grey29"               "grey30"              
## [292] "grey31"               "grey32"               "grey33"              
## [295] "grey34"               "grey35"               "grey36"              
## [298] "grey37"               "grey38"               "grey39"              
## [301] "grey40"               "grey41"               "grey42"              
## [304] "grey43"               "grey44"               "grey45"              
## [307] "grey46"               "grey47"               "grey48"              
## [310] "grey49"               "grey50"               "grey51"              
## [313] "grey52"               "grey53"               "grey54"              
## [316] "grey55"               "grey56"               "grey57"              
## [319] "grey58"               "grey59"               "grey60"              
## [322] "grey61"               "grey62"               "grey63"              
## [325] "grey64"               "grey65"               "grey66"              
## [328] "grey67"               "grey68"               "grey69"              
## [331] "grey70"               "grey71"               "grey72"              
## [334] "grey73"               "grey74"               "grey75"              
## [337] "grey76"               "grey77"               "grey78"              
## [340] "grey79"               "grey80"               "grey81"              
## [343] "grey82"               "grey83"               "grey84"              
## [346] "grey85"               "grey86"               "grey87"              
## [349] "grey88"               "grey89"               "grey90"              
## [352] "grey91"               "grey92"               "grey93"              
## [355] "grey94"               "grey95"               "grey96"              
## [358] "grey97"               "grey98"               "grey99"              
## [361] "grey100"              "honeydew"             "honeydew1"           
## [364] "honeydew2"            "honeydew3"            "honeydew4"           
## [367] "hotpink"              "hotpink1"             "hotpink2"            
## [370] "hotpink3"             "hotpink4"             "indianred"           
## [373] "indianred1"           "indianred2"           "indianred3"          
## [376] "indianred4"           "ivory"                "ivory1"              
## [379] "ivory2"               "ivory3"               "ivory4"              
## [382] "khaki"                "khaki1"               "khaki2"              
## [385] "khaki3"               "khaki4"               "lavender"            
## [388] "lavenderblush"        "lavenderblush1"       "lavenderblush2"      
## [391] "lavenderblush3"       "lavenderblush4"       "lawngreen"           
## [394] "lemonchiffon"         "lemonchiffon1"        "lemonchiffon2"       
## [397] "lemonchiffon3"        "lemonchiffon4"        "lightblue"           
## [400] "lightblue1"           "lightblue2"           "lightblue3"          
## [403] "lightblue4"           "lightcoral"           "lightcyan"           
## [406] "lightcyan1"           "lightcyan2"           "lightcyan3"          
## [409] "lightcyan4"           "lightgoldenrod"       "lightgoldenrod1"     
## [412] "lightgoldenrod2"      "lightgoldenrod3"      "lightgoldenrod4"     
## [415] "lightgoldenrodyellow" "lightgray"            "lightgreen"          
## [418] "lightgrey"            "lightpink"            "lightpink1"          
## [421] "lightpink2"           "lightpink3"           "lightpink4"          
## [424] "lightsalmon"          "lightsalmon1"         "lightsalmon2"        
## [427] "lightsalmon3"         "lightsalmon4"         "lightseagreen"       
## [430] "lightskyblue"         "lightskyblue1"        "lightskyblue2"       
## [433] "lightskyblue3"        "lightskyblue4"        "lightslateblue"      
## [436] "lightslategray"       "lightslategrey"       "lightsteelblue"      
## [439] "lightsteelblue1"      "lightsteelblue2"      "lightsteelblue3"     
## [442] "lightsteelblue4"      "lightyellow"          "lightyellow1"        
## [445] "lightyellow2"         "lightyellow3"         "lightyellow4"        
## [448] "limegreen"            "linen"                "magenta"             
## [451] "magenta1"             "magenta2"             "magenta3"            
## [454] "magenta4"             "maroon"               "maroon1"             
## [457] "maroon2"              "maroon3"              "maroon4"             
## [460] "mediumaquamarine"     "mediumblue"           "mediumorchid"        
## [463] "mediumorchid1"        "mediumorchid2"        "mediumorchid3"       
## [466] "mediumorchid4"        "mediumpurple"         "mediumpurple1"       
## [469] "mediumpurple2"        "mediumpurple3"        "mediumpurple4"       
## [472] "mediumseagreen"       "mediumslateblue"      "mediumspringgreen"   
## [475] "mediumturquoise"      "mediumvioletred"      "midnightblue"        
## [478] "mintcream"            "mistyrose"            "mistyrose1"          
## [481] "mistyrose2"           "mistyrose3"           "mistyrose4"          
## [484] "moccasin"             "navajowhite"          "navajowhite1"        
## [487] "navajowhite2"         "navajowhite3"         "navajowhite4"        
## [490] "navy"                 "navyblue"             "oldlace"             
## [493] "olivedrab"            "olivedrab1"           "olivedrab2"          
## [496] "olivedrab3"           "olivedrab4"           "orange"              
## [499] "orange1"              "orange2"              "orange3"             
## [502] "orange4"              "orangered"            "orangered1"          
## [505] "orangered2"           "orangered3"           "orangered4"          
## [508] "orchid"               "orchid1"              "orchid2"             
## [511] "orchid3"              "orchid4"              "palegoldenrod"       
## [514] "palegreen"            "palegreen1"           "palegreen2"          
## [517] "palegreen3"           "palegreen4"           "paleturquoise"       
## [520] "paleturquoise1"       "paleturquoise2"       "paleturquoise3"      
## [523] "paleturquoise4"       "palevioletred"        "palevioletred1"      
## [526] "palevioletred2"       "palevioletred3"       "palevioletred4"      
## [529] "papayawhip"           "peachpuff"            "peachpuff1"          
## [532] "peachpuff2"           "peachpuff3"           "peachpuff4"          
## [535] "peru"                 "pink"                 "pink1"               
## [538] "pink2"                "pink3"                "pink4"               
## [541] "plum"                 "plum1"                "plum2"               
## [544] "plum3"                "plum4"                "powderblue"          
## [547] "purple"               "purple1"              "purple2"             
## [550] "purple3"              "purple4"              "red"                 
## [553] "red1"                 "red2"                 "red3"                
## [556] "red4"                 "rosybrown"            "rosybrown1"          
## [559] "rosybrown2"           "rosybrown3"           "rosybrown4"          
## [562] "royalblue"            "royalblue1"           "royalblue2"          
## [565] "royalblue3"           "royalblue4"           "saddlebrown"         
## [568] "salmon"               "salmon1"              "salmon2"             
## [571] "salmon3"              "salmon4"              "sandybrown"          
## [574] "seagreen"             "seagreen1"            "seagreen2"           
## [577] "seagreen3"            "seagreen4"            "seashell"            
## [580] "seashell1"            "seashell2"            "seashell3"           
## [583] "seashell4"            "sienna"               "sienna1"             
## [586] "sienna2"              "sienna3"              "sienna4"             
## [589] "skyblue"              "skyblue1"             "skyblue2"            
## [592] "skyblue3"             "skyblue4"             "slateblue"           
## [595] "slateblue1"           "slateblue2"           "slateblue3"          
## [598] "slateblue4"           "slategray"            "slategray1"          
## [601] "slategray2"           "slategray3"           "slategray4"          
## [604] "slategrey"            "snow"                 "snow1"               
## [607] "snow2"                "snow3"                "snow4"               
## [610] "springgreen"          "springgreen1"         "springgreen2"        
## [613] "springgreen3"         "springgreen4"         "steelblue"           
## [616] "steelblue1"           "steelblue2"           "steelblue3"          
## [619] "steelblue4"           "tan"                  "tan1"                
## [622] "tan2"                 "tan3"                 "tan4"                
## [625] "thistle"              "thistle1"             "thistle2"            
## [628] "thistle3"             "thistle4"             "tomato"              
## [631] "tomato1"              "tomato2"              "tomato3"             
## [634] "tomato4"              "turquoise"            "turquoise1"          
## [637] "turquoise2"           "turquoise3"           "turquoise4"          
## [640] "violet"               "violetred"            "violetred1"          
## [643] "violetred2"           "violetred3"           "violetred4"          
## [646] "wheat"                "wheat1"               "wheat2"              
## [649] "wheat3"               "wheat4"               "whitesmoke"          
## [652] "yellow"               "yellow1"              "yellow2"             
## [655] "yellow3"              "yellow4"              "yellowgreen"

You can also use this PDF
http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.pdf from
Earl F. Glynn’s page on Stowers Institute for Medical Research.

R also features a variety of premade pallets
For example,

Rainbow

N <- 10
Color <- rainbow(N)
pie(rep(1,N), col=Color)

5

Gray

Color <- gray(0:N/N)
pie(rep(1,N), col=Color)

6

Heat

Color <- heat.colors(N)
pie(rep(1,N), col=Color)

7

Topographic

Color <- topo.colors(N)
pie(rep(1,N), col=Color)

8

Change the N and see what kinds of colors you can get.

Text and symbols

Text and symbols are modified with cex. cex = symbol size relative to default (1),
cex.axis = magnification of axis
cex.lab, cex.main, cex.sub are all magnifications relative to cex setting.
font = 1, plain; 2 = bold; 3 = italic; 4 = bold italic; 5 = symbol.
font.lab, font.main. font.sub, etc. all change the font for that area. ps = text point/pixel size. Final text size is ps * cex
family = font family. E.g., serif, sans, mono, etc.

Examples:

plot(mtcars$wt.mt, mtcars$kpl,
     abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
     main="Defaults")

9

plot(mtcars$wt.mt, mtcars$kpl, cex = 2, 
     abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
     main="Big Symbols")

10

plot(mtcars$wt.mt, mtcars$kpl, cex = 1, font = 3,
     cex.main = .75, cex.lab = 2, abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
     main="Italic Axes Labels, Large Text Legends, and Small Title")

11

Dimensions

pin(width, height) changes the absolute size of the graph in inches. This makes the whole graph fit into a specific size and all other options are static. In other words, making the graph very big doesn’t necessarily make the text fit well. mai (bottom, left, top, right) are margins. You can change specific parts of how the graph is plotted with margins. They can get quite complex but there is a very nice guide available through http://research.stowers-institute.org/efg/R/Graphics/Basics/mar-oma/

Let’s put all this to use.
the par commands apply to both graphs but the inline only to that graph. We start by setting the dimensions of the graph to 5 inches wide by 4 inches tall. Then we make a thicker line and larger text with lch and cex. Finally, we make the axis text smaller and italicised.

par(pin=c(5,4))
par(lwd=2, cex=1.5)
par(cex.axis=.75, font.axis=3)

For each plot independently we will change the color and shape of the symbols

plot(mtcars$wt.mt, mtcars$kpl,
  abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
  main="Defaults",
  pch = 19,
  col = "dodgerblue")

12

plot(mtcars$wt.mt, mtcars$kpl,
     abline(lm(mtcars$kpl ~ mtcars$wt.mt)), 
     main="Defaults",
     pch = 23,
     col = "indianred")

13

plot(mtcars$kpl, mtcars$hp, pch = 23, col="blue", 
     abline(lm(mtcars$kpl ~ mtcars$hp)))

14

And reset the global parameters to their defaults.

par(defaultpar)

Text customization

You can add text with main (title), sub (subtitle), xlab (x axis label), and ylab (y axis label).

plot(mtcars$kpl, mtcars$wt.mt, 
     xlab = "Kilometers per Liter", 
     ylab = "Weight in Metric Tons", 
     main = "Scatterplot of K/L and WT", 
     sub = "Data from mtcars")

15

You can also annotate a graph with text and mtext. First we create a graph
Then, over the top of that graph we write at the intersections of wt.mt and kpl the name of the car. Since the name of the car is the name of the rows we can use row.names(mtcars). pos refers to the position that the text writes in we can use 4 to indicate to the right.

plot(mtcars$wt.mt, mtcars$kpl, 
     main = "K/L vs. Weight", 
     xlab = "Weight", 
     ylab = "Kilometers per Liter",
     pch = 18, 
     col = "steelblue")

text(mtcars$wt.mt, mtcars$kpl, row.names(mtcars), cex = .6, pos = 4, col = "Blue")

16

If we wanted to instead see how many cylenders each car has we would graph that just as easily by specifying that as the text to place in those positions.

plot(mtcars$wt.mt, mtcars$kpl, 
     main = "K/L vs. Weight", 
     xlab = "Weight", 
     ylab = "Kilometers per Liter",
     pch = 18, 
     col = "steelblue")

text(mtcars$wt.mt, mtcars$kpl, mtcars$cyl, cex = .6, pos = 4, col = "steelblue")

17

You can adjust the limits of the axies with xlim and ylim.
To set limits you give a list of lower coordinate and higher coordinate e.g., c(-5,32)

plot(mtcars$wt.mt, mtcars$kpl, 
     main = "K/L vs. Weight", 
     xlab = "Weight", 
     ylab = "Kilometers per Liter",
     pch = 18, 
     col = "Purple", 
     xlim=c(0,10), 
     ylim=c(0,40)) 

18

Combining Graphs

R can produce your plots in a matrix with par. One command in par is mfrow which stands for matrix plot where graphs are entered by row until filled. mfcol is the column version This wil automatically adjust things like cex of all options to be smaller in order to fit the graphs into the new matrix structure. Alternatively, you can use layout or split.screen. All the options have their strengths and weaknesses and none of them can be used together. Spend some time looking over the help documents for the three methods and choose the one that makes the most sense to you. I prefer layout which has the form:
layout(matrix, widths = rep.int(1, ncol(mat)), heights = rep.int(1, nrow(mat)), respect = FALSE). This creates a plot where the location of the next N figures are plotted. layout lets you choose exactly where on the plot things are appearing and how much room they take up. In the matrix you use an Integer to specify which plot goes where. For instance 1 is the next plot, 2 is the plot after that, etc. up until the number of plots you intend on being in the matrix a 0 means don’t use that area and a number in multiple places means use those cells for the same plot (span across the cells).

Let’s start with a matrix of plots where the next 4 plots get entered into their own cells. Lets have it so they are entered by row in a 2 by 2 fashion.
We can test if this is the layout we want with layout.show(n) where n is the number of plots we want to see (4 in this case.)

layout(matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE), respect = TRUE)

layout.show(4)

19

Okay, we have the arrangement we are looking for. Now we just create 4 plots and they will be filled in as they are plotted.

layout(matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE), respect = TRUE)
plot(mtcars$wt.mt, mtcars$kpl, main = "Scatterplot of K/L and WT")
plot(mtcars$wt.mt, mtcars$disp.c, main = "Scatterplot of Weight and Displacement")
hist(mtcars$wt.mt, main = "Histogram of Weight")
boxplot(mtcars$wt.mt, main = "Boxplot of Weight")

20

Then we need to reset back to the basics.

par(defaultpar)

We could replicate most of what we have above but also assign the entire top row to 1 graph.

layout(matrix(c(1,1,2,3), 2, 2, byrow=TRUE))

layout.show(3)

21

hist(mtcars$wt.mt, main = "Histogram of Weight")
hist(mtcars$kpl, main = "Histogram of Kilometers per Liter")
hist(mtcars$disp.c, main = "Histogram of Displacement")

22

Then we need to reset back to the basics.

par(defaultpar)

Finally, sometimes you will need a very fine control over the graphs. To do that we use fig to specify the exact coordinates for a plot to take up. fig is specified as a numerical vector of the form c(x1, x2, y1, y2) which gives the coordinates of the figure region in the display region of the device. If you set this you start a new plot, so to add to an existing plot use new = TRUE. The plotting area goes from 0 to 1 (think of it like percentages of the plotting area you want a figure to be inside). You can do negative and over 1 if you want to plot outside the typical range.

Let’s start with a plot that goes from 00% to 80% of X and 00% to 80% of Y. Then we will graph onto the 20% of the area above and to the right of those areas. What we want to create are density plots around a scatter plot with a regression line.

After that we specify a graph to fill the rest of the space. This is where it can begin to get tricky. Since the graph we want will span the same x that is easy (00 and 0.8). But, the graph on the y axis will be small if we tell it to take only the remaining space. So, it’s best to play around with the exact dimensions for output that you think looks good. If you are using RStudio don’t rely on the preview since it will scale to the dimensions of your monitor. You will need to use zoom or save the graph in order to get the best dimensions for display or print.

par(fig=c(0, 0.8, 0, 0.8)) #Specify coordinates for plot

layout.show(1) #Check if this is the right plotting area.

23

plot(mtcars$wt.mt, mtcars$kpl,
     xlab = "Weight in Metric Tons",
     ylab = "Kilometers per Liter",
     col = "steelblue", pch = 10) #Create our plot.

par(fig=c(0, 0.8, 0.55, 1), new = TRUE)

# For the boxplot we can flip the graph with horizontal = TRUE and
# disable the display of the axes with axes = FALSE.

boxplot(mtcars$wt.mt, horizontal = TRUE, axes = FALSE, col = "steelblue2")

par(fig=c(0.7, 0.95, 0, 0.8), new = TRUE)
boxplot(mtcars$kpl, axes=FALSE, col = "steelblue2")

mtext("Scatterplot of K/L and Weight with Density Boxplots", side = 3, outer = TRUE, 
      col = "mediumvioletred", line = -3, cex = 1.5)

24

Finally, we can add a title with mtext (if we used main in the original graph it would overlay the position we want the boxplot to be in). We use side to say where it should be positioned in this case 3 which is the top. Then we can tell it that graphing outside the plot area is fine with outer = TRUE. Last, we need to offset the title a bit with line = -3. This too will be a little trial and error in order to find a good position for you, based on the size of the graph you are constructing.

par(defaultpar)

I, frankly, do not like using fig. I find the plots never quite turn out how you want and there is simply too much fiddling around and inexactness. Usually, if you want a complex plot it can be accomplished easier through the use of packages. Those packages usually come with a better way to print and save the plot as well.

 

Now that you have become an expert on creating graphs with Base-R why not give the lab a try? It’s a rather simple exercise where you try and replicate a few graphs by using what you learned above.

Lab 2: https://docs.google.com/document/d/1g3nQ1a0shnvXC-PkPcAYKV5fuDWMxMORVQG5armCbXw/edit?usp=sharing

Answers: https://drive.google.com/file/d/0BzzRhb-koTrLNnpOZXhNMEplVjA/view?usp=sharing

If you are having issues with the answers try downloading them and opening them in your browser of choice.

Part I: Basic Data Structures and R Syntax

I just finished up teaching a semester long course on R programming in the social sciences. After gathering a lot of feedback from the students and the notes I took I am going back through the course and making modifications and extensions on some topics. As I modify the course and shape it to be even better I will be posting syntax and output for people to follow along and labs for testing abilities!

The following series of Teaching and Learning R is aimed at helping someone start from knowing very little about R and computer programming in general who has a basic statistical knowledge (General Linear Model) understand how to properly format data, graph, do statistical analyses, and output from R into a usable format. This includes publication quality graphs and tables. If you have any comments or feedback please don’t hesitate to email me directly or leave a comment on the blog!

Please try to write out the syntax yourself. Try and play around a little and see what you get and what the boundaries are. At the end of the lesson is a lab to help test your skills!

Syntax for R is similar to a computer programming language. You may use whatever rules you want within certain limits. R is whitespace insensitive so you can use as many or as few spaces as you wish. R is case sensitive so you will need to be sure you are capitalizing things consistently. Although you may do whatever you wish with your syntax there are a number of rules that will make your code easier to read, follow, and understand. Particularly, I believe that having spaces around operators, spaces after all commas, and a consistent methodology for naming variables to be the most essential things to get used to. I would suggest Hadley Wickham’s R style guide (http://adv-r.had.co.nz/Style.html). Read through that document and commit it to memory and you will have an easier time with R programming. Google also maintains a useful style guide (http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml).
Use comments for everything! You don’t know when you will want to review code or give code to someone else. A good description of what you were thinking when you wrote it and what you hoped to acomplish and why you did what you did will save you a lot of time in the long run. Writing comments in R-Studio is easy. Just type a # and follow it with a long line of text. Then highlight the line of text and click reflow comment in the code menu. In Windows the shortcut is ctrl+shift+/.
R is composed of a number of components. You have the Console where everything is run. In a text or Base-R you would do most of your code here. This is a great place to do simple analyses, quick plots, or to test things out. You can hit the up arrow while in the console to see past entries. This is the lower left window in RStudio. You have a syntax view available where you can spend more time structuring syntax and flows. This is generally where you will do most of your work and is the upper left window in RStudio. When a variable is created, a dataset loaded, or something is saved it is stored in the workspace. Think of this like a desktop where all your documents are located. This is represented as “Environemnt” in the upper right corner in RStudio (it’s basically a constant display of str()). Last there are a number of objects that will be created in any analysis (e.g., plots) which will be popups in Base-R and are stored in the lower right window in RStudio.
There is a large amount of information and guides available for running R through the R Project Manuals page (http://cran.r-project.org/). I would recomend reading and following along with them. Particularly the beginning of “An Introduction to R” and “Data Import / Export” as those are very helpful topics. You can find a lot of help through a number of websites as well. The most popular place to ask R questions is at stackoverflow (http://stackoverflow.com/questions/tagged/r). There is a smaller but still helpful community you can access through reddit as well (http://www.reddit.com/r/rstats). For either of these websites you can ask basic and complex questions but you should try searching the websites for similar questions first, people can get quite grumpy with repeated questions that have already been answered. If you have a new question try and provide example data either as a download or give syntax that creates a small dataset like what you are working with and what you expect the output to look like when you are done. If you don’t the first responses to your question will be someone telling you to do that. Finally, you can access help manuals for each function with help(command) or ?command. You can even do a web search with ??command. Try it out ?sum

Vectors

x <- c(10.4, 5.6, 3.1, 6.4, 21.7) #c stands for concatenated list

OR

assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))

<- is therefore a shortcut for assign. -> and = also assign as long as the arrow points the correct direction. Usually the = are not used as much as the arrows since you should always be sure which direction you are doing your assignments.

If we do arithmetic on this vector it doesn’t change the vector

1 / x
## [1] 0.09615385 0.17857143 0.32258065 0.15625000 0.04608295
x + 10
## [1] 20.4 15.6 13.1 16.4 31.7
x * 100
## [1] 1040  560  310  640 2170

Only if we assign it a variable name is it stored.

We can also use vectors within vectors

y <- c(x, 0, x)
y
##  [1] 10.4  5.6  3.1  6.4 21.7  0.0 10.4  5.6  3.1  6.4 21.7

R will always make vector arithmetic the length of the longest vector

v <- 2 * x + y + 1
## Warning in 2 * x + y: longer object length is not a multiple of shorter
## object length
v
##  [1] 32.2 17.8 10.3 20.2 66.1 21.8 22.6 12.8 16.9 50.8 43.5

x is repeated 2.2 times, y once and 1 11 times

You can also do arithmetic between parts of vectors

x[2] * x[4]
## [1] 35.84

you can also have vectors of characters

a <- c("one", "two", "three")
a
## [1] "one"   "two"   "three"

and logical

b <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
b
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Vector Referencing

vector[position]

x[2] #second position
## [1] 5.6
x[c(2, 5)] #second and fifth position
## [1]  5.6 21.7

R also supports a through statement

x[2:6] #positions 2 through 6 (returns an NA because 6 doesn't exist)
## [1]  5.6  3.1  6.4 21.7   NA

Matrices

matrix(data = NA, nrow = numberofrows, ncol = numberofcolumns, byrow = FALSE, dimnames = c(rownames, colnames))

c <- matrix(1:20, nrow = 5, ncol = 4) 

add byrow = TRUE to fill in the matrix by rows

c
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

Matrix referencing

matrix[row position, col position]

A blank means all

c[1, ] #all of row 1
## [1]  1  6 11 16
c[, 1] #all of column 1
## [1] 1 2 3 4 5
c[5, 2] #cell from row 5 column 2
## [1] 10
c[c(2, 5), 4] #rows 2 and 5 in column 4
## [1] 17 20
c[1:3, 2:3] #rows 1 through 3 and columns 2 through 3
##      [,1] [,2]
## [1,]    6   11
## [2,]    7   12
## [3,]    8   13

Arrays

array(data = NA, dim = length(data), dimnames = NULL)

dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")

z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
z
## , , C1
## 
##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6
## 
## , , C2
## 
##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12
## 
## , , C3
## 
##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18
## 
## , , C4
## 
##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24

Array referencing

array[row position, col position, dimension position]

z[1, 2, 1:3]
## C1 C2 C3 
##  3  9 15

Data Frames

Similar to what you would expect to work with in SPSS, SAS, Excel, etc.

Sepallength <- c(5.1, 4.9, 7, 6.4, 6.3, 5.8)
Sepalwidth <- c(3.5, 3.0, 3.2, 3.2, 3.3, 2.7)
Petallength <- c(1.4, 1.4, 4.7, 4.5, 6.0, 5.1)
Petalwidth <- c(.2, .2,1.4, 1.5,  2.5, 1.9)
Species <- c("I. setosa", "I. setosa", "I. versicolor", "I. versicolor", "I. virginica", "I. virginica")
Firis <- data.frame(Sepallength, Sepalwidth, Petallength, Petalwidth, Species)
Firis
##   Sepallength Sepalwidth Petallength Petalwidth       Species
## 1         5.1        3.5         1.4        0.2     I. setosa
## 2         4.9        3.0         1.4        0.2     I. setosa
## 3         7.0        3.2         4.7        1.4 I. versicolor
## 4         6.4        3.2         4.5        1.5 I. versicolor
## 5         6.3        3.3         6.0        2.5  I. virginica
## 6         5.8        2.7         5.1        1.9  I. virginica

Data frame referencing

dataframe[row position, col position]

Unlike with matrices you can also use column names

Firis[c(1, 3)] #Comparing Sepal Length and Petal Length
##   Sepallength Petallength
## 1         5.1         1.4
## 2         4.9         1.4
## 3         7.0         4.7
## 4         6.4         4.5
## 5         6.3         6.0
## 6         5.8         5.1

Instead of counting columns we can refer to column name

Firis[c("Sepallength", "Petallength")] 
##   Sepallength Petallength
## 1         5.1         1.4
## 2         4.9         1.4
## 3         7.0         4.7
## 4         6.4         4.5
## 5         6.3         6.0
## 6         5.8         5.1

The most common way we will reference something is with a $. A $ means within. We can call a single variable with dataframe$variable_name

Firis$Sepalwidth 
## [1] 3.5 3.0 3.2 3.2 3.3 2.7

selecting a single variable is very important, especially when we want to cross tabulate

table(Firis$Sepalwidth, Firis$Species)
##      
##       I. setosa I. versicolor I. virginica
##   2.7         0             0            1
##   3           1             0            0
##   3.2         0             2            0
##   3.3         0             0            1
##   3.5         1             0            0

I have a little secret. The iris data in it’s entirety already exists inside Base R. Lets clear the workspace then load up that data.

rm(list=ls())

There is a lot of data so we can get partial pictures of the dataset with

head(iris) #first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris) #last 6 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
summary(iris) #summary statistics for each column
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
str(iris) #the types of variables in the data frame
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We could use table(iris$Sepal.Width, iris$Species) to see an expanded version of the above table or we can make sure R will use the iris data. We can do that with attach(dataframe). This loads all the variables in the dataset into the global environment so they are accessable by all functions without telling them what dataset they belong to. However, variables that you create and add to the dataset will NOT be automatically attached. Most programmers, myself included, would recommend not using attach.

attach(iris)
table(Sepal.Width, Species)
##            Species
## Sepal.Width setosa versicolor virginica
##         2        0          1         0
##         2.2      0          2         1
##         2.3      1          3         0
##         2.4      0          3         0
##         2.5      0          4         4
##         2.6      0          3         2
##         2.7      0          5         4
##         2.8      0          6         8
##         2.9      1          7         2
##         3        6          8        12
##         3.1      4          3         4
##         3.2      5          3         5
##         3.3      2          1         3
##         3.4      9          1         2
##         3.5      6          0         0
##         3.6      3          0         1
##         3.7      3          0         0
##         3.8      4          0         2
##         3.9      2          0         0
##         4        1          0         0
##         4.1      1          0         0
##         4.2      1          0         0
##         4.4      1          0         0

you can reverse attach with detach()

detach(iris)

You can also temporarily do a series of operations in a data frame

with(iris, {
  plot(Species, Petal.Length, main="Petal Length by Species")
})

 

Petal Length by Species

The limitation of with is that it only considers the variables you specify and doesn’t call the dataframe. We can call the dataframe with within.

within(iris, {
  Petal.Area <- Petal.Length * Petal.Width #Create the variable
})
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica
##     Petal.Area
## 1         0.28
## 2         0.28
## 3         0.26
## 4         0.30
## 5         0.28
## 6         0.68
## 7         0.42
## 8         0.30
## 9         0.28
## 10        0.15
## 11        0.30
## 12        0.32
## 13        0.14
## 14        0.11
## 15        0.24
## 16        0.60
## 17        0.52
## 18        0.42
## 19        0.51
## 20        0.45
## 21        0.34
## 22        0.60
## 23        0.20
## 24        0.85
## 25        0.38
## 26        0.32
## 27        0.64
## 28        0.30
## 29        0.28
## 30        0.32
## 31        0.32
## 32        0.60
## 33        0.15
## 34        0.28
## 35        0.30
## 36        0.24
## 37        0.26
## 38        0.14
## 39        0.26
## 40        0.30
## 41        0.39
## 42        0.39
## 43        0.26
## 44        0.96
## 45        0.76
## 46        0.42
## 47        0.32
## 48        0.28
## 49        0.30
## 50        0.28
## 51        6.58
## 52        6.75
## 53        7.35
## 54        5.20
## 55        6.90
## 56        5.85
## 57        7.52
## 58        3.30
## 59        5.98
## 60        5.46
## 61        3.50
## 62        6.30
## 63        4.00
## 64        6.58
## 65        4.68
## 66        6.16
## 67        6.75
## 68        4.10
## 69        6.75
## 70        4.29
## 71        8.64
## 72        5.20
## 73        7.35
## 74        5.64
## 75        5.59
## 76        6.16
## 77        6.72
## 78        8.50
## 79        6.75
## 80        3.50
## 81        4.18
## 82        3.70
## 83        4.68
## 84        8.16
## 85        6.75
## 86        7.20
## 87        7.05
## 88        5.72
## 89        5.33
## 90        5.20
## 91        5.28
## 92        6.44
## 93        4.80
## 94        3.30
## 95        5.46
## 96        5.04
## 97        5.46
## 98        5.59
## 99        3.30
## 100       5.33
## 101      15.00
## 102       9.69
## 103      12.39
## 104      10.08
## 105      12.76
## 106      13.86
## 107       7.65
## 108      11.34
## 109      10.44
## 110      15.25
## 111      10.20
## 112      10.07
## 113      11.55
## 114      10.00
## 115      12.24
## 116      12.19
## 117       9.90
## 118      14.74
## 119      15.87
## 120       7.50
## 121      13.11
## 122       9.80
## 123      13.40
## 124       8.82
## 125      11.97
## 126      10.80
## 127       8.64
## 128       8.82
## 129      11.76
## 130       9.28
## 131      11.59
## 132      12.80
## 133      12.32
## 134       7.65
## 135       7.84
## 136      14.03
## 137      13.44
## 138       9.90
## 139       8.64
## 140      11.34
## 141      13.44
## 142      11.73
## 143       9.69
## 144      13.57
## 145      14.25
## 146      11.96
## 147       9.50
## 148      10.40
## 149      12.42
## 150       9.18

Notice how it prints out all the data with our new column?

Compare that to with.

with(iris, {
  Petal.Area <- Petal.Length * Petal.Width
})

Nothing is printed.

In order to save this data we need to assign it back to the dataframe or to a new dataframe.

within(iris, {
  Petal.Area <- Petal.Length * Petal.Width
}) -> iris #Assign the variable to iris dataframe
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1          5.1         3.5          1.4         0.2  setosa       0.28
## 2          4.9         3.0          1.4         0.2  setosa       0.28
## 3          4.7         3.2          1.3         0.2  setosa       0.26
## 4          4.6         3.1          1.5         0.2  setosa       0.30
## 5          5.0         3.6          1.4         0.2  setosa       0.28
## 6          5.4         3.9          1.7         0.4  setosa       0.68

We now have Petal.Area as a column in our dataframe.

If we used with we would only have our new variable

with(iris, {
  Petal.Area <- Petal.Length * Petal.Width
}) -> iris2
head(iris2)
## [1] 0.28 0.28 0.26 0.30 0.28 0.68

Factors

R will automatically create dummy codes for text entries if you turn them into factors. Factors can be complex at first but they are quite powerful. You can read more about how R deals with factors at http://www.stat.berkeley.edu/~s133/factors.html

diabetes <- c("Type1", "Type2", "Type1", "Type2")
diabetes
## [1] "Type1" "Type2" "Type1" "Type2"
class(diabetes) #class tells us what type of variable we have
## [1] "character"
str(diabetes)
##  chr [1:4] "Type1" "Type2" "Type1" "Type2"
diabetes <- factor(diabetes)
diabetes #notice how the "" are gone
## [1] Type1 Type2 Type1 Type2
## Levels: Type1 Type2
class(diabetes)
## [1] "factor"
str(diabetes) 
##  Factor w/ 2 levels "Type1","Type2": 1 2 1 2

You can see the codes now. Codes are applied as the catagories in alphabetical order. This is a NOMINAL variable.

rating <- c("Strongly Disagree", "Disagree", "Agree", "Strongly Agree")
rating <- factor(rating)
rating
## [1] Strongly Disagree Disagree          Agree             Strongly Agree   
## Levels: Agree Disagree Strongly Agree Strongly Disagree
class(rating)
## [1] "factor"
str(rating) #notice agree is 1, then disagree is 2, etc.
##  Factor w/ 4 levels "Agree","Disagree",..: 4 2 1 3

To make this an ORDINAL variable we need to use ordered = TRUE and levels

rating <- factor(c("Strongly Disagree", "Disagree", "Agree", "Strongly Agree"),
                 ordered=TRUE, 
                 levels=c("Strongly Disagree", "Disagree", "Agree", "Strongly Agree"))
rating
## [1] Strongly Disagree Disagree          Agree             Strongly Agree   
## Levels: Strongly Disagree < Disagree < Agree < Strongly Agree
class(rating)
## [1] "ordered" "factor"
str(rating)
##  Ord.factor w/ 4 levels "Strongly Disagree"<..: 1 2 3 4

If you have numeric data and you want to make it a categorical variable rating <- factor(rating, levels=(c(1:4)), labels=c(“Strongly Disagree”, “Disagree”, “Agree”, “Strongly Agree”))

Let’s pretend like someone rated how much they liked those irises. We can use a randomizer to assign these values for us quickly. If we want it to be a reporoducable randomization we can use set.seed which tells R the next time you randomize something use this randomizer signature.

set.seed(42); rating <- sample(c("Very Pretty", "Pretty", "Ugly", "Very Ugly"), 
                               150, replace = TRUE)

normally seed is derived from current time in ms and process ID

rating <- factor(rating, ordered=TRUE, 
                 levels=c("Very Pretty", "Pretty", "Ugly", "Very Ugly"))

Let’s recreate that exact same data with just a numeric representation for comparison.

set.seed(42); rating.numeric <- sample(1:4, 150, replace = TRUE)

Then we add them to the iris data frame

iris$rating <- rating
iris$rating.numeric <- rating.numeric
summary(iris) 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species     Petal.Area             rating   rating.numeric 
##  setosa    :50   Min.   : 0.110   Very Pretty:34   Min.   :1.000  
##  versicolor:50   1st Qu.: 0.420   Pretty     :29   1st Qu.:2.000  
##  virginica :50   Median : 5.615   Ugly       :46   Median :3.000  
##                  Mean   : 5.794   Very Ugly  :41   Mean   :2.627  
##                  3rd Qu.: 9.690                    3rd Qu.:4.000  
##                  Max.   :15.870                    Max.   :4.000

Notice how Species and rating are treated by R even though they have numeric values.

str(iris)
## 'data.frame':    150 obs. of  8 variables:
##  $ Sepal.Length  : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width   : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length  : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width   : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species       : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Petal.Area    : num  0.28 0.28 0.26 0.3 0.28 0.68 0.42 0.3 0.28 0.15 ...
##  $ rating        : Ord.factor w/ 4 levels "Very Pretty"<..: 4 4 2 4 3 3 3 1 3 3 ...
##  $ rating.numeric: int  4 4 2 4 3 3 3 1 3 3 ...

Here is a quick example of what this will look like when you try and use these for visualization or statistics.

with(iris, {
  plot(rating, Sepal.Width, main="Ordinal Factor Rating")
  plot(rating.numeric, Sepal.Width, main="Numeric Factor Rating")
})

 

Ordinal Factor Rating

Numeric Rating

Importing a Dataset

Create a folder close to root for use I usually use something like E:/Rcourse/L1. You can have R create the directory for you easily.

dir.create("E:/Rcourse/L1", showWarnings = FALSE)

Then set the working directory for R to that folder. This lets you import and use the file easier. It also lets you know where to look for old workspaces and anything created by R (like a save file). I strongly – STRONGLY – recommend that you create a new directory for every analysis. Keep your original data in pristine format and have a syntax file that cleans the data and saves it to a new directory. Then when you do a primary analysis load that cleaned data and save any modification you make to a new directory. This allows you to go back to previous steps and easily make modifications without having to start over from the very beginning. It also means you will never have to admit you lost data, overwrote data, or in general screwed up. Computers have essentially unlimited data storage when used for typical social science research (a million rows of 30 variables stored in RData format is probably going to be less than 25 megabytes)

setwd("E:/Rcourse/L1")

Text

A delimited file is always the best way to import data into R I would suggest exporting from SAS, SPSS, Excel, Etc. as a CSV then importing. We can even download a file from the internet if we know where to look for it. Here we can pull some responses to a Job in General survey

JiG <- read.csv(file = "http://degovx.eurybia.feralhosting.com/JiG.csv", 
                fileEncoding = "UTF-8-BOM")

Most windows programs write a special bit of text at the front of text based files called a Byte Order Mark which can cause a bit of garbage to appear in the string of the first variable in the header. If you pass the fileEncoding BOM statement it cleans up that mark.

head(JiG)
##   XJIG1 XJIG2 XJIG3 XJIG4 XJIG5 XJIG6 XJIG7 XJIG8 XJIG9 XJIG10 XJIG11
## 1     3     3     3     3     3     3     3     3     3      3      3
## 2     3     3     1     3     3     3     3     3     3      0      3
## 3     3     3     3     3     3     3     3     3     3      0      0
## 4     3     3     0     3     3     3     3     3     3      0      0
## 5     3     3     3     3     3     3     3     3     3      3      3
## 6     3     3     3     3     3     3     3     3     3      0      3
##   XJIG12 XJIG13 XJIG14 XJIG15 XJIG16 XJIG17 XJIG18 XJIG19x XJIG20x XJIG21x
## 1      3      3      3      3      3      3      3       3       3       3
## 2      3      3      3      0      3      3      3       3       3       3
## 3      3      3      3      0      3      3      3       3       3       0
## 4      3      0      3      0      3      3      3       0       3       0
## 5      3      3      3      3      3      3      3       3       3       3
## 6      3      3      3      3      3      3      3       3       3       3
summary(JiG)
##      XJIG1           XJIG2           XJIG3           XJIG4      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :0.000   Median :3.000  
##  Mean   :2.488   Mean   :2.699   Mean   :1.092   Mean   :2.663  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##      XJIG5           XJIG6           XJIG7           XJIG8      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.527   Mean   :2.577   Mean   :2.321   Mean   :2.746  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##      XJIG9          XJIG10           XJIG11          XJIG12     
##  Min.   :0.00   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.00   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:3.000  
##  Median :3.00   Median :0.0000   Median :3.000   Median :3.000  
##  Mean   :2.76   Mean   :0.9461   Mean   :2.122   Mean   :2.574  
##  3rd Qu.:3.00   3rd Qu.:3.0000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.00   Max.   :3.0000   Max.   :3.000   Max.   :3.000  
##      XJIG13          XJIG14          XJIG15          XJIG16    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:3.00  
##  Median :3.000   Median :3.000   Median :0.000   Median :3.00  
##  Mean   :1.832   Mean   :2.382   Mean   :1.119   Mean   :2.78  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.00  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.00  
##      XJIG17          XJIG18         XJIG19x        XJIG20x     
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :3.00   Median :3.000  
##  Mean   :2.184   Mean   :2.679   Mean   :2.27   Mean   :2.224  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.00   Max.   :3.000  
##     XJIG21x     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :1.283  
##  3rd Qu.:3.000  
##  Max.   :3.000
str(JiG)
## 'data.frame':    1485 obs. of  21 variables:
##  $ XJIG1  : int  3 3 3 3 3 3 3 3 0 3 ...
##  $ XJIG2  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG3  : int  3 1 3 0 3 3 3 0 0 3 ...
##  $ XJIG4  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG5  : int  3 3 3 3 3 3 3 0 3 3 ...
##  $ XJIG6  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG7  : int  3 3 3 3 3 3 3 1 3 3 ...
##  $ XJIG8  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG9  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG10 : int  3 0 0 0 3 0 3 0 0 1 ...
##  $ XJIG11 : int  3 3 0 0 3 3 3 0 3 3 ...
##  $ XJIG12 : int  3 3 3 3 3 3 3 1 3 3 ...
##  $ XJIG13 : int  3 3 3 0 3 3 3 0 0 3 ...
##  $ XJIG14 : int  3 3 3 3 3 3 3 1 3 3 ...
##  $ XJIG15 : int  3 0 0 0 3 3 3 0 0 3 ...
##  $ XJIG16 : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG17 : int  3 3 3 3 3 3 3 1 3 3 ...
##  $ XJIG18 : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ XJIG19x: int  3 3 3 0 3 3 3 0 3 3 ...
##  $ XJIG20x: int  3 3 3 3 3 3 3 1 3 3 ...
##  $ XJIG21x: int  3 3 0 0 3 3 3 0 0 3 ...

We can even open the dataset for interaction

# view(JiG)

Excel

Excel is supported only on Windows with the package RODBC or xlsx

SPSS, Stata, SAS

Most statistical software packages are supported with the package foreign
mydataframe <- read.spss(“mydata.sav”, use.value.labels=TRUE)
mydataframe <- read.dta(“mydata.dta”)
mydataframe <- read.xport(“mydata.dta”)
We can save the file we just downloaded as an RData file

save(JiG, file = "JiG.RData")

Or export it as a csv

write.csv(JiG, file = "JiG.csv")

If you are going to continue using R I recommend keeping files in RData it’s faster and smaller.

file.info(c("JiG.csv", "JiG.RData"))
##            size isdir mode               mtime               ctime
## JiG.csv   73330 FALSE  666 2015-06-24 14:00:56 2015-06-23 16:55:48
## JiG.RData  7964 FALSE  666 2015-06-24 14:00:56 2015-06-23 16:55:48
##                         atime exe
## JiG.csv   2015-06-23 16:55:48  no
## JiG.RData 2015-06-23 16:55:48  no

Using knitR

For your labs and creating beautiful reports you will be creating a syntax file that can be run in it’s entirety to give all the answers. They should also include comments like this specifying what question the next block of syntax is designed to answer. Once you are done with your syntax block you will run it with knitR. You run knitR through File -> Knit and select HTML notebook. Later we will go over how to use knitR to make pretty reports.

 

Now that you have completed Lesson 1 why not give your new skills a test?

Lab 1: https://docs.google.com/document/d/1BhOOOHf3-PrFurB3ZbuLb70zpFtc_8hYKITnLN_It7E/edit?usp=sharing

Answers: https://drive.google.com/file/d/0BzzRhb-koTrLZHRIM25QRTdZSFU/view?usp=sharing

If you are having issues with the answers try downloading them and opening them in your browser of choice.

A Brief Guide to dplyr

dplyr and all of the packages from the Wickham-verse (ggplot2, reshape2, tidyr, ggviz, etc.) have rapidly become essential to the way I visualize my data and construct my syntax. I spent most of a three hour class period going over the fine points of how to use R (stay tuned, it will be posted eventually), but my students thought it would be helpful if they had a more brief guide about the various functions. So here it is!

The dataset used is about 17,700 cases sampled from a larger IMDB and Rotten Tomatoes dataset. You can find the data used in this example here: http://degovx.eurybia.feralhosting.com/moviescleaned.RData

#############
#   dplyr   #
# Functions #
#############

require("dplyr")
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("E:/Rcourse")
load("moviesclean.RData")

options(digits = 2)

# dplyr can modify any aspect of a dataframe as well as present the dataframe
# in the table format
# 
# table format
movies <- tbl_df(movies)
movies
## Source: local data frame [17,751 x 24]
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)
# To modify dataframes you can use a variety of commands
# Within each command (verb) you can use modifiers (adverbs)
# 
# dplyr commands have the form of VERB(DATA, ADVERBS, OPTIONS)

# Below are the main verbs and their adverbs
# VERB: Select which returns a subset of the columns
# ADVERBS:
# starts_with("X")
# ends_with("X")
# contains("X")
# matches("X")
# num_range("X", 1:5, width = 2) selects X01, x02, x03, x04 etc.
# You can also use "-" to select all but.
select(movies, Title, starts_with("Genre"), contains("rating"))
## Source: local data frame [17,751 x 9]
## 
##                         Title   Genre_1   Genre_2 Genre_3 imdbRating
## 1     The Great Train Robbery     Short   Western      NA        7.4
## 2       Juve Against Fantomas     Crime     Drama      NA        6.6
## 3                     Cabiria Adventure     Drama History        6.5
## 4  Tillie's Punctured Romance    Comedy        NA      NA        7.3
## 5                Regeneration Biography     Crime   Drama        6.8
## 6                Les vampires    Action Adventure   Crime        6.7
## 7                      Mickey    Comedy     Drama      NA        7.5
## 8                   J'accuse!    Horror       War      NA        7.0
## 9            True Heart Susie    Comedy     Drama Romance        7.1
## 10    Dr. Jekyll and Mr. Hyde     Drama    Horror  Sci-Fi        7.1
## ..                        ...       ...       ...     ...        ...
## Variables not shown: RTomRating (dbl), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), RTomRatingCatagory (fctr)
# Here we match title (the default behavior of matches("X")), then find things that
# start with Genre, and any variable that contains "rating".


# VERB: Filter which returns a subset of the rows
# ADVERBS: 
# All base r math and statistical commands as well as boolean operators
# For example
# x < y, x > y, x <= y, x >= y, x == y, x != y
# and all boolean operators
# !, &, and |
# R also has a special operator x %in% [vector]
filter(movies, Genre_1 == "Drama" | Genre_1 == "Comedy", 
       !(Language_01 == "English"), Runtime > 60, imdbRating %in% c(1,2,3,4,5,6,7,8,9))
## Source: local data frame [231 x 24]
## 
##       X               Title Year Runtime   Released imdbRating imdbVotes
## 1    29 Battleship Potemkin 1925      66 1925-12-24          8     34093
## 2    40               Faust 1926      85 1926-12-06          8      8753
## 3   100         Miss Europe 1930      93 1930-08-01          7       380
## 4   131      The Blue Light 1932      85 1934-05-08          7       659
## 5   898        Early Summer 1951     124 1972-08-02          8      3539
## 6  1050         I Vitelloni 1953     104 1956-11-07          8      8478
## 7  1076    A Lesson in Love 1954      96 1960-03-14          7      1273
## 8  1154               Ordet 1955     126 1955-01-10          8      8260
## 9  1171     Street of Shame 1956      87 1959-06-04          8      1758
## 10 1185    The Burmese Harp 1956     116 1967-04-28          8      3340
## ..  ...                 ...  ...     ...        ...        ...       ...
## Variables not shown: RTomRating (dbl), Fresh (int), Rotten (int),
##   RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2
##   (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02 (fctr),
##   Language_03 (fctr), Country_01 (fctr), Country_02 (fctr), weighted
##   (dbl), RTomRatingCatagory (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we find rows where Genre_1 is Drama OR Comedy (the | makes it an or statement), AND 
# (the default behavior is that a comma indicates and), Language_01 does NOT equal English 
# (the ! inverts the statement), AND Runtime is greater than 60, AND finally that imdbRating
# matches the numbers 1,2,3,4,5,6,7,8, or 9 (essentially whole numbers only).


# VERB: Summarize which reduces each group to a single row by calculating aggregate measures
# ADVERBS: 
# fist(x) The first element of vector x
# last(x) The last element of vector x
# nth(x, n) The nth element of vector x
# n() The number of rows in the data.frame or group of observations that summarise() describes
# n_distinct(x) The number of unique values in vector x
# And any math or statistic function that can be used as an aggregator of data
# Adverbs have the form desired.name = adverb
summary.movies <- summarize(movies, First.Title = first(Title), Last.Title = last(Title), Middle.Title = nth(Title, 8875),
          Total.Titles = n(), Distinct.Genres = n_distinct(Genre_1), Average.Rating = mean(imdbRating),
          Best.Rating = max(imdbRating))
print.data.frame(summary.movies)
##               First.Title  Last.Title
## 1 The Great Train Robbery Citizenfour
##                                     Middle.Title Total.Titles
## 1 Escape to Life: The Erika and Klaus Mann Story        17751
##   Distinct.Genres Average.Rating Best.Rating
## 1              23            6.5         9.4
# Here we summarized our dataset by finding the first Title in the dataframe and
# the last title. Then we Found the title in the 8875th place (roughly the
# middle), The total number of rows (and also the total number of titles), the
# number of distinct genres, the average rating and the maximum rating.


# VERB: Arrange which reorders the rows according to single or multiple variables
# ADVERB:
# DESC which inverts the order
arrange(movies, desc(imdbRating), desc(RTomRating), Title)
## Source: local data frame [17,751 x 24]
## 
##        X                                         Title Year Runtime
## 1  12550                                  Interstellar 2014     169
## 2   5951                      The Shawshank Redemption 1994     142
## 3   2468                                 The Godfather 1972     175
## 4   2655                        The Godfather: Part II 1974     200
## 5   5927                                  Pulp Fiction 1994     154
## 6   1888                The Good, the Bad and the Ugly 1966     161
## 7  11783                               The Dark Knight 2008     152
## 8   1232                                  12 Angry Men 1957      96
## 9   5657                              Schindler's List 1993     195
## 10  7710 The Lord of the Rings: The Return of the King 2003     201
## ..   ...                                           ...  ...     ...
## Variables not shown: Released (date), imdbRating (dbl), imdbVotes (int),
##   RTomRating (dbl), Fresh (int), Rotten (int), RTomUserRating (dbl),
##   imdbRatingCatagory (fctr), Genre_1 (fctr), Genre_2 (fctr), Genre_3
##   (fctr), Language_01 (fctr), Language_02 (fctr), Language_03 (fctr),
##   Country_01 (fctr), Country_02 (fctr), weighted (dbl), RTomRatingCatagory
##   (fctr), Director_01 (fctr), Director_02 (fctr)
# Here we rearranged our data (always the complete dataframe given to arrange
# not just the specified rows) in descending order of imdbRating, then when
# there were ties RTomRating was used to break ties. Finally, Ties were broken
# by Title.


# VERB: Mutate which adds columns from existing data 
# ADVERBS: Any mathmatical or
# statistical function (including user created functions) that can be performed 
# on a row. Any variable created can be used in subsequent calculatons.
composite <- mutate(movies, composite = (imdbRating + RTomRating) / 2,
       avg.composite = mean(composite), deviation = composite - avg.composite)
select(composite, composite, avg.composite, deviation)
## Source: local data frame [17,751 x 3]
## 
##    composite avg.composite deviation
## 1        7.5           6.3     1.226
## 2        7.5           6.3     1.276
## 3        7.4           6.3     1.126
## 4        6.8           6.3     0.576
## 5        8.0           6.3     1.726
## 6        7.8           6.3     1.476
## 7        6.3           6.3     0.026
## 8        7.2           6.3     0.926
## 9        6.2           6.3    -0.024
## 10       7.4           6.3     1.176
## ..       ...           ...       ...
# Here we create a composite variable which is the mean of IMDB and Rotten
# Tomatoes ratings. Then we found the mean of those composite scores. Finally,
# we created a deviation from the mean based on the composite - the mean. We
# then displayed only those columns with selct.


# VERB: group_by() which creates metadata groups that summarize will use
# to give breakdowns. Multiple groups can be specified in the group_by procedure
# ADVERBS: NONE

#Notice how the only change is there is now a "Groups:" entry at the top. 
group_by(movies, Genre_1)
## Source: local data frame [17,751 x 24]
## Groups: Genre_1
## 
##     X                      Title Year Runtime   Released imdbRating
## 1   1    The Great Train Robbery 1903      11 1903-12-01        7.4
## 2   2      Juve Against Fantomas 1913      61 1913-10-02        6.6
## 3   3                    Cabiria 1914     148 1914-06-01        6.5
## 4   4 Tillie's Punctured Romance 1914      82 1914-12-21        7.3
## 5   5               Regeneration 1915      72 1915-09-13        6.8
## 6   6               Les vampires 1915     399 1916-11-23        6.7
## 7   7                     Mickey 1918      93 1918-08-01        7.5
## 8   8                  J'accuse! 1919     166 1919-04-25        7.0
## 9   9           True Heart Susie 1919      87 1919-06-01        7.1
## 10 10    Dr. Jekyll and Mr. Hyde 1920      49 1920-04-01        7.1
## .. ..                        ...  ...     ...        ...        ...
## Variables not shown: imdbVotes (int), RTomRating (dbl), Fresh (int),
##   Rotten (int), RTomUserRating (dbl), imdbRatingCatagory (fctr), Genre_1
##   (fctr), Genre_2 (fctr), Genre_3 (fctr), Language_01 (fctr), Language_02
##   (fctr), Language_03 (fctr), Country_01 (fctr), Country_02 (fctr),
##   weighted (dbl), RTomRatingCatagory (fctr), Director_01 (fctr),
##   Director_02 (fctr)
#The real change is when you run summarize
grouped <- group_by(movies, Genre_1)

summary.movies <- summarize(grouped, First.Title = first(Title), Last.Title = last(Title),
          Total.Titles = n(), Average.Rating = mean(imdbRating), Best.Rating = max(imdbRating))

print.data.frame(summary.movies)
##        Genre_1                        First.Title
## 1       Action                       Les vampires
## 2        Adult                           Caligula
## 3    Adventure                            Cabiria
## 4    Animation                 Gulliver's Travels
## 5    Biography                       Regeneration
## 6       Comedy         Tillie's Punctured Romance
## 7        Crime              Juve Against Fantomas
## 8  Documentary H„xan: Witchcraft Through the Ages
## 9        Drama            Dr. Jekyll and Mr. Hyde
## 10      Family                             Skippy
## 11     Fantasy                            Destiny
## 12   Film-Noir                              Laura
## 13     History                      Western Union
## 14      Horror                          J'accuse!
## 15       Music                  One Night of Love
## 16     Musical                The Broadway Melody
## 17     Mystery             The Kennel Murder Case
## 18     Romance                        Easy Virtue
## 19      Sci-Fi                     The Devil-Doll
## 20       Short            The Great Train Robbery
## 21    Thriller                           Sabotage
## 22         War                      The Way Ahead
## 23     Western                     The Iron Horse
##                        Last.Title Total.Titles Average.Rating Best.Rating
## 1          I Am a Knife with Legs         1889            6.2         9.0
## 2                      Destricted            3            4.7         5.2
## 3                         Pirates          687            6.4         9.4
## 4  Thunder and the House of Magic          472            6.7         8.6
## 5                  The Golden Era          619            7.0         8.9
## 6                   Force Majeure         4488            6.3         8.6
## 7                   The Blue Room         1142            6.7         9.3
## 8                     Citizenfour         2119            7.2         8.9
## 9                      But Always         4757            6.7         8.9
## 10               Teen Beach Movie           56            6.0         8.2
## 11 Painted Skin: The Resurrection           80            6.0         8.0
## 12              I Bury the Living           14            7.3         8.4
## 13                        Phantom            5            6.7         7.1
## 14       The Houses October Built          839            5.7         8.6
## 15                     Tamla Rose           10            6.4         8.4
## 16           Peaches Does Herself           48            6.5         7.8
## 17                    Frequencies          108            6.7         8.6
## 18                Still the Water           74            6.5         8.1
## 19                     The Signal           81            5.9         8.2
## 20                        Hellion           34            6.7         8.4
## 21                     Heatstroke          149            6.1         8.0
## 22                Dark Blue World            9            7.0         7.5
## 23               Django Unchained           68            6.9         9.0
# Here we have the same summarized data as before (with the exceptions of
# Distinct.Genres since there will only be 1 and middle movie since that will
# change by catagory). The only difference is that now the displayed data has a
# row for every unique genre. We can use this to compute aggregate statistics.


# The final part of the dplyr package is imported from the magrittr package. and
# is not a verb at all Instead it is an operator that takes commands and imports
# them into the first statement of the next variable The command is the pipe
# %>%. It works by joining two sides of an equation with the words "and then". 
# For instance, LEFTHANDSIDE %>% (AND THEN) RIGHTHANDSIDE
# 
# You can enable the pipe by starting dplyr or by uzing the origional package magrittr.
# require("magrittr")
#
# We can use everything we learned to explore our dataset for interesting
# results. Let's look at feature length (90 minute) movies released between 1950
# and 2000 We want to see a summary of movies by Genre. The summary should
# include the number of movies in that category and the mean composite score.
# The composite score whould be a sum of imdbRating and RTomRating. Sorted by
# Composite score (highest first).
movies %>%
  group_by(Genre_1) %>%
    filter(Released > as.Date("1950-01-01") & Released < as.Date("2000-01-01"), Runtime >= 90) %>%
      mutate(composite = imdbRating + RTomRating) %>%
       summarize(N = n(), Composite = mean(composite)) %>%
        arrange(desc(Composite))
## Source: local data frame [22 x 3]
## 
##        Genre_1    N Composite
## 1  Documentary  110        15
## 2          War    5        14
## 3    Film-Noir    2        14
## 4      Mystery   44        14
## 5    Biography  246        14
## 6      Western   52        14
## 7        Drama 1607        14
## 8    Animation   40        14
## 9        Crime  429        13
## 10       Music    1        13
## ..         ...  ...       ...
# Here we take the movies dataset. Then we use pipe to put that into the first
# part of the group_by verb This imports the data. Then we assign the group
# Genre_1 and take that new data and apply the filter statement. This reduces
# our data by only selecting the rows we desire (between 1950 and 200 and over
# 90 minutes long). We take that reduced number of rows and we add a composite
# of the imdbRating and RTomRating. After the composite is created we take that
# data and import it into the summarize function where we compute N and the
# composite score. Last, we import that data into arrange where we sort it by
# composite.

// //

Teaching R

I have been learning and using R in my research for a number of years now which has been a lot of fun! On the down side, most of the education undergraduates receive is in SPSS and graduate students will either continue that trend or move on to STATA or SAS. I consider myself proficient in the use of SAS and SPSS but I would much rather use R. For one, the ability to do all my analyses in one place is very appealing. I can pull in data, clean it, conduct item response theory analyses, exploratory or confirmatory factor analysis, multilevel modeling using IRT theta scores, etc. A process that might take two or three programs with any other software. Two, it allows easy work across my work machine (Windows) and my home machine (Linux). Three, it’s open and free so I don’t have to wait for my university to get around to updating the license so I can do my work.

In an effort to have a few more students able to use R I proposed a course in R programming. I was amazed at the overwhelmingly positive reactions I received from my colleagues. I created the course and was again amazed when it reached capacity within two days of listing. Although I am not done with this semester yet (barely half way, maalesef) I have learned a lot about what I know and what I can do with R (usually how much I don’t know). From the questions students ask to the clever ways they find to make the lab assignments easy I have learned more this semester than in the years of steady use.

Stay tuned I will be posting my notes so others can benefit!

Automating the Accept and Reject process in MTurk with R

I love spending a ton of time making an R program to accomplish something that I could do manually in 10 minutes or have an undergraduate do for essentially free!

See relevant XKCD.

I hope someone can find this bit of code useful.

Directions:
Go to MTurk and download your worker results csv file.
Then download your survey data. I use Qualtrics. The script assumes you have a two line header with variables on line 1 and descriptions of those variables on line 2. If you have a one line header you will need to tweak the code a bit.

The way I structure my MTurk data collection is a field in the HIT asking for a survey key. I honestly don’t bother too much with generating one for every participant. I find it doesn’t add much. What I do instead is have the worker paste a static code into the box. On the survey side I have participants go to their dashboard and copy/paste their Worker ID into my survey on the last page.

The syntax gathers this static key and checks if it was entered. Then looks if there was a substantial amount of missing data. It also checks for the correct answers to attention check items and compares them to a threshold. Last, it looks for people who have IDs in the MTurk file but not in the survey (the people who find your static key on a website and enter it). Failure to meet these conditions results in a rejection. The rejection messages are coded to be specific to the type of rejection so feedback is customized.

To run you will need to have dplyr and psych packages installed. dplyr for general awesomeness and the pipe command and psych for the ease of scoring multiple choice tests (used for the attention check items). If one or more attention check items have multiple correct responses I recommend recording them into a binary right / wrong beforehand format.

#Scott Withrow
#2015-02-12

# Match Worker ID’s from survey and MTurk
# Rejects people that fail attention check items

# In your survey and mturk data name the workerID variable “WorkerId”
# MTurk should automatically call your survey key Answer.surveycode but make sure that is true

# Fill in the Values Below ————————————————
#Key provided for “completed survey”
surveykey <- "yUacuEugjBohzK1OMrqmU7o6P"

#The number of acceptable NAs in a respondant.
acceptablena = 10

#List of attention check items. Set to NA if none.
attncheck = c("IPIP_30", "NA_2")

#List of correct answers to the attncheck items.
attncheck.correct = c(2, 3)

#Number of attention check items that can be failed
attnfail = 2

#Working Directory (where the datafiles are)
wrkdir <- "E:/data/GEC/"

#Names of the datafiles
survey.data <- "GEC_Validity_Study.csv"
mturk.data <- "mturk.csv"

# The program begins! —————————————————–
#Setup
setwd(wrkdir)
require("dplyr")
require("psych")

#Read in that funky Qualtrics csv output.
header <- scan(survey.data, nlines = 1, what = character(), sep=",")
survey <- read.table(survey.data, skip = 2, stringsAsFactors = FALSE, fill = TRUE, header = FALSE, sep = ",")
names(survey) <- header
survey <- tbl_df(survey)
#survey

#Read in the Amazon.com worker output file.
mturk <- tbl_df(read.table(mturk.data, stringsAsFactors = FALSE, fill = TRUE, header = TRUE, sep = ","))
#mturk

#Returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

#Clean up the Worker ID string so we can do a validation check.
survey$WorkerId <- trim(survey$WorkerId)

#Clean up the survey key code field for validation checks.
mturk$Answer.surveycode <- trim(mturk$Answer.surveycode)

#Identify people whose ID's are not in the survey and reject them
bad.id <- !(mturk$WorkerId %in% intersect(survey$WorkerId, mturk$WorkerId))
mturk$Reject[bad.id] <- "Your Worker ID was not found in the survey and your survey key code could not be authenticated. Sorry."

#Reject people that don't have a matching survey code.
mturk$Reject[mturk$Answer.surveycode != surveykey] <- "They survey key code entered into the HIT did not match the code provided by the survey or was left blank. Sorry."

#Reject people that have too many NAs
survey$numNA acceptablena] <- "The survey submitted contains more than an acceptable minimum number of missing values and is not considered a completed HIT. Sorry."

#Reject people that failed the attention check items.
survey.attncheck <- tbl_df(data.frame(score.multiple.choice(attncheck.correct, survey[attncheck], score = FALSE)))
survey$attntotal attnfail] <- "The survey submitted contains more than an acceptable minimum number of failed attention checks and is not considered a completed HIT. Sorry."

#Create a smaller dataset containing the rejected people from the survey
survey.reject %
select(Reject, WorkerId)

#Merge the datasets
merged <- merge(mturk, survey.reject, by="WorkerId", all.x = TRUE)

#Merge the reject columns
merged <- within(merged, {
Reject <- rep(NA, nrow(merged))
ifelse (is.na(Reject.x), Reject <- Reject.y, Reject <- Reject.x)
})

#Approve everyone that didn't get rejected
merged$Approve[is.na(merged$Reject)] <- "x"

#Drop the extra columns
merged <- select(merged, -(Reject.x), -(Reject.y))

#Rejected Participants
Rejected % select(WorkerId, Reject, Answer.surveycode, AssignmentId)

#Approved Participants
Approved % select(WorkerId, AssignmentId)

#Save as a csv for uploading to mturk
write.csv(merged, file = “Upload_Mturk.csv”, row.names = FALSE, na = “”)

#Save a rejected csv for double checking
write.csv(Rejected, file = “Rejected_Participants.csv”, row.names = FALSE, na = “”)

#Save an approved csv
write.csv(Approved, file = “Approved_Participants.csv”, row.names = FALSE, na = “”)