Last week we learned about R:
In data analysis and model fitting, we often have missing values. NA
represents missing values and NaN
means “not a number”, which is a special type of missing value.
> m <- matrix(nrow=3, ncol=3)
> m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
> 0/1
[1] 0
> 1/0
[1] Inf
> 0/0
[1] NaN
NULL
NULL
is a special type of reserved value in R.
> x <- vector(mode="list", length=3)
> x
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
We saw earlier that when we mixed classes in a vector they were all coerced to be of type character:
> c("a", 1:3, TRUE, FALSE)
[1] "a" "1" "2" "3" "TRUE" "FALSE"
You can directly apply coercion with functions as.numeric()
, as.character()
, as.logical()
, etc.
This doesn’t always work out well:
> x <- 1:3
> as.character(x)
[1] "1" "2" "3"
>
> y <- c("a", "b", "c")
> as.numeric(y)
Warning: NAs introduced by coercion
[1] NA NA NA
Lists allow you to hold different classes of objects in one variable.
> x <- list(1:3, "a", c(TRUE, FALSE))
> x
[[1]]
[1] 1 2 3
[[2]]
[1] "a"
[[3]]
[1] TRUE FALSE
>
> # access any element of the list
> x[[2]]
[1] "a"
> x[[3]][2]
[1] FALSE
The elements of a list can be given names.
> x <- list(counting=1:3, char="a", logic=c(TRUE, FALSE))
> x
$counting
[1] 1 2 3
$char
[1] "a"
$logic
[1] TRUE FALSE
>
> # access any element of the list
> x$char
[1] "a"
> x$logic[2]
[1] FALSE
The data frame is one of the most important objects in R. Data sets very often come in tabular form of mixed classes, and data frames are constructed exactly for this.
Data frames are lists where each element has the same length.
> df <- data.frame(counting=1:3, char=c("a", "b", "c"),
+ logic=c(TRUE, FALSE, TRUE))
> df
counting char logic
1 1 a TRUE
2 2 b FALSE
3 3 c TRUE
>
> nrow(df)
[1] 3
> ncol(df)
[1] 3
> dim(df)
[1] 3 3
>
> names(df)
[1] "counting" "char" "logic"
>
> attributes(df)
$names
[1] "counting" "char" "logic"
$row.names
[1] 1 2 3
$class
[1] "data.frame"
Attributes give information (or meta-data) about R objects. The previous slide shows attributes(df)
, the attributes of the data frame df
.
> x <- 1:3
> attributes(x) # no attributes for a standard vector
NULL
>
> m <- matrix(1:6, nrow=2, ncol=3)
> attributes(m)
$dim
[1] 2 3
> paint <- factor(c("red", "white", "blue", "blue", "red",
+ "red"))
> attributes(paint)
$levels
[1] "blue" "red" "white"
$class
[1] "factor"
Names can be assigned to columns and rows of vectors, matrices, and data frames. This makes your code easier to write and read.
> names(x) <- c("Princeton", "Rutgers", "Penn")
> x
Princeton Rutgers Penn
1 2 3
>
> colnames(m) <- c("NJ", "NY", "PA")
> rownames(m) <- c("East", "West")
> m
NJ NY PA
East 1 3 5
West 2 4 6
> colnames(m)
[1] "NJ" "NY" "PA"
Displaying or assigning names to these three types of objects does not have consistent syntax.
Object | Column Names | Row Names |
---|---|---|
vector | names() |
N/A |
data frame | names() |
row.names() |
data frame | colnames() |
rownames() |
matrix | colnames() |
rownames() |
Reproducibility involves being able to recalculate the exact numbers in a data analysis using the code and raw data provided by the analyst.
Reproducibility is often difficult to achieve and has slowed down the discovery of important data analytic errors.
Reproducibility should not be confused with “correctness” of a data analysis. A data analysis can be fully reproducible and recreate all numbers in an analysis and still be misleading or incorrect.
Taken from Elements of Data Analytic Style
Reproducible research is often used these days to indicate the ability to recalculate the exact numbers in a data analysis
Replicable research results often refers to the ability to independently carry out a study (thereby collecting new data) and coming to equivalent conclusions as the original study
These two terms are often confused, so it is important to clearly state the definition
Use a data analysis script – e.g., R Markdown (discussed next section!) or iPython Notebooks
Record versions of software and paramaters – e.g., use sessionInfo()
in R as in project_1.Rmd
Organize your data analysis
Use version control – e.g., GitHub
Set a random number generator seed – e.g., use set.seed()
in R
Have someone else run your analysis
R Markdown was developed by the RStudio team to allow one to write reproducible research documents using Markdown and knitr
. This is contained in the rmarkdown
package, but can easily be carried out in RStudio.
Markdown was originally developed as a very simply text-to-html conversion tool. With Pandoc, Markdown is a very simply text-to-X
conversion tool where X
can be many different formats: html, LaTeX, PDF, Word, etc.
R Markdown documents begin with a metadata section, the YAML header, that can include information on the title, author, and date as well as options for customizing output.
---
title: "SML 201 -- Project 1"
author: "Your Name"
date: February 8, 2016
output: pdf_document
---
Many options are available. See http://rmarkdown.rstudio.com for full documentation.
Headers:
# Header 1
## Header 2
### Header 3
Emphasis:
*italic* **bold**
_italic_ __bold__
Tables:
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
Unordered list:
- Item 1
- Item 2
- Item 2a
- Item 2b
Ordered list:
1. Item 1
2. Item 2
3. Item 3
- Item 3a
- Item 3b
Links:
http://example.com
[linked phrase](http://example.com)
Blockquotes:
Florence Nightingale once said:
> For the sick it is important
> to have the best.
Plain code blocks:
```
This text is displayed verbatim with no formatting.
```
Inline Code:
We use the `print()` function to print the contents
of a variable in R.
Additional documentation and examples can be found here and here.
The knitr
R package allows one to execute R code within a document, and to display the code itself and its output (if desired). This is particularly easy to do in the R Markdown setting. For example…
Placing the following text in an R Markdown file
The sum of 2 and 2 is `r 2+2`.
produces in the output file
The sum of 2 and 2 is 4.
Chunks of R code separated from the text. In R Markdown:
```{r}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
[1] 3
> print(x)
[1] 2
echo
In R Markdown:
```{r, echo=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
[1] 3
[1] 2
results
In R Markdown:
```{r, results="hide"}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
> print(x)
include
In R Markdown:
```{r, include=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
(nothing)
eval
In R Markdown:
```{r, eval=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
> print(x)
Naming your chunks can be useful for identifying them in your file and during the execution, and also to denote dependencies among chunks.
```{r my_first_chunk}
x <- 2
x + 1
print(x)
```
cache
Sometimes you don’t want to run chunks over and over, especially for large calculations. You can “cache” them.
```{r chunk1, cache=TRUE, include=FALSE}
x <- 2
```
```{r chunk2, cache=TRUE, dependson="chunk1"}
y <- 3
z <- x + y
```
This creates a directory called cache
in your working directory that stores the objects created or modified in these chunks. When chunk1
is modified, it is re-run. Since chunk2
depends on chunk1
, it will also be re-run.
You can add chunk options regarding the placement and size of figures. Examples include:
fig.width
fig.height
fig.align
If you will be using the same options on most chunks, you can set default options for the entire document. Run something like this at the beginning of your document with your desired chunk options.
```{r my_opts, cache=FALSE, echo=FALSE}
library("knitr")
opts_chunk$set(fig.align="center", fig.height=4, fig.width=6)
```
Control structures in R allow you to control the flow of execution of a series of R expressions
They allow you to put some logic into your R code, rather than just always executing the same R code every time
Control structures also allow you to respond to inputs or to features of the data and execute different R expressions accordingly
Paraphrased from R Programming for Data Science
if
and else
: testing a condition and acting on itfor
: execute a loop a fixed number of timeswhile
: execute a loop while a condition is truerepeat
: execute an infinite loop (must break out of it to stop)break
: break the execution of a loopnext
: skip an interation of a loopFrom R Programming for Data Science
R has built-in functions that produce TRUE
or FALSE
such as is.vector
or is.na
. You can also do the following:
x == y
: does x equal y?x > y
: is x greater than y? (also <
less than)x >= y
: is x greater than or equal to y?x && y
: are both x and y true?x || y
: is either x or y true?!is.vector(x)
: this is TRUE
if x is not a vectorif
Idea:
if(<condition>) {
## do something
}
## Continue with rest of code
Example:
> x <- c(1,2,3)
> if(is.numeric(x)) {
+ x+2
+ }
[1] 3 4 5
if
-else
Idea:
if(<condition>) {
## do something
}
else {
## do something else
}
Example:
> x <- c("a", "b", "c")
> if(is.numeric(x)) {
+ print(x+2)
+ } else {
+ class(x)
+ }
[1] "character"
for
LoopsExample:
> for(i in 1:10) {
+ print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for
Loops (cont’d)Examples:
> x <- c("a", "b", "c", "d")
>
> for(i in 1:4) {
+ print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
>
> for(i in seq_along(x)) {
+ print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
for
LoopsExample:
> m <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
>
> for(i in seq_len(nrow(m))) {
+ for(j in seq_len(ncol(m))) {
+ print(m[i,j])
+ }
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
while
Example:
> x <- 1:10
> idx <- 1
>
> while(x[idx] < 4) {
+ print(x[idx])
+ idx <- idx + 1
+ }
[1] 1
[1] 2
[1] 3
>
> idx
[1] 4
Repeats the loop until while the condition is TRUE
.
repeat
Example:
> x <- 1:10
> idx <- 1
>
> repeat {
+ print(x[idx])
+ idx <- idx + 1
+ if(idx >= 4) {
+ break
+ }
+ }
[1] 1
[1] 2
[1] 3
>
> idx
[1] 4
Repeats the loop until break
is executed.
break
and next
break
ends the loop. next
skips the rest of the current loop iteration.
Example:
> x <- 1:1000
> for(idx in 1:1000) {
+ # %% calculates division remainder
+ if((x[idx] %% 2) > 0) {
+ next
+ } else if(x[idx] > 10) { # an else-if!!
+ break
+ } else {
+ print(x[idx])
+ }
+ }
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
R is usually smart about doing calculations with vectors. Examples:
>
> x <- 1:3
> y <- 4:6
>
> 2*x # same as c(2*x[1], 2*x[2], 2*x[3])
[1] 2 4 6
> x + 1 # same as c(x[1]+1, x[2]+1, x[3]+1)
[1] 2 3 4
> x + y # same as c(x[1]+y[1], x[2]+y[2], x[3]+y[3])
[1] 5 7 9
> x*y # same as c(x[1]*y[1], x[2]*y[2], x[3]*y[3])
[1] 4 10 18
If two vectors are of different lengths, R tries to find a solution for you (and doesn’t always tell you).
> x <- 1:5
> y <- 1:2
> x+y
Warning in x + y: longer object length is not a multiple of
shorter object length
[1] 2 4 4 6 6
What happened here?
Operations on matrices are also vectorized. Example:
> x <- matrix(1:4, nrow=2, ncol=2, byrow=TRUE)
> y <- matrix(1:4, nrow=2, ncol=2)
>
> x+y
[,1] [,2]
[1,] 2 5
[2,] 5 8
>
> x*y
[,1] [,2]
[1,] 1 6
[2,] 6 16
What happens when we do calculations involving a vector and a matrix? Example:
> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:2
>
> x + z
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 6 7 8
>
> x * z
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 8 10 12
Another example:
> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:3
>
> x + z
[,1] [,2] [,3]
[1,] 2 5 5
[2,] 6 6 9
>
> x * z
[,1] [,2] [,3]
[1,] 1 6 6
[2,] 8 5 18
What happened this time?
We saw &&
and ||
applied to pairs of logical values. We can also vectorize these operations.
> a <- c(TRUE, TRUE, FALSE)
> b <- c(FALSE, TRUE, FALSE)
>
> a | b
[1] TRUE TRUE FALSE
> a & b
[1] FALSE TRUE FALSE
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.12.3 devtools_1.10.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.2.1 tools_3.2.3
[4] htmltools_0.3 revealjs_0.5.1 yaml_2.1.13
[7] memoise_1.0.0 stringi_1.0-1 rmarkdown_0.9.2
[10] highr_0.5.1 stringr_1.0.0 digest_0.6.9
[13] evaluate_0.8