The way a question is phrased can influence a person’s response. For example, Pew Research Center conducted a survey with the following question:
“As you may know, by 2014 nearly all Americans will be required to have health insurance. [People who do not buy insurance will pay a penalty] while [People who cannot afford it will receive financial help from the government]. Do you approve or disapprove of this policy?”
For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the two statements were reversed.
Table 6.2 shows the results of this experiment, reproduced below.
2nd Statement | Sample Size | % Approve Law | % Disapprove Law | % Other |
---|---|---|---|---|
“people who cannot afford it will receive financial help from the government” | 771 | 47 | 49 | 3 |
“people who do not buy it will pay a penalty” | 732 | 34 | 63 | 3 |
Create and interpret a 90% confidence interval of the difference in approval. Also perform a hyppthesis test that the approval rates are equal.
> x <- round(c(0.47*771, 0.34*732))
> n <- round(c(771*0.97, 732*0.97))
> prop.test(x=x, n=n, conf.level=0.90)
2-sample test for equality of proportions with
continuity correction
data: x out of n
X-squared = 26.023, df = 1, p-value = 3.374e-07
alternative hypothesis: two.sided
90 percent confidence interval:
0.08979649 0.17670950
sample estimates:
prop 1 prop 2
0.4839572 0.3507042
The book OIS does a “by hand” calculation using the \(z\)-statistics and comes up with a similar answer (but not identical).
> p1.hat <- 0.47
> n1 <- 771
> p2.hat <- 0.34
> n2 <- 732
> stderr <- sqrt(p1.hat*(1-p1.hat)/n1 + p2.hat*(1-p2.hat)/n2)
>
> # the 90% CI
> (p1.hat - p2.hat) + c(-1,1)*abs(qnorm(0.05))*stderr
[1] 0.08872616 0.17127384
poisson.test()
> str(poisson.test)
function (x, T = 1, r = 1, alternative = c("two.sided",
"less", "greater"), conf.level = 0.95)
From the help:
Arguments
x number of events. A vector of length one or two.
T time base for event count. A vector of length one or two.
r hypothesized rate or rate ratio
alternative indicates the alternative hypothesis and must be one of
"two.sided", "greater" or "less". You can specify just the initial letter.
conf.level confidence level for the returned confidence interval.
RNA-Seq gene expression was measured for p53 lung tissue in 12 healthy individuals and 14 individuals with lung cancer.
The counts were given as follows.
Healthy: 82 64 66 88 65 81 85 87 60 79 80 72
Cancer: 59 50 60 60 78 69 70 67 72 66 66 68 54 62
It is hypothesized that p53 expression is higher in healthy individuals. Test this hypothesis, and form a 99% CI.
> healthy <- c(82, 64, 66, 88, 65, 81, 85, 87, 60, 79, 80, 72)
> cancer <- c(59, 50, 60, 60, 78, 69, 70, 67, 72, 66, 66, 68,
+ 54, 62)
> poisson.test(x=c(sum(healthy), sum(cancer)), T=c(12,14),
+ conf.level=0.99)
Comparison of Poisson rates
data: c(sum(healthy), sum(cancer)) time base: c(12, 14)
count1 = 909, expected count1 = 835.38, p-value =
0.0005739
alternative hypothesis: true rate ratio is not equal to 1
99 percent confidence interval:
1.041626 1.330051
sample estimates:
rate ratio
1.177026
> poisson.test(x=c(sum(healthy), sum(cancer)), T=c(12,14),
+ alternative="less", conf.level=0.99)
Comparison of Poisson rates
data: c(sum(healthy), sum(cancer)) time base: c(12, 14)
count1 = 909, expected count1 = 835.38, p-value =
0.9998
alternative hypothesis: true rate ratio is less than 1
99 percent confidence interval:
0.000000 1.314529
sample estimates:
rate ratio
1.177026
> poisson.test(x=c(sum(healthy), sum(cancer)), T=c(12,14),
+ alternative="greater", conf.level=0.99)
Comparison of Poisson rates
data: c(sum(healthy), sum(cancer)) time base: c(12, 14)
count1 = 909, expected count1 = 835.38, p-value =
0.0002881
alternative hypothesis: true rate ratio is greater than 1
99 percent confidence interval:
1.053921 Inf
sample estimates:
rate ratio
1.177026
Which analysis is the more informative and scientifically correct one, and why?
One of the most important goals when analyzing data is to understand how variables relate to one another. This may include:
We will consider both categorical and quantitative variables to achieve these goals. Over the next few weeks we will study:
> library("MASS")
> data("survey", package="MASS")
> survey <- tbl_df(survey)
> head(survey)
Source: local data frame [6 x 12]
Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer
(fctr) (dbl) (dbl) (fctr) (fctr) (int) (fctr) (fctr)
1 Female 18.5 18.0 Right R on L 92 Left Some
2 Male 19.5 20.5 Left R on L 104 Left None
3 Male 18.0 13.3 Right L on R 87 Neither None
4 Male 18.8 18.9 Right R on L NA Neither None
5 Male 20.0 20.0 Right Neither 35 Right Some
6 Female 18.0 17.7 Right L on R 64 Right Some
Variables not shown: Smoke (fctr), Height (dbl), M.I (fctr),
Age (dbl)
A contingency table:
> tbl = table(survey$Sex, survey$W.Hnd)
> tbl
Left Right
Female 7 110
Male 10 108
Let’s test the null hypothesis that sex and writing hand are independent vs. the alternative hypothesis that they are dependent.
> str(chisq.test)
function (x, y = NULL, correct = TRUE, p = rep(1/length(x),
length(x)), rescale.p = FALSE, simulate.p.value = FALSE,
B = 2000)
> chisq.test(tbl)
Pearson's Chi-squared test with Yates' continuity
correction
data: tbl
X-squared = 0.23563, df = 1, p-value = 0.6274
A \(\chi^2\) distribution with \(d\) degrees of freedom is equivalent to the sum of \(d\) independent Normal\((0,1)\) random variables.
\[ \chi^2_d \sim Z_1^2 + Z_2^2 + \cdots + Z_d^2 \]
where \(Z_1, Z_2, \ldots, Z_d\) are iid Normal\((0,1)\).
Observed counts:
> tbl
Left Right
Female 7 110
Male 10 108
Expected (under \(H_0\)) counts:
> n <- sum(tbl)
> p <- sum(tbl[1,])/n # freq Female
> q <- sum(tbl[,1])/n # freq Left
> expected <- n * matrix(c(p*q, (1-p)*q, p*(1-q), (1-p)*(1-q)),
+ nrow=2)
> expected
[,1] [,2]
[1,] 8.46383 108.5362
[2,] 8.53617 109.4638
The chi-squared statistic is calculated as
\[ X^2 = \sum \frac{(O - E)^2}{E} \]
where \(O\) is the observed count, \(E\) is the expected count, and the sum is taken over all cells in the table.
> X2 <- sum((tbl - expected)^2 / expected)
> X2
[1] 0.5435149
>
> chisq.test(tbl, correct=FALSE)$statistic # equals X2
X-squared
0.5435149
> chisq.test(tbl)$statistic # with continuity correction
X-squared
0.2356302
The null distribution of \(X^2\) is a \(\chi^2\) distribution with \(d\) degrees of freedom. We calculate \(d\) by \(d = (r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns.
> 1-pchisq(X2, df=1)
[1] 0.4609797
>
> chisq.test(tbl, correct=FALSE)$p.value
[1] 0.4609797
The theoretical derivation of this test is beyond the scope of this course.
However, it is worth noting that it is related to the \(Z\)-statistic approximation from last week:
\[Z = \frac{\mbox{estimator} - \mbox{parameter}}{\mbox{standard error}} \sim \mbox{Normal}(0,1).\]
When these are violated, continuity corrections and simulation based p-values can be used… or other tests can be used such as Fisher’s Exact Test — see fisher.test()
.
> tbl = table(survey$Clap, survey$W.Hnd)
> tbl
Left Right
Left 9 29
Neither 5 45
Right 4 143
Note that now one of the categorical variables takes three values.
Also note the existence of low cell counts.
> chisq.test(tbl)
Warning in chisq.test(tbl): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: tbl
X-squared = 19.252, df = 2, p-value = 6.598e-05
We address this warning by simulating tables from the null hypothesis.
> chisq.test(tbl, simulate.p.value = TRUE, B=10000)
Pearson's Chi-squared test with simulated p-value
(based on 10000 replicates)
data: tbl
X-squared = 19.252, df = NA, p-value = 9.999e-05
> tbl = table(survey$Exer, survey$W.Hnd)
> tbl
Left Right
Freq 7 107
None 3 21
Some 8 90
>
> chisq.test(tbl, simulate.p.value = TRUE, B=10000)
Pearson's Chi-squared test with simulated p-value
(based on 10000 replicates)
data: tbl
X-squared = 1.2065, df = NA, p-value = 0.5532
> tbl = table(survey$Smoke, survey$Exer)
> tbl
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
>
> chisq.test(tbl, simulate.p.value = TRUE, B=10000)
Pearson's Chi-squared test with simulated p-value
(based on 10000 replicates)
data: tbl
X-squared = 5.4885, df = NA, p-value = 0.4794
What feature of the data is this test ignoring?
The chisq.tes()
function also performs goodness of fit tests. These are goodness of fit tests of a set of probabilities, very related to our tests of proportions from last week.
For example, suppose we want to test whether a six-sided die is fair. We roll the die 100 times and record the frequency with which we observe each face.
> die
die
1 2 3 4 5 6
17 14 21 28 14 6
> chisq.test(x=die, p=rep(1/6, 6))
Chi-squared test for given probabilities
data: die
X-squared = 16.52, df = 5, p-value = 0.005506
It is often the case that two or more quantitative variables are measured on each unit of observation (such as an individual).
We are then often interested in characterizing how pairs of variables are associated or how they vary together.
A common measure that is used is called “correlation”, which is most well suited for measuring linear associations
Suppose we observe \(n\) pairs of data \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). Their sample correlation is
\begin{eqnarray} r_{xy} & = & \frac{\sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2 \sum_{i=1}^n (y_i - \overline{y})^2}} \\ \ & = & \frac{\sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y})}{(n-1) s_x s_y} \end{eqnarray}where \(s_x\) and \(s_y\) are the sample standard deviations of each measured variable.
There are other ways to measure correlation that are less reliant on linear trends in covariation and are also more robust to outliers.
Specifically, one can convert each measured variable to ranks by size (1 for the smallest, \(n\) for the largest) and then use a formula for correlation designed for these ranks.
One popular measure of rank-based correlation is the Spearman correlation.
Suppose there are two random variables \(X\) and \(Y\). Their population correlation is
\[ \rho_{XY} = \frac{\operatorname{E}\left[(X-{\rm E}[X])(Y-{\rm E}[Y])\right]}{\sqrt{\operatorname{Var}(X) {\rm Var}(Y)}} \]
> ggplot(data = survey, mapping=aes(x=Wr.Hnd, y=Height)) +
+ geom_point() + geom_vline(xintercept=mean(survey$Wr.Hnd, na.rm=TRUE)) +
+ geom_hline(yintercept=mean(survey$Height, na.rm=TRUE))
> str(cor)
function (x, y = NULL, use = "everything", method = c("pearson",
"kendall", "spearman"))
>
> cor(survey$Wr.Hnd, survey$Height,
+ use="pairwise.complete.obs")
[1] 0.6009909
> DF <- survey %>% dplyr::select(Wr.Hnd, Height) %>% na.omit()
> sum((DF$Wr.Hnd - mean(DF$Wr.Hnd)) *
+ (DF$Height - mean(DF$Height))) /
+ ((nrow(DF)-1) * sd(DF$Wr.Hnd) * sd(DF$Height))
[1] 0.6009909
Image from Wikipedia.
> str(cor.test)
function (x, ...)
From the help file:
Usage
cor.test(x, ...)
## Default S3 method:
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact = NULL, conf.level = 0.95, continuity = FALSE,
...)
## S3 method for class 'formula'
cor.test(formula, data, subset, na.action, ...)
> cor.test(x=survey$Wr.Hnd, y=survey$Height)
Pearson's product-moment correlation
data: survey$Wr.Hnd and survey$Height
t = 10.792, df = 206, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5063486 0.6813271
sample estimates:
cor
0.6009909
Compare the following to the above output of cor.test()
.
> r <- cor(survey$Wr.Hnd, survey$Height,
+ use="pairwise.complete.obs")
> df <- sum(complete.cases(survey[,c("Wr.Hnd", "Height")]))-2
> # dplyr way to get df:
> # df <- (survey %>% select(Wr.Hnd, Height) %>%
> # na.omit() %>% nrow())-2
>
> tstat <- r/sqrt((1 - r^2)/df)
> tstat
[1] 10.79234
>
> pvalue <- 2*pt(q=-abs(tstat), df=df)
> pvalue
[1] 8.227549e-22
> ggplot(data = survey) +
+ geom_point(aes(x=Wr.Hnd, y=NW.Hnd))
> cor.test(x=survey$Wr.Hnd, y=survey$NW.Hnd)
Pearson's product-moment correlation
data: survey$Wr.Hnd and survey$NW.Hnd
t = 45.712, df = 234, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9336780 0.9597816
sample estimates:
cor
0.9483103
> library("car")
> data("Davis", package="car")
> htwt <- tbl_df(Davis)
> htwt[12,c(2,3)] <- htwt[12,c(3,2)]
> head(htwt)
Source: local data frame [6 x 5]
sex weight height repwt repht
(fctr) (int) (int) (int) (int)
1 M 77 182 77 180
2 F 58 161 51 159
3 F 53 161 54 158
4 M 68 177 70 175
5 F 59 157 59 155
6 M 76 170 76 165
> ggplot(htwt) +
+ geom_point(aes(x=height, y=weight, color=sex), size=2, alpha=0.5) +
+ scale_color_manual(values=c("red", "blue"))
> cor.test(x=htwt$height, y=htwt$weight)
Pearson's product-moment correlation
data: htwt$height and htwt$weight
t = 17.04, df = 198, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7080838 0.8218898
sample estimates:
cor
0.7710743
Recall we had to fix an error in the data, which we noticed as an outlier in the scatterplot. Here is the effect of the outlier:
> cor.test(x=Davis$height, y=Davis$weight)
Pearson's product-moment correlation
data: Davis$height and Davis$weight
t = 2.7179, df = 198, p-value = 0.007152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05228435 0.31997151
sample estimates:
cor
0.1896496
Let’s use the Spearman rank-based correlation:
> cor.test(x=Davis$height, y=Davis$weight, method="spearman")
Warning in cor.test.default(x = Davis$height, y = Davis$weight,
method = "spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: Davis$height and Davis$weight
S = 308750, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7684305
> htwt %>% filter(sex=="F") %>%
+ cor.test(~ height + weight, data = .)
Pearson's product-moment correlation
data: height and weight
t = 6.2801, df = 110, p-value = 6.922e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3627531 0.6384268
sample estimates:
cor
0.5137293
> htwt %>% filter(sex=="M") %>%
+ cor.test(~ height + weight, data = .)
Pearson's product-moment correlation
data: height and weight
t = 5.9388, df = 86, p-value = 5.922e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3718488 0.6727460
sample estimates:
cor
0.5392906
Why are the stratified correlations lower?
It is often the case that we would like to build a model that explains the variation of one variable in terms of other variables.
Least squares linear regression is one of the simplest and most useful modeling systems for doing so.
It is simple to fit, it satisfies some optimality criteria, and it is straightforward to check assumptions on the data so that statistical inference can be performed.
Let’s start with least squares linear regression of just two variables.
Suppose that we have observed \(n\) pairs of data \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\).
Least squares linear regression models variation of \(y_i\) in terms of \(\beta_0 + \beta_1 x_i\) where \(\beta_0\) and \(\beta_1\) are chosen to satisfy a least squares optimization.
The least squares regression line is formed from the value of \(\beta_0\) and \(\beta_1\) that minimize:
\[\sum_{i=1}^n \left( y_i - \beta_0 - \beta_1 x_i \right)^2.\]
For a given set of data, there is a unique solution to this minimization as long as there are at least two unique values among \(x_1, x_2, \ldots, x_n\).
Let \(\hat{\beta_0}\) and \(\hat{\beta_1}\) be the values that minimize this sum of squares.
These values are:
\[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}\]
\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\]
These values have a useful interpretation.
> ggplot(data=htwt, mapping=aes(x=height, y=weight)) +
+ geom_point(size=2, alpha=0.5) +
+ geom_smooth(method="lm", se=FALSE, formula=y~x)
> beta1 <- cor(htwt$height, htwt$weight) *
+ sd(htwt$weight) / sd(htwt$height)
> beta1
[1] 1.150092
>
> beta0 <- mean(htwt$weight) - beta1 * mean(htwt$height)
> beta0
[1] -130.9104
>
> yhat <- beta0 + beta1 * htwt$height
> df <- data.frame(htwt, yhat=yhat)
> ggplot(data=df) + geom_point(aes(x=height, y=weight), size=2, alpha=0.5) +
+ geom_line(aes(x=height, y=yhat), color="blue", size=1.2)
> myfit <- lm(weight ~ height, data=htwt)
> myfit
Call:
lm(formula = weight ~ height, data = htwt)
Coefficients:
(Intercept) height
-130.91 1.15
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] car_2.1-1 MASS_7.3-45 broom_0.4.0
[4] dplyr_0.4.3 ggplot2_2.1.0 knitr_1.12.3
[7] magrittr_1.5 devtools_1.10.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 nloptr_1.0.4 formatR_1.2.1
[4] plyr_1.8.3 tools_3.2.3 digest_0.6.9
[7] lme4_1.1-11 evaluate_0.8 memoise_1.0.0
[10] gtable_0.2.0 nlme_3.1-125 lattice_0.20-33
[13] mgcv_1.8-11 Matrix_1.2-3 psych_1.5.8
[16] DBI_0.3.1 yaml_2.1.13 parallel_3.2.3
[19] SparseM_1.7 stringr_1.0.0 MatrixModels_0.4-1
[22] revealjs_0.5.1 grid_3.2.3 nnet_7.3-12
[25] R6_2.1.2 rmarkdown_0.9.5 minqa_1.2.4
[28] reshape2_1.4.1 tidyr_0.4.1 scales_0.4.0
[31] htmltools_0.3 splines_3.2.3 assertthat_0.1
[34] pbkrtest_0.4-6 mnormt_1.5-3 colorspace_1.2-6
[37] quantreg_5.21 labeling_0.3 stringi_1.0-1
[40] lazyeval_0.1.10 munsell_0.4.3