Probabilistic modeling and/or statistical inference are required in data science when the goals include:
It is possible to do data analysis without probability and formal statistical inference:
A proper mathematical formulation of a probability measure should include the following properties:
The probability of two events are calculated by the following general relationship:
An important calclation in probability and statistics is the conditional probability. We can consider the probability of an event \(A\), conditional on the fact that we are restricted to be within event \(B\). This is defined as:
\[{\rm Pr}(A | B) = \frac{\operatorname{Pr}(A \mbox{ and } B)}{\operatorname{Pr}(B)}\]
Two events \(A\) and \(B\) by definition independent when:
All three of these are equivalent.
A common approach in statistics is to obtain a conditional probability of two events through the opposite conditional probability and their marginal probability. This is called Bayes Theorem:
\[{\rm Pr}(B | A) = \frac{\operatorname{Pr}(A | B){\rm Pr}(B)}{\operatorname{Pr}(A)}\]
This forms the basis of Bayesian Inference but has more general use in carrying out probability calculations.
We will define a random variable (rv) to be a variable that takes values according to a probability distribution.
We define a random variable through its probability mass function (pmf) for discrete rv’s or its probability density function (pdf) for continuous rv’s.
We can also define the rv through its cumulative distribution function (cdf). The pmf/pdf determines the cdf, and vice versa.
Note: There’s a more technical and rigorous definition of a rv, but it does not affect how we will use random variables.
A discrete rv \(X\) takes on a discrete set of values such as \(S\) = {1, 2, …, \(n\)} or \(S\) = {0, 1, 2, 3, …}.
Its distribution is characterized by its pmf: \[f(x) = {\rm Pr}(X = x) \mbox{ for } x \in S.\]
Its cdf is: \[F(y) = {\rm Pr}(X \leq y) = \sum_{x \leq y} {\rm Pr}(X = x).\]
Examples:
Probability | CDF | PMF |
---|---|---|
\({\rm Pr}(X \leq b)\) | \(F(b)\) | \(\sum_{x \leq b} f(x)\) |
\({\rm Pr}(X \geq a)\) | \(1-F(a-1)\) | \(\sum_{x \geq a} f(x)\) |
\({\rm Pr}(X > a)\) | \(1-F(a)\) | \(\sum_{x > a} f(x)\) |
\({\rm Pr}(a \leq X \leq b)\) | \(F(b) - F(a-1)\) | \(\sum_{a \leq x \leq b} f(x)\) |
\({\rm Pr}(a < X \leq b)\) | \(F(b) - F(a)\) | \(\sum_{a < x \leq b} f(x)\) |
A continuous rv \(X\) takes on a continuous set of values such as \(S\) = [0, \(\infty\)) or \(S\) = \(\mathbb{R}\) = (-\(\infty\), \(\infty\)).
The probability that \(X\) takes on any specific value is 0; but the probability it lies within an interval can be non-zero. Its pdf \(f(x)\) therefore gives an infinitesimal, local, relative probability.
For \(S\) = (-\(\infty\), \(\infty\)), its cdf is: \[F(y) = {\rm Pr}(X \leq y) = \int_{-\infty}^y f(x) dx.\]
When \(S\) = [0, \(\infty\)), the integral starts at 0.
Examples:
Probability | CDF | |
---|---|---|
\({\rm Pr}(X \leq b)\) | \(F(b)\) | \(\int_{-\infty}^{b} f(x) dx\) |
\({\rm Pr}(X \geq a)\) | \(1-F(a)\) | \(\int_{a}^{\infty} f(x) dx\) |
\({\rm Pr}(X > a)\) | \(1-F(a)\) | \(\int_{a}^{\infty} f(x) dx\) |
\({\rm Pr}(a \leq X \leq b)\) | \(F(b) - F(a)\) | \(\int_{a}^{b} f(x) dx\) |
\({\rm Pr}(a < X \leq b)\) | \(F(b) - F(a)\) | \(\int_{a}^{b} f(x) dx\) |
PMFs and PDFs are defined as zero outside of the sample space \(S\). That is:
\[f(x) = 0 \mbox{ for } x \not\in S\]
Also, they sum or integrate to 1:
\[\sum_{x \in S} f(x) = 1\]
\[\int_{x \in S} f(x) dx = 1\]
We earlier discussed measures of center and spread for a set of data, such as the mean and the variance.
Analogous measures exist for probability distributions.
These are distinguished by calling those on data “sample” measures (e.g., sample mean) and those on probability distributions “population” measures (e.g., population mean).
The expected value, also called the “population mean”, is a measure of center for a rv. It is calculated in a fashion analogous to the sample mean:
\[{\rm E}[X] = \sum_{x \in S} x \ f(x) \ \ \ \ \mbox{(discrete)}\]
\[{\rm E}[X] = \int_{-\infty}^{\infty} x \ f(x) \ dx \ \ \ \ \mbox{(continuous)}\]
The variance, also called the “population variance”, is a measure of spread for a rv. It is calculated in a fashion analogous to the sample variance:
\[{\rm Var}(X) = {\rm E} \left[\left(X-{\rm E}[X]\right)^2\right]; \quad \quad {\rm SD}(X) = \sqrt{\operatorname{Var}(X)}\]
\[{\rm Var}(X) = \sum_{x \in S} \left(x-{\rm E}[X]\right)^2 \ f(x) \ \ \ \ \mbox{(discrete)}\]
\[{\rm Var}(X) = \int_{-\infty}^{\infty} \left(x-{\rm E}[X]\right)^2 \ f(x) \ dx \ \ \ \ \mbox{(continuous)}\]
The pmf/pdf, cdf, quantile function, and random number generator for many important random variables are built into R. They all follow the form, where <name>
is replaced with the name used in R for each specific distribution:
d<name>
: pmf or pdfp<name>
: cdfq<name>
: quantile function or inverse cdfr<name>
: random number generatorTo see a list of random variables, type ?Distributions
in R.
This simple rv distribution assigns equal probabilities to a finite set of values:
\[X \sim \mbox{Uniform}\{1, 2, \ldots, n\}\]
\[S = \{1, 2, \ldots, n\}\]
\[f(x) = 1/n \mbox{ for } x \in S\]
\[{\rm E}[X] = \frac{n+1}{2}, \ {\rm Var}(X) = \frac{n^2-1}{12}\]
There is no family of functions built into R for this distribution since it is so simple. However, it is possible to generate random values via the sample
function:
> n <- 20L
> sample(x=1:n, size=10, replace=TRUE)
[1] 13 4 8 1 13 3 1 8 2 6
>
> x <- sample(x=1:n, size=1e6, replace=TRUE)
> mean(x) - (n+1)/2
[1] -0.005421
> var(x) - (n^2-1)/12
[1] -0.008098145
A single success/failure event, such as heads/tails when flipping a coin or survival/death.
\[X \sim \mbox{Bernoulli}(p)\]
\[S = \{0, 1\}\]
\[f(x) = p^x (1-p)^{1-x} \mbox{ for } x \in S\]
\[{\rm E}[X] = p, \ {\rm Var}(X) = p(1-p)\]
An extension of the Bernoulli distribution to simultaneously considering \(n\) independent success/failure trials and counting the number of successes.
\[X \sim \mbox{Binomial}(n, p)\]
\[S = \{0, 1, 2, \ldots, n\}\]
\[f(x) = {n \choose x} p^x (1-p)^{n-x} \mbox{ for } x \in S\]
\[{\rm E}[X] = np, \ {\rm Var}(X) = np(1-p)\]
Note that \({n \choose x} = \frac{n!}{x! (n-x)!}\) is the number of unique ways to choose \(x\) items from \(n\) without respect to order.
> str(dbinom)
function (x, size, prob, log = FALSE)
> str(pbinom)
function (q, size, prob, lower.tail = TRUE, log.p = FALSE)
> str(qbinom)
function (p, size, prob, lower.tail = TRUE, log.p = FALSE)
> str(rbinom)
function (n, size, prob)
Models the number of occurrences of something within a defined time/space period, where the occurrences are independent. Examples: the number of lightning strikes on campus in a given year; the number of emails received on a given day.
\[X \sim \mbox{Poisson}(\lambda)\]
\[S = \{0, 1, 2, 3, \ldots \}\]
\[f(x) = \frac{\lambda^x e^{-\lambda}}{x!} \mbox{ for } x \in S\]
\[{\rm E}[X] = \lambda, \ {\rm Var}(X) = \lambda\]
> str(dpois)
function (x, lambda, log = FALSE)
> str(ppois)
function (q, lambda, lower.tail = TRUE, log.p = FALSE)
> str(qpois)
function (p, lambda, lower.tail = TRUE, log.p = FALSE)
> str(rpois)
function (n, lambda)
Models the scenario where all values in the unit interval [0,1] are equally likely.
\[X \sim \mbox{Uniform}(0,1)\]
\[S = [0,1]\]
\[f(x) = 1 \mbox{ for } x \in S\]
\[F(y) = y \mbox{ for } y \in S\]
\[{\rm E}[X] = 1/2, \ {\rm Var}(X) = 1/12\]
> str(dunif)
function (x, min = 0, max = 1, log = FALSE)
> str(punif)
function (q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
> str(qunif)
function (p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
> str(runif)
function (n, min = 0, max = 1)
Models a time to failure and has a “memoryless property”.
\[X \sim \mbox{Exponential}(\lambda)\]
\[S = [0, \infty)\]
\[f(x) = \lambda e^{-\lambda x} \mbox{ for } x \in S\]
\[F(y) = 1 - e^{-\lambda y} \mbox{ for } y \in S\]
\[{\rm E}[X] = \frac{1}{\lambda}, \ {\rm Var}(X) = \frac{1}{\lambda^2}\]
> str(dexp)
function (x, rate = 1, log = FALSE)
> str(pexp)
function (q, rate = 1, lower.tail = TRUE, log.p = FALSE)
> str(qexp)
function (p, rate = 1, lower.tail = TRUE, log.p = FALSE)
> str(rexp)
function (n, rate = 1)
Due to the Central Limit Theorem (covered later), this “bell curve” distribution is often observed in properly normalized real data.
\[X \sim \mbox{Normal}(\mu, \sigma^2)\]
\[S = (-\infty, \infty)\]
\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \mbox{ for } x \in S\]
\[{\rm E}[X] = \mu, \ {\rm Var}(X) = \sigma^2\]
> str(dnorm) #notice it requires the STANDARD DEVIATION, not the variance
function (x, mean = 0, sd = 1, log = FALSE)
> str(pnorm)
function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> str(qnorm)
function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> str(rnorm)
function (n, mean = 0, sd = 1)
Suppose that \(X\) is a random variable and that \(a\) and \(b\) are constants. Then:
\[{\rm E}\left[a + bX \right] = a + b {\rm E}[X]\]
\[{\rm Var}\left(a + bX \right) = b^2 {\rm Var}(X)\]
If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed (iid) random variables, then:
\[{\rm E}\left[ \sum_{i=1}^n X_i \right] = \sum_{i=1}^n {\rm E}[X_i]\]
\[{\rm Var}\left( \sum_{i=1}^n X_i \right) = \sum_{i=1}^n {\rm Var}(X_i)\]
Suppose \(X_1, X_2, \ldots, X_n\) are independent and identically distributed (iid) random variables. Let \(\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i\) be their sample mean. Then:
\[{\rm E}\left[\overline{X}\right] = {\rm E}[X_i]\]
\[{\rm Var}\left(\overline{X}\right) = \frac{1}{n}{\rm Var}(X_i)\]
Suppose \(X_1, X_2, \ldots, X_n\) are iid rv’s with population mean \({\rm E}[X_i] = \mu\) and variance \({\rm Var}(X_i) = \sigma^2\).
Then for “large \(n\)”, \(\sqrt{n}(\overline{X} - \mu)\) approximately follows the Normal(0, \(\sigma^2\)) distribution.
As \(n \rightarrow \infty\), this approximation becomes exact.
Let \(X_1, X_2, \ldots, X_{40}\) be iid Poisson(\(\lambda\)) with \(\lambda=6\).
We will form \(\sqrt{40}(\overline{X} - 6)\) over 10,000 realizations and compare their distribution to a Normal(0, 6) distribution.
> x <- replicate(n=1e4, expr=rpois(n=40, lambda=6),
+ simplify="matrix")
> x_bar <- apply(x, 2, mean)
> clt <- sqrt(40)*(x_bar - 6)
>
> df <- data.frame(clt=clt, x = seq(-18,18,length.out=1e4),
+ y = dnorm(seq(-18,18,length.out=1e4),
+ sd=sqrt(6)))
> ggplot(data=df) +
+ geom_histogram(aes(x=clt, y=..density..), color="blue",
+ fill="lightgray", binwidth=0.75) +
+ geom_line(aes(x=x, y=y), size=1.5)
Individuals are uniformly and independently randomly sampled from a population.
The measurements taken on these individuals are then modeled as random variables, specifically random realizations from the complete population of values.
Simple random samples form the basis of modern surveys.
Individuals under study are randomly assigned to one of two or more available treatments.
This induces randomization directly into the study and breaks the relationship between the treatments and other variables that may be influencing the response of interest.
This is the gold standard study design in clinical trials to assess the evidence that a new drug works on a given disease.
The sampling distribution of a statistic is the probability disribution of the statistic under repeated realizations of the data from the assumed data generating probability distribution.
The sampling distribution is how we connect an observed statistic to the population.
Suppose I claim that a specific coin is fair, i.e., that it lands on heads or tails with equal probability.
I flip it 20 times and it lands on heads 16 times.
Let’s simulate 10,000 times what my estimate would look like if \(p=0.5\) and I repeated the 20 coin flips over and over.
> x <- replicate(n=1e4, expr=rbinom(1, size=20, prob=0.5))
> sim_p_hat <- x/20
> my_p_hat <- 16/20
What can I do with this information?
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] ggplot2_2.1.0 knitr_1.12.3 magrittr_1.5
[4] devtools_1.10.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 revealjs_0.5.1 digest_0.6.9
[4] plyr_1.8.3 grid_3.2.3 gtable_0.2.0
[7] formatR_1.2.1 evaluate_0.8 scales_0.4.0
[10] stringi_1.0-1 rmarkdown_0.9.5 labeling_0.3
[13] tools_3.2.3 stringr_1.0.0 munsell_0.4.3
[16] yaml_2.1.13 colorspace_1.2-6 memoise_1.0.0
[19] htmltools_0.3