Introduction to statistics

Author

Panagiotis Papasaikas

Published

November 27, 2024

Introduction

In the next three lectures we will introduce some basic concepts and the corresponding R-framework for statistical analysis, modelling and testing. These are all critical parts of research in the natural sciences irrespective of the specific domain or experimental methodology. They equip us with a formal, mathematical language for the description of experimental observations and allow us to prioritize hypotheses by assigning to them degrees of confidence. Therefore, the proper selection and application of statistical tools is paramount.

R is at its heart a statistical language and provides us with a convenient framework to carry out these tasks. Today, we start by introducing the concepts (and related R functions) of statistical variables, distributions, descriptive statistics for data samples, variable dependence and testing.

Load required packages and Download datasets

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(ggplot2)

Download the following file (also linked on the course homepage) and place it in a folder called data/ in your working directory (remember getwd() and setwd()).
- melanoma_data.txt: https://ivanek.github.io/introductionToR/data/melanoma_data.txt

The melanoma dataset example

Consider the melanoma dataset. Let’s look at the histograms of the melanoma thickness when split by sex:

# Loading the data `melanoma_data.txt` into `mel`
mel <- read.delim("data/melanoma_data.txt",colClasses=c("character","integer","factor","factor","integer","integer","numeric","factor") )

ggplot(mel, aes(x=thickness, color=sex, fill=sex)) +
geom_histogram(aes(y=..density..), position="identity", alpha=0.5)+
scale_color_manual(values=c( "#E69F00", "#56B4E9"))+
scale_fill_manual(values=c( "#E69F00", "#56B4E9"))+
labs(title="Thickness histogram plot",x="thickness", y = "density")+
theme_classic()

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Are the two density plots identical? If not, how do they differ? How confident can we be that the thickness of melanoma in males and females is different?

In order to be able to answer such questions we will first introduce some measures that summarize properties of the distributions of variables such as their location and their spread (scatter) as well as the relationship between variables.

We will next introduce R tools for summarizing and comparing properties of datasets and for associating degrees of confidence to the results of these comparisons.

Random Variables, probability distributions and random samples.

A random variable is any observable property whose outcome is not fixed. They are called variables because their outcomes vary and random because the way in which these outcomes vary depends on some underlying chance (but not necessarily unpredictable!) process.

Examples of random variables:
- Size and weight of people
- Average temperature of the day
- Expression level of a gene
- Rolls of a die

Random variables can be continuous (if the possible outcomes can be infinite) or discrete (when there is a limited number of possible outcomes).

For example the thickness of melanoma is a case of a continuous random variable. Gender is an example of a discrete random variable.

The function that gives the probability of continuous random variable falling in particular value range is called the probability density dunction (pdf)
The function that gives the probability of a discrete random variable taking on a specific values is called the probability mass function (pmf)

Probability distributions

A probability distribution is the complete description of the possible outcomes of a variable. It assigns a probability value for each possible outcome (or range of outcomes) of our variable. Knowing the underlying distribution of a variable does not necessarily mean that we have full knowledge of the process that generates the outcomes but rather that we can make accurate statements on the probability of observing an outcome.

Depending on the type of variable that they describe, distributions can also be continuous or discrete.
Depending on their general properties (e.g shape, possible range of values) they fall in more specific groups. Distributions are fully specified by a number of values that are called parameters of the distribution.

Uniform distribution (continuous or discrete)

Arises when all possible outcomes (values of a random variable) have equal probability.
Parameter-free (given the domain of the random variable)

set.seed(1)
par(mfrow = c(1,2))
# discrete
r_unif_discr <- sample(x = 1:6, size = 1000, replace = TRUE)
ggplot() + geom_histogram(aes(x=r_unif_discr)) + xlab("observation") + ylab("frequency")+ ggtitle("Rolling a die 1000 times") +
geom_hline(yintercept = 1000 / 6, linetype="dashed", color = "red")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# continuous
r_unif_cont <- runif(n = 1000, min = 0, max = 1)
nbins <- 100
ggplot() + geom_histogram(aes(x=r_unif_cont),bins = nbins) + xlab(expression(paste("Final resting angle (divided by ", 2*pi, ")"))) + ylab("frequency")+ ggtitle("1000 spins of a spinner") + geom_hline(yintercept = 1000 / nbins, linetype="dashed", color = "red")

Gaussian/normal distribution (continuous)

Arises when a random variable (quantity) is resulting from the sum of many (i.i.d) random variables with finite variance (central limit theorem). This does not depend on the shape of the individual distributions.
Parameters: mean (\(\mu\)) and standard deviation (\(\sigma\))
Examples: repeated independent noisy measurements of the same quantity

Sampling from a normal distribution with specific mean and standard deviation is also straightforward:

r_norm <- rnorm(10000, mean=0, sd=1)
nbins <- 100
ggplot() + geom_histogram(aes(x=r_norm),bins = nbins) + xlab("value") + ylab("frequency")+ ggtitle("sampling a normal")

Poisson distribution (discrete)

Describes the number of events observed in a fixed interval of time or space, if events are occuring at a constant average rate.
Parameters: lambda (\(\lambda\), corresponds to mean and variance)
Examples:
- number of cars driving by per minute (assuming constant traffic rate)
- number of viruses infecting the same cell, given a multiplicity of infection

Generating random observations (sampling) from a poisson distribution with a specific lambda:

r_pois <- rpois(1000, lambda = 2)
ggplot() + geom_bar(aes(x=r_pois)) + xlab("value") + ylab("frequency")+ ggtitle("sampling a poisson")

For more distributions and details…

… see the Appendix

Data Samples

Why do you think it is important / useful to know what is the underlying distribution of a random variable?

How can we know what distribution governs the behavior of a variable of interest?

Sometimes we can guess based on prior knowledge of the process that generates outcomes (can you think of such cases?).
More commonly (and more safely) we collect samples of outcomes of our variable of interest and make a more informed decision.

A data sample is a collection of observed outcomes of a variable. The distribution that arises from a sample of observations is called the empirical distribution. When our samples are large enough and cover in an unbiased (representative) way the complete space of possible outcomes we can infer the (true) underlying distribution that gives rise to the data. Collection of large, representative samples is, therefore, a critical step in the process of studying a random variable. The density plots in the previous examples are representations of empirical distributions. They are based on limited sampling of the possible outcomes. They are, thus, crude approximations of the true probability function of those variables.

Exercise 1

Produce vectors representing 100 samples for the following random variables:

The result (heads/tails) of a coin toss. What is the ratio of heads/tails? Repeat using 100000 samples. How does the result change?
The sum of two dice throws. Produce a histogram for the values. Which value appears more often? What is the ratio of incidence of 7 over the incidence of 3? Repeat using 100000 samples. How do the results change? Can you estimate the probability of ontaining a value greater than 9?
The number of daily views of a webpage that on average receives 14000 views per week, assuming constant daily traffic. Produce a histogram for the values.
The temperature T1 at 12pm of a given day of the year, assuming it follows a normal distribution with mean 15C and standard deviation 3C ( T1~N(15,3) )
The temperature T2 8h later that same day assuming it follows the function: T2 = T1 - 5 + e, where e follows a normal distribution with mean 0 and standard deviation 2.5 . Produce a scatterplot of T2 as a function of T1.

Solution 1

#1.
ss  <- 10000
Throws <- sample(c("H","T"), size=ss,replace=TRUE)
ratio <- sum(Throws=="H")/sum(Throws=="T")

# Set ss to 100000 to produce the larger sample. As the sample grows the ratio of
# heads to tails consistenly approaches 1.

#2.
ss <- 100000
sum2D <- sample(1:6, ss, replace=TRUE) + sample(1:6, ss, replace=TRUE)
hist(sum2D, breaks=100)

ggplot() + geom_bar(aes(x=sum2D)) + xlab("value") + ylab("frequency")+ ggtitle("2-dice throws")

which.max(table(sum2D))

7 
6

ratio <- sum(sum2D==7)/sum(sum2D==3)
# Set ss to 100000 to produce the larger sample. As the sample grows it becomes
# clear that the most probable single value is 7 and the ratio of 7s to 3s approaches 3.
# Using a large sample size we can estimate the probability mass. E.g for values greater than 9:
ss <- 1e6
sum2D <- sample(1:6, ss, replace=TRUE) + sample(1:6, ss, replace=TRUE)
Pg8 <- sum (sum2D >9) /ss # Approaches 1/6

#3.
ss <- 100000
views <- rpois(ss, lambda = 2000)
ggplot() + geom_histogram(aes(x=views)) + xlab("value") + ylab("frequency")+ ggtitle("page views")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#4.
ss <- 100
T1 <- rnorm(ss,15,3)
ggplot() + geom_histogram(aes(x=T1)) + xlab("temperature") + ylab("frequency")+ ggtitle("Temp. @ 12pm")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#5.
T2 <- T1 - 5 + rnorm(ss,0,2.5) 
ggplot() + geom_histogram(aes(x=T2)) + xlab("temperature") + ylab("frequency")+ ggtitle("Temp. @ 8pm")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# T1 vs T2:
ggplot() + geom_point(aes(x=T1, y=T2) )+ xlab("T @ 12pm") + ylab("T @ 8pm") + geom_abline(slope=1, intercept=0, linetype="dashed", color = "red")

Descriptive Statistics

Descriptive statistics are summary measures that describe basic features of a distribution (or empirical distribution). They are of three types:

-Measures of location or central tendency: Describe where a typical outcome tends to fall on average. Three common measures of location are the mean, the median and the mode

Mean is the sum of all the variables values divided by the total number of observations (sample size)
Median is the middle value of the outcomes. It splits our dataset in two equal parts. Half are smaller than the median and half are larger.
Mode is the value that occurs most commonly in our sample.

Robustness

Q: What do you think are some advantages and disadvantages of these measures?

A: A critical difference of these measures is how robust they are to outliers. Both median and mode do not break down easily in the presence of outliers. In contrast the mean is sensitive even to relatively small fraction of outliers in the data sample.

Robustness to outliers is important to keep in mind when choosing statistics.

-Measures of scatter (or dispersion): They describe how spread-out are our observations. Some common measures of dispersion:

Range The difference between the maximum and minimum value in a sample. The related interquartile range (IQR) is the difference in the 75th and 25th percentile of the data
Variance This is the average of the squared differences between each value and the mean.
Standard deviation The square root of the variance.

Again here, these measures differ in terms of their sensitivity to outliers. The IQR is a more robust mesure of dispersion compared to range, variance and standard deviation which are severely distorted even in the presence of few high magnitude outliers.

-Measures of shape: They try to capture how observations are distributed around location. They can indicate for example if a distribution is symmetric or skewed, unimodal or multimodal. Examples of such measure are skewness and curtosis (not of much practical use).

n=50
set.seed(1)
X=c(50,sample(c(1:10), n-1,replace=TRUE)) #Uniformly sampling 1:10 plus an outlier
Y=c(50, rnorm(n-1, 0, 2)  ) #Sampling from a normal with mean 0, sd 2 plus an outlier

#Some measures of location for X,Y
mode.X=which (table(X)==max(table(X)))[1]
mu.Y=mean(Y)
med.Y=median(Y)

#Measures of dispersion of variable Y
rng.Y=diff(range(Y))
var.Y=var(Y)
std.Y=sqrt(var.Y)
iqr.Y=IQR(Y)


ggplot() + geom_bar(aes(x=X)) + xlab("value") + ylab("frequency")+ 
  geom_vline(xintercept = mode.X,linetype="dashed", color = "red" ) +
  annotate(geom="text",x=mode.X+5, y=max(table(X))-1, label=paste("mode.X=",mode.X) )

ggplot() + geom_histogram(aes(x=Y),bins = 64) + xlab("value") + ylab("frequency")+ 
  geom_vline(xintercept = mu.Y,linetype="dashed", color = "red" ) +
  annotate(geom="text",x=mu.Y+5, y=4, label=paste("mean.Y=",round(mu.Y,3)) ) +
  geom_vline(xintercept = med.Y,linetype="dashed", color = "red" ) +
  annotate(geom="text",x=med.Y+5, y=6, label=paste("median.Y=",round(med.Y,3)) ) +
  annotate(geom="text",x=15, y=10, label=paste("var Y=",round(var.Y,3)) ) +
  annotate(geom="text",x=15, y=11, label=paste("rng Y=",round(rng.Y,3)) ) +
  annotate(geom="text",x=15, y=12, label=paste("iqr Y=",round(iqr.Y,3)) )

Statistical dependence

Random variables can be related to each other. This is the result of the underlying processes that generate their outcomes being somehow linked. For example two variables can be affected by a common upstream process, or they can be causally linked. When such a relationship exists we say that the variables are statistically dependent. On the other hand, variables that are not linked in any way are statistically independent. In other words when two variables are statistically independent knowledge of the outcome of one does not improve our ability to predict the outcome of the other.

A common measure of dependence is correlation. Two commonly used measures of correlation are pearson’s and spearman’s correlation cofficients.

Both measures take values in the range [-1, 1] but differ in terms of the assumptions they make about the underlying relationship between the two variables but also in terms of how robust they are to the presence of outliers:

#Calculate pearson's and spearman's correlation in two examples:  
#In the presence of outliers
p.cor=cor(X,Y)
sp.cor=cor(X,Y,method="spearman")


ggplot() + geom_point(aes(x=X, y=Y) ) + 
annotate(geom="text",x=15, y=25, label=paste("pearson's cor=",round(p.cor,3)) ) +
annotate(geom="text",x=15, y=30, label=paste("spearman  cor=",round(sp.cor,3)) )

The presence of a single large outlier in those two samples dominates the outcome of Pearson’s correlation coefficient. In contrast, the importance of a few high-magnitude outliers is moderate in the case of Spearman’s correlation coefficient because of the way it operates (only the rank of observations is important, not their magnitude. see example below for more details).

#When relationship is not linear (but still monotonic)
X1=runif(1000,0,0.25)
Y1=(1/X1) + rnorm(1000, 0,1 )
#Pearson's  and Spearman's correlation:
p.cor=cor(X1,Y1)
sp.cor=cor(X1,Y1,method="spearman")


ggplot() + geom_point(aes(x=X1, y=Y1) ) + 
annotate(geom="text",x=0.15, y=700, label=paste("pearson's cor=",round(p.cor,3)) ) +
annotate(geom="text",x=0.15, y=1700, label=paste("spearman  cor=",round(sp.cor,3)) )

# How Spearman's correlation works: Convert variables to ranks, then caclulate Pearson's rho
X1r=rank(X1)
Y1r=rank(Y1)
p.cor=cor(X1r,Y1r)
ggplot() + geom_point(aes(x=X1r, y=Y1r) ) + 
annotate(geom="text",x=300, y=200, label=paste("pearson's cor=",round(p.cor,3)) )

Q: Which of these two measures of correlation is more appropriate in this case?

A: Evidently spearman’s correlation is a better choice as it captures the strong association of the two variables. Pearson’s correlation fails to do that in this case because it assumes a linear relationship between the two variables. In contrast, Spearman’s correlation is more flexible and can capture arbitrary monotonic associations between variables.

It is important to keep in mind that the existence of statistical dependence only tells us if variables are linked and nothing about how they are organized in terms of “cause and effect” relationships. In the case of correlated variables this is commonly summarized as correlation does not imply causation.

Exercise 2

Location measures: What are the mean and median of the melanoma thickness variable? How about the same measures when grouped by sex?
Dispersion and shape: What is the range, IQR, variance and standard deviation for those groups? Are the distributions symmetric, unimodal or multimodal for each of the groups?

Based on the previous results can you conclude if gender and melanoma thickness are statistically dependent? If not what do you think is missing?

Plot melanoma thickness against age. Calculate the spearman and pearson correlation between these two variables. In this case is one more appropriate than the other?

Solution 2

mel <- read.delim("data/melanoma_data.txt")
#1. Location measures:
mean.both=mean(mel$thickness)
median.both=median(mel$thickness)
#Grouped by sex:
mean.0=mean(mel$thickness[mel$sex == 0])
mean.1=mean(mel$thickness[mel$sex == 1])
median.0=median(mel$thickness[mel$sex == 0])
median.1=median(mel$thickness[mel$sex == 1])

#2. Dispersion measures:
range.both=diff(range(mel$thickness))
iqr.both=IQR(mel$thickness)
var.both=var(mel$thickness)
sd.both=sd(mel$thickness) #same as sd.both=sqrt(var.both)
#Grouped by sex:
range.0=diff(range(mel$thickness[mel$sex == 0]))
iqr.0=IQR(mel$thickness[mel$sex == 0])
var.0=var(mel$thickness[mel$sex == 0])
sd.0=sd(mel$thickness[mel$sex == 0]) #same as sd.both=sqrt(var.0)
range.1=diff(range(mel$thickness[mel$sex == 1]))
iqr.1=IQR(mel$thickness[mel$sex == 1])
var.1=var(mel$thickness[mel$sex == 1])
sd.1=sd(mel$thickness[mel$sex == 1]) #same as sd.both=sqrt(var.1)
# Looking at the histograms of the data the melanoma thickness distributions
# appear unimodal and skewed -with a heavy tail towards higher values- (see section 2 for the histograms)
# Actually the distribution looks to fit the profile of either a truncated normal distribution ( https://en.wikipedia.org/wiki/Truncated_normal_distribution )
# or perhaps an exponential: (https://en.wikipedia.org/wiki/Exponential_distribution)
# It is tough to pinpoint the exact distribution with such a small sample.
# It is also tough to decide whether gender and mel. thickness are statistically dependent only based on these measurements.
# Although there is a difference in location it is hard to decide whether this is meaningful or just part of expected sample variance.
# We need a proper Hypothesis testing procedure to put a confidence level on our statement!


#3. Correlation
ggplot(mel, aes(x=age, y=thickness) ) + geom_point()

cor.pearson=cor(mel$age, mel$thickness)
cor.spearman=cor(mel$age, mel$thickness,method="spearman")      
# Both correlation measurements are similar.
# This makes sense: There are no heavy outliers and the (weak) association
# between the two variables appears to be linear.
# Deciding whether this association is meaningful is again tough without
# a proper test setting.

Hypothesis testing

The process of formulating hypotheses in a formal way, constructing appropriate tests for these hypotheses and interpreting the results of these tests in order to accept of reject the original hypothesis is known as Hypothesis testing.

The steps of hypothesis testing are the following:

Formalize your question in the form of a null (commonly denoted H0) and an alternative hypothesis (Ha or H1). The null and alternative hypothesis “split the world” in two possible mutually exclusive realities.
Calculate the probability of your observations (samples) under the assumption the the null hypothesis is true.
Accept or reject your null hypothesis in favor of the alternative on the grounds of the previous calculation.

Parametric and non-parametric tests.

How can we decide on a test that will calculate the probability of our observation under the null hypothesis? There are different tests for comparing measures of location (e.g mean), measures of dispersion (e.g variance), for testing dependence between variables or for testing whether a sample comes from a specific distribution

Second, it depends on our prior knowledge about the distribution of the involved variables. Tests that utilize prior assumptions about the underlying distribution are known as parametric while those that are agnostic to the distributions that generate the observations are non-parametric.

For example the t-test for the difference of means is a parametric test for testing differences of a specific location measure (mean) between two populations. It is parametric because it relies on some strong assumptions about the underlying distributions, most importantly that the tested variables come from a normal distribution. A non-parametric counterpart of the t-test is Wilcoxon’s rank-sum test (also known as Mann-Whitney test). It still tests whether the means of two samples differ but makes no assumptions about the underlying distribution.

An example of a parametric measurement of variable dependence is pearson’s correlation coefficient. It is parametric because it assumes (among other things) that the two tested variables are linked with a linear relationship. Its non parametric counterpart is spearman’s correlation coefficient.

t.test(Y)


    One Sample t-test

data:  Y
t = 0.81664, df = 49, p-value = 0.4181
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.234733  2.925228
sample estimates:
mean of x 
0.8452476

t.test(Y,alternative="greater")


    One Sample t-test

data:  Y
t = 0.81664, df = 49, p-value = 0.209
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 -0.8900412        Inf
sample estimates:
mean of x 
0.8452476

wilcox.test(Y)


    Wilcoxon signed rank test with continuity correction

data:  Y
V = 590, p-value = 0.65
alternative hypothesis: true location is not equal to 0

wilcox.test(Y,alternative="greater")


    Wilcoxon signed rank test with continuity correction

data:  Y
V = 590, p-value = 0.6784
alternative hypothesis: true location is greater than 0

cor.test(X,Y)


    Pearson's product-moment correlation

data:  X and Y
t = 11.737, df = 48, p-value = 1.037e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7665633 0.9191809
sample estimates:
      cor 
0.8611636

cor.test(X,Y,method="spearman")

Warning in cor.test.default(X, Y, method = "spearman"): Cannot compute
exact p-value with ties


    Spearman's rank correlation rho

data:  X and Y
S = 23052, p-value = 0.4598
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.1069331

Exercise 3

Construct a test for checking whether the mean thickness is different in the two sexes. What should be your null hypothesis? If you had to decide between a ttest and wilcoxon’s test which one would you choose? Why?
Construct a test for checking whether mean thickness is correlated to age. In this case what is the null hypothesis? Is spearman’s or pearson’s correlation more appropriate in this case?

Solution 3

#1. Here our null hypothesis is that average thickness is the same in the two sexes 
# and our alternative that it is different. 
# t-test is not the proper choice in this case since the distribution of thickness
# is clearly not gaussian (see exercise1 question 2 comments).
# Wilcoxon is a safer choice:
wilcox.test(mel$thickness[mel$sex==0],mel$thickness[mel$sex==1],alternative = "two.sided",conf.int = TRUE)


    Wilcoxon rank sum test with continuity correction

data:  mel$thickness[mel$sex == 0] and mel$thickness[mel$sex == 1]
W = 3794.5, p-value = 0.004213
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -1.5899252 -0.2599798
sample estimates:
difference in location 
            -0.8099166

# Based on the results of this test we can reject the null hypothesis (same thickness in both sexes)
# with confidence > 99.5% (since the p.value < 0.005)

#2. Here our null hypothesis is that thickness IS NOT correlated to age.
# Since as we saw in exercise 1 both pearson and spearman correlation give similar results
# we can try both tests:
cor.test(mel$thickness,mel$age)


    Pearson's product-moment correlation

data:  mel$thickness and mel$age
t = 3.0981, df = 203, p-value = 0.002223
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.0777074 0.3396259
sample estimates:
      cor 
0.2124798

cor.test(mel$thickness,mel$age,method="spearman")

Warning in cor.test.default(mel$thickness, mel$age, method = "spearman"):
Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  mel$thickness and mel$age
S = 1146124, p-value = 0.003719
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2017634

# Both tests give similar results. That is we can
# reject our null (there is no correlation i.e rho==0)
# at a confidence level >99.6% (Both p.values < 0.004)
# I would have a slight preference for the spearman test
# because pearson's correlation assumptions tend to break
# for heavy tailed distributions (such as the melanoma thickness)

Appendix: Distributions

Here is a short overview of a few distributions that are commonly used when analyzing data.
This part will not be covered in the lecture and is just there for your information. It aims to give you an intuition when different distributions arise, how they are parameterized,
and how they look like given their parameters.

Uniform distribution (continuous or discrete)

continuous: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous) discrete: https://en.wikipedia.org/wiki/Discrete_uniform_distribution

Arises when elements have equal probability.

Parameters (continuous): min and max

R functions: dunif, punif, qunif and runif

Probability density:

set.seed(1)
par(mfrow = c(1,2))
# discrete
r_unif_discr <- sample(x = 1:6, size = 1000, replace = TRUE)
plot(table(r_unif_discr))
abline(h = 1000 / 6, lty = 3)
# continuous
r_unif_cont <- runif(n = 1000, min = 0, max = 1)
hist(r_unif_cont)

Examples:
- continuous: P values in [0,1] under the null hypothesis
- discrete: getting a number from {1,2,3,4,5,6} when rolling a die (an unloaded die)

Gaussian (normal) distribution (continuous)

https://en.wikipedia.org/wiki/Normal_distribution

Arises when a random variable (quantity) is resulting from the sum of many independent, identically distributed random variables with finite variance (central limit theorem, https://en.wikipedia.org/wiki/Central_limit_theorem). This does not depend on the shape of the individual distributions.

Frequently observed in “nature” (for example physical quantities are often expected to be the sum of many independent processes including measurement errors). Because of that and because of it’s “nice” mathematical properties it is often used to model data of unknown distribution.

Parameters: mean (\(\mu\)) and standard deviation (\(\sigma\))

R functions: dnorm, pnorm, qnorm and rnorm

Probability density:

xs <- seq(-8, 8, length.out = 256)
mu <- 0
sds <- c(1, 1.5, 2, 3)
plot(xs, dnorm(x = xs, mean = mu, sd = sds[1]), type = "l", col = 2)
lines(xs, dnorm(x = xs, mean = mu - 1.5, sd = sds[1]), col = 2, lty = 2)
for (i in 2:4)
    lines(xs, dnorm(x = xs, mean = mu, sd = sds[i]), col = i + 1)

Examples:
- repeated independent noisy measurements of the same quantity

getNsamples <- function(N, n=100, FUN = runif, ...) {
    stopifnot(is.numeric(N),
              length(N) == 1,
              is.numeric(n),
              length(n) == 1,
              is.function(FUN),
              "n" %in% names(formals(runif)))
    sL <- lapply(seq.int(N), function(i) FUN(n = n, ...))
    nm <- deparse(substitute(FUN))
    names(sL) <- paste0(nm,"_",seq.int(N))
    sL <- c(sL, list(Reduce("+", sL)))
    names(sL)[length(sL)] <- paste(nm,"_sum")
    return(sL)
}

set.seed(2)
dat.unif <- getNsamples(3, n =  100, FUN = runif)
dat.pois <- getNsamples(3, n =  500, FUN = rpois, lambda = 3)
dat.exp  <- getNsamples(3, n = 1000, FUN = rexp, rate = 0.5)

par(mfrow = c(2,2))
for (i in seq_along(dat.unif))
    hist(dat.unif[[i]], main = names(dat.unif)[i], xlab = "Values")

par(mfrow = c(2,2))
for (i in seq_along(dat.pois))
    hist(dat.pois[[i]], main = names(dat.pois)[i], xlab = "Values")

par(mfrow = c(2,2))
for (i in seq_along(dat.exp))
    hist(dat.exp[[i]], main = names(dat.exp)[i], xlab = "Values")

Exponential distribution (continuous)

https://en.wikipedia.org/wiki/Exponential_distribution

Describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate.

Parameters: rate (greater than zero)

R functions: dexp, pexp, qexp and rexp

Probability density:

xs <- seq(0, 5, length.out = 128)
rates <- c(0.5, 1.0, 1.5)
plot(c(0,5), c(0,1.6), type = "n", xlab = "x", ylab = "P(x)")
for (i in seq_along(rates)) 
    lines(xs, dexp(xs, rates[i]), col = i + 1)
    ## lines(xs, rates[i] * exp(-rates[i] * xs), col = i + 1)
legend(x = "topright", legend = rates, title = "rate", lwd = 1, col = 2:4, bty = "n")

Examples:
- residence time of a transcription factor molecules (constant rate of leaving)
- waiting time until next red car drives by (assuming constant traffic rate)
- distance between neighboring genes (assuming random and uniform gene locations)

Poisson distribution (discrete)

https://en.wikipedia.org/wiki/Poisson_distribution

Describes the number of events occuring in a fixed interval of time or space, if events are occuring at a known and constant average rate.

Parameters: lambda (corresponds to mean and variance)

R functions: dpois, ppois, qpois and rpois

Probability density:

xs <- 0:20
lambdas <- c(1.0, 4.0, 10.0)
plot(c(0,20), c(0,0.4), type = "n", xlab = "x", ylab = "P(x)")
for (i in seq_along(lambdas)) {
    lines(xs, dpois(x = xs, lambda = lambdas[i]), col = i + 1)
    points(xs, dpois(x = xs, lambda = lambdas[i]), pch = 1, col = i + 1)
}
legend(x = "topright", legend = lambdas, title = "lambda", lwd = 1, col = 2:4, bty = "n")

Examples:
- number of red cars driving by per hour (assuming constant traffic rate)
- number of genes per Mb (assuming random and uniform gene locations)
- number of viruses infecting a given cell, given a multiplicity of infection of lambda

nGenes <- 200
genomeSize <- 1e6
binSize <- 1e4
nBins <- genomeSize / binSize

set.seed(3)
genePositions <- runif(n = nGenes, min = 1, max = genomeSize)

nGenesPer10kb <- tabulate(round(genePositions/binSize))
dGenesPer10kb <- table(nGenesPer10kb)

plot(as.numeric(names(dGenesPer10kb)), as.numeric(dGenesPer10kb), type = "p",
     ylim = c(0,max(dGenesPer10kb)), xlab = "Number of genes per 10kb", ylab = "Number of 10kb bins")
lines(0:6, nBins * dpois(x = 0:6, lambda = nGenes/nBins), col = 2)

Binomial distribution (discrete)

https://en.wikipedia.org/wiki/Binomial_distribution

Describes the number of successes in a given series of independent experiments with binary outcome and equal winning probability.

Parameters: size (number of trial) and prob (winning probability)

R functions: dbinom, pbinom, qbinom and rbinom

Probability density:

xs <- 0:40
ns <- c(20, 20, 40)
ps <- c(0.5, 0.7, 0.5)
plot(c(0,40), c(0,0.25), type = "n", xlab = "x", ylab = "P(x)")
for (i in seq_along(ns)) {
    lines(xs, dbinom(x = xs, size = ns[i], prob = ps[i]), lwd = 1, col = i + 1)
    points(xs, dbinom(x = xs, size = ns[i], prob = ps[i]), pch = 1, col = i + 1)
}
legend(x = "topright", legend = sprintf("p = %.1f  n = %d", ps, ns), lwd = 1, pch = 1, col = 2:4, bty = "n")

Examples:
- Number of heads when flipping a coin \(n\) times (\(p = 0.5\))
- Number of methylated CpGs covered \(n\) times, if the average methylation is \(p\)