Skip to main content

Some useful probability distributions

In this section we will explore two useful probability distributions using R.

Note

It's good to get used to visualising these in R. But if you prefer you can also use the interactive distribution zoo site to explore these distributions.

Note on learning outcomes

For this part of CM4, we are focussed on the concepts. You don't need to know all the mathematical details of these distributions, nor do you need to know all the R-specific details below. On the other hand, you should have a sense of what they are used to represent and why we might be interested in them, their shape and how the parameters work.

The Normal distribution

The normal distribution, also known as the gaussian distribution, is commonly used to handle variables that might be sums of lots of other things.

For example, we might imagine that the expression of a gene depends on lots of factors that determine expression levels, all of which add up to the eventual expression level. Or we might imagine that a phenotype (like height, for example) is determined by lots of genetic and environmental factors all adding up. It turns out that variable that are sums of lots of small things like this tend to be normal.

A famous demonstration of this is the so-called 'Galton board' which you can see here:

Galton board video

img

The result of adding up all those random jumps to the left and right is... a normal distribution.

Challenge

Pick a mean value μ\mu (start somewhere between 10-10 and 1010) and a variance vv (which must be positive - for example, 22 is a good starting choice). Then plot the density of the normal distribution over the continuous range x=2020x=-20 \cdots 20.

The normal distribution density in R is given by the dnorm() function.

For example, you could plot it by creating a grid of values to plot at:

x = seq( from = -20, to = 20, by = 0.001 )

and choosing a mean and variance parameter:

mu = 5
variance = 2

Then use the dnorm() function to plot:

plot(
x,
dnorm(x, mean = mu, sd = sqrt( variance )),
type = 'l',
lwd = 2,
xlab = "x",
ylab = "Normal distribution density",
xlim = c( -20, 20 ),
ylim = c( 0, 0.4 ),
bty = 'n'
)
grid()

img

Note. As I've done above, it is best to set sensible xlim and ylim values.

How does the distribution differ as you vary the mean and the variance?

For reference, here is the formula for the normal distribution density:

xμ,v12πve12(xμ)2vx|\mu,v \sim \frac{1}{\sqrt{2\pi v}}\cdot e^{\frac{1}{2}\frac{(x-\mu)^2}{v}}

It's not as complicated as it looks - the first bit (12πv)\left(\frac{1}{\sqrt{2\pi v}}\right) is just the normalising constant (it doesn't depend on xx and is just there to make the distribution sum to 11. The second part is more or less just a quadratic formula depending on the squared distance of xx to the mean (xμ)2v\frac{(x-\mu)^2}{v}. (You don't need to know these details for the exam.)

Binomial distribution

The binomial distribution answers the following important question:

Suppose a particular allele AA is at frequency pp in a population of interest. If we sample nn chromosomes and genotype them, how many will carry the allele?

Challenge

Pick a number of samples n (start between 5 and 20) and a probability or frequency pp (start between 0.1 and 0.9).

Then plot the binomial distribution over the range of integers x=0,1,2,,nx = 0, 1, 2, \cdots, n.

For example, you could pick parameters like this:

n = 20
p = 0.1

and use dbinom() to plot it

x = 0:n
binom = dbinom( x, size = n, prob = p )
names(binom) = x
barplot(binom)

img

Question How does the shape of the binomial differ as you vary nn and pp?

Note. The plot above for n=20n=20 and p=0.1p=0.1 says that - if the true frequency was 1010% and we sampled 2020 chromosomes, we'd be most likely to find 22 that carry the allele - but it could be as high as, say, 66.

On the other hand - we'd be very unlikely to find as many as 1010 alleles. This can be computed using the corresponding pbinom() function:

pbinom(
q = 10,
size = n,
prob = p,
lower.tail = F # mass under `dbinom()` under the right-hand tail
)