###########
#
# GenMedMSc2023_CM2.2_practical_extension.R
#
## A longer, harder optional extended practical to accompany lectures
## CM2.1 and CM2.2 of the Genomic Medicine MSc core module 2
## by Luke Jostins-Dean, Jennie Astley and Alexandra Martin-Geary
## 24th October 2023
#

##NOTE: This is an older, and harder version of the practical.
##We no longer require students to carry this out, and they have found it too difficult
##But please do feel welcome to try it, we think it has some nice stuff in it

##compared to the current (2023) version of the practical it covers additional topics including:
# - building larger pedigrees
# - simulating genotypes on a pedigree using transmission matrices
# - carrying out segregation analysis
# - doing the power calculations from the lecture

##Part 0 - setting up packages and downloading files

#There are three example files in the same directory on canvas that you downloaded this R script from 

#You will need to edit this command to where-ever you downloaded the three example files to on your system:
# Copy folder path to clipboard from Files
file_prefix <- '~/MSc/'

#you will also need to install three packages from CRAN: pedtools, paramlink2 and genpwr, if you don't already have them installed:
install.packages('pedtools')
install.packages('paramlink2')
install.packages('genpwr')


###Part 1 - making and visualising simple pedigrees

### The package pedtools can make pedigrees. Lets load the package:
library(pedtools)

#we can make a simple nuclear family, with alternating sex offspring
nuc1 <- nuclearPed(nch=5,sex=c(1,2))
nuc1

#this type of format is often called "pedigree" or "ped" format
#each individual has an id (here numbered 1 to 7), as well as the ID of that individual's father ("fid") and mother ("mid"), where "*" or "0" means that that parent isn't present in the pedigree, and the sex (1 for male, 2 for female). Pedigree format can also include other information that isn't present here, such as disese status.

#we can visualise the ped file using the default plotting function
plot(nuc1)

#this creates pedigree plot, for family tree. Squares are male, circles are female, the parents are at the top and all the offspring are at the bottom.

#next, we are going to put some genotypes onto the pedigree. For now, we will make them up, assuming both parents are heterozygous (i.e. have dosage = 1), and the offspring range from homozygous reference (dosage = 0), heterozygous (dosage = 1) or homozygous alternative (dosage = 2): 
nuc1_genotypes <- c(1,1,0,2,1,1,2)

#pedtools has a "marker" file format that holds marker data for a pedigree. 
nuc1_markers <- marker(nuc1,geno=c("A/A","A/a","a/a")[1 + nuc1_genotypes])

#pedtools can now visualise this 
plot(nuc1,marker= nuc1_markers)

#finally, lets assume that this variant we have made up drives a fully penetrant recessive Mendelian disease. So we make an affection status vector, which is TRUE for individuals who are homozygous (and thus have the disase) and false (i.e. does does not have the disease) otherwise.
nuc1_affected <- nuc1_genotypes == 2

#lets plot the pedigree, with genotypes and affection statuses, together
plot(nuc1,marker= nuc1_markers, aff = which(nuc1_affected))

##Tasks for part 1
#Task 1.1 - make another example pedigree with genotypes and affection status, illustrating a dominant disease
#Task 1.2 - make another one, this time showing an X-linked recessive disease. This will require a bit of thinking, as males should be coded as "a" or "A", and female "A/A", "A/a" or "a/a". You can tell the marker() function to expect an X chromosome by setting the argument chrom="X".

###Part 2 - bigger pedigrees, and simulating genotypes and phenotypes on pedigrees

#we can make a bigger set of cousins by joining together pedigrees

#lets simulate a nuclear family for each of the offspring in nuc1
nuc2.1 <- nuclearPed(nch=5,sex=c(1,2))
nuc2.2 <- nuclearPed(nch=5,sex=c(1,2))
nuc2.3 <- nuclearPed(nch=5,sex=c(1,2))
nuc2.4 <- nuclearPed(nch=5,sex=c(1,2))
nuc2.5 <- nuclearPed(nch=5,sex=c(1,2))

#and join each of them onto the first pedigree
bigped1 = mergePed(nuc1, nuc2.1, by = c("7" = "1"))
bigped1 = mergePed(bigped1, nuc2.2, by = c("6" = "2"))
bigped1 = mergePed(bigped1, nuc2.3, by = c("5" = "1"))
bigped1 = mergePed(bigped1, nuc2.4, by = c("4" = "2"))
bigped1 = mergePed(bigped1, nuc2.5, by = c("3" = "1"),relabel=TRUE)

#plot this - we have one big pedigree with five children and 25 grandchildren
plot(bigped1)
#you might need to make the plotting area bigger to see all the ids
#note that we used "relabel=TRUE" in the mergePed argument above, so the ids have now changed to make them easier to read (which means individual "7" in this pedigree is the person who was called individual "5" in the last one) 

#for adding in genotypes, we are now going to use simulation, rather than typing 37 individual genotypes. To do this, we need to think about transmission probabilities

#transmission probabilities are a fancy way of saying the numbers you get out of Punnett square, which you probably did in school, and which Mendel came up with back in the 19th century. 

#So, for example, if both parents have dosage = 0 (i.e. both are "A/A"), then there is a 100% change that their offspring will have dosage = 0 as well. And, to give anothjer example, if one parent is dosage = 0 (A/A) and the other is dosage = 1 (A/a), there is a 50% chance the offspring will be dosage = 0 (A/A) and 50% chance they will be dosage = 1 (A/a).

#this three-dimensional array gives all possible transmission probabalities;
transMat <- array(dim=c(3,3,3))
transMat[,,1] <- cbind(c(1,0.5,0),c(0.5,0.25,0),c(0,0,0))
transMat[,,2] <- cbind(c(0,0.5,1),c(0.5,0.5,0.5),c(1,0.5,0))
transMat[,,3] <- cbind(c(0,0,0),c(0,0.25,0.5),c(0,0.5,1))
dimnames(transMat) <- list(c("A/A","A/a","a/a"),c("A/A","A/a","a/a"),c("A/A","A/a","a/a"))

#so, e.g., for dosage=0 parents, dosage = 0 offspring probability is 1 (i.e. 100%):
transMat['A/A',"A/A","A/A"]

#you can also see all the transmission probabilities for a particular set of parental genotypes 
transMat['A/a',"A/a",]

#you can then simulate genotypes for offspring use the R function sample(). This command samples a single genotype (size=1) for the offspring of heterozygous parents.
sample(c("A/A","A/a","a/a"),size=1,prob=transMat['A/a',"A/a",])

#Task 2.1 - check that this command gives the expected 1:2:1 ratio expected from Mendel's laws. You can change the "size" argument to increase the number of offspring you sample.

#we can make a function to carry out simulations automatically for a given individual in a pedigree, sampling from a set allele frequency f for founders (i.e. individuals with no parents in the pedigree), and from the transmission probabilities for non-founders (i.e. offspring).
simulateGenotype <- function(pedigree,genotypes,i,f,transMat){
	
	#check if the individual is a founder (i.e. has no parents)
	isFounder <- pedigree$FIDX[i] == 0 & pedigree$MIDX[i] == 0
	
	if (isFounder) {
		#if it is a founder, sample genotypes from the allele frequency
		return(rbinom(1,2,f))
	} else {
		#if it is not a founder, use the Mendelian tranmission probabilities:
		#use the pedigree to find the genotypes for the parents
		M_geno <- genotypes[pedigree$MIDX[i]]
		F_geno <- genotypes[pedigree$FIDX[i]]

		#get the transmission probabilities for this individual
		probs <- transMat[1+M_geno,1+F_geno,]
	    return(sample(0:2,1,prob=probs))
	}
}

#let's make an vector to hold the genotypes
bigped1_genotypes <- rep(NA,length(bigped1$ID))

#and set both the grandparents to be heterozygous (i.e. dosage=1)
bigped1_genotypes[1] <- 1
bigped1_genotypes[2] <- 1

#and then simulate genotypes for everyone else (lets set f=0.0001, ie the frequency of the mutation is only 0.01% in the population)
for (i in 3:length(bigped1$ID)) bigped1_genotypes[i] <- simulateGenotype(bigped1, bigped1_genotypes,i,f=0.0001,transMat=transMat)

#we can make a marker object from the genotypes
bigped1_marker <- marker(bigped1,geno=c("A/A","A/a","a/a")[1 + bigped1_genotypes])

plot(bigped1,marker= bigped1_marker)

#finally, let's think about simulating phenotypes
#instead of looking at a fully penetrant disease, lets model a disease with incomplete penetrance and phenocopies

#we can define the disease model and the penetrance using genotype-specific prevalences, which give the probability of getting disease for the three genotype classes ("A/A", "A/a","a/a").

#For instance an entirely genetic, fully penetrent recessive disease has the vector c(0,0,1), i.e. prevalence = 0 in homozygous reference, pentrance = 0 in heterozygotes and prevalence = 1 in homozygous alternative (i.e. all homozygous mutants have the disease, and no-one else). 

#A dominant disease with partial penetrance (i.e. some people who have the mutation do not get the disease) and a small rate of phenocopies (i.e. people who get the disease without having the mutation) might have a penetrance vector of c(0.001,0.8,0.8). I.e. 0.1% of people without the mutation get the disease, and 80% of those with the mutation get it.

#Let's use that second one as an example, as set the prevalence vector:
Kvec <- c("A/A"=0.001,"A/a"=0.8,"a/a"=0.8)

#We can simulate affection status by sampling a random number from the uniform distribution and checking whether it is smaller than the penetrance (ask an instructor if you are confused by this). 

#e.g. to simulate effect status for a heterozygous individual you could use:
runif(1) < Kvec["A/a"]
#if you run this a bunch of times, you will see it comes back "TRUE" 80% of the time

#So now let's sample affection status for the whole pedigree, based on their genotypes
bigped1_affected <- runif(length(bigped1$ID)) < Kvec[1 + bigped1_genotypes]

#and visualize it:
plot(bigped1,marker= bigped1_marker,aff=which(bigped1_affected))


###Part 3 - segregation analysis

#below is a function for carrying out segregation analysis on a pedigree
#specifically, it takes in a genetic model (allele frequency f, prevalence vector Kvec), a pedigree and associated phenotypes (i.e. affection statuses), and a proposed genotype for one or two of the founders, and returns a log likelihood.
#this is slightly complicated R programming (it involves recursive programming - you can see that the function calls itself), so don't worry if you don't understand all the details. An instructor will be happy to explain how this works, if you would like help to understand it.

getLogL_ind <- function(geno, phenotypes, pedigree,i,f,transMat,Kvec, set_partner_geno = NA){
	#function for calculating the likelihood of observing the affection status of all individuals descended from a given individual, conditional on a genetic model (allele frequency and penetrance vector), and the genotype of the given individual (and, optionally, their partner)
	#geno <- the proposed genotype of the individual
	#phenotypes <- a vector of affection statuses for all individuals in the pedigree
	#pedigree <- a pedigree object for the family
	#i <- the index of the individual in the pedigree
	#f <- the allele frequency
	#transMat <- the transition matrix
	#Kvec <- the genotype-specific penentrances
	
	#define a logsum function, to add up values in log-space
	logsum <- function(x) {
		if (all(x == -Inf)) return(-Inf)
		else return(max(x) + log(sum(exp(x - max(x)))))
	}
	
	#get the affection status for the individual i
	aff <- phenotypes[i]
	
	#check if they are a leaf (i.e. have no children)
	isLeaf <- !(pedigree$ID[i] %in%  pedigree$FIDX | pedigree$ID[i] %in%  pedigree$MIDX)
	if (isLeaf){
		#if they are a leaf, return the log likelihood they have the disease given their hypothesed genotype
		return(dbinom(aff,1,Kvec[1 + geno],log=TRUE))
	} else {
		#otherwise, calculate the likelihood for all their descendents
		
		#find out which individuals are their offspring and partner
		offspring <- which(pedigree$ID[i] == pedigree$FIDX | pedigree$ID[i] ==  pedigree$MIDX)
		partner <- setdiff(c(pedigree$FIDX[offspring],pedigree$MIDX[offspring]),i)
		
		#find the log likelihood of their affection status
		SelfAffL <- dbinom(aff,1,Kvec[1 + geno],log=TRUE)

		#test each of their partners possible genotypes
		partnerGenoVec <- rep(NA,2)
		for (partnerGeno in 0:2){
			
			#get their partners affection status
			partnerAff <- phenotypes[partner]
			
			#get their partners genotype and affection status likelihoods (assuming their partner is a founder)
			PartnerGenoL <- dbinom(partnerGeno,2,f,log=TRUE)
			PartnerAffL <- dbinom(partnerAff,1,Kvec[1 + partnerGeno],log=TRUE)
			
			#check each of the offspring
			offspringVec <- rep(NA,length(offspring))
			for (j in 1:length(offspring)){
				#check all possible genotypes for this offspring
				offspringGenoVec <- rep(NA,2)
				for (offspringGeno in 0:2){
					#get the genotype status likelihoods for this offspring from the transmission matrix
					OffspringGenoL <- log(transMat[1 + geno, 1 + partnerGeno, 1 + offspringGeno])
					
					#use this function recursively to get the affection status likelihood for this individual and all its offspring (if they have any)
					OffspringAffL <- getLogL_ind(offspringGeno, phenotypes, pedigree, offspring[j],f,transMat,Kvec)
					
					#add the above together
					offspringGenoVec[1 + offspringGeno] <- OffspringGenoL + OffspringAffL
				}
				
				#combine the likelihoods across different offspring genotypes
				offspringVec[j] <- logsum(offspringGenoVec)
			}
			
			if (is.na(set_partner_geno)){
				#if the partners genotype is unset, add up the offspring
				partnerGenoVec[1 + partnerGeno] <- sum(offspringVec) + PartnerGenoL + PartnerAffL
			} else {
				#otherwise, only include this likelihood if the genotype is equal to the preset partner genotype
				if (set_partner_geno  == partnerGeno){
					partnerGenoVec[1 + partnerGeno] <- sum(offspringVec) + PartnerAffL 
				} else {
					partnerGenoVec[1 + partnerGeno] <- -Inf
				}
			}
		}
		return(SelfAffL + logsum(partnerGenoVec))
		
	}
}

#Let's use this function to test some possible genetic architectures

#possibility 1 - there is no mutation, the family is just at a high risk of disease regardless of genotype for some environmental reason (prevalence = 0.5)

#so, set the parents genotypes to be equal to 0, the allele frequency to be 0, and the prevalence equal to 0.5 regardless of genotype
getLogL_ind(geno=0, set_partner_geno =0,phenotypes= bigped1_affected, pedigree= bigped1,i=1,f=0,transMat=transMat,Kvec=c(0.5,0.5,0.5))

#possibility 2 - the true model (i.e. Kvec = 0.001,0.8,0.8), with both parents carrying the mutation
getLogL_ind(geno=1, set_partner_geno =1,phenotypes= bigped1_affected, pedigree= bigped1,i=1,f=0.0001,transMat=transMat,Kvec=c(0.001,0.8,0.8))


#we can see that the true genetic model has a much higher (i.e. less negative) log likelihood than the non-genetic model

#Task 3.1  - Can this pedigree be explained by a fully penetrant recessive variant? How about a partially penetrant one?
#Task 3.2  - How strong is the evidence for reduced penetrance in this pedigree? 


###Part 4 - Reading in pedigree files, and running a linkage scan

#read in the paramlink2 library, for doing parametric linkage
library(paramlink2)

#we have a ped file of cousins, similar to the previous pedigree. Let's read it in and take a look at it;
pedfile <- readPed(paste0(file_prefix,'/Pedigree2.ped'))

plot(pedfile)

#we also have affection statuses for the individuals
afffile <- read.table(paste0(file_prefix,'/Pedigree2.aff'))

#let's just check that the individuals are in the same order
all(afffile$id == pedfile$ID)

#and let's plot the affection status on the pedigree
plot(pedfile,aff=which(afffile$phenotype))

#lots of cases!

#note that in this case, the markers are stored in the ped file itself, as a list of marker objects. This shows us the geneotypes for the first marker:
pedfile$MARKERS[[1]]

#the pedigree file itself doesn't store details on the markers (their position, their chromosome, etc). That is stored in a map file, which includes the chromosome, the name of the marker, and the position of the marker (in Mb):
mapfile <- read.table(paste0(file_prefix,'/Pedigree2.map'))
head(mapfile)

#so, we can see that the first marker:
mapfile[1,]

#is on chromosome 1, is called "marker1", and is at around 1.43Mb. We can also plot it on the pedigree:

plot(pedfile,marker=pedfile$MARKERS[[1]],aff=which(afffile$phenotype))

#this first variant doesn't seem to line up with affection status.

#Can any of the other variants explain the disease? Lets do a parametric linkage scan. Remember, parametric means that you set model parameters yourself, so we need to decide on what inheritance model to use.

#What models are compatible with this data? It couldn't be a fully penetrant dominant model, because the disease skips a generation in some cases (e.g. individual 6). It could be a partially penetrant dominant variant, though. Lets try that:

#we set the model to autosomal dominant ("AD"), and set the penetrances, with the diseaseModel function:
modAD = diseaseModel("AD", penetrances=c(0,0.8,0.8))

#and then we run the linkage scan
lodsAD = lod(pedfile, aff = which(afffile$phenotype), model = modAD)

#we can plot the LOD scores for each marker
plot(lodsAD)

#no LOD scores greater than 2, so no signal there.

#Let's try another model. This pedigree could be compatible with a recessive model, if the variant was common in the population and was being introduced into the family at multiple times. Lets try that:

#set the model to autosomal recessive ("AR"), with default penetrances.
modAR = diseaseModel("AR")

#run the linkage scan:
lodsAR = lod(pedfile, aff = which(afffile$phenotype), model = modAR)

#And plot the results
plot(lodsAR)

#Bingo! We have a big hit (LOD > 6).

#Task 4.1 - which marker is this hit? Where is it located in the genome?
#Task 4.2 - make a plot of this marker on the pedigree, along with affection status - how well does it seem to fit?
#Tast 4.3 - what is the allele frequency of this variant in the founders of this pedigree? Does this seem suprising? Under what circumstances might you see a pathogenic variant to rise to such a high allele frequency?


###Part 5 - A very quick detour into power and candidate gene studies

#We won't run an actual candidate gene study here, because the principles are very similar to running a proper genome-wide association study, which we will cover later on in the course.

#But we are going to cover how we generate the sort of power and posterior curves we saw in the lecture

#we will use the R package genpwr to calculate power curves:
library(genpwr)

#suppose we have a candidate gene study of a binary trait, looking at one variant, using a p-value threshold of 0.05, in 200 cases and 200 controls, and with an ellele frequency of 30%. How does power change with the effect size?

#lets set an array to contain a set of odds ratios to test, we will test 100 odds ratios, spaced linearly on the log scale between 1.01 and 2:
ORs <- exp(seq(log(1.01),log(2),length.out=100))
ORs

#now lets calculate the power. We will assume we are using a logistic model. Note that, instead of setting the number of cases and controls, we set a total sample size (N=400), and specify that 50% of them are cases (Case.Rate=0.5). We will assume the true model is additive model, and that we are also testing using an additive model (we discussed in CM2.1.2 what this means).
pwrs <- genpwr.calc(calc="power",model="logistic",N=400,Case.Rate=0.5,MAF=0.3,OR=ORs,True.Model="Additive",Test.Model="Additive",Alpha=0.05)

#this object, pwrs, includes all the details of the power calculations. we can use it to plot odds ratio against sample size (we will put the x-axis on the log scale, to make it more readable).
plot(pwrs$OR,pwrs$Power,log="x")

#we can see from this plot that this study would have 50% power for an odds ratio of about 1.4, and an 80% power for an odds ratio of 1.6.

#if we set an alpha value (we will use alpha=0.05 for p < 0.05) and the prior (which will we set to be 1 in 20):
alpha <- 0.05
prior <- 1/20 

#we can calculate the posterior (i.e. the probability the association is a true positive and not a false positive), as a function of the effect size, using the equation from the lecture
posterior <- prior*pwrs$Power/(prior*pwrs$Power + alpha*(1 - prior))

#you can see that the posterior never goes above 52%, and for small odds ratios (<1.2) it is very low indeed.
plot(pwrs$OR, posterior,log="x")


#Task 5.1 - Remember that in the lecture we did a Bayesian analysis of the PPARg candidate gene study. We said that the true effect size was OR=1.2, the sample size of 3000 (assuming 50% cases), a p-value threshold of 0.002, and the minor allele frequency of 0.13. Can you reproduce the posterior plot, as a function of the prior?
#Task 5.2 - if the PPARg paper had only achieved a p-value threshold of p < 0.05, would it have been strong enough evidence to think it was likely PPARg was associated with disease? At a prior value of 1 in 100, what p-value threshold would you need to have seen to be confident the association was real?