Featured image of post Heritability

Heritability

Introduction

Heritability is the proportion of variation in a trait within a population that can be attributed to genetic differences. There are two type of definition and Nanow-sense Heritability is the common one.

Broad-sense Heritability

H2=VGVG+VE=VGVP\begin{align*} H^2&=\frac{V_G}{V_G+V_E} \\ &=\frac{V_G}{V_P} \end{align*}
  • VPV_P is phenotype variation
  • VGV_G is genetic variation
  • VEV_E is environment
H2=VGVP\begin{align*} H^2=\frac{V_G}{V_P} \end{align*}

Nanow-sense Heritability

h2=VAVP,whereVG=VA+VNA\begin{align*} h^2=\frac{V_A}{V_P}, \quad \text{where} V_G=V_A+V_{NA} \end{align*}
  • VAV_A is additive genetic variation
  • VNAV_{NA} is non-addrive genetic variation

Example-Dominant Coding

In dominant coding, genotypes CCCC, CTC T and TTT T are coded as 11, 11 and 00, respectively, if CC is the minor allele.

Example-Epistasis

Labrador coat color is determined by two genes with four genotypes: BEBE, bEbE, BeBe, bebe

  • Color is black when genotype is BEB-E-
  • Color is chocolate when genotype is bbEbbE-
  • Color is yellow when genotype is ee--ee

Note

  1. Heritability refers to a specific population, not to individuals.

  2. Heritability \neq inheritance. For example, your brown hair may be inherited from your father, but the heritability of brown hair in the population may be low.

  3. Heritability \neq total genetic contribution. A low h2=VAVPh^2 = \frac{V_A}{V_P} does not necessarily mean that genetics plays a small role.

  4. If h2h^2 is low, identifying associated genes might be less fruitful.

There are 33 common types of heritability.

Family-Based Heritability (hfamily2)(h^2_{\text{family}})

Family-based studies, often twin studies, estimate heritability by comparing monozygotic (MZ) twins and dizygotic (DZ) twins. Let rMZr_{MZ} be the phenotypic correlation for MZ twins and rDZr_{DZ} for DZ twins.

{rMZ=A+CrDZ=A2+C\begin{align*} \left\lbrace \begin{array}{lll} r_{MZ} = A+C \\ r_{DZ} = \frac{A}{2}+C \end{array} \right. \end{align*}

where

  • AA is additive genetic effect
  • CC is shared (common) environmental effect

We esitmate

hfamily2=A=2(rMZrDZ)C=ArMZ\begin{align*} h^2_{\text{family}} & =A=2(r_{MZ}-r_{DZ}) \\ C & = A-r_{MZ} \end{align*}

error E=1CE = 1-C and A+C+E=1A+C+E=1

SNP-Based Heritability (hSNP2)(h^2_{\text{SNP}})

Estimated using tools such as GCTA under the mixed linear model

Y=Xβ+Wu+ε\begin{align*} \mathbf{Y=X \beta+W u+\varepsilon} \end{align*}

where

  • uN(0,Iσu2)\mathbf{u} \sim N\left(0, \mathbf{I \sigma_u^2}\right)
  • εN(0,Iσ22)\varepsilon \sim N\left(0, \mathbf{I \sigma_2^2}\right)
  • β\beta is a fixed effect (no variation)

So the variance of Y\mathbf{Y} is

Var(Y)=V=Var(Wu)+Var(ε)=WWTσu2+Iσε2\begin{align*} \operatorname{Var}(\mathbf{Y})= & V \\ = & \operatorname{Var}(\mathbf{W u})+\operatorname{Var}(\varepsilon) \\ = & \mathbf{W W^T \sigma_u^2+I \sigma_{\varepsilon}^2} \end{align*}W={wij},wij=Xij2pj2pj(1pj)\begin{align*} \mathbf{W}=\left\{w_{i j}\right\}, \quad w_{i j}=\frac{X_{i j}-2 p_j}{\sqrt{2 p_j\left(1-p_j\right)}} \end{align*}

where

  • XijX_{ij} is jthj-th SNP for ithi-th individual
  • pjp_j is jthj-th SNP MAF

Define Genetic Relationship Matrix (GRM)

A=WWTN\begin{align*} A=\frac{\mathbf{WW^T}}{N} \end{align*}

where σg2=Nσu2\mathbf{\sigma_g^2}=N \mathbf{\sigma_u^2}

Rewiting the model

Y=Xβ+g+ε,gN(0,Aσg2)\begin{align*} \Rightarrow \quad \mathbf{Y=X \beta+g+\varepsilon}, \quad \mathbf{g} \sim N\left(\mathbf{0, A \sigma_g^2}\right) \end{align*}

We estimate σg2\sigma_g^2 using REML (Restricted Maximum Likelihood). The proportion of phenotypic variance explained by the SNPs used to construct the GRM is given by

hSNP2=σg2Var(Y)\begin{align*} h^2_{\text{SNP}} = \frac{\sigma_g^2}{\text{Var}(Y)} \end{align*}

More detail read the paragraph GCTA

GWAS-based Heritability (hGWAS 2)(h_{\text {GWAS }}^2)

This estimates heritability using only the significant SNPs identified in GWAS. Assuming mm significant SNPs are linearly associated with the trait

Y=β0+i=1mβiXi+ε\begin{align*} {Y}=\beta_0+\sum_{i=1}^m \beta_i X_i+\varepsilon \end{align*}

The heritability is

hGWAS 2=Var(Y^)Var(Y)\begin{align*} h_{\text {GWAS }}^2 = \frac{Var(\hat{Y})}{Var(Y)} \end{align*}

Relationship Between Heritability Types

hfamily 2>hSNP 2>>hGWAS 2\begin{align*} h_{\text {family }}^2 > h_{\text {SNP }}^2 >> h_{\text {GWAS }}^2 \end{align*}

This gap is known as missing heritability.

Code

We used the gcta64 command to estimate the Genetic Relationship Matrix (GRM) for each of the 22 chromosomes separately. Compared to estimating the GRM for all autosomes together, we found that the results are identical. However, the former method took approximately two and a half hours, significantly longer than the latter, which required only ten minutes.

Computing Separately

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
## estimate h^2_SNP
getwd()
setwd("D:/GWAS_CLASS/GCTA")


## step0: split snp data to different chr
for (i in 1:22) {
  system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i, " --make-bed --out D:/GWAS_CLASS/GCTA/data/merge_chr", i))
}

## step1: make GRM 
# --maf: filter SNPs
# --make-grm: make GRM
# --thread-num: Parallel computation. You should generally not specify a number of threads that exceeds the number of physical cores.
for (i in 1:22) {
  system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i ," --maf 0.01 --make-grm --out D:/GWAS_CLASS/GCTA/data/merge_chr", i, " --thread-num 10"))
}

## step2: build grm_chrs.txt put in all chr GRM file name 
writeLines(paste0("D:/GWAS_CLASS/GCTA/data/merge_chr", 1:22), "D:/GWAS_CLASS/GCTA/grm_list.txt")

## step3: merge all the GRMs by the following command:
system("gcta64 --mgrm D:/GWAS_CLASS/GCTA/grm_list.txt --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge")

## step4: remove cryptic relatedness: 0.025 roughly corresponds to individuals who are less related than third-degree
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge --grm-cutoff 0.025 --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered")

## step5: estimating the variance explained by the SNPs (heritability)
### input: GRM in step 3 (grm_merge) + phenotype info (pheno.txt)
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --pheno D:/GWAS_CLASS/20101123/process/pheno.txt --reml --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --thread-num 10")

Computing Together

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## estimate h^2_GWAS
system("plink --bfile D:/GWAS_CLASS/20101123/process/merge --extract D:/GWAS_CLASS/20101123/process/significant_snp.txt --recodeA --out D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp")

df <- fread("D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp.raw")
pheno <- fread("D:/GWAS_CLASS/20101123/process/pheno.txt")
df <- df[,-(1:6)]
df$y <- pheno$Height

lm_sigsnp <- lm(y ~ ., data = df)
summary(lm_sigsnp)

Result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Summary result of REML analysis:
# Source	Variance	SE
# V(G)	10.927486	69.918584
# V(e)	0.000011	69.472995
# Vp	10.927497	5.158109
# V(G)/Vp	0.999999	6.357631
# 
# Sampling variance/covariance of the estimates of variance components:
# 4.888608e+03	-4.844250e+03
# -4.844250e+03	4.826497e+03	

Reference

Licensed under CC BY-NC-SA 4.0
使用 Hugo 建立
主題 StackJimmy 設計