Heritability

Introduction

Heritability is the proportion of variation in a trait within a population that can be attributed to genetic differences. There are two type of definition and Nanow-sense Heritability is the common one.

Broad-sense Heritability

\begin{align*} H^2&=\frac{V_G}{V_G+V_E} \\ &=\frac{V_G}{V_P} \end{align*}

$V_P$ is phenotype variation
$V_G$ is genetic variation
$V_E$ is environment

\begin{align*} H^2=\frac{V_G}{V_P} \end{align*}

Nanow-sense Heritability

\begin{align*} h^2=\frac{V_A}{V_P}, \quad \text{where} V_G=V_A+V_{NA} \end{align*}

$V_A$ is additive genetic variation
$V_{NA}$ is non-addrive genetic variation

Example-Dominant Coding

In dominant coding, genotypes $CC$ , $C T$ and $T T$ are coded as $1$ , $1$ and $0$ , respectively, if $C$ is the minor allele.

Example-Epistasis

Labrador coat color is determined by two genes with four genotypes: $BE$ , $bE$ , $Be$ , $be$

Color is black when genotype is $B-E-$
Color is chocolate when genotype is $bbE-$
Color is yellow when genotype is $--ee$

Note

Heritability refers to a specific population, not to individuals.
Heritability $\neq$ inheritance. For example, your brown hair may be inherited from your father, but the heritability of brown hair in the population may be low.
Heritability $\neq$ total genetic contribution. A low $h^2 = \frac{V_A}{V_P}$ does not necessarily mean that genetics plays a small role.
If $h^2$ is low, identifying associated genes might be less fruitful.

There are $3$ common types of heritability.

Family-Based Heritability $(h^2_{\text{family}})$

Family-based studies, often twin studies, estimate heritability by comparing monozygotic (MZ) twins and dizygotic (DZ) twins. Let $r_{MZ}$ be the phenotypic correlation for MZ twins and $r_{DZ}$ for DZ twins.

\begin{align*} \left\lbrace \begin{array}{lll} r_{MZ} = A+C \\ r_{DZ} = \frac{A}{2}+C \end{array} \right. \end{align*}

where

$A$ is additive genetic effect
$C$ is shared (common) environmental effect

We esitmate

\begin{align*} h^2_{\text{family}} & =A=2(r_{MZ}-r_{DZ}) \\ C & = A-r_{MZ} \end{align*}

error $E = 1-C$ and $A+C+E=1$

SNP-Based Heritability $(h^2_{\text{SNP}})$

Estimated using tools such as GCTA under the mixed linear model

\begin{align*} \mathbf{Y=X \beta+W u+\varepsilon} \end{align*}

where

$\mathbf{u} \sim N\left(0, \mathbf{I \sigma_u^2}\right)$
$\varepsilon \sim N\left(0, \mathbf{I \sigma_2^2}\right)$
$\beta$ is a fixed effect (no variation)

So the variance of $\mathbf{Y}$ is

\begin{align*} \operatorname{Var}(\mathbf{Y})= & V \\ = & \operatorname{Var}(\mathbf{W u})+\operatorname{Var}(\varepsilon) \\ = & \mathbf{W W^T \sigma_u^2+I \sigma_{\varepsilon}^2} \end{align*}

\begin{align*} \mathbf{W}=\left\{w_{i j}\right\}, \quad w_{i j}=\frac{X_{i j}-2 p_j}{\sqrt{2 p_j\left(1-p_j\right)}} \end{align*}

where

$X_{ij}$ is $j-th$ SNP for $i-th$ individual
$p_j$ is $j-th$ SNP MAF

Define Genetic Relationship Matrix (GRM)

\begin{align*} A=\frac{\mathbf{WW^T}}{N} \end{align*}

where $\mathbf{\sigma_g^2}=N \mathbf{\sigma_u^2}$

Rewiting the model

\begin{align*} \Rightarrow \quad \mathbf{Y=X \beta+g+\varepsilon}, \quad \mathbf{g} \sim N\left(\mathbf{0, A \sigma_g^2}\right) \end{align*}

We estimate $\sigma_g^2$ using REML (Restricted Maximum Likelihood). The proportion of phenotypic variance explained by the SNPs used to construct the GRM is given by

\begin{align*} h^2_{\text{SNP}} = \frac{\sigma_g^2}{\text{Var}(Y)} \end{align*}

GWAS-based Heritability $(h_{\text {GWAS }}^2)$

This estimates heritability using only the significant SNPs identified in GWAS. Assuming $m$ significant SNPs are linearly associated with the trait

\begin{align*} {Y}=\beta_0+\sum_{i=1}^m \beta_i X_i+\varepsilon \end{align*}

The heritability is

\begin{align*} h_{\text {GWAS }}^2 = \frac{Var(\hat{Y})}{Var(Y)} \end{align*}

Relationship Between Heritability Types

\begin{align*} h_{\text {family }}^2 > h_{\text {SNP }}^2 >> h_{\text {GWAS }}^2 \end{align*}

This gap is known as missing heritability.

Code

We used the gcta64 command to estimate the Genetic Relationship Matrix (GRM) for each of the 22 chromosomes separately. Compared to estimating the GRM for all autosomes together, we found that the results are identical. However, the former method took approximately two and a half hours, significantly longer than the latter, which required only ten minutes.

Computing Separately

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
## estimate h^2_SNP
getwd()
setwd("D:/GWAS_CLASS/GCTA")


## step0: split snp data to different chr
for (i in 1:22) {
  system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i, " --make-bed --out D:/GWAS_CLASS/GCTA/data/merge_chr", i))
}

## step1: make GRM 
# --maf: filter SNPs
# --make-grm: make GRM
# --thread-num: Parallel computation. You should generally not specify a number of threads that exceeds the number of physical cores.
for (i in 1:22) {
  system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i ," --maf 0.01 --make-grm --out D:/GWAS_CLASS/GCTA/data/merge_chr", i, " --thread-num 10"))
}

## step2: build grm_chrs.txt put in all chr GRM file name 
writeLines(paste0("D:/GWAS_CLASS/GCTA/data/merge_chr", 1:22), "D:/GWAS_CLASS/GCTA/grm_list.txt")

## step3: merge all the GRMs by the following command:
system("gcta64 --mgrm D:/GWAS_CLASS/GCTA/grm_list.txt --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge")

## step4: remove cryptic relatedness: 0.025 roughly corresponds to individuals who are less related than third-degree
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge --grm-cutoff 0.025 --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered")

## step5: estimating the variance explained by the SNPs (heritability)
### input: GRM in step 3 (grm_merge) + phenotype info (pheno.txt)
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --pheno D:/GWAS_CLASS/20101123/process/pheno.txt --reml --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --thread-num 10")

Computing Together

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## estimate h^2_GWAS
system("plink --bfile D:/GWAS_CLASS/20101123/process/merge --extract D:/GWAS_CLASS/20101123/process/significant_snp.txt --recodeA --out D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp")

df <- fread("D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp.raw")
pheno <- fread("D:/GWAS_CLASS/20101123/process/pheno.txt")
df <- df[,-(1:6)]
df$y <- pheno$Height

lm_sigsnp <- lm(y ~ ., data = df)
summary(lm_sigsnp)

Result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Summary result of REML analysis:
# Source	Variance	SE
# V(G)	10.927486	69.918584
# V(e)	0.000011	69.472995
# Vp	10.927497	5.158109
# V(G)/Vp	0.999999	6.357631
# 
# Sampling variance/covariance of the estimates of variance components:
# 4.888608e+03	-4.844250e+03
# -4.844250e+03	4.826497e+03	

Reference

Labrador
Surface from bing

Introduction

Broad-sense Heritability

Nanow-sense Heritability

Example-Dominant Coding

Example-Epistasis

Family-Based Heritability (hfamily2)(h^2_{\text{family}})(hfamily2​)

SNP-Based Heritability (hSNP2)(h^2_{\text{SNP}})(hSNP2​)

GWAS-based Heritability (hGWAS 2)(h_{\text {GWAS }}^2)(hGWAS 2​)

Relationship Between Heritability Types

Code

Computing Separately

Computing Together

Reference

Family-Based Heritability $(h^2_{\text{family}})$

SNP-Based Heritability $(h^2_{\text{SNP}})$

GWAS-based Heritability $(h_{\text {GWAS }}^2)$