GCTA

gcta上網站，gcta compute heritability, 放QC後的snp，計算GRM，reml 估出variance，算出 heritability。以往是使用GWAS找出跟trait顯著有關的位點，計算這些位點的heritability。但是有些位點跟trait有關，不過effect size很小，很難用GWAS找出。GCTA 好處是，可以估計所有 SNP 綜合起來的 heritability，省略掉使用GWAS尋找顯著相關的snp。假如同一筆資料，GCTA算出的heritability很高，但GWAS找到的位點算出很低，代表可以再多得到樣本，找到更多的資訊

{GCTA}

GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait \textcolor{blue}{rather than} testing the association of any particular SNP to the trait.
GCTA’s five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation.

% GCTA用來估計一群SNP的var，我們可以把SNP用一個個染色體來區分，或是分成編碼乘蛋白質的genetic snp跟不直接影響蛋白質，但是會影響基因表達的intergenetic snp，而不是估計特定的1,2個snp % 估計不同樣本之間的snp相近程度；估計snp可以解釋的變異

{Precess}

Quality control the SNPs
Compute GRM/relatedness matrix in GCTA
choose a mixed linear model
REML method in GCTA

% 步驟是使用QC過的SNP，計算GRM 親緣矩陣，選個混合模型，使用 algorithm 算出最像是unbiased 的variance，知道variance 之後，就能算Heritability %首先，因為GCTA裡面沒有函數可以做QC，所以會使用QC過的SNP，像是MAF、 Hardy-Weinberg equilibrium

{GRM(Genetic Relationship Matrix)}

For $n$ individuals, $m$ SNPs

$ \mathbf{X} $: $ n \times m $ genotype matrix
$ p_j $: $j-th$ SNP MAF

$\mathbf{X}^{\text{norm}}$ is standardized genotype matrix by

$$ X_{ij}^{\text{norm}} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}} $$

Define GRM $\mathbf{A}=\frac{1}{m}\mathbf{X}^{\text{norm}} (\mathbf{X}^{\text{norm}})'$ by

$$ A_{jk} = \frac{1}{m} \sum_{i=1}^{m} \frac{(X_{ij} - 2p_j)(X_{ik} - 2p_j)}{2p_j(1 - p_j)} $$

% A_jk就是第j,k個樣本，都有標準化的SNP 向量，然後是s維度的，把向量做內積 % 算出GRM可以知道樣本之間的基因相似程度。接著我們要挑選掉基因相似的樣本

{GRM}

Suppose minor allele is a and MAF is $p$, then

$$ \begin{align*} \left\lbrace \begin{array}{lll} P(SNP = AA) = (1-p)^2 & \text{with} & 0\text{ minor allele} \\ P(SNP = Aa) = 2p(1-p) & \text{with} & 1\text{ minor allele} \\ P(SNP = aa) = p^2 & \text{with} & 2\text{ minor allele} \\ \end{array} \right. \end{align*} $$$$ \begin{align*} E(SNP) &= 0 \times (1-p)^2 +1 \times 2p(1-p) + 2 \times p^2 \\ & = 2p \end{align*} $$$$ \begin{align*} Var(SNP) &= 0^2 \times (1-p)^2 +1^2 \times 2p(1-p) + 2^2 \times p^2 -E^2(SNP) \\ & = 2p(1-p) \end{align*} $$

{Example}

$$ \begin{array}{|c|c|c|c|} \hline \textbf{Individual} & \textbf{SNP1} & \textbf{SNP2} & \textbf{SNP3} \\ \hline A & 0 & 1 & 1 \\ \hline B & 1 & 0 & 1 \\ \hline \text{MAF} (p_j) & 0.25 & 0.25 & 0.5 \\ \hline \end{array} $$

$X^{\text{norm}}_{ij} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1 - p_j)}}$

Standardize $X$

$$ \begin{array}{|c|c|c|c|} \hline individual & SNP1 & SNP2 & SNP3 \\ \hline A & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline B & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline \end{array} $$

{Example}

the standardized genotype matrix is:

$$ X^{\text{norm}} = \begin{bmatrix} -0.816 & 0.816 & 0 \\ 0.816 & -0.816 & 0 \end{bmatrix} $$

Genetic Relationship Matrix is $\frac{1}{3} X^{\text{norm}} (X^{\text{norm}})^T$

$$ \begin{bmatrix} 0.443 & -0.443 \\ -0.443 & 0.443 \end{bmatrix} $$

{Exclude Close Relatives}

Including close relatives, this estimate could be a \textcolor{blue}{biased} estimate of total genetic variance

% 很相似的基因好比兄弟姊妹，他們來自相同的生長環境，這環境有可能影響基因表達，假如這資料也拿來使用，variance 就會估不好

{Mixed Linear Model}

$$ \begin{align*} \mathbf{Y_{n \times 1}}= \underbrace{\mathbf{X_{n \times p} \beta_{p \times 1}} }_{\text{fixed term}} + \underbrace{\mathbf{W_{n \times q}u_{q \times 1}} }_{\text{random term}} +\mathbf{\varepsilon_{n \times 1}} \end{align*} $$

$\mathbf{u} \sim N(0,\mathbf{I} \sigma_u^2)$, $\mathbf{\varepsilon} \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
$\mathbf{W}$ is standardized genotype matrix with $w_{ij} = \frac{x_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}}$, where $x_{ij}$ is $i-th$ individual $j-th$ SNP and $p_j$ is $j-th$ SNP MAF
$Var(\mathbf{Y}) = \mathbf{W W'}\sigma_u^2 +\mathbf{I} \sigma_\varepsilon^2$

%X is sex, age, 20PCs, trait?；u is snp effect % Y is phenotype，像是身高、BMI 等的性狀，上次講錯了

{Model 1}

To estimate the variance explained by all autosomal SNPs, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{\varepsilon} \end{align*} $$

The Mixed Linear Model is equivalent to this model with $\mathbf{A*g} = \frac{\mathbf{W W'}}{m},\ \sigma*\mathbf{g}^2 = m\sigma\_\mathbf{u}^2 $

$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
$Var(\mathbf{Y}) =\mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{I} \sigma_\varepsilon^2$

%這個模型，把所有的SNP看成同一個影響

{Model 2}

To partition genetic variance onto each of the 22 autosomes, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \sum_{i=1}^{22} \mathbf{g_i} + \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g_i} \sim N(0,\mathbf{A_i} \sigma_i^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_i}$ is GRM from the SNPs on $i-th$ chromosome
$Var(\mathbf{Y}) = \sum_{i=1}^{22} \mathbf{A_i} \sigma_i^2 + \mathbf{I} \mathbf{\sigma_\varepsilon^2}$

% 把SNP用染色體來區分，分成22個effect

{Model 3}

To estimate the variance of genotype-environment interaction effects, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{ge} + \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\mathbf{ge} \sim N(0,\mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
$Var(\mathbf{Y}) = \mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2 + \mathbf{I} \sigma_\varepsilon^2$
$$ \mathbf{A_{ge}}= \left\lbrace \begin{array}{lll} \mathbf{A_{g}} & \text{if} & \text{pairs of individuals in the same environment} \\ \mathbf{0} & \text{if} & \text{pairs of individuals in different environment}\end{array} \right. $$

% 假如想知道環境對基因表達的影響，也可以多一個SNp跟環境的交互作用向，這裡用ge表示

{Model 3}

$Cov(y_i,y_k) = \left\lbrace \begin{array}{lll} {A_{ik}}(\sigma_{\mathbf{ge}}^2+\sigma_{\mathbf{g}}^2) & \text{if} & \text{same environment} \\ {A_{ik}} \sigma_{\mathbf{g}}^2 & \text{if} & \text{different environment} \end{array} \right.$

##　｛Build Model} For model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g_{cis}} \sim N(0,\mathbf{A_{cis}} \sigma_{cis}^2)$, $\mathbf{g_{trans}} \sim N(0,\mathbf{A_{trans}} \sigma_{trans}^2)$, $\mathbf{g_{GE}} \sim N(0,\mathbf{A_{GE}} \sigma_{GE}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
$\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $
$\theta = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_{\varepsilon}^2)$

Log likelihood function

$$ \begin{align*} & L_Y(\beta, \theta) \\ = & -\frac{n}{2} \ln(2\pi) -\frac{1}{2} \ln|\mathbf{V}| -\frac{1}{2} \mathbf{(Y - X\beta)^T V^{-1} (Y - X\beta)} \\ \propto & -\frac{1}{2} \left[ \ln |\mathbf{V}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \end{align*} $$

REML

Log likelihood function independent to $\beta$ is the target of REML (restricted maximum likelihood). Comparing to ML, variance estimator is unbiased in REML. Let $\mathbf{M}\ s.t. \mathbf{M}\mathbf{X} = 0$

$$ \begin{align*} &\text{Let } \mathbf{W} = \mathbf{M}\mathbf{Y} \\ &E(\mathbf{W}) = \mathbf{M}\mathbf{X}\beta = 0 \\ &\text{Var}(\mathbf{W}) = \mathbf{M}\mathbf{V}\mathbf{M}^T \end{align*} $$

Transfer $\mathbf{X}$ to $\mathbf{M}\mathbf{X}$，$\mathbf{Y}$ to $\mathbf{M}\mathbf{Y}$. Log likelihood function of $\mathbf{W}$

$$ \begin{align*} L_\text{REML}(\theta) & \propto -\frac{1}{2} \left[ \ln |\mathbf{V}| + \ln |\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln | \mathbf{M}\mathbf{V}\mathbf{M}^T | + \ln |\mathbf{(MX)}^T(\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1}\mathbf{MX}| + (\mathbf{MY} - \mathbf{MX}\beta)^T \mathbf{V}^{-1} (\mathbf{MY} - \mathbf{MX}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{Y}^T\mathbf{M}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{M}^T\mathbf{Y} \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \\ \end{align*} $$

Therefore

$$ \begin{align*} &L_\text{REML}(\theta) = -\frac{n-p}{2}\ln(2 \pi) -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \end{align*} $$

Log likelihood function of $\mathbf{W}$ is independent to $\beta$.

some problem

It’s very common that $\mathbf{M}\mathbf{V}\mathbf{M}^T \text{ be singular}$

$$\Rightarrow \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| \to -\infty$$

REML avoid it.

Generalized least square (GLS)

$$ \begin{align*} &\mathbf{Y} = \mathbf{X}\beta + \epsilon, \quad \text{Var}(\epsilon) = \mathbf{V}, \quad \mathbf{V} = \mathbf{\Sigma} \mathbf{\Sigma}^T \\ &E(\epsilon) = 0 \\ &\mathbf{\Sigma}^{-1}\mathbf{Y} = \mathbf{\Sigma}^{-1}\mathbf{X}\beta + \mathbf{\Sigma}^{-1}\epsilon \\ &{\mathbf{Y}} = {\mathbf{X}}\beta + {\epsilon} \\ &\text{Var}({\epsilon}) = \text{Var}(\mathbf{\Sigma}^{-1}\epsilon) \\ &\quad = \mathbf{\Sigma}^{-1} \text{Var}(\epsilon) {\mathbf{\Sigma}^{-1}}^T \\ &\quad = \mathbf{\Sigma}^{-1} \mathbf{\Sigma} \mathbf{\Sigma}^T {\mathbf{\Sigma}^{-1}}^T \\ &\quad = I \end{align*} $$$$ \begin{align*} \hat{\beta} &= ({\mathbf{X}}^T {\mathbf{X}})^{-1} ({\mathbf{X}}^T {\mathbf{Y}}) \\ &= \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^{-1} \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{X})^{-1} \mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y} \end{align*} $$

{EM Algorithm}

For estimating $(\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$, we use EM algorithm as an initial step to determine the direction of the iteration updates

$$ \begin{align*} \sigma^{2(1)}_i = \frac{1}{n} \left[ \sigma^{4(0)}_i \mathbf{Y^T P A_i P Y} + \operatorname{tr} (\sigma^{2(0)}_i \mathbf{I} - \sigma^{4(0)}_i \mathbf{P A_i}) \right] \end{align*} $$

where

$$ \begin{align*} & i = 1, \cdots, 4 \\ & \mathbf{P = V^{-1} - V^{-1} X (X^T V^{-1} X)^{-1} X^T V^{-1}} \end{align*} $$

{Average Information

Algorithm} After one EM iteration, GCTA switches to the average information algorithm

$$ \begin{align*} \bm{\theta}^{(t+1)} = \bm{\theta}^{(t)} + (\mathbf{AI}^{(t)})^{-1} \frac{\partial L}{\partial \bm{\theta}} \Big|_{\bm{\theta}^{(t)}} \end{align*} $$

where $\bm{\theta} = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$

The iteration stop when $L^{(t+1)}-L^{(t)}< 10^{-4}$
In the iteration process, if $\sigma_i^2<0$, set $\sigma_i^2= 10^{-6}\sigma_Y^2$, where $\sigma_Y^2$ is phenotype variance

{REML Method}

$$ \begin{align*} & \mathbf{A I}=\mathbf{1} / \mathbf{2}\left[\begin{array}{cccc} \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_1 \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P P Y} \\ \vdots & \vdots & \vdots & \vdots \\ \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P A _ { r }} \mathbf{P P Y} \\ \mathbf{Y}^{\prime} \mathbf{P P} \mathbf{A}_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P P} A_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P P P Y} \end{array}\right] ;\\ & \partial L / \partial \boldsymbol{\theta}=-1 / 2\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{P} \mathbf{A}_1\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_1 \mathbf{P} \mathbf{Y} \\ \vdots \\ \operatorname{tr}\left(\mathbf{P} A_r\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_r \mathbf{P Y} \\ \operatorname{tr}(\mathbf{P})-\mathbf{Y}^{\prime} \mathbf{P P Y}\end{array}\right] \end{align*} $$

{REML (restricted maximum likelihood)}

As $Y_1, \cdots , Y_n$ is constant mean, $A^T X\beta = 0$. Hence, $L_W(\beta, \sigma_c^2, \sigma_t^2)$ doesn’t depend on $\beta$
Compared to ML, REML is less affected by fixed effects
Compared to ML, REML has lower bias.

{Heritability}

Assume model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$

where $\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $

$h^2_{\mathbf{g},cis} = \frac{\sigma^2_{\mathbf{g},cis}}{\mathbf{V}}$
$h^2_{\mathbf{g},trans} = \frac{\sigma^2_{\mathbf{g},trans}}{\mathbf{V}}$
$h^2_{\mathbf{g},GE} = \frac{\sigma^2_{\mathbf{g},GE}}{\mathbf{V}}$

{GCTA}

{Precess}

{GRM(Genetic Relationship Matrix)}

{GRM}

{Example}

{Example}

{Exclude Close Relatives}

{Mixed Linear Model}

{Model 1}

{Model 2}

{Model 3}

{Model 3}

REML

some problem

Generalized least square (GLS)

{EM Algorithm}

{Average Information

{REML Method}

{REML (restricted maximum likelihood)}

{Heritability}

Reference