GCTA

  • gcta上網站,gcta compute heritability, 放QC後的snp,計算GRM,reml 估出variance,算出 heritability。以往是使用GWAS找出跟trait顯著有關的位點,計算這些位點的heritability。但是有些位點跟trait有關,不過effect size很小,很難用GWAS找出。GCTA 好處是,可以估計所有 SNP 綜合起來的 heritability,省略掉使用GWAS尋找顯著相關的snp。假如同一筆資料,GCTA算出的heritability很高,但GWAS找到的位點算出很低,代表可以再多得到樣本,找到更多的資訊

{GCTA}

  • GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait \textcolor{blue}{rather than} testing the association of any particular SNP to the trait.
  • GCTA’s five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation.

% GCTA用來估計一群SNP的var,我們可以把SNP用一個個染色體來區分,或是分成編碼乘蛋白質的genetic snp跟不直接影響蛋白質,但是會影響基因表達的intergenetic snp,而不是估計特定的1,2個snp % 估計不同樣本之間的snp相近程度;估計snp可以解釋的變異

{Precess}

  1. Quality control the SNPs
  2. Compute GRM/relatedness matrix in GCTA
  3. choose a mixed linear model
  4. REML method in GCTA

% 步驟是使用QC過的SNP,計算GRM 親緣矩陣,選個混合模型,使用 algorithm 算出最像是unbiased 的variance,知道variance 之後,就能算Heritability %首先,因為GCTA裡面沒有函數可以做QC,所以會使用QC過的SNP,像是MAF、 Hardy-Weinberg equilibrium

{GRM(Genetic Relationship Matrix)}

For $n$ individuals, $m$ SNPs

  • \( \mathbf{X} \): \( n \times m \) genotype matrix
  • \( p_j \): $j-th$ SNP MAF

$\mathbf{X}^{\text{norm}}$ is standardized genotype matrix by

$$ X_{ij}^{\text{norm}} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}} $$

Define GRM $\mathbf{A}=\frac{1}{m}\mathbf{X}^{\text{norm}} (\mathbf{X}^{\text{norm}})'$ by

$$ A_{jk} = \frac{1}{m} \sum_{i=1}^{m} \frac{(X_{ij} - 2p_j)(X_{ik} - 2p_j)}{2p_j(1 - p_j)} $$

% A_jk就是第j,k個樣本,都有標準化的SNP 向量,然後是s維度的,把向量做內積 % 算出GRM可以知道樣本之間的基因相似程度。接著我們要挑選掉基因相似的樣本

{GRM}

Suppose minor allele is a and MAF is $p$, then
$$ \begin{align*} \left\lbrace \begin{array}{lll} P(SNP = AA) = (1-p)^2 & \text{with} & 0\text{ minor allele} \\ P(SNP = Aa) = 2p(1-p) & \text{with} & 1\text{ minor allele} \\ P(SNP = aa) = p^2 & \text{with} & 2\text{ minor allele} \\ \end{array} \right. \end{align*} $$$$ \begin{align*} E(SNP) &= 0 \times (1-p)^2 +1 \times 2p(1-p) + 2 \times p^2 \\ & = 2p \end{align*} $$$$ \begin{align*} Var(SNP) &= 0^2 \times (1-p)^2 +1^2 \times 2p(1-p) + 2^2 \times p^2 -E^2(SNP) \\ & = 2p(1-p) \end{align*} $$

{Example}

$$ \begin{array}{|c|c|c|c|} \hline \textbf{Individual} & \textbf{SNP1} & \textbf{SNP2} & \textbf{SNP3} \\ \hline A & 0 & 1 & 1 \\ \hline B & 1 & 0 & 1 \\ \hline \text{MAF} (p_j) & 0.25 & 0.25 & 0.5 \\ \hline \end{array} $$

$X^{\text{norm}}_{ij} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1 - p_j)}}$

Standardize $X$

$$ \begin{array}{|c|c|c|c|} \hline individual & SNP1 & SNP2 & SNP3 \\ \hline A & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline B & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline \end{array} $$

{Example}

the standardized genotype matrix is:

$$ X^{\text{norm}} = \begin{bmatrix} -0.816 & 0.816 & 0 \\ 0.816 & -0.816 & 0 \end{bmatrix} $$

Genetic Relationship Matrix is $\frac{1}{3} X^{\text{norm}} (X^{\text{norm}})^T$

$$ \begin{bmatrix} 0.443 & -0.443 \\ -0.443 & 0.443 \end{bmatrix} $$

{Exclude Close Relatives}

  • Including close relatives, this estimate could be a \textcolor{blue}{biased} estimate of total genetic variance

% 很相似的基因好比兄弟姊妹,他們來自相同的生長環境,這環境有可能影響基因表達,假如這資料也拿來使用,variance 就會估不好

{Mixed Linear Model}

$$ \begin{align*} \mathbf{Y_{n \times 1}}= \underbrace{\mathbf{X_{n \times p} \beta_{p \times 1}} }_{\text{fixed term}} + \underbrace{\mathbf{W_{n \times q}u_{q \times 1}} }_{\text{random term}} +\mathbf{\varepsilon_{n \times 1}} \end{align*} $$
  • $\mathbf{u} \sim N(0,\mathbf{I} \sigma_u^2)$, $\mathbf{\varepsilon} \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
  • $\mathbf{W}$ is standardized genotype matrix with $w_{ij} = \frac{x_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}}$, where $x_{ij}$ is $i-th$ individual $j-th$ SNP and $p_j$ is $j-th$ SNP MAF
  • $Var(\mathbf{Y}) = \mathbf{W W'}\sigma_u^2 +\mathbf{I} \sigma_\varepsilon^2$

%X is sex, age, 20PCs, trait?;u is snp effect % Y is phenotype,像是身高、BMI 等的性狀,上次講錯了

{Model 1}

To estimate the variance explained by all autosomal SNPs, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{\varepsilon} \end{align*} $$

The Mixed Linear Model is equivalent to this model with $\mathbf{A*g} = \frac{\mathbf{W W'}}{m},\ \sigma*\mathbf{g}^2 = m\sigma\_\mathbf{u}^2 $

  • $\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
  • $Var(\mathbf{Y}) =\mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{I} \sigma_\varepsilon^2$

%這個模型,把所有的SNP看成同一個影響

{Model 2}

To partition genetic variance onto each of the 22 autosomes, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \sum_{i=1}^{22} \mathbf{g_i} + \mathbf{\varepsilon} \end{align*} $$
  • $\mathbf{g_i} \sim N(0,\mathbf{A_i} \sigma_i^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_i}$ is GRM from the SNPs on $i-th$ chromosome
  • $Var(\mathbf{Y}) = \sum_{i=1}^{22} \mathbf{A_i} \sigma_i^2 + \mathbf{I} \mathbf{\sigma_\varepsilon^2}$

% 把SNP用染色體來區分,分成22個effect

{Model 3}

To estimate the variance of genotype-environment interaction effects, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{ge} + \mathbf{\varepsilon} \end{align*} $$
  • $\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\mathbf{ge} \sim N(0,\mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
  • $Var(\mathbf{Y}) = \mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2 + \mathbf{I} \sigma_\varepsilon^2$
  • $$ \mathbf{A_{ge}}= \left\lbrace \begin{array}{lll} \mathbf{A_{g}} & \text{if} & \text{pairs of individuals in the same environment} \\ \mathbf{0} & \text{if} & \text{pairs of individuals in different environment}\end{array} \right. $$

% 假如想知道環境對基因表達的影響,也可以多一個SNp跟環境的交互作用向,這裡用ge表示

{Model 3}

$Cov(y_i,y_k) = \left\lbrace \begin{array}{lll} {A_{ik}}(\sigma_{\mathbf{ge}}^2+\sigma_{\mathbf{g}}^2) & \text{if} & \text{same environment} \\ {A_{ik}} \sigma_{\mathbf{g}}^2 & \text{if} & \text{different environment} \end{array} \right.$

## {Build Model} For model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$
  • $\mathbf{g_{cis}} \sim N(0,\mathbf{A_{cis}} \sigma_{cis}^2)$, $\mathbf{g_{trans}} \sim N(0,\mathbf{A_{trans}} \sigma_{trans}^2)$, $\mathbf{g_{GE}} \sim N(0,\mathbf{A_{GE}} \sigma_{GE}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
  • $\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $
  • $\theta = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_{\varepsilon}^2)$

Log likelihood function

$$ \begin{align*} & L_Y(\beta, \theta) \\ = & -\frac{n}{2} \ln(2\pi) -\frac{1}{2} \ln|\mathbf{V}| -\frac{1}{2} \mathbf{(Y - X\beta)^T V^{-1} (Y - X\beta)} \\ \propto & -\frac{1}{2} \left[ \ln |\mathbf{V}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \end{align*} $$

REML

Log likelihood function independent to $\beta$ is the target of REML (restricted maximum likelihood). Comparing to ML, variance estimator is unbiased in REML. Let $\mathbf{M}\ s.t. \mathbf{M}\mathbf{X} = 0$

$$ \begin{align*} &\text{Let } \mathbf{W} = \mathbf{M}\mathbf{Y} \\ &E(\mathbf{W}) = \mathbf{M}\mathbf{X}\beta = 0 \\ &\text{Var}(\mathbf{W}) = \mathbf{M}\mathbf{V}\mathbf{M}^T \end{align*} $$

Transfer $\mathbf{X}$ to $\mathbf{M}\mathbf{X}$,$\mathbf{Y}$ to $\mathbf{M}\mathbf{Y}$. Log likelihood function of $\mathbf{W}$

$$ \begin{align*} L_\text{REML}(\theta) & \propto -\frac{1}{2} \left[ \ln |\mathbf{V}| + \ln |\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln | \mathbf{M}\mathbf{V}\mathbf{M}^T | + \ln |\mathbf{(MX)}^T(\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1}\mathbf{MX}| + (\mathbf{MY} - \mathbf{MX}\beta)^T \mathbf{V}^{-1} (\mathbf{MY} - \mathbf{MX}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{Y}^T\mathbf{M}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{M}^T\mathbf{Y} \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \\ \end{align*} $$

Therefore

$$ \begin{align*} &L_\text{REML}(\theta) = -\frac{n-p}{2}\ln(2 \pi) -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \end{align*} $$

Log likelihood function of $\mathbf{W}$ is independent to $\beta$.

some problem

It’s very common that $\mathbf{M}\mathbf{V}\mathbf{M}^T \text{ be singular}$

$$\Rightarrow \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| \to -\infty$$

REML avoid it.

Generalized least square (GLS)

$$ \begin{align*} &\mathbf{Y} = \mathbf{X}\beta + \epsilon, \quad \text{Var}(\epsilon) = \mathbf{V}, \quad \mathbf{V} = \mathbf{\Sigma} \mathbf{\Sigma}^T \\ &E(\epsilon) = 0 \\ &\mathbf{\Sigma}^{-1}\mathbf{Y} = \mathbf{\Sigma}^{-1}\mathbf{X}\beta + \mathbf{\Sigma}^{-1}\epsilon \\ &{\mathbf{Y}} = {\mathbf{X}}\beta + {\epsilon} \\ &\text{Var}({\epsilon}) = \text{Var}(\mathbf{\Sigma}^{-1}\epsilon) \\ &\quad = \mathbf{\Sigma}^{-1} \text{Var}(\epsilon) {\mathbf{\Sigma}^{-1}}^T \\ &\quad = \mathbf{\Sigma}^{-1} \mathbf{\Sigma} \mathbf{\Sigma}^T {\mathbf{\Sigma}^{-1}}^T \\ &\quad = I \end{align*} $$$$ \begin{align*} \hat{\beta} &= ({\mathbf{X}}^T {\mathbf{X}})^{-1} ({\mathbf{X}}^T {\mathbf{Y}}) \\ &= \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^{-1} \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{X})^{-1} \mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y} \end{align*} $$

{EM Algorithm}

For estimating $(\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$, we use EM algorithm as an initial step to determine the direction of the iteration updates

$$ \begin{align*} \sigma^{2(1)}_i = \frac{1}{n} \left[ \sigma^{4(0)}_i \mathbf{Y^T P A_i P Y} + \operatorname{tr} (\sigma^{2(0)}_i \mathbf{I} - \sigma^{4(0)}_i \mathbf{P A_i}) \right] \end{align*} $$

where

$$ \begin{align*} & i = 1, \cdots, 4 \\ & \mathbf{P = V^{-1} - V^{-1} X (X^T V^{-1} X)^{-1} X^T V^{-1}} \end{align*} $$

{Average Information

Algorithm} After one EM iteration, GCTA switches to the average information algorithm

$$ \begin{align*} \bm{\theta}^{(t+1)} = \bm{\theta}^{(t)} + (\mathbf{AI}^{(t)})^{-1} \frac{\partial L}{\partial \bm{\theta}} \Big|_{\bm{\theta}^{(t)}} \end{align*} $$

where $\bm{\theta} = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$

  • The iteration stop when $L^{(t+1)}-L^{(t)}< 10^{-4}$
  • In the iteration process, if $\sigma_i^2<0$, set $\sigma_i^2= 10^{-6}\sigma_Y^2$, where $\sigma_Y^2$ is phenotype variance

{REML Method}

$$ \begin{align*} & \mathbf{A I}=\mathbf{1} / \mathbf{2}\left[\begin{array}{cccc} \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_1 \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P P Y} \\ \vdots & \vdots & \vdots & \vdots \\ \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P A _ { r }} \mathbf{P P Y} \\ \mathbf{Y}^{\prime} \mathbf{P P} \mathbf{A}_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P P} A_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P P P Y} \end{array}\right] ;\\ & \partial L / \partial \boldsymbol{\theta}=-1 / 2\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{P} \mathbf{A}_1\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_1 \mathbf{P} \mathbf{Y} \\ \vdots \\ \operatorname{tr}\left(\mathbf{P} A_r\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_r \mathbf{P Y} \\ \operatorname{tr}(\mathbf{P})-\mathbf{Y}^{\prime} \mathbf{P P Y}\end{array}\right] \end{align*} $$

{REML (restricted maximum likelihood)}

  • As $Y_1, \cdots , Y_n$ is constant mean, $A^T X\beta = 0$. Hence, $L_W(\beta, \sigma_c^2, \sigma_t^2)$ doesn’t depend on $\beta$
  • Compared to ML, REML is less affected by fixed effects
  • Compared to ML, REML has lower bias.

{Heritability}

Assume model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$

where $\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $

  • $h^2_{\mathbf{g},cis} = \frac{\sigma^2_{\mathbf{g},cis}}{\mathbf{V}}$
  • $h^2_{\mathbf{g},trans} = \frac{\sigma^2_{\mathbf{g},trans}}{\mathbf{V}}$
  • $h^2_{\mathbf{g},GE} = \frac{\sigma^2_{\mathbf{g},GE}}{\mathbf{V}}$

Reference

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
使用 Hugo 建立
主題 StackJimmy 設計