Meta Analysis

Introduction

Material and Method

flowchart

Data

We utilized data from the 1,000 Genomes Project to perform a GWAS for height. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.

Genotype QC

  1. Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter
  2. Excluding the SNP with MAF $\leq 0.05$ : 6797981 variants and 1092 people pass filter
  3. Excluding the SNP with HWE $< 0.0001$ i.e. pvalue $< 0.0001$ : 4941621 variants and 1092 people pass filter
  4. Excluding the SNP with $r^2 < 0.2$ in 500 window bp to PCA : 299901 variants and 1092 people pass filter
  5. flip beta to -beta

Fix Effect

Random Effect

Result

SNP Finding

LD

Manhattan Plot

Code

Introduction

$$(\hat{\beta}_i, \sigma_i), \ i=1,\ldots,N$$

where

  • $\sigma_i$ is standard error of $\hat{\beta}_i$

$\widetilde{\beta} = \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}}$ is a common weight called inverse variance weight. If $\widetilde{\beta} \sim {N}(\beta, \sigma_i^2)$ and independent

$$\begin{align*} Var(\widetilde{\beta}) &= {Var}\left( \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}} \right) \\ &= \frac{\sum_{i=1}^N {Var}\left( \hat{\beta}_i \sigma_i^{-2} \right)}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-4} \sigma_i^2}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-2}}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{1}{\sum_{i=1}^N \sigma_i^{-2}} \end{align*}$$

GWAS Model

$$y = \beta_0 + \beta(x) + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon$$

Under logistic regression,

$$y \sim \text{Ber}(p), \ y \in \{0, 1\}$$$$\begin{align*} \text{logit}(p) &= \log\left(\frac{p}{1-p}\right) \\ &= \beta_0 + \beta x + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon \end{align*}$$$$\begin{align*} \beta &= \log\left( \frac{p(x=1)}{1-p(x=1)} \right) - \log\left( \frac{p(x=0)}{1-p(x=0)} \right) \\ &= \log\left( \frac{p(x=1)}{1-p(x=1)} \middle/ \frac{p(x=0)}{1-p(x=0)} \right) \end{align*}$$

Chi-square Test for Heterogeneity in Effect

$$\begin{align*} Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1} \end{align*}$$

test is there any $\beta_i$ sig. different

$$\begin{align*} I^2 &= 100\% \cdot \frac{Q - \text{df}}{Q}\\ \end{align*}$$
  • $I^2 = 0-25\%$: Low heterogeneity, then Heterogeneity is small $(\beta_1 = \beta_2 = \cdots = \beta_N)$. Not reject $H_0$
  • $I^2 = 25-50\%:$ Moderate
  • $I^2 = 50-75\%:$ Substantial
  • $I^2 >75\%:$ Considerable, then Heterogeneity is large. Reject $H_0$

where

  • $Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$
  • df = $N-1$

Cochran’s Q test

 it might be underpowered when few studies have been included or when event rates are low. Therefore, it is often recommended to adopt a higher P-value (rather than 0.05) as a threshold for statistical significance when using Cochran’s Q test to determine statistical heterogeneity.

$$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$$

Heterogeneity in Effect

The genetic influence on a trait varies across different individuals or populations, even when the trait looks the same. May arise from

  • Differences in LD structure
  • Interactions with environmental or other genetic exposures at different frequencies

Fix Effect Meta-Analysis

$$\begin{align*} (\hat{\beta}_i, \sigma^2_{i}),\quad i = 1, \ldots, N,\quad N \text{ studies} \end{align*}$$

where

  • $\hat{\beta_i}$ is effect size
  • $\sigma_i^2$ is variance
$$\begin{align*} \hat{\beta}_i \sim N(\beta, \sigma^2_{i}) \\ \tilde{\beta} = \frac{ \sum_{i=1}^N \hat{\beta}_i \sigma^{-2}_{i} }{ \sum_{i=1}^N \sigma^{-2}_{i} } \end{align*}$$

Random Effect Meta-Analysis

$$\begin{align*} &\hat{\beta}_i \sim {N}(\beta_i, \sigma_i^2) , \quad \beta_i \sim {N}(\mu, \tau^2) \end{align*}$$

where

  • $\sigma^2_i$ is sampling variation within study (研究內的抽樣誤差)
  • $\tau^2$ is variance between studies (研究之間的不同)
$$\begin{align*} \text{Var}(\hat{\beta}_i) &= E(\text{Var}(\hat{\beta}_i | \beta_i)) + \text{Var}(E(\hat{\beta}_i | \beta_i)) \\ &= E(\sigma_i^2) + \text{Var}(\beta_i) \\ &= \sigma_i^2 + \tau^2 \end{align*}$$$$\begin{align*} \Rightarrow \hat{\beta}_i \sim {N}(\mu, \sigma^2 + {\tau}^2) \end{align*}$$

Probability Distributions

$$\begin{align*} P(\hat{\beta}_i | \mu, \tau) &\propto \int P(\hat{\beta}_i | \beta_i) \, d\beta_i \\ &\propto \int P(\hat{\beta}_i | \beta_i) P(\beta_i | \mu, \tau) \, d\beta_i \\ &\propto \int \exp\left\{ -\frac{1}{2\sigma_i^2} (\hat{\beta}_i - \beta_i)^2 - \frac{1}{2\tau^2} (\beta_i - \mu)^2 \right\} \, d\beta_i \end{align*}$$

$P(\hat{\beta}_i ,\beta_i)$ is bivariate normal distribution, and the marginal distribution is still normal dist.

Genomic Control

$$\begin{align*} \lambda_{\text{GC}} &= \frac{\text{median}(\chi^2_{\text{observed}})}{\text{median}(\chi^2_{\text{adjusted}})} = \frac{\text{median}(\chi^2_{\text{observed}})}{0.455} \\ &\lambda_{\text{GC}} \begin{cases} \approx 1: & \text{well-calibrated} \\ > 1: & \text{inflative} \\ < 1: & \text{conservative test} \end{cases} \end{align*}$$

where $\chi^2_{\text{adjusted}} = \frac{\chi^2_{\text{observed}}}{\lambda_{\text{GC}}}$

Allelic Chi-square Test

假設有一個 SNP,兩個等位基因:A, a, 出現在case, control 的次數是

$$ \begin{array}{c|cc} & A & a \\ \hline \text{Case} & O_{1A} & O_{1a} \\ \text{Control} & O_{0A} & O_{0a} \\ \end{array} $$

where $N = O_{1A}+O_{1a}+O_{0A}+O_{0a}$

$$ E_{1A} = \frac{(O_{1A}+O_{0A})(O_{1A}+O_{1a})}{N} $$

其餘類似,把期望值算出

$$ E_{1a}, E_{0A}, E_{0a} $$

接著計算每個snp的 chi-square statistic

$$ \chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i} $$

Genomic Control 不是重新計算 $\chi^2$,而是認為資料的$\chi^2$ 偏高,需要除上 $\lambda_{\text{GC}}$ 矯正

$$ \chi^2_{\text{GC}} = \frac{\chi^2}{\lambda_{\text{GC}}} $$

如何計算 λGC?

所有 SNP 的$\chi^2$中位數,除以理論中位數(df=1 時)。

$$ \text{median}(\chi^2_{df=1}) = 0.455 $$

矯正每個snp 計算的統計量

$$ \chi^2_{\text{GC}, i} = \frac{\chi^2_i}{\lambda_{\text{GC}}} = \frac{\text{median}(\chi^2_{\text{all SNPs}})}{0.455} $$

再換 p-value

$$ p_i^{\text{GC}} = 1 - F_{\chi^2_{df=1}}(\chi^2_{\text{GC}, i}) $$

Genomic Control 不適合 polygenic traits

Next Steps

Following the GWAS, several post-GWAS analyses can be conducted, including fine-mapping, functional annotation, and the calculation of polygenic risk scores (PRS). Furthermore, the GWAS catalog provides a vast repository of existing GWAS summary statistics. We can leverage this data to validate the significant SNPs identified in our study.

Reference

1000G
Prof. lhchien Course

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
使用 Hugo 建立
主題 StackJimmy 設計