Introduction
Material and Method
flowchart
Data
We utilized data from the 1,000 Genomes Project to perform a GWAS for height. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.
Genotype QC
- Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter
- Excluding the SNP with MAF $\leq 0.05$ : 6797981 variants and 1092 people pass filter
- Excluding the SNP with HWE $< 0.0001$ i.e. pvalue $< 0.0001$ : 4941621 variants and 1092 people pass filter
- Excluding the SNP with $r^2 < 0.2$ in 500 window bp to PCA : 299901 variants and 1092 people pass filter
- flip beta to -beta
Fix Effect
Random Effect
Result
SNP Finding
LD
Manhattan Plot
Code
Introduction
$$(\hat{\beta}_i, \sigma_i), \ i=1,\ldots,N$$where
- $\sigma_i$ is standard error of $\hat{\beta}_i$
$\widetilde{\beta} = \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}}$ is a common weight called inverse variance weight. If $\widetilde{\beta} \sim {N}(\beta, \sigma_i^2)$ and independent
$$\begin{align*} Var(\widetilde{\beta}) &= {Var}\left( \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}} \right) \\ &= \frac{\sum_{i=1}^N {Var}\left( \hat{\beta}_i \sigma_i^{-2} \right)}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-4} \sigma_i^2}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-2}}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{1}{\sum_{i=1}^N \sigma_i^{-2}} \end{align*}$$GWAS Model
$$y = \beta_0 + \beta(x) + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon$$Under logistic regression,
$$y \sim \text{Ber}(p), \ y \in \{0, 1\}$$$$\begin{align*} \text{logit}(p) &= \log\left(\frac{p}{1-p}\right) \\ &= \beta_0 + \beta x + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon \end{align*}$$$$\begin{align*} \beta &= \log\left( \frac{p(x=1)}{1-p(x=1)} \right) - \log\left( \frac{p(x=0)}{1-p(x=0)} \right) \\ &= \log\left( \frac{p(x=1)}{1-p(x=1)} \middle/ \frac{p(x=0)}{1-p(x=0)} \right) \end{align*}$$Chi-square Test for Heterogeneity in Effect
$$\begin{align*} Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1} \end{align*}$$test is there any $\beta_i$ sig. different
$$\begin{align*} I^2 &= 100\% \cdot \frac{Q - \text{df}}{Q}\\ \end{align*}$$- $I^2 = 0-25\%$: Low heterogeneity, then Heterogeneity is small $(\beta_1 = \beta_2 = \cdots = \beta_N)$. Not reject $H_0$
- $I^2 = 25-50\%:$ Moderate
- $I^2 = 50-75\%:$ Substantial
- $I^2 >75\%:$ Considerable, then Heterogeneity is large. Reject $H_0$
where
- $Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$
- df = $N-1$
Cochran’s Q test
it might be underpowered when few studies have been included or when event rates are low. Therefore, it is often recommended to adopt a higher P-value (rather than 0.05) as a threshold for statistical significance when using Cochran’s Q test to determine statistical heterogeneity.
$$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$$Under large sample, if p-value $P(\chi^2_{N-1}>Q)<0.05$, reject $H_0$
Under small sample, if p-value $P(\chi^2_{N-1}>Q)<0.1$, reject $H_0$
Heterogeneity in Effect
The genetic influence on a trait varies across different individuals or populations, even when the trait looks the same. May arise from
- Differences in LD structure
- Interactions with environmental or other genetic exposures at different frequencies
Fix Effect Meta-Analysis
$$\begin{align*} (\hat{\beta}_i, \sigma^2_{i}),\quad i = 1, \ldots, N,\quad N \text{ studies} \end{align*}$$where
- $\hat{\beta_i}$ is effect size
- $\sigma_i^2$ is variance
Random Effect Meta-Analysis
$$\begin{align*} &\hat{\beta}_i \sim {N}(\beta_i, \sigma_i^2) , \quad \beta_i \sim {N}(\mu, \tau^2) \end{align*}$$where
- $\sigma^2_i$ is sampling variation within study (研究內的抽樣誤差)
- $\tau^2$ is variance between studies (研究之間的不同)
Probability Distributions
$$\begin{align*} P(\hat{\beta}_i | \mu, \tau) &\propto \int P(\hat{\beta}_i | \beta_i) \, d\beta_i \\ &\propto \int P(\hat{\beta}_i | \beta_i) P(\beta_i | \mu, \tau) \, d\beta_i \\ &\propto \int \exp\left\{ -\frac{1}{2\sigma_i^2} (\hat{\beta}_i - \beta_i)^2 - \frac{1}{2\tau^2} (\beta_i - \mu)^2 \right\} \, d\beta_i \end{align*}$$$P(\hat{\beta}_i ,\beta_i)$ is bivariate normal distribution, and the marginal distribution is still normal dist.
Genomic Control
$$\begin{align*} \lambda_{\text{GC}} &= \frac{\text{median}(\chi^2_{\text{observed}})}{\text{median}(\chi^2_{\text{adjusted}})} = \frac{\text{median}(\chi^2_{\text{observed}})}{0.455} \\ &\lambda_{\text{GC}} \begin{cases} \approx 1: & \text{well-calibrated} \\ > 1: & \text{inflative} \\ < 1: & \text{conservative test} \end{cases} \end{align*}$$where $\chi^2_{\text{adjusted}} = \frac{\chi^2_{\text{observed}}}{\lambda_{\text{GC}}}$
Allelic Chi-square Test
假設有一個 SNP,兩個等位基因:A, a, 出現在case, control 的次數是
$$ \begin{array}{c|cc} & A & a \\ \hline \text{Case} & O_{1A} & O_{1a} \\ \text{Control} & O_{0A} & O_{0a} \\ \end{array} $$where $N = O_{1A}+O_{1a}+O_{0A}+O_{0a}$
$$ E_{1A} = \frac{(O_{1A}+O_{0A})(O_{1A}+O_{1a})}{N} $$其餘類似,把期望值算出
$$ E_{1a}, E_{0A}, E_{0a} $$接著計算每個snp的 chi-square statistic
$$ \chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i} $$Genomic Control 不是重新計算 $\chi^2$,而是認為資料的$\chi^2$ 偏高,需要除上 $\lambda_{\text{GC}}$ 矯正
$$ \chi^2_{\text{GC}} = \frac{\chi^2}{\lambda_{\text{GC}}} $$如何計算 λGC?
取所有 SNP 的$\chi^2$中位數,除以理論中位數(df=1 時)。
$$ \text{median}(\chi^2_{df=1}) = 0.455 $$矯正每個snp 計算的統計量
$$ \chi^2_{\text{GC}, i} = \frac{\chi^2_i}{\lambda_{\text{GC}}} = \frac{\text{median}(\chi^2_{\text{all SNPs}})}{0.455} $$再換 p-value
$$ p_i^{\text{GC}} = 1 - F_{\chi^2_{df=1}}(\chi^2_{\text{GC}, i}) $$Genomic Control 不適合 polygenic traits
Next Steps
Following the GWAS, several post-GWAS analyses can be conducted, including fine-mapping, functional annotation, and the calculation of polygenic risk scores (PRS). Furthermore, the GWAS catalog provides a vast repository of existing GWAS summary statistics. We can leverage this data to validate the significant SNPs identified in our study.