<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Genome on Quan's Base</title><link>https://jiangcc.netlify.app/categories/genome/</link><description>Recent content in Genome on Quan's Base</description><generator>Hugo -- gohugo.io</generator><language>zh-tw</language><lastBuildDate>Sat, 10 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://jiangcc.netlify.app/categories/genome/index.xml" rel="self" type="application/rss+xml"/><item><title>Bayesian Interpretation for Positive False Discovery Rate</title><link>https://jiangcc.netlify.app/p/bayesian-interpretation-for-positive-false-discovery-rate/</link><pubDate>Sat, 10 Jan 2026 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/bayesian-interpretation-for-positive-false-discovery-rate/</guid><description>&lt;h2 id="problem">Problem&lt;/h2>
&lt;p>In multiple testing, we concern the rate of false positives among all rejected hypotheses rather than the probability reject wrongly at least a hypotheses. We allow reject null hypotheses is true under controled ratio.&lt;/p>
&lt;p>pFDR can be written as a Bayesian posterior probability&lt;/p>
&lt;h2 id="pfdr">pFDR&lt;/h2>
&lt;h3 id="concept">Concept&lt;/h3>
$$
\begin{table}[]
\centering
\begin{tabular}{lccc}
\toprule
&amp; Not rejected &amp; Rejected &amp; Total \\
\midrule
Null true &amp; $U$ &amp; $V$ &amp; $m_0$ \\
Alternative true &amp; $T$ &amp; $S$ &amp; $m_1$ \\
\midrule
Total &amp; $W$ &amp; $R$ &amp; $m$ \\
\bottomrule
\end{tabular}
\caption{Possible outcomes from m hypothesis tests}
\end{table}
$$&lt;h3 id="theorem">Theorem&lt;/h3>
&lt;h4 id="posterior-thm">posterior Thm&lt;/h4>
&lt;p>Suppose $m$ identical hypothesis tests are performed with statistics $T_1, \cdots, T_m$ and significance region $\Gamma$ . Assume that $(T_i , H_i)$ are $i.i.d.$ and $T_i \mid H_i \sim (1-H_i)F_0 + H_i F_1$ for null distribution $F_0$ and alternative distribution $F_1$ , and $H_i \sim Ber(\pi_1)$ then&lt;/p>
$$
\begin{align*}
\mathrm{pFDR}(\Gamma)= P(H=0 \mid T \in \Gamma)
\end{align*}
$$&lt;p>where $\pi_0=1-\pi_1$&lt;/p>
&lt;p>P-value(t) of observed statistic $T = t$ is defined to be&lt;/p>
$$
\begin{align*}
\mathrm{p\text{-}value}(t)= \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}}
P(T\in\Gamma_\alpha \mid H=0)
\end{align*}
$$&lt;p>For an observed statistic $T = t$ define the q-value of $t$ to be&lt;/p>
$$
\begin{align*}
\text{q-value}(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \text{pFDR}(\Gamma_{\alpha})
\end{align*}
$$&lt;h4 id="corollary-2">Corollary 2&lt;/h4>
&lt;p>Under the assumptions of Theorem 1,&lt;/p>
$$
\begin{align*}
\text{q-value}(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} P(H=0 \mid T \in \Gamma_{\alpha} )
\end{align*}
$$&lt;h4 id="thm-for-dependence-stat">Thm for dependence stat&lt;/h4>
&lt;p>Suppose as $m \to \infty$, for each $\alpha>0$ for some conti. function $G_0, G_1$&lt;/p>
$$
\begin{align*}
\sum_{i=1}^{m} \frac{(1 - H_i)}{m} \to \pi_0, \quad\frac{V_m(\Gamma_\alpha)}{\sum_{i=1}^{m} (1 - H_i)} \to G_0(\alpha), \quad \frac{S_m(\Gamma_\alpha)}{\sum_{i=1}^{m} H_i} \to G_1(\alpha)
\end{align*}
$$&lt;p>with probability 1&lt;/p>
&lt;p>Then for any $\delta>0$&lt;/p>
$$
\begin{align*}
\text{(i)} &amp; \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \frac{V_m(\Gamma_\alpha)}{R_m(\Gamma_\alpha) \vee 1} - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| \stackrel{a.s.}{=} 0 \\
\text{(ii)} &amp; \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \text{FDR}_m(\Gamma_\alpha) - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| = 0 \\
\text{(iii)} &amp; \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \text{pFDR}_m(\Gamma_\alpha) - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| = 0
\end{align*}
$$&lt;p>where $P{\infty}(H = 0 \mid X \in \Gamma_\alpha) = \frac{\pi_0 \cdot G_0(\alpha)}{\pi_0 \cdot G_0(\alpha) + (1 - \pi_0) \cdot G_1(\alpha)}$&lt;/p>
&lt;h3 id="benefit">Benefit&lt;/h3>
&lt;h3 id="limitation">Limitation&lt;/h3>
&lt;h2 id="common-technology">Common Technology&lt;/h2>
&lt;h3 id="omnibus-test">Omnibus Test&lt;/h3>
&lt;p>The Omnibus Test uses summary data to deal with multiple cohorts/methods. In this paper, we use the omnibus test to check for significant associations across predictions from YFS, METSIM, and NTR (different tissues). For gene $i$&lt;/p>
$$
\begin{align*}
\text{omnibus}_i = \mathbf{Z_i^T C_i^{-1} Z_i} \overset{approx}{\sim} \chi^2_3
\end{align*}
$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$\mathbf{Z_i}$ is $3 \times 1$ vector, representing $3$ cohort TWAS Z score&lt;/li>
&lt;li>$\mathbf{C_i}$ is $3 \times 3$ correlation matrix for $3$ cohort&lt;/li>
&lt;/ul>
&lt;h2 id="performance">Performance&lt;/h2>
&lt;h3 id="true-data">True Data&lt;/h3>
&lt;h2 id="simulation">Simulation&lt;/h2>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://projecteuclid.org/journals/annals-of-statistics/volume-31/issue-6/The-positive-false-discovery-rate--a-Bayesian-interpretation-and/10.1214/aos/1074290335.full" target="_blank" rel="noopener"
>THE POSITIVE FALSE DISCOVERY RATE: A BAYESIAN INTERPRETATION AND THE q-VALUE&lt;/a> &lt;br>&lt;/li>
&lt;/ul></description></item><item><title>Meta Analysis</title><link>https://jiangcc.netlify.app/p/meta-analysis/</link><pubDate>Fri, 13 Jun 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/meta-analysis/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Meta-Analysis idea is combining results across studies. Cause more associated SNPs as sample size increasing. There are two type of meta-analysis, fixed effect meta-analysis and random effect meta-analysis. Suppose there is $N$ studies in hand. And the data $i$ with different effect size $\hat{\beta}_i$ and standard error $\sigma_i$&lt;/p>
$$(\hat{\beta}_i, \sigma_i), \ i=1,\ldots,N$$&lt;p>where $\sigma_i$ is standard error of $\hat{\beta}_i$&lt;/p>
&lt;p>$\widetilde{\beta} = \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}}$ is a common weight called inverse variance weight. If $\widetilde{\beta} \sim {N}(\beta, \sigma_i^2)$ and independent&lt;/p>
$$
\begin{align*}
Var(\widetilde{\beta})
&amp;= {Var}\left( \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}} \right) \\
&amp;= \frac{\sum_{i=1}^N {Var}\left( \hat{\beta}_i \sigma_i^{-2} \right)}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\
&amp;= \frac{\sum_{i=1}^N \sigma_i^{-4} \sigma_i^2}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\
&amp;= \frac{\sum_{i=1}^N \sigma_i^{-2}}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\
&amp;= \frac{1}{\sum_{i=1}^N \sigma_i^{-2}}
\end{align*}
$$&lt;ul>
&lt;li>&lt;strong>Genomic Control ($\lambda_{GC}$)&lt;/strong>：&lt;/li>
&lt;li>fix effect model 使用 逆變異數加權法 (Inverse Variance Weighting)，適用於 研究間的族群背景、實驗方法非常接近時&lt;/li>
&lt;li>random effect model 變異包含研究內誤差 ($\sigma_i^2$) 與研究間異質性 ($\tau^2$)&lt;/li>
&lt;li>Heterogeneity Test ，在合併數據前，我們必須確認這些研究是否「合得來」，Cochran’s Q test&lt;/li>
&lt;li>&lt;strong>Genomic Control ($\lambda_{GC}$)&lt;/strong>：校正各研究內部的群體分層&lt;/li>
&lt;li>&lt;strong>Funnel Plot (漏斗圖)&lt;/strong>：檢查是否存在發表偏倚（Publication Bias）&lt;/li>
&lt;li>&lt;strong>Forest Plot (森林圖)&lt;/strong>：觀察單一 SNP 在各研究中的效應方向&lt;/li>
&lt;/ul>
&lt;h2 id="material-and-method">Material and Method&lt;/h2>
&lt;p>flowchart&lt;/p>
&lt;h3 id="data">Data&lt;/h3>
&lt;p>We utilized data from the &lt;a class="link" href="%28https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/%29" >1,000 Genomes Project&lt;/a> to perform a GWAS for &lt;a class="link" href="https://github.com/lhchien-ndhu/113-2-Statistical-Genoimcs/tree/main/0412%20%E6%9C%9F%E4%B8%AD%E5%A0%B1%E5%91%8A" target="_blank" rel="noopener"
>height&lt;/a>. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.&lt;/p>
&lt;h3 id="genotype-qc">Genotype QC&lt;/h3>
&lt;ol>
&lt;li>Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with MAF $\leq 0.05$ :
6797981 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with HWE $&lt; 0.0001$ i.e. pvalue $&lt; 0.0001$ :
4941621 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with $r^2 &lt; 0.2$ in 500 window bp to PCA :
299901 variants and 1092 people pass filter&lt;/li>
&lt;li>flip beta to -beta&lt;/li>
&lt;/ol>
&lt;h3 id="fix-effect-meta-analysis">Fix Effect Meta-Analysis&lt;/h3>
$$\begin{align*}
(\hat{\beta}_i, \sigma^2_{i}),\quad i = 1, \ldots, N,\quad N \text{ studies}
\end{align*}$$&lt;p>
where&lt;/p>
&lt;ul>
&lt;li>$\hat{\beta_i}$ is effect size&lt;/li>
&lt;li>$\sigma_i^2$ is variance&lt;/li>
&lt;/ul>
$$\begin{align*}
\hat{\beta}_i \sim N(\beta, \sigma^2_{i}) \\
\tilde{\beta} = \frac{ \sum_{i=1}^N \hat{\beta}_i \sigma^{-2}_{i} }{ \sum_{i=1}^N \sigma^{-2}_{i} }
\end{align*}$$&lt;h3 id="random-effect-meta-analysis">Random Effect Meta-Analysis&lt;/h3>
$$\begin{align*}
&amp;\hat{\beta}_i \sim {N}(\beta_i, \sigma_i^2) , \quad \beta_i \sim {N}(\mu, \tau^2)
\end{align*}$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$\sigma^2_i$ is sampling variation within study (研究內的抽樣誤差)&lt;/li>
&lt;li>$\tau^2$ is variance between studies (研究之間的不同)&lt;/li>
&lt;/ul>
$$\begin{align*}
\text{Var}(\hat{\beta}_i) &amp;= E(\text{Var}(\hat{\beta}_i | \beta_i)) + \text{Var}(E(\hat{\beta}_i | \beta_i)) \\
&amp;= E(\sigma_i^2) + \text{Var}(\beta_i) \\ &amp;= \sigma_i^2 + \tau^2
\end{align*}$$$$\begin{align*}
\Rightarrow \hat{\beta}_i \sim {N}(\mu, \sigma^2 + {\tau}^2)
\end{align*}$$&lt;h3 id="chi-square-test-for-heterogeneity-in-effect">Chi-square Test for Heterogeneity in Effect&lt;/h3>
$$\begin{align*}
Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}
\end{align*}$$&lt;p>test is there any $\beta_i$ sig. different&lt;/p>
$$\begin{align*}
I^2 &amp;= 100\% \cdot \frac{Q - \text{df}}{Q}\\
\end{align*}$$&lt;ul>
&lt;li>$I^2 = 0-25\%$: Low heterogeneity, then Heterogeneity is small $(\beta_1 = \beta_2 = \cdots = \beta_N)$. Not reject $H_0$&lt;/li>
&lt;li>$I^2 = 25-50\%:$ Moderate&lt;/li>
&lt;li>$I^2 = 50-75\%:$ Substantial&lt;/li>
&lt;li>$I^2 >75\%:$ Considerable, then Heterogeneity is large. Reject $H_0$&lt;/li>
&lt;/ul>
&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$&lt;/li>
&lt;li>df = $N-1$&lt;/li>
&lt;/ul>
&lt;h3 id="cochrans-q-test">&lt;strong>Cochran’s Q test&lt;/strong>&lt;/h3>
&lt;p> it might be underpowered when few studies have been included or when event rates are low. Therefore, it is often recommended to adopt a higher P-value (rather than 0.05) as a threshold for statistical significance when using Cochran’s Q test to determine statistical heterogeneity.&lt;/p>
$$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$$&lt;ul>
&lt;li>
&lt;p>Under large sample, if p-value $P(\chi^2_{N-1}>Q)&lt;0.05$, reject $H_0$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Under small sample, if p-value $P(\chi^2_{N-1}>Q)&lt;0.1$, reject $H_0$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>reference web
&lt;a class="link" href="https://www.ncbi.nlm.nih.gov/books/NBK53317/table/ch3.t2/#:~:text=Cochran%27s%20Q%20test%20is%20the,within%20subjects%20within%20a%20study" target="_blank" rel="noopener"
>https://www.ncbi.nlm.nih.gov/books/NBK53317/table/ch3.t2/#:~:text=Cochran's%20Q%20test%20is%20the,within%20subjects%20within%20a%20study&lt;/a>.
 &lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="heterogeneity-in-effect">Heterogeneity in Effect&lt;/h3>
&lt;p>The genetic influence on a trait varies across different individuals or populations, even when the trait looks the same. May arise from&lt;/p>
&lt;ul>
&lt;li>Differences in LD structure&lt;/li>
&lt;li>Interactions with environmental or other genetic exposures at different frequencies&lt;/li>
&lt;/ul>
&lt;h3 id="genomic-control">Genomic Control&lt;/h3>
$$\begin{align*}
\lambda_{\text{GC}} &amp;= \frac{\text{median}(\chi^2_{\text{observed}})}{\text{median}(\chi^2_{\text{adjusted}})} = \frac{\text{median}(\chi^2_{\text{observed}})}{0.455} \\
&amp;\lambda_{\text{GC}} \begin{cases}
\approx 1: &amp; \text{well-calibrated} \\
> 1: &amp; \text{inflative} \\
&lt; 1: &amp; \text{conservative test}
\end{cases}
\end{align*}$$&lt;p>where $\chi^2_{\text{adjusted}} = \frac{\chi^2_{\text{observed}}}{\lambda_{\text{GC}}}$&lt;/p>
&lt;h3 id="如何計算-λgc">&lt;strong>如何計算 λGC?&lt;/strong>&lt;/h3>
&lt;p>取&lt;strong>所有 SNP 的$\chi^2$中位數&lt;/strong>，除以理論中位數（df=1 時）。&lt;/p>
$$
\text{median}(\chi^2_{df=1}) = 0.455
$$&lt;p>矯正每個snp 計算的統計量&lt;/p>
$$
\chi^2_{\text{GC}, i} = \frac{\chi^2_i}{\lambda_{\text{GC}}} = \frac{\text{median}(\chi^2_{\text{all SNPs}})}{0.455}
$$&lt;p>再換 p-value&lt;/p>
$$
p_i^{\text{GC}} = 1 - F_{\chi^2_{df=1}}(\chi^2_{\text{GC}, i})
$$&lt;p>Genomic Control 不適合 polygenic traits&lt;/p>
&lt;h2 id="result">Result&lt;/h2>
&lt;h3 id="snp-finding">SNP Finding&lt;/h3>
&lt;h3 id="ld">LD&lt;/h3>
&lt;h3 id="manhattan-plot">Manhattan Plot&lt;/h3>
&lt;h2 id="code">Code&lt;/h2>
&lt;h2 id="gwas-model">GWAS Model&lt;/h2>
$$y = \beta_0 + \beta(x) + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon$$&lt;p>Under logistic regression,&lt;/p>
$$y \sim \text{Ber}(p), \ y \in \{0, 1\}$$$$\begin{align*}
\text{logit}(p) &amp;= \log\left(\frac{p}{1-p}\right) \\
&amp;= \beta_0 + \beta x + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon
\end{align*}$$$$\begin{align*}
\beta &amp;= \log\left( \frac{p(x=1)}{1-p(x=1)} \right) - \log\left( \frac{p(x=0)}{1-p(x=0)} \right) \\
&amp;= \log\left( \frac{p(x=1)}{1-p(x=1)} \middle/ \frac{p(x=0)}{1-p(x=0)} \right)
\end{align*}$$&lt;h2 id="probability-distributions">Probability Distributions&lt;/h2>
$$\begin{align*}
P(\hat{\beta}_i | \mu, \tau) &amp;\propto \int P(\hat{\beta}_i | \beta_i) \, d\beta_i \\
&amp;\propto \int P(\hat{\beta}_i | \beta_i) P(\beta_i | \mu, \tau) \, d\beta_i \\
&amp;\propto \int \exp\left\{ -\frac{1}{2\sigma_i^2} (\hat{\beta}_i - \beta_i)^2 - \frac{1}{2\tau^2} (\beta_i - \mu)^2 \right\} \, d\beta_i
\end{align*}$$&lt;p>$P(\hat{\beta}_i ,\beta_i)$ is bivariate normal distribution, and the marginal distribution is still normal dist.&lt;/p>
&lt;h3 id="allelic-chi-square-test">Allelic Chi-square Test&lt;/h3>
&lt;p>假設有一個 SNP，兩個等位基因：A, a， 出現在case, control 的次數是&lt;/p>
$$
\begin{array}{c|cc}
&amp; A &amp; a \\
\hline
\text{Case} &amp; O_{1A} &amp; O_{1a} \\
\text{Control} &amp; O_{0A} &amp; O_{0a} \\
\end{array}
$$&lt;p>
where $N = O_{1A}+O_{1a}+O_{0A}+O_{0a}$&lt;/p>
$$
E_{1A} = \frac{(O_{1A}+O_{0A})(O_{1A}+O_{1a})}{N}
$$&lt;p>其餘類似，把期望值算出&lt;/p>
$$
E_{1a}, E_{0A}, E_{0a}
$$&lt;p>接著計算每個snp的 chi-square statistic&lt;/p>
$$
\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}
$$&lt;p>Genomic Control 不是重新計算 $\chi^2$，而是認為資料的$\chi^2$ 偏高，需要除上 $\lambda_{\text{GC}}$ 矯正&lt;/p>
$$
\chi^2_{\text{GC}} = \frac{\chi^2}{\lambda_{\text{GC}}}
$$&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;p>Following the GWAS, several post-GWAS analyses can be conducted, including fine-mapping, functional annotation, and the calculation of &lt;a class="link" href="https://jiangcc.netlify.app/p/polygenic-risk-score/" target="_blank" rel="noopener"
>polygenic risk scores (PRS)&lt;/a>. Furthermore, the &lt;a class="link" href="https://www.ebi.ac.uk/gwas/efotraits/EFO_0005570" target="_blank" rel="noopener"
>GWAS catalog&lt;/a> provides a vast repository of existing GWAS summary statistics. We can leverage this data to validate the significant SNPs identified in our study.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;p>&lt;a class="link" href="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/" target="_blank" rel="noopener"
>1000G&lt;/a> &lt;br>
&lt;a class="link" href="https://github.com/lhchien-ndhu/113-2-Statistical-Genoimcs" target="_blank" rel="noopener"
>Prof. lhchien Course&lt;/a>&lt;/p></description></item><item><title>TWAS</title><link>https://jiangcc.netlify.app/p/twas/</link><pubDate>Sun, 11 May 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/twas/</guid><description>&lt;img src="https://jiangcc.netlify.app/p/twas/surface.jpeg" alt="Featured image of post TWAS" />&lt;h2 id="problem">Problem&lt;/h2>
&lt;p>Studies of complex traits often have small sample sizes. There are some methods to address this, such as overlapping analysis of eQTLs and GWAS trait variants, but these may miss small effect size expression.&lt;/p>
&lt;h2 id="twas">TWAS&lt;/h2>
&lt;h3 id="concept">Concept&lt;/h3>
&lt;p>First, check that $h^2_{cis} \neq 0$ is significant. Then we use true expression data to train an imputed expression model. There are three imputed expression models, using cis-eQTL and BLUP or BSLMM, respectively. We compare their $\frac{r^2}{h^2}$, and BSLMM is the best one. We impute expression-trait association statistics from GWAS summary statistics and the imputed expression model.&lt;/p>
&lt;h3 id="benefit">Benefit&lt;/h3>
&lt;p>Gene expression data is not required in TWAS.&lt;/p>
&lt;h3 id="limitation">Limitation&lt;/h3>
&lt;ol>
&lt;li>We assume that SNPs affect traits through gene expression.&lt;/li>
&lt;li>TWAS can&amp;rsquo;t distinguish causality; how to solve this? Add a trait term to the linear model. If the imputed expression becomes not significant, it means that there is a phenotype-mediated effect (SNP → trait → expression).&lt;/li>
&lt;/ol>
&lt;h2 id="common-technology">Common Technology&lt;/h2>
&lt;h3 id="omnibus-test">Omnibus Test&lt;/h3>
&lt;p>The Omnibus Test uses summary data to deal with multiple cohorts/methods. In this paper, we use the omnibus test to check for significant associations across predictions from YFS, METSIM, and NTR (different tissues). For gene $i$&lt;/p>
$$
\begin{align*}
\text{omnibus}_i = \mathbf{Z_i^T C_i^{-1} Z_i} \overset{approx}{\sim} \chi^2_3
\end{align*}
$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$\mathbf{Z_i}$ is $3 \times 1$ vector, representing $3$ cohort TWAS Z score&lt;/li>
&lt;li>$\mathbf{C_i}$ is $3 \times 3$ correlation matrix for $3$ cohort&lt;/li>
&lt;/ul>
&lt;h3 id="permutation-test">Permutation Test&lt;/h3>
&lt;p>Permutation test doesn&amp;rsquo;t need distribution assumption. It&amp;rsquo;s a nonparameter method and testing multiple group data is significant different. In this paper. we shuffle expression-trait association 1,000 times for each TWAS gene, plot the distribution of shuffled Z score $Z_{perm}$ which follows $\sim N(0, \Sigma_{s,s})$) . We compute p-value&lt;/p>
$$
\begin{align*}
\text{p-value} = \frac{\displaystyle \sum_i^{1000}I(Z_{obs} &lt; Z_{perm,i})}{1000}
\end{align*}
$$&lt;p>If p-value$&lt;0.05$, we reject null hypothesis (expression $\perp$ trait).&lt;/p>
&lt;h2 id="performance">Performance&lt;/h2>
&lt;h3 id="true-data">True Data&lt;/h3>
&lt;p>TWAS Identify 25 novel expression-trait associations using summary association statistics from a 2010 lipid GWAS.&lt;/p>
&lt;h2 id="simulation">Simulation&lt;/h2>
&lt;h3 id="under-null">Under null&lt;/h3>
&lt;p>We simulate expression from two null expression models. For expression $\perp$ SNP, cis-heritable trait model&lt;/p>
$$Z-score \sim N\left(0,\mathbf{\frac{WZ}{(W\Sigma_{s,s} W')^{1/2}}}\right) ,\ \text{expression} \sim N(0,1)$$&lt;p>For trait $\perp$ SNP, cis-heritable expression model&lt;/p>
$$ Z-score \sim N(0,1) ,\ \text{expression}=\sum_i X_i +\varepsilon$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$\mathbf{W=\Sigma_{e,s}\Sigma^{-1}_{s,s}}$&lt;/li>
&lt;li>$\mathbf{\Sigma_{e,s}}:$ covariance between SNPs and expression&lt;/li>
&lt;li>$\mathbf{\Sigma_{s,s}}:$ covariance among all SNPs&lt;/li>
&lt;/ul>
&lt;h3 id="under-alternative">Under alternative&lt;/h3>
&lt;p>We use $6000$ unrelated METSIM GWAS samples, $100$ genes and the SNPs in the surrounding 1MB. For $100$ genes, expression simulated as&lt;/p>
$$
\begin{align*}
\mathbf{E}=\mathbf{X {\beta} + \varepsilon},\ \text{where } \varepsilon,\ \beta \text{ from Normal} \quad (1)
\end{align*}
$$&lt;p>to achieve $h^2_{cis-g}=0.17$. $1000$ samples with SNPs and simulated expression were then withheld for training $(1)$. And we use $(1)$ to simulate remaining $5000$ samples expression. For remaining $5000$ samples, phenotype $Y$ simulated as&lt;/p>
$$
\begin{align*}
Y=E \alpha'+\varepsilon \quad (2)
\end{align*}
$$&lt;p>So that $h^2_E=\frac{0.1}{180}$ or $\frac{0.2}{180}$. Repeating $5000$ samples expression simulation $(1)$ and phenotype simulation $(2)$ $60$ times with different $\varepsilon$. After computing Z-score between snp, phenotype, we simulate $5000 \times 60$ size GWAS.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://pubmed.ncbi.nlm.nih.gov/26854917/" target="_blank" rel="noopener"
>Integrative approaches for large-scale transcriptome-wide association studies&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://cloufield.github.io/GWASTutorial/99_About/" target="_blank" rel="noopener"
>GWASTutorial&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2022.867724/full#B30" target="_blank" rel="noopener"
>Empirically-Derived Null Distributions&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://yanglab.westlake.edu.cn/software/gcta/#Overview" target="_blank" rel="noopener"
>GCTA coding&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://escholarship.org/uc/item/9ms4h3zm" target="_blank" rel="noopener"
>ImpG-Summary&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://xiangzhou.github.io/software/" target="_blank" rel="noopener"
>GEMMA&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://annahutch.github.io/PhD/LD-score-regression.html" target="_blank" rel="noopener"
>LD score and genomic-control method&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://www.hudsonalpha.org/wp-content/uploads/2024/01/EDNA_transcription-Translation.jpeg" target="_blank" rel="noopener"
>Surface picture&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>GWAS</title><link>https://jiangcc.netlify.app/p/gwas/</link><pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/gwas/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Genome-wide Association Study (GWAS) is a classic methodology for identifying SNPs associated with common diseases. By scanning the entire genome without prior hypotheses, it utilizes linear models to identify SNPs with significant statistical associations to specific diseases. However, because allele frequencies vary across populations and current GWAS data are predominantly derived from European ancestry, the predictive performance across different ethnic groups remains limited. Furthermore, due to Linkage Disequilibrium (LD), the significant loci identified by GWAS are often merely &amp;ldquo;tagging SNPs&amp;rdquo; rather than the actual causal variants responsible for the disease.&lt;/p>
&lt;p>During the Quality Control (QC) process, SNPs with a low Minor Allele Frequency (MAF) are typically excluded. Rare variants have extremely low frequencies; for instance, a MAF of 0.01 implies that, on average, only one individual out of 100 carries that specific SNP. When the sample size of a study is insufficient, the influence of these few rare-variant carriers often fails to reach the stringent p-value threshold required for genome-wide significance.&lt;/p>
&lt;h2 id="material-and-method">Material and Method&lt;/h2>
&lt;p>The GWAS workflow begins with stringent QC based on MAF and Hardy-Weinberg Equilibrium (HWE). Following QC, the filtered SNPs are used for the primary association analysis. For population structure correction, a subset of SNPs is subjected to LD pruning to perform Principal Component Analysis (PCA). The final Linear Regression Model incorporates the post-QC SNPs as the independent variable, with Principal Components (PCs) and gender included as covariates to account for population stratification and confounding factors.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="flowchart.png" style="width:30%;">
&lt;/div>
&lt;h3 id="data">Data&lt;/h3>
&lt;p>We utilized data from the &lt;a class="link" href="%28https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/%29" >1,000 Genomes Project&lt;/a> to perform a GWAS for &lt;a class="link" href="https://github.com/lhchien-ndhu/113-2-Statistical-Genoimcs/tree/main/0412%20%E6%9C%9F%E4%B8%AD%E5%A0%B1%E5%91%8A" target="_blank" rel="noopener"
>height&lt;/a>. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.&lt;/p>
&lt;h3 id="genotype-qc">Genotype QC&lt;/h3>
&lt;ol>
&lt;li>Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with MAF $\leq 0.05$ :
6797981 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with HWE $&lt; 0.0001$ i.e. pvalue $&lt; 0.0001$ :
4941621 variants and 1092 people pass filter&lt;/li>
&lt;li>Excluding the SNP with $r^2 &lt; 0.2$ in 500 window bp to PCA :
299901 variants and 1092 people pass filter&lt;/li>
&lt;/ol>
&lt;h3 id="linear-model">Linear model&lt;/h3>
&lt;p>After applying filters for missingness rate, MAF, and HWE, the dataset retained 4,941,621 variants and 1,092 individuals. This high-quality dataset served as the foundation for constructing the linear model&lt;/p>
$$
\begin{align*}
Y_i &amp;= X_j + Gender + {PC}_1 + {PC}_3 + {PC}_3,\\
i &amp;=1, \cdots , 1092; \quad j=1, \cdots , 299901
\end{align*}
$$&lt;p>where $Y_i$ is i-th sample, $X_j$ is j-th SNP.&lt;/p>
&lt;h2 id="result">Result&lt;/h2>
&lt;h3 id="snp-finding">SNP Finding&lt;/h3>
&lt;p>We identified several SNPs significantly associated with height, using the standard threshold of p-value $&lt; 5 \cdot 10^{-8}$. These lead SNPs and their corresponding statistics are summarized in the table below.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="snp_sig.png" style="width:90%;">
&lt;/div>
&lt;h3 id="ld">LD&lt;/h3>
&lt;p>To determine if multiple significant SNPs represent the same underlying genetic signal, we evaluated the LD between them. Our analysis indicates that five significant SNPs on Chromosome 5 are in high LD with one another, suggesting they likely tag the same causal locus. Similarly, high LD was observed between two SNPs on Chromosome 6, as well as two SNPs on Chromosome 17.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="sig_ld.png" style="width:90%;">
&lt;/div>
&lt;h3 id="manhattan-plot">Manhattan Plot&lt;/h3>
&lt;p>The Manhattan plot visualizes the association results across the genome, with the red dashed line indicating the significance threshold ($-\log_{10}(5 \cdot 10^{-8})$). The plot highlights prominent peaks of significant SNPs, most notably on Chromosome 5, which demonstrates the strongest association with the trait.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="Manhattan_Plot.png" style="width:90%;">
&lt;/div>
&lt;h2 id="code">Code&lt;/h2>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;span class="lnt">2
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data.table&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dplyr&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>merge data&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">setwd&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/midterm&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kr">for&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="kr">in&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">22&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">## vcf to bed&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">paste0&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --vcf ALL.chr&amp;#34;&lt;/span> &lt;span class="p">,&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz --make-bed --out chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="p">)&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">system&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">## 把snp id是.的轉換成新的名稱&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">paste0&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34; --snps-only just-acgt --set-missing-var-ids @:#[b37] --make-bed --out chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;_TransformMissing&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">system&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">list&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">list&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kr">for&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="kr">in&lt;/span> &lt;span class="m">2&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">22&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">## merge files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">list&lt;/span>&lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">list&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">rbind&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">paste&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;_TransformMissing&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;.bed&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;.bim&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;.fam&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">sep&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s">&amp;#34;&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## 造merge_files.txt，放你要合併的檔案名稱&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">write.table&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="n">list&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">file&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s">&amp;#34;merge_files.txt&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">row.names&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">col.names&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">quote&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## 把chr22_1_1的bed,bim,fam 跟merge_files.txt 裡的合併&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile chr1_TransformMissing --merge-list merge_files.txt --make-bed --out process//merge&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>QC and LD and PCA&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;span class="lnt">25
&lt;/span>&lt;span class="lnt">26
&lt;/span>&lt;span class="lnt">27
&lt;/span>&lt;span class="lnt">28
&lt;/span>&lt;span class="lnt">29
&lt;/span>&lt;span class="lnt">30
&lt;/span>&lt;span class="lnt">31
&lt;/span>&lt;span class="lnt">32
&lt;/span>&lt;span class="lnt">33
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">setwd&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/midterm&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## 產生兩個檔案，分別紀錄人跟snp missing 的檔案&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile process/merge --missing&amp;#34;&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 把人missing rate 超過0.1 的人 丟掉&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile process//merge --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.0001 --make-bed --out process//merge_QC&amp;#34;&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 轉成ped, map&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile process//merge_QC --recode --out process//merge_QCPed&amp;#34;&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#ld pruning(只輸出prune.in, prune.out)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --file process//merge_QCPed --indep-pairwise 500 50 0.2 --out process//merge_QCld&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># choose SNP in prune.in, output .ped and .map&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --file process//merge_QCPed --extract process//merge_QCld.prune.in --recode --out process//merge_prune&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#pca&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --file process//merge_prune --pca --out process//merge_pca&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">e.vec&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;process//merge_pca.eigenvec&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">g&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;process//pheno.txt&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">covar&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">data.frame&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">FID&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">e.vec&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">V1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">IID&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">e.vec&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">V2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">gender&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">g&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">Gender&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PC1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">e.vec&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">V3&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PC2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">e.vec&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">V4&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">PC3&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">e.vec&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">V5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">write.table&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">covar&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">file&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s">&amp;#34;D://GWAS_CLASS//20101123//process//covar.txt&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">row.names&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">quote&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>lm&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;span class="lnt">2
&lt;/span>&lt;span class="lnt">3
&lt;/span>&lt;span class="lnt">4
&lt;/span>&lt;span class="lnt">5
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## association&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">### 千萬要注意!!!pheno.txt 檔案顯示成文字檔，每個row文字要用tab區隔，也就是說用記事本打開，看起來不會是對齊的&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile process//merge_QC --pheno process//pheno.txt --pheno-name Height --make-bed --out process//merge_f&amp;#34;&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">##&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile process//merge_f --covar process//covar.txt --covar-name gender PC1 PC2 PC3 --allow-no-sex --linear --out process//linear_model&amp;#34;&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>choose sig SNP&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## plot&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">r&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">read.table&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s">&amp;#34;process//linear_model.assoc.linear&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">header&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">T&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">head&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loc&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;process//merge_f.bim&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">header&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 合併v5,v6&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loc&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">loc&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">mutate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">CodingAllele&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">paste&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">V5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">V6&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sep&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;/&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">names&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">[1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">4&lt;/span>&lt;span class="n">]&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;CHR&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;SNP&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;v3&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34;BP&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## 有4941621 個snp，只取線性模型裡的ADD term&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">r_snp&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">r[1&lt;/span>&lt;span class="m">+5&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">0&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">4941620&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">dim&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r_snp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">head&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r_snp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">dim&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">dim&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">snp_sig&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">r_snp&lt;/span>&lt;span class="nf">[which&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r_snp&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">P&lt;/span>&lt;span class="o">&amp;lt;&lt;/span>&lt;span class="m">5&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="nf">^&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">-8&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">snp_sig&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">snp_sig&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">left_join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span> &lt;span class="nf">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">SNP&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">CodingAllele&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">by&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;SNP&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>sig SNP is LD high?&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="n">prune_out&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;process\\merge_QCld.prune.out&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">snp_sig_ld&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">intersect&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">snp_sig&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">SNP&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">prune_out&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">snp_sig_ld&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sig&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">data.frame&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sig_snp&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">snp_sig&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">SNP&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">write.table&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">sig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">file&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s">&amp;#34;process//significant_snp.txt&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">row.names&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">col.names&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">quote&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">F&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># select significant_snp.txt snp in merge_QCPed&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --file process//merge_QCPed --extract process//significant_snp.txt --recode --out process//merge_sig&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># get all pairs ld&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --file process//merge_sig --r2 --ld-window-r2 0 --out process//sig_snp_LD&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>plot&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;span class="lnt">25
&lt;/span>&lt;span class="lnt">26
&lt;/span>&lt;span class="lnt">27
&lt;/span>&lt;span class="lnt">28
&lt;/span>&lt;span class="lnt">29
&lt;/span>&lt;span class="lnt">30
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ggplot2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dplyr&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">library&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ggrepel&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># 為避免標籤重疊&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 計算 -log10(p-value)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">r_snp&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">logP&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="nf">log10&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r_snp&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">P&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 建立 i 軸 (x 軸索引)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">r_snp&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">r_snp&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span> &lt;span class="nf">arrange&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">CHR&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">BP&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">mutate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="nf">n&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 取得每個染色體中間位置，當作 x 軸刻度的位置&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">chr_labels&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">r_snp&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">group_by&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">CHR&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">summarize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">center&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">median&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 篩選出顯著 SNP（p &amp;lt; 5e-8）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sig_snps&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">r_snp&lt;/span> &lt;span class="o">%&amp;gt;%&lt;/span> &lt;span class="nf">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">P&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="m">5e-8&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 畫圖&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">ggplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">r_snp&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">logP&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">color&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">as.factor&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">CHR&lt;/span> &lt;span class="o">%%&lt;/span> &lt;span class="m">2&lt;/span>&lt;span class="p">)))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_point&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">scale_color_manual&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">values&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nf">c&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;skyblue&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34;grey&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_hline&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yintercept&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="nf">log10&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">5e-8&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">color&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;red&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">linetype&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;dashed&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">scale_x_continuous&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">breaks&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">chr_labels&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">center&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">chr_labels&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">CHR&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">labs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Chromosome&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;-log10(p-value)&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;Manhattan Plot&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme_minimal&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">theme&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">legend.position&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;none&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># 加上 SNP 標籤&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">geom_text_repel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">sig_snps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nf">aes&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">SNP&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">max.overlaps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">20&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>Following the GWAS, several post-GWAS analyses can be conducted, including fine-mapping, functional annotation, and the calculation of &lt;a class="link" href="https://jiangcc.netlify.app/p/polygenic-risk-score/" target="_blank" rel="noopener"
>polygenic risk scores (PRS)&lt;/a>. Furthermore, the &lt;a class="link" href="https://www.ebi.ac.uk/gwas/efotraits/EFO_0005570" target="_blank" rel="noopener"
>GWAS catalog&lt;/a> provides a vast repository of existing GWAS summary statistics. We can leverage this data to validate the significant SNPs identified in our study.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;p>&lt;a class="link" href="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/" target="_blank" rel="noopener"
>1000G&lt;/a> &lt;br>
&lt;a class="link" href="https://github.com/lhchien-ndhu/113-2-Statistical-Genoimcs" target="_blank" rel="noopener"
>Prof. lhchien Course&lt;/a>&lt;/p></description></item><item><title>Heritability</title><link>https://jiangcc.netlify.app/p/heritability/</link><pubDate>Tue, 22 Apr 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/heritability/</guid><description>&lt;img src="https://jiangcc.netlify.app/p/heritability/surface.jpg" alt="Featured image of post Heritability" />&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Heritability is the proportion of variation in a trait within a population that can be attributed to genetic differences. There are two type of definition and Nanow-sense Heritability is the common one.&lt;/p>
&lt;h2 id="broad-sense-heritability">Broad-sense Heritability&lt;/h2>
$$
\begin{align*}
H^2&amp;=\frac{V_G}{V_G+V_E} \\
&amp;=\frac{V_G}{V_P}
\end{align*}
$$&lt;ul>
&lt;li>$V_P$ is phenotype variation&lt;/li>
&lt;li>$V_G$ is genetic variation&lt;/li>
&lt;li>$V_E$ is environment&lt;/li>
&lt;/ul>
$$
\begin{align*}
H^2=\frac{V_G}{V_P}
\end{align*}
$$&lt;h2 id="nanow-sense-heritability">Nanow-sense Heritability&lt;/h2>
$$
\begin{align*}
h^2=\frac{V_A}{V_P}, \quad \text{where} V_G=V_A+V_{NA}
\end{align*}
$$&lt;ul>
&lt;li>$V_A$ is additive genetic variation&lt;/li>
&lt;li>$V_{NA}$ is non-addrive genetic variation&lt;/li>
&lt;/ul>
&lt;h3 id="example-dominant-coding">Example-Dominant Coding&lt;/h3>
&lt;p>In dominant coding, genotypes $CC$, $C T$ and $T T$ are coded as $1$, $1$ and $0$, respectively, if $C$ is the minor allele.&lt;/p>
&lt;h3 id="example-epistasis">Example-Epistasis&lt;/h3>
&lt;p>Labrador coat color is determined by two genes with four genotypes: $BE$, $bE$, $Be$, $be$&lt;/p>
&lt;ul>
&lt;li>Color is &lt;font color="black">black&lt;/font> when genotype is $B-E-$&lt;/li>
&lt;li>Color is &lt;font color="chocolate">chocolate&lt;/font> when genotype is $bbE-$&lt;/li>
&lt;li>Color is &lt;font color="gold">yellow&lt;/font> when genotype is $--ee$&lt;/li>
&lt;/ul>
&lt;div style="text-align: center;">
&lt;img src="dog.jpg" style="width:90%;">
&lt;/div>
&lt;p>&lt;strong>Note&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Heritability refers to a specific population, not to individuals.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heritability $\neq$ inheritance. For example, your brown hair may be inherited from your father, but the heritability of brown hair in the population may be low.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Heritability $\neq$ total genetic contribution. A low $h^2 = \frac{V_A}{V_P}$ does not necessarily mean that genetics plays a small role.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If $h^2$ is low, identifying associated genes might be less fruitful.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>There are $3$ common types of heritability.&lt;/p>
&lt;h2 id="family-based-heritability-h2_textfamily">Family-Based Heritability $(h^2_{\text{family}})$&lt;/h2>
&lt;p>Family-based studies, often twin studies, estimate heritability by comparing monozygotic (MZ) twins and dizygotic (DZ) twins. Let $r_{MZ}$ be the phenotypic correlation for MZ twins and $r_{DZ}$ for DZ twins.&lt;/p>
$$
\begin{align*}
\left\lbrace \begin{array}{lll}
r_{MZ} = A+C \\
r_{DZ} = \frac{A}{2}+C
\end{array} \right.
\end{align*}
$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$A$ is additive genetic effect&lt;/li>
&lt;li>$C$ is shared (common) environmental effect&lt;/li>
&lt;/ul>
&lt;p>We esitmate&lt;/p>
$$
\begin{align*}
h^2_{\text{family}} &amp; =A=2(r_{MZ}-r_{DZ}) \\
C &amp; = A-r_{MZ}
\end{align*}
$$&lt;p>error $E = 1-C$ and $A+C+E=1$&lt;/p>
&lt;h2 id="snp-based-heritability-h2_textsnp">SNP-Based Heritability $(h^2_{\text{SNP}})$&lt;/h2>
&lt;p>Estimated using tools such as &lt;a class="link" href="https://jiangcc.netlify.app/p/gcta/" target="_blank" rel="noopener"
>GCTA&lt;/a> under the mixed linear model&lt;/p>
$$
\begin{align*}
\mathbf{Y=X \beta+W u+\varepsilon}
\end{align*}
$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$\mathbf{u} \sim N\left(0, \mathbf{I \sigma_u^2}\right)$&lt;/li>
&lt;li>$\varepsilon \sim N\left(0, \mathbf{I \sigma_2^2}\right)$&lt;/li>
&lt;li>$\beta$ is a fixed effect (no variation)&lt;/li>
&lt;/ul>
&lt;p>So the variance of $\mathbf{Y}$ is&lt;/p>
$$
\begin{align*}
\operatorname{Var}(\mathbf{Y})= &amp; V \\
= &amp; \operatorname{Var}(\mathbf{W u})+\operatorname{Var}(\varepsilon) \\
= &amp; \mathbf{W W^T \sigma_u^2+I \sigma_{\varepsilon}^2}
\end{align*}
$$&lt;p>The &lt;strong>standardized genotype matrix&lt;/strong>&lt;/p>
$$
\begin{align*}
\mathbf{W}=\left\{w_{i j}\right\}, \quad w_{i j}=\frac{X_{i j}-2 p_j}{\sqrt{2 p_j\left(1-p_j\right)}}
\end{align*}
$$&lt;p>where&lt;/p>
&lt;ul>
&lt;li>$X_{ij}$ is $j-th$ SNP for $i-th$ individual&lt;/li>
&lt;li>$p_j$ is $j-th$ SNP MAF&lt;/li>
&lt;/ul>
&lt;p>Define &lt;strong>Genetic Relationship Matrix (GRM)&lt;/strong>&lt;/p>
$$
\begin{align*}
A=\frac{\mathbf{WW^T}}{N}
\end{align*}
$$&lt;p>where $\mathbf{\sigma_g^2}=N \mathbf{\sigma_u^2}$&lt;/p>
&lt;p>Rewiting the model&lt;/p>
$$
\begin{align*}
\Rightarrow \quad \mathbf{Y=X \beta+g+\varepsilon}, \quad \mathbf{g} \sim N\left(\mathbf{0, A \sigma_g^2}\right)
\end{align*}
$$&lt;p>We estimate $\sigma_g^2$ using REML (Restricted Maximum Likelihood). The proportion of phenotypic variance explained by &lt;font color="blue">the SNPs used to construct the GRM&lt;/font> is given by&lt;/p>
$$
\begin{align*}
h^2_{\text{SNP}} = \frac{\sigma_g^2}{\text{Var}(Y)}
\end{align*}
$$&lt;p>More detail read the paragraph &lt;a class="link" href="https://jiangcc.netlify.app/p/gcta/" target="_blank" rel="noopener"
>GCTA&lt;/a>&lt;/p>
&lt;h2 id="gwas-based-heritability-h_text-gwas-2">GWAS-based Heritability $(h_{\text {GWAS }}^2)$&lt;/h2>
&lt;p>This estimates heritability using only the &lt;font color="blue">significant SNPs identified in GWAS&lt;/font>. Assuming $m$ significant SNPs are linearly associated with the trait&lt;/p>
$$
\begin{align*}
{Y}=\beta_0+\sum_{i=1}^m \beta_i X_i+\varepsilon
\end{align*}
$$&lt;p>The heritability is&lt;/p>
$$
\begin{align*}
h_{\text {GWAS }}^2 = \frac{Var(\hat{Y})}{Var(Y)}
\end{align*}
$$&lt;h3 id="relationship-between-heritability-types">Relationship Between Heritability Types&lt;/h3>
$$
\begin{align*}
h_{\text {family }}^2 > h_{\text {SNP }}^2 >> h_{\text {GWAS }}^2
\end{align*}
$$&lt;p>This gap is known as &lt;strong>missing heritability&lt;/strong>.&lt;/p>
&lt;h2 id="code">Code&lt;/h2>
&lt;p>We used the &lt;code>gcta64&lt;/code> command to estimate the Genetic Relationship Matrix (GRM) for each of the 22 chromosomes separately. Compared to estimating the GRM for all autosomes together, we found that the results are identical. However, the former method took approximately two and a half hours, significantly longer than the latter, which required only ten minutes.&lt;/p>
&lt;h3 id="computing-separately">Computing Separately&lt;/h3>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;span class="lnt">19
&lt;/span>&lt;span class="lnt">20
&lt;/span>&lt;span class="lnt">21
&lt;/span>&lt;span class="lnt">22
&lt;/span>&lt;span class="lnt">23
&lt;/span>&lt;span class="lnt">24
&lt;/span>&lt;span class="lnt">25
&lt;/span>&lt;span class="lnt">26
&lt;/span>&lt;span class="lnt">27
&lt;/span>&lt;span class="lnt">28
&lt;/span>&lt;span class="lnt">29
&lt;/span>&lt;span class="lnt">30
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## estimate h^2_SNP&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">getwd&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">setwd&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/GCTA&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step0: split snp data to different chr&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kr">for&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="kr">in&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">22&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">paste0&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34; --make-bed --out D:/GWAS_CLASS/GCTA/data/merge_chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step1: make GRM&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># --maf: filter SNPs&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># --make-grm: make GRM&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># --thread-num: Parallel computation. You should generally not specify a number of threads that exceeds the number of physical cores.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kr">for&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="kr">in&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">22&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">paste0&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="p">,&lt;/span>&lt;span class="s">&amp;#34; --maf 0.01 --make-grm --out D:/GWAS_CLASS/GCTA/data/merge_chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s">&amp;#34; --thread-num 10&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step2: build grm_chrs.txt put in all chr GRM file name&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">writeLines&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nf">paste0&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/GCTA/data/merge_chr&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">22&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="s">&amp;#34;D:/GWAS_CLASS/GCTA/grm_list.txt&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step3: merge all the GRMs by the following command:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;gcta64 --mgrm D:/GWAS_CLASS/GCTA/grm_list.txt --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step4: remove cryptic relatedness: 0.025 roughly corresponds to individuals who are less related than third-degree&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge --grm-cutoff 0.025 --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## step5: estimating the variance explained by the SNPs (heritability)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">### input: GRM in step 3 (grm_merge) + phenotype info (pheno.txt)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --pheno D:/GWAS_CLASS/20101123/process/pheno.txt --reml --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --thread-num 10&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h3 id="computing-together">Computing Together&lt;/h3>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## estimate h^2_GWAS&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">system&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;plink --bfile D:/GWAS_CLASS/20101123/process/merge --extract D:/GWAS_CLASS/20101123/process/significant_snp.txt --recodeA --out D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp.raw&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">pheno&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">fread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s">&amp;#34;D:/GWAS_CLASS/20101123/process/pheno.txt&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">df[&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="m">1&lt;/span>&lt;span class="o">:&lt;/span>&lt;span class="m">6&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="n">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">y&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="n">pheno&lt;/span>&lt;span class="o">$&lt;/span>&lt;span class="n">Height&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lm_sigsnp&lt;/span> &lt;span class="o">&amp;lt;-&lt;/span> &lt;span class="nf">lm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span> &lt;span class="o">~&lt;/span> &lt;span class="n">.,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nf">summary&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lm_sigsnp&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>&lt;strong>Result&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-r" data-lang="r">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Summary result of REML analysis:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Source Variance SE&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># V(G) 10.927486 69.918584&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># V(e) 0.000011 69.472995&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Vp 10.927497 5.158109&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># V(G)/Vp 0.999999 6.357631&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Sampling variance/covariance of the estimates of variance components:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 4.888608e+03 -4.844250e+03&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># -4.844250e+03 4.826497e+03&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://www.narlabs.org.tw/files/file_pool/1/0J357576112842629654/%E5%9C%96%E4%BA%8C_20191220.jpg" target="_blank" rel="noopener"
>Labrador&lt;/a> &lt;br>&lt;/li>
&lt;li>Surface from &lt;a class="link" href="https://www.bing.com/?FORM=GENBHP" target="_blank" rel="noopener"
>bing&lt;/a> &lt;br>&lt;/li>
&lt;/ul></description></item><item><title>GCTA</title><link>https://jiangcc.netlify.app/p/gcta/</link><pubDate>Fri, 07 Mar 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/gcta/</guid><description>&lt;ul>
&lt;li>gcta上網站，gcta compute heritability, 放QC後的snp，計算GRM，reml 估出variance，算出 heritability。以往是使用GWAS找出跟trait顯著有關的位點，計算這些位點的heritability。但是有些位點跟trait有關，不過effect size很小，很難用GWAS找出。GCTA 好處是，可以估計所有 SNP 綜合起來的 heritability，省略掉使用GWAS尋找顯著相關的snp。假如同一筆資料，GCTA算出的heritability很高，但GWAS找到的位點算出很低，代表可以再多得到樣本，找到更多的資訊&lt;/li>
&lt;/ul>
&lt;h2 id="gcta">{GCTA}&lt;/h2>
&lt;ul>
&lt;li>GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait \textcolor{blue}{rather than} testing the association of any particular SNP to the trait.&lt;/li>
&lt;li>GCTA’s five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation.&lt;/li>
&lt;/ul>
&lt;p>% GCTA用來估計一群SNP的var，我們可以把SNP用一個個染色體來區分，或是分成編碼乘蛋白質的genetic snp跟不直接影響蛋白質，但是會影響基因表達的intergenetic snp，而不是估計特定的1,2個snp
% 估計不同樣本之間的snp相近程度；估計snp可以解釋的變異&lt;/p>
&lt;h2 id="precess">{Precess}&lt;/h2>
&lt;ol>
&lt;li>Quality control the SNPs&lt;/li>
&lt;li>Compute GRM/relatedness matrix in GCTA&lt;/li>
&lt;li>choose a mixed linear model&lt;/li>
&lt;li>REML method in GCTA&lt;/li>
&lt;/ol>
&lt;p>% 步驟是使用QC過的SNP，計算GRM 親緣矩陣，選個混合模型，使用 algorithm 算出最像是unbiased 的variance，知道variance 之後，就能算Heritability
%首先，因為GCTA裡面沒有函數可以做QC，所以會使用QC過的SNP，像是MAF、 Hardy-Weinberg equilibrium&lt;/p>
&lt;h2 id="grmgenetic-relationship-matrix">{GRM(Genetic Relationship Matrix)}&lt;/h2>
&lt;p>For $n$ individuals, $m$ SNPs&lt;/p>
&lt;ul>
&lt;li>\( \mathbf{X} \): \( n \times m \) genotype matrix&lt;/li>
&lt;li>\( p_j \): $j-th$ SNP MAF&lt;/li>
&lt;/ul>
&lt;p>$\mathbf{X}^{\text{norm}}$ is standardized genotype matrix by&lt;/p>
$$
X_{ij}^{\text{norm}} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}}
$$&lt;p>Define GRM $\mathbf{A}=\frac{1}{m}\mathbf{X}^{\text{norm}} (\mathbf{X}^{\text{norm}})'$ by&lt;/p>
$$
A_{jk} = \frac{1}{m} \sum_{i=1}^{m} \frac{(X_{ij} - 2p_j)(X_{ik} - 2p_j)}{2p_j(1 - p_j)}
$$&lt;p>% A_jk就是第j,k個樣本，都有標準化的SNP 向量，然後是s維度的，把向量做內積
% 算出GRM可以知道樣本之間的基因相似程度。接著我們要挑選掉基因相似的樣本&lt;/p>
&lt;h2 id="grm">{GRM}&lt;/h2>
&lt;pre>&lt;code>Suppose minor allele is a and MAF is $p$, then
&lt;/code>&lt;/pre>
$$
\begin{align*}
\left\lbrace \begin{array}{lll}
P(SNP = AA) = (1-p)^2 &amp; \text{with} &amp; 0\text{ minor allele} \\
P(SNP = Aa) = 2p(1-p) &amp; \text{with} &amp; 1\text{ minor allele} \\
P(SNP = aa) = p^2 &amp; \text{with} &amp; 2\text{ minor allele} \\
\end{array} \right.
\end{align*}
$$$$
\begin{align*}
E(SNP) &amp;= 0 \times (1-p)^2 +1 \times 2p(1-p) + 2 \times p^2 \\
&amp; = 2p
\end{align*}
$$$$
\begin{align*}
Var(SNP) &amp;= 0^2 \times (1-p)^2 +1^2 \times 2p(1-p) + 2^2 \times p^2 -E^2(SNP) \\
&amp; = 2p(1-p)
\end{align*}
$$&lt;h3 id="example">{Example}&lt;/h3>
$$
\begin{array}{|c|c|c|c|}
\hline
\textbf{Individual} &amp; \textbf{SNP1} &amp; \textbf{SNP2} &amp; \textbf{SNP3} \\ \hline
A &amp; 0 &amp; 1 &amp; 1 \\ \hline
B &amp; 1 &amp; 0 &amp; 1 \\ \hline
\text{MAF} (p_j) &amp; 0.25 &amp; 0.25 &amp; 0.5 \\ \hline
\end{array}
$$&lt;p>$X^{\text{norm}}_{ij} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1 - p_j)}}$&lt;/p>
&lt;p>Standardize $X$&lt;/p>
$$
\begin{array}{|c|c|c|c|}
\hline
individual &amp; SNP1 &amp; SNP2 &amp; SNP3 \\ \hline
A &amp; \displaystyle \frac{0 - 0.5}{0.612} &amp; \displaystyle \frac{1 - 0.5}{0.612} &amp; \displaystyle \frac{1 - 1}{0.707} \\ \hline
B &amp; \displaystyle \frac{1 - 0.5}{0.612} &amp; \displaystyle \frac{0 - 0.5}{0.612} &amp; \displaystyle \frac{1 - 1}{0.707} \\ \hline
\end{array}
$$&lt;h3 id="example-1">{Example}&lt;/h3>
&lt;p>the standardized genotype matrix is:&lt;/p>
$$
X^{\text{norm}} =
\begin{bmatrix}
-0.816 &amp; 0.816 &amp; 0 \\
0.816 &amp; -0.816 &amp; 0
\end{bmatrix}
$$&lt;p>Genetic Relationship Matrix is $\frac{1}{3} X^{\text{norm}} (X^{\text{norm}})^T$&lt;/p>
$$
\begin{bmatrix}
0.443 &amp; -0.443 \\
-0.443 &amp; 0.443
\end{bmatrix}
$$&lt;h3 id="exclude-close-relatives">{Exclude Close Relatives}&lt;/h3>
&lt;ul>
&lt;li>Including close relatives, this estimate could be a \textcolor{blue}{biased} estimate of total genetic variance&lt;/li>
&lt;/ul>
&lt;p>% 很相似的基因好比兄弟姊妹，他們來自相同的生長環境，這環境有可能影響基因表達，假如這資料也拿來使用，variance 就會估不好&lt;/p>
&lt;h3 id="mixed-linear-model">{Mixed Linear Model}&lt;/h3>
$$
\begin{align*}
\mathbf{Y_{n \times 1}}= \underbrace{\mathbf{X_{n \times p} \beta_{p \times 1}} }_{\text{fixed term}} + \underbrace{\mathbf{W_{n \times q}u_{q \times 1}} }_{\text{random term}} +\mathbf{\varepsilon_{n \times 1}}
\end{align*}
$$&lt;ul>
&lt;li>$\mathbf{u} \sim N(0,\mathbf{I} \sigma_u^2)$, $\mathbf{\varepsilon} \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$&lt;/li>
&lt;li>$\mathbf{W}$ is standardized genotype matrix with $w_{ij} = \frac{x_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}}$, where $x_{ij}$ is $i-th$ individual $j-th$ SNP and $p_j$ is $j-th$ SNP MAF&lt;/li>
&lt;li>$Var(\mathbf{Y}) = \mathbf{W W'}\sigma_u^2 +\mathbf{I} \sigma_\varepsilon^2$&lt;/li>
&lt;/ul>
&lt;p>%X is sex, age, 20PCs, trait?；u is snp effect
% Y is phenotype，像是身高、BMI 等的性狀，上次講錯了&lt;/p>
&lt;h3 id="model-1">{Model 1}&lt;/h3>
&lt;p>To estimate the variance explained by all autosomal SNPs, we specify the model as&lt;/p>
$$
\begin{align*}
\mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{\varepsilon}
\end{align*}
$$&lt;p>The Mixed Linear Model is equivalent to this model with $\mathbf{A*g} = \frac{\mathbf{W W'}}{m},\ \sigma*\mathbf{g}^2 = m\sigma\_\mathbf{u}^2 $&lt;/p>
&lt;ul>
&lt;li>$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM&lt;/li>
&lt;li>$Var(\mathbf{Y}) =\mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{I} \sigma_\varepsilon^2$&lt;/li>
&lt;/ul>
&lt;p>%這個模型，把所有的SNP看成同一個影響&lt;/p>
&lt;h3 id="model-2">{Model 2}&lt;/h3>
&lt;p>To partition genetic variance onto each of the 22 autosomes, we specify the model as&lt;/p>
$$
\begin{align*}
\mathbf{Y} = \mathbf{X\beta} + \sum_{i=1}^{22} \mathbf{g_i} + \mathbf{\varepsilon}
\end{align*}
$$&lt;ul>
&lt;li>$\mathbf{g_i} \sim N(0,\mathbf{A_i} \sigma_i^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_i}$ is GRM from the SNPs on $i-th$ chromosome&lt;/li>
&lt;li>$Var(\mathbf{Y}) = \sum_{i=1}^{22} \mathbf{A_i} \sigma_i^2 + \mathbf{I} \mathbf{\sigma_\varepsilon^2}$&lt;/li>
&lt;/ul>
&lt;p>% 把SNP用染色體來區分，分成22個effect&lt;/p>
&lt;h3 id="model-3">{Model 3}&lt;/h3>
&lt;p>To estimate the variance of genotype-environment interaction effects, we specify the model as&lt;/p>
$$
\begin{align*}
\mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{ge} + \mathbf{\varepsilon}
\end{align*}
$$&lt;ul>
&lt;li>$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\mathbf{ge} \sim N(0,\mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM&lt;/li>
&lt;li>$Var(\mathbf{Y}) = \mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2 + \mathbf{I} \sigma_\varepsilon^2$&lt;/li>
&lt;li>$$
\mathbf{A_{ge}}= \left\lbrace \begin{array}{lll}
\mathbf{A_{g}} &amp; \text{if} &amp; \text{pairs of individuals in the same environment} \\
\mathbf{0} &amp; \text{if} &amp; \text{pairs of individuals in different environment}\end{array} \right.
$$&lt;/li>
&lt;/ul>
&lt;p>% 假如想知道環境對基因表達的影響，也可以多一個SNp跟環境的交互作用向，這裡用ge表示&lt;/p>
&lt;h3 id="model-3-1">{Model 3}&lt;/h3>
&lt;p>$Cov(y_i,y_k)
= \left\lbrace \begin{array}{lll}
{A_{ik}}(\sigma_{\mathbf{ge}}^2+\sigma_{\mathbf{g}}^2) &amp; \text{if} &amp; \text{same environment} \\
{A_{ik}} \sigma_{\mathbf{g}}^2 &amp; \text{if} &amp; \text{different environment}
\end{array} \right.$&lt;/p>
&lt;p>##　｛Build Model}
For model&lt;/p>
$$
\begin{align*}
\mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon}
\end{align*}
$$&lt;ul>
&lt;li>$\mathbf{g_{cis}} \sim N(0,\mathbf{A_{cis}} \sigma_{cis}^2)$, $\mathbf{g_{trans}} \sim N(0,\mathbf{A_{trans}} \sigma_{trans}^2)$, $\mathbf{g_{GE}} \sim N(0,\mathbf{A_{GE}} \sigma_{GE}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$&lt;/li>
&lt;li>$\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $&lt;/li>
&lt;li>$\theta = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_{\varepsilon}^2)$&lt;/li>
&lt;/ul>
&lt;p>Log likelihood function&lt;/p>
$$
\begin{align*}
&amp; L_Y(\beta, \theta) \\
= &amp; -\frac{n}{2} \ln(2\pi) -\frac{1}{2} \ln|\mathbf{V}| -\frac{1}{2} \mathbf{(Y - X\beta)^T V^{-1} (Y - X\beta)} \\
\propto &amp; -\frac{1}{2} \left[ \ln |\mathbf{V}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right]
\end{align*}
$$&lt;h2 id="reml">REML&lt;/h2>
&lt;p>Log likelihood function independent to $\beta$ is the target of REML (restricted maximum likelihood). Comparing to ML, variance estimator is &lt;strong>unbiased&lt;/strong> in REML. Let $\mathbf{M}\ s.t. \mathbf{M}\mathbf{X} = 0$&lt;/p>
$$
\begin{align*}
&amp;\text{Let } \mathbf{W} = \mathbf{M}\mathbf{Y} \\
&amp;E(\mathbf{W}) = \mathbf{M}\mathbf{X}\beta = 0 \\
&amp;\text{Var}(\mathbf{W}) = \mathbf{M}\mathbf{V}\mathbf{M}^T
\end{align*}
$$&lt;p>Transfer $\mathbf{X}$ to $\mathbf{M}\mathbf{X}$，$\mathbf{Y}$ to $\mathbf{M}\mathbf{Y}$. Log likelihood function of $\mathbf{W}$&lt;/p>
$$
\begin{align*}
L_\text{REML}(\theta) &amp; \propto -\frac{1}{2} \left[ \ln |\mathbf{V}| + \ln |\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \\
&amp;\propto -\frac{1}{2} \left[ \ln | \mathbf{M}\mathbf{V}\mathbf{M}^T | + \ln |\mathbf{(MX)}^T(\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1}\mathbf{MX}| + (\mathbf{MY} - \mathbf{MX}\beta)^T \mathbf{V}^{-1} (\mathbf{MY} - \mathbf{MX}\beta) \right] \\
&amp;\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{Y}^T\mathbf{M}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{M}^T\mathbf{Y} \right] \\
&amp;\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \\
\end{align*}
$$&lt;p>Therefore&lt;/p>
$$
\begin{align*}
&amp;L_\text{REML}(\theta) = -\frac{n-p}{2}\ln(2 \pi) -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right]
\end{align*}
$$&lt;p>Log likelihood function of $\mathbf{W}$ is independent to $\beta$.&lt;/p>
&lt;h3 id="some-problem">some problem&lt;/h3>
&lt;p>It&amp;rsquo;s very common that $\mathbf{M}\mathbf{V}\mathbf{M}^T \text{ be singular}$&lt;/p>
$$\Rightarrow \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| \to -\infty$$&lt;p>REML avoid it.&lt;/p>
&lt;h3 id="generalized-least-square-gls">Generalized least square (GLS)&lt;/h3>
$$
\begin{align*}
&amp;\mathbf{Y} = \mathbf{X}\beta + \epsilon, \quad \text{Var}(\epsilon) = \mathbf{V}, \quad \mathbf{V} = \mathbf{\Sigma} \mathbf{\Sigma}^T \\
&amp;E(\epsilon) = 0 \\
&amp;\mathbf{\Sigma}^{-1}\mathbf{Y} = \mathbf{\Sigma}^{-1}\mathbf{X}\beta + \mathbf{\Sigma}^{-1}\epsilon \\
&amp;{\mathbf{Y}} = {\mathbf{X}}\beta + {\epsilon} \\
&amp;\text{Var}({\epsilon}) = \text{Var}(\mathbf{\Sigma}^{-1}\epsilon) \\
&amp;\quad = \mathbf{\Sigma}^{-1} \text{Var}(\epsilon) {\mathbf{\Sigma}^{-1}}^T \\
&amp;\quad = \mathbf{\Sigma}^{-1} \mathbf{\Sigma} \mathbf{\Sigma}^T {\mathbf{\Sigma}^{-1}}^T \\
&amp;\quad = I
\end{align*}
$$$$
\begin{align*}
\hat{\beta} &amp;= ({\mathbf{X}}^T {\mathbf{X}})^{-1} ({\mathbf{X}}^T {\mathbf{Y}}) \\
&amp;= \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^{-1} \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\
&amp;= (\mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{X})^{-1} \mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\
&amp;= (\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y}
\end{align*}
$$&lt;h2 id="em-algorithm">{EM Algorithm}&lt;/h2>
&lt;p>For estimating $(\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$, we use EM algorithm as an initial step to determine the direction of the iteration updates&lt;/p>
$$
\begin{align*}
\sigma^{2(1)}_i = \frac{1}{n} \left[ \sigma^{4(0)}_i \mathbf{Y^T P A_i P Y} + \operatorname{tr} (\sigma^{2(0)}_i \mathbf{I} - \sigma^{4(0)}_i \mathbf{P A_i}) \right]
\end{align*}
$$&lt;p>where&lt;/p>
$$
\begin{align*}
&amp; i = 1, \cdots, 4 \\
&amp; \mathbf{P = V^{-1} - V^{-1} X (X^T V^{-1} X)^{-1} X^T V^{-1}}
\end{align*}
$$&lt;h2 id="average-information">{Average Information&lt;/h2>
&lt;p>Algorithm}
After one EM iteration, GCTA switches to the average information algorithm&lt;/p>
$$
\begin{align*}
\bm{\theta}^{(t+1)} = \bm{\theta}^{(t)} + (\mathbf{AI}^{(t)})^{-1} \frac{\partial L}{\partial \bm{\theta}} \Big|_{\bm{\theta}^{(t)}}
\end{align*}
$$&lt;p>where $\bm{\theta} = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$&lt;/p>
&lt;ul>
&lt;li>The iteration stop when $L^{(t+1)}-L^{(t)}&lt; 10^{-4}$&lt;/li>
&lt;li>In the iteration process, if $\sigma_i^2&lt;0$, set $\sigma_i^2= 10^{-6}\sigma_Y^2$, where $\sigma_Y^2$ is phenotype variance&lt;/li>
&lt;/ul>
&lt;h2 id="reml-method">{REML Method}&lt;/h2>
$$
\begin{align*}
&amp; \mathbf{A I}=\mathbf{1} / \mathbf{2}\left[\begin{array}{cccc}
\mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P} A_1 \mathbf{P Y} &amp; \cdots &amp; \mathbf{Y}^{\prime} \mathbf{P A}_1 \mathbf{P A}_r \mathbf{P Y} &amp; \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P P Y} \\
\vdots &amp; \vdots &amp; \vdots &amp; \vdots \\
\mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P} A_1 \mathbf{P Y} &amp; \cdots &amp; \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P A}_r \mathbf{P Y} &amp; \mathbf{Y}^{\prime} \mathbf{P A _ { r }} \mathbf{P P Y} \\
\mathbf{Y}^{\prime} \mathbf{P P} \mathbf{A}_1 \mathbf{P Y} &amp; \cdots &amp; \mathbf{Y}^{\prime} \mathbf{P P} A_r \mathbf{P Y} &amp; \mathbf{Y}^{\prime} \mathbf{P P P Y}
\end{array}\right] ;\\
&amp; \partial L / \partial \boldsymbol{\theta}=-1 / 2\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{P} \mathbf{A}_1\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_1 \mathbf{P} \mathbf{Y} \\ \vdots \\ \operatorname{tr}\left(\mathbf{P} A_r\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_r \mathbf{P Y} \\ \operatorname{tr}(\mathbf{P})-\mathbf{Y}^{\prime} \mathbf{P P Y}\end{array}\right]
\end{align*}
$$&lt;h2 id="reml-restricted-maximum-likelihood">{REML (restricted maximum likelihood)}&lt;/h2>
&lt;ul>
&lt;li>As $Y_1, \cdots , Y_n$ is constant mean, $A^T X\beta = 0$. Hence, $L_W(\beta, \sigma_c^2, \sigma_t^2)$ doesn&amp;rsquo;t depend on $\beta$&lt;/li>
&lt;li>Compared to ML, REML is less affected by fixed effects&lt;/li>
&lt;li>Compared to ML, REML has lower bias.&lt;/li>
&lt;/ul>
&lt;h2 id="heritability">{Heritability}&lt;/h2>
&lt;p>Assume model&lt;/p>
$$
\begin{align*}
\mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon}
\end{align*}
$$&lt;p>where $\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $&lt;/p>
&lt;ul>
&lt;li>$h^2_{\mathbf{g},cis} = \frac{\sigma^2_{\mathbf{g},cis}}{\mathbf{V}}$&lt;/li>
&lt;li>$h^2_{\mathbf{g},trans} = \frac{\sigma^2_{\mathbf{g},trans}}{\mathbf{V}}$&lt;/li>
&lt;li>$h^2_{\mathbf{g},GE} = \frac{\sigma^2_{\mathbf{g},GE}}{\mathbf{V}}$&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://rongxie.wordpress.com/wp-content/uploads/2011/01/statistics-for-spatial-data-revised-version-1993.pdf" target="_blank" rel="noopener"
>REML (p92)&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3014363/" target="_blank" rel="noopener"
>GCTA paper&lt;/a> &lt;br>&lt;/li>
&lt;/ul></description></item><item><title>RNA- Sequencing</title><link>https://jiangcc.netlify.app/p/rna-sequencing/</link><pubDate>Sun, 16 Feb 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/rna-sequencing/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>RNA- Sequencing is a powerful technology enables researchers to quantify RNA levels, identify novel transcripts, and analyze alternative splicing events, offering deeper insights into cellular function and disease mechanisms. Key processes such as read alignment, normalization methods like RPKM and TPM, and splicing analysis ensure accurate interpretation of RNA-Seq data. However, despite its advantages, RNA-Seq faces challenges, including biases in library preparation, sequencing errors, and the complexity of data analysis.&lt;/p>
&lt;h2 id="workflow">Workflow&lt;/h2>
&lt;div style="text-align: center;">
&lt;img src="workflow.jpg" style="width:100%;">
&lt;/div>
&lt;h3 id="alternative-splicing">Alternative Splicing&lt;/h3>
&lt;p>A cellular process in which exons from the same gene are joined in different combinations, resulting in distinct but related mRNA transcripts (isoforms).&lt;/p>
&lt;h3 id="read-sequencing-technology">Read Sequencing Technology&lt;/h3>
&lt;p>There are three common technologies for splitting transcripts into reads.&lt;/p>
&lt;ul>
&lt;li>Illumina&lt;/li>
&lt;li>Nanopore&lt;/li>
&lt;li>PacBio&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>High throughout technology&lt;/strong> refers to massive parallel sequencing, which generates millions to billions of reads in a single experiment.&lt;/p>
&lt;h3 id="read-alignment">Read Alignment&lt;/h3>
&lt;p>There are two types of read alignment: aligning reads to a reference transcriptome and aligning reads to a reference genome.&lt;br>
Aligning reads to reference transcriptome:&lt;/p>
&lt;ul>
&lt;li>Performs well if the reference transcripts is enough&lt;/li>
&lt;li>Faster&lt;/li>
&lt;li>Used for computing gene expression&lt;/li>
&lt;/ul>
&lt;p>Aligning reads to reference genome:&lt;/p>
&lt;ul>
&lt;li>Enables the identification of new isoforms, we call it novel isoforms&lt;/li>
&lt;li>Deal intron-exon structure, providing a more complete but slower analysis&lt;/li>
&lt;/ul>
&lt;p>When introns are large, the latter method (aligning to the genome) requires significantly more time to process intron-exon structures. On the other hand, transcriptome-based alignment is faster because it does not handle intron-exon structures. However, this trade-off makes it difficult to detect rare disease-related transcripts and reduces accuracy when the reference transcriptome is incomplete.&lt;/p>
&lt;h4 id="rpkm">RPKM&lt;/h4>
&lt;p>RPKM(Reads Per Kilobase Per Million), a gene expression unit of account.&lt;/p>
$$
\begin{align*}
\text{RPKM}_g = \frac{r_g \times 10^9}{{fl}_g \times R}
\end{align*}
$$&lt;pre>&lt;code>where
&lt;/code>&lt;/pre>
$$
\begin{align*}
&amp; r_g =\text{ reads number to gene } g \\
&amp; {fl}_g = \text{ mapped gene length } \\
&amp; R = \text{ total number of reads in all gene}
\end{align*}
$$&lt;h4 id="example">Example&lt;/h4>
&lt;div style="text-align: center;">
&lt;img src="RPKM.png" style="width:100%;">
&lt;/div>
&lt;p>Genes are not limited to A, B, and C in this example. Let&amp;rsquo;s focus on gene A in sample 1, which has $12$ reads and a mapped gene length of $600$. The total number of reads across all genes is $6*10^6$.&lt;/p>
$$
\begin{align*}
\text{RPKM}_A
&amp; = \frac{12 \times 10^9}{600 \times 6*10^6} \\
&amp; = 3.33
\end{align*}
$$&lt;h4 id="tpm">TPM&lt;/h4>
&lt;p>TPM(Transcripts Per Million), a gene expression unit of account.&lt;/p>
$$
\begin{align*}
\text{TPM}=\frac{r_g \times rl \times 10^6}{{fl}_g \times T},
\end{align*}
$$&lt;pre>&lt;code>where
&lt;/code>&lt;/pre>
$$
\begin{align*}
&amp; r_g =\text{ reads number to gene } g \\
&amp; rl = \text{ reads length to gene } g \\
&amp; {fl}_g = \text{ mapped gene length} \\
&amp; T=\displaystyle \sum_{g \in G} \frac{r_g \times rl}{{fl}_g}
\end{align*}
$$&lt;p>Read length depends on the sequencing technology rather than the transcript itself, so we use $rl$ instead of $rl_g$.&lt;/p>
&lt;h4 id="example-1">Example&lt;/h4>
&lt;div style="text-align: center;">
&lt;img src="tpm1.png" style="width:100%;">
&lt;/div>
&lt;p>We standardize gene length because it is easier to align reads for longer genes. Therefore, the read count divided by gene length is defined as &lt;strong>RPK (Reads Per Kilobase)&lt;/strong>.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="tpm2.png" style="width:100%;">
&lt;/div>
&lt;p>Suppose there are only genes A, B, and C, and the total RPK values are $650$ and $700$, respectively.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="tpm3.png" style="width:100%;">
&lt;/div>
&lt;h2 id="additional">Additional&lt;/h2>
&lt;h3 id="sequencing-depth">Sequencing Depth&lt;/h3>
&lt;p>Sequencing depth refers to the average number of times a nucleotide is mapped by a read. A higher sequencing depth generates more informative reads but comes at a higher cost.&lt;/p>
$$
\begin{align*}
\text{Sequencing Depth} = \frac{\text{reads length}\times \text{reads number}}{\text{reference sequence length}}
\end{align*}
$$&lt;h3 id="example-2">Example&lt;/h3>
&lt;p>$10^8$ reads with length $150$ bp, the reference sequence length $3 \times 10^9$ bp.&lt;/p>
$$
\begin{align*}
\text{Sequencing Depth}
&amp; = \frac{10^8 \times 150}{3 \times 10^9} \\
&amp; = 5 \text{X}
\end{align*}
$$&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.997383/full" target="_blank" rel="noopener"
>RNA-seq data science: From raw data to effective interpretation&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://link.springer.com/article/10.1007/s12064-012-0162-3" target="_blank" rel="noopener"
>Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://izabelcavassim.wordpress.com/2015/03/09/rpkm-and-fpkm-normalization-units-of-expression/" target="_blank" rel="noopener"
>RPKM example&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="http://toolsbiotech.blog.fc2.com/blog-entry-171.html" target="_blank" rel="noopener"
>Read Count&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://3billion.io/blog/sequencing-depth-vs-coverage" target="_blank" rel="noopener"
>Sequencing Depth vs Coverage&lt;/a> &lt;br>&lt;/li>
&lt;/ul></description></item><item><title>Summary PrediXcan</title><link>https://jiangcc.netlify.app/p/summary-predixcan/</link><pubDate>Thu, 02 Jan 2025 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/summary-predixcan/</guid><description>&lt;h2 id="background">Background&lt;/h2>
&lt;p>GWAS identifies SNPs that affect a trait, but the mechanism is unknown. We use S-PrediXcan to determine whether SNPs affect the trait through gene expression.&lt;/p>
&lt;h3 id="gene-regulation">Gene Regulation&lt;/h3>
&lt;p>Although each cell in body contains the same DNA sequences, each cell does not express the same set of genes.
Each cell with different genes encoded in the DNA and transcribed into mRNA or translated into protein. The process
of express genes to produce mRNA and protein is called &lt;strong>gene expression&lt;/strong>. And the mechanism of controlling specific
genes express is called &lt;strong>gene regulation&lt;/strong>. If human chromosome stretched out linearly, it would be over $4$ cm long.
And every gene expressed, the cell have to be enormous.&lt;/p>
&lt;p>&lt;strong>Alternative RNA splicing&lt;/strong> is a common mechanism of gene regulation in eukaryotes. Up to $70\%$ of genes in humans
are expressed as multiple proteins through it. Different combinations of introns and exons made up per-mRNA. And introns
or exons to be removed from the primary transcript. Spliced mRNAs will create different proteins.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="alternative_splicing.png" style="width:70%;">
&lt;/div>
&lt;h3 id="colocalization">Colocalization&lt;/h3>
&lt;p>&lt;strong>Colocalization&lt;/strong> is that GWAS, eQTL signal are overlaped on the same locus. It can determine whether the SNP in GWAS affect
gene expression. There are three conditions&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Linkage:&lt;br>
Two independent causal variants are closely located in the genome, leading to overlapping signals.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Causality:&lt;br>
A SNP directly affects the trait by changing gene expression, representing a direct causal relationship.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Pleiotropy:&lt;br>
A single SNP independently affects multiple traits. The association between these traits is caused by the same SNP,
but the effects occur through different biological pathways.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>We use Mendelian Randomization to check condition is causality or pleiotropy.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="colocalizing_signal.png" style="width:90%;">
&lt;/div>
&lt;h3 id="heterogeneity">Heterogeneity&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>allelic heterogeneity:&lt;br>
A similar phenotype is produced by different alleles within the same gene&lt;/p>
&lt;/li>
&lt;li>
&lt;p>locus heterogeneity :&lt;br>
A similar phenotype is produced by mutations at different loci.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div style="text-align: center;">
&lt;img src="chrom.png" style="width:70%;">
&lt;/div>
&lt;h3 id="bayesian-sparse-linear-mixed-models-bslmm">Bayesian Sparse Linear Mixed Models (BSLMM)&lt;/h3>
&lt;p>For $n$ sample and $p$ SNP&lt;/p>
$$
\begin{align*}
Y_i = \sum_{j=1}^p X_{ij} \beta_j + u_i + \epsilon_i
\end{align*}
$$&lt;ul>
&lt;li>$Y_i$: phenotype of i-th sample&lt;/li>
&lt;li>$X_{ij}$: genotype of i-th sample at j-th SNP&lt;/li>
&lt;li>$\beta_j$: effect size&lt;/li>
&lt;li>$u_i$: random effect for i-th sample&lt;/li>
&lt;li>$\epsilon_i$: error term&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Sparse Component&lt;/strong>&lt;br>
Lots SNP effect size will be zero, and contain important SNP only.&lt;/p>
&lt;p>&lt;strong>Polygenic Component&lt;/strong>&lt;br>
Lots SNP effect size are very small, so SNPs contribute together to trait.&lt;/p>
&lt;h2 id="method">Method&lt;/h2>
&lt;h3 id="gwas-and-predixcan">GWAS and PrediXcan&lt;/h3>
&lt;p>We assume that phenotype is linear function of $X_l$ and $T_g$ respectively.&lt;/p>
$$
\begin{align}
&amp; Y = \alpha_1 + X_l \beta_l + \eta \\
&amp; Y = \alpha_2 + T_g \gamma_g + \epsilon
\end{align}
$$&lt;ul>
&lt;li>$\alpha_1 , \ \alpha_2$ are constant&lt;/li>
&lt;li>$\eta, \ \epsilon$ are error term&lt;/li>
&lt;li>$T_g = \sum_{l \in \text{Model}_g}^{} w_{lg} X_l$, predicted gene expression (transcriptome)&amp;lt;&amp;gt;&lt;/li>
&lt;li>$\text{Var}(T_g) = \hat{\sigma}_g^2$&lt;/li>
&lt;li>$X_l$ is $l$-th SNP allelic dosage (genotype)&lt;/li>
&lt;li>$\text{Var}(X_l) = \hat{\sigma}_l^2$&lt;/li>
&lt;li>$Y$ is level of the trait (phenotype)&lt;/li>
&lt;li>$\text{Var}(Y) = \hat{\sigma}_Y^2$&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>PrediXcan&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>$w_{lg}$ from &lt;a class="link" href="https://predictdb.org/" target="_blank" rel="noopener"
>predictDB&lt;/a>&lt;/li>
&lt;li>get transcriptome $\hat{T}$&lt;/li>
&lt;li>get $\hat{\gamma_g}$&lt;/li>
&lt;/ol>
&lt;p>PrediXcan is a computational algorithm developed to exploit GTEx data, including eQTLs
identification and their relationship to complex traits. PrediXcan evaluates the aggregate
effects of cis-regulatory variants (within in 1MB upstream or downstream of genes of interest)
on gene expression via an elastic net regression method, and consequently, PrediXcan may identify
loci with modest to weak effect sizes that do not achieve significance in variant-based association studies.&lt;/p>
&lt;h3 id="s-predixcan">S-PrediXcan&lt;/h3>
&lt;ul>
&lt;li>$w_{lg}$ from &lt;a class="link" href="https://predictdb.org/" target="_blank" rel="noopener"
>predictDB&lt;/a>&lt;/li>
&lt;li>$\hat{\sigma}_g$ from training set or reference set&lt;/li>
&lt;li>$\hat{\beta}_l, \ \text{se}(\hat{\beta}_l)$ from GWAS&lt;/li>
&lt;/ul>
&lt;p>We get&lt;/p>
$$
\begin{align*}
Z_g
= &amp; \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)} {\sqrt{\dfrac{1-\mathit{R}_l^2}{1-\mathit{R}_g^2}}} \\
\approx &amp; \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)}
\end{align*}
$$&lt;ul>
&lt;li>make sure that the GWAS and prediction model are based on the same population.&lt;/li>
&lt;li>get $\hat{\gamma_g}$, z score&lt;/li>
&lt;/ul>
&lt;div style="text-align: center;">
&lt;img src="fig1.png" style="width:90%;">
&lt;/div>
&lt;h3 id="pve-by-snp-and-transcriptome">PVE by SNP and Transcriptome&lt;/h3>
&lt;p>Proportion of variance explained (PVE) by covariate $X_l$ and $T_g$ are&lt;/p>
$$
\begin{align*}
&amp; R_g^2 = \frac{ \text{var}(T_g \hat{\gamma_g} ) }{ \text{var}(Y) } = \hat{\gamma}_g^2 \ \frac{\hat{\sigma}_g^2}{\hat{\sigma}_Y^2} \\
&amp; R_l^2 = \frac{ \text{var}(X_l \hat{\beta}_l ) }{ \text{var}(Y) } = \hat{\beta}_l^2 \ \frac{\hat{\sigma}_l^2}{\hat{\sigma}_Y^2}
\end{align*}
$$&lt;h3 id="predicted-effect-size">Predicted Effect Size&lt;/h3>
&lt;p>We represent $\hat{\sigma}_g^2$ in matrix form&lt;/p>
$$
\begin{align*}
\hat{\sigma}_g^2
&amp;= \text{Var}(\sum_{l \in \text{Model}_g}^{} w_{lg} X_l) \notag \\
&amp;= \text{Var}(\mathbf{W}_g\mathbf{X}_g) \notag \\
&amp;= \mathbf{W}_g' \cdot \text{Var}(\mathbf{X}_g) \cdot \mathbf{W}_g \notag \\
&amp;= \mathbf{W}_g' \cdot \mathbf{\Gamma}_g \cdot \mathbf{W}_g
\end{align*}
$$&lt;ul>
&lt;li>$\mathbf{X}$ is $n \times p$ matrix of SNP data in model $g$&lt;/li>
&lt;li>$\bar{\mathbf{X}}$ is $n \times p$ matrix with column $l$ has the column mean of $X_l$&lt;/li>
&lt;li>$\mathbf{W}_g$ is the vector of $w_{lg}$ for SNPs in the model of $g$&lt;/li>
&lt;li>$\mathbf{\Gamma}_g = (\mathbf{X}_g - \bar{\mathbf{X}_g})'(\mathbf{X}_g - \bar{\mathbf{X}_g})$, the sample covariance matrix of $\mathbf{X}_g$&lt;/li>
&lt;/ul>
&lt;p>For the assumption of linear function, the predicted effect size (coefficient) of covariate $X_l$ is&lt;/p>
$$
\begin{align*}
&amp; \hat{\beta}_l = \frac{\text{Cov}(X_l, Y)}{\text{Var}(X_l)} = \frac{\text{Cov}(X_l, Y)}{\hat{\sigma}_l^2} \\
\Rightarrow \ &amp; \text{Cov}(X_l, Y) = \hat{\beta}_l \hat{\sigma}_l^2
\end{align*}
$$&lt;p>And coefficient of covariate $T_g$ is&lt;/p>
$$
\begin{align*}
\hat{\gamma_g} &amp;= \dfrac{\text{Cov}(T_g, Y)}{\hat{\sigma}_g^2} \\
&amp;= \dfrac{\text{Cov}(\sum_{l \in \text{Model}_g} w_{lg} X_l, Y)}{\hat{\sigma}_g^2} \\
&amp;= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \text{Cov}(X_l, Y)}{\hat{\sigma}_g^2} \\
&amp;= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \hat{\beta}_l\hat{\sigma}_l^2}{\hat{\sigma}_g^2}
\end{align*}
$$&lt;p>As the linear assumption&lt;/p>
$$
\begin{align*}
&amp; Y = \alpha_1 + X_l \beta_l + \eta \\
\Rightarrow \ &amp; \hat{\sigma}_Y^2 = \hat{\sigma}_\eta^2 + \hat{\sigma}_l^2 \hat{\beta}_l^2
\end{align*}
$$&lt;p>We rewrite the variance&lt;/p>
$$
\begin{align*}
\text{var}(\hat{\beta_l}) &amp;= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(Y_i-\bar{Y})}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\
&amp;= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})Y_i}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\
&amp;= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(\alpha_1 + X_{li} \beta_l + \eta)}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\
&amp;= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l}) \eta}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\
&amp;= \dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})^ 2 \cdot \sigma_{\epsilon}^2}{(\sum_{i=1}^n (X_{li}-\bar{X_l})^2)^2} \\
&amp;= \dfrac{\sigma_{\epsilon}^2}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2} \\
&amp;= \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_l^2 \hat{\beta}_l^2}{n\hat{\sigma}_l^2} \\
&amp;= \frac{\hat{\sigma}_Y^2(1 - R_l^2)}{n\hat{\sigma}_l^2}
\end{align*}
$$&lt;p>So&lt;/p>
$$
\begin{align}
\frac{\hat{\sigma}_Y^2}{n} = \dfrac{se^2(\hat{\beta_l}) \cdot\hat{\sigma}_l^2}{1 - R_l^2}
\end{align}
$$&lt;p>Similarly,&lt;/p>
$$
\begin{align*}
\text{var}(\hat{\gamma_g}) = \frac{\hat{\sigma}_Y^2}{n} \cdot \frac{(1 - R_g^2)}{\hat{\sigma}_g^2}
\end{align*}
$$&lt;p>By $(1)$,&lt;/p>
$$
\begin{align*}
se(\hat{\gamma_g})
&amp; = \sqrt{\text{var}(\hat{\gamma_g})} \\
&amp; = se(\hat{\beta_l}) \cdot \frac{\hat{\sigma}_l}{\hat{\sigma}_g} \cdot \sqrt{ \frac{(1 - R_g^2)}{(1 - R_l^2)}}
\end{align*}
$$&lt;p>We infer PrediXcan results ($\hat{\gamma_g},\ \text{se}(\hat{\gamma_g})$) using GWAS results ($\hat{\beta}_l,\ \text{se}(\hat{\beta}_l)$), SNPs information ($\hat{\sigma}_l^2,\ \mathbf{\Gamma}_g$) and PredictDB weights ($w_{lg}$).&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;h3 id="compare-predixcan-and-s-predixcan">Compare PrediXcan and S-PrediXcan&lt;/h3>
&lt;div style="text-align: center;">
&lt;img src="fig2.png" style="width:90%;">
&lt;/div>
&lt;p>$w_{lg}$ from &lt;a class="link" href="https://predictdb.org/" target="_blank" rel="noopener"
>predictDB&lt;/a> that based on EUR Depression Genes and Network’s (DGN) Whole Blood data,
GTEx, Framingham, etc. Training set will usually be different from the study sets. When individual level data are
not available from the training set we use population reference sets such as 1000 Genomes data.&lt;/p>
&lt;p>&lt;strong>2a&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>$Y$ is simulated phenotype which under $H_0:$ phenotype is independent to transcriptome (predicted gene expression). So (2) doesn&amp;rsquo;t
with covariate $\hat{T}$, only some environmet covariates.&lt;/li>
&lt;li>study sets (GWAS set) and reference sets (LD calculation set) both consisted of African (661), East Asian (504), and
European (503) individuals from the 1000 Genomes Project&lt;/li>
&lt;/ul>
&lt;p>For the same race, S-PrediXcan and PrediXcan are high correlated. Eventhough different race, it is high correlated also.
Futhermore, for AFR sutudy/refernce set, the EUR $r^2$ is higher than EAS.&lt;/p>
&lt;p>&lt;strong>2b&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>$Y$ is intrinsic growth phenotype&lt;/li>
&lt;li>study sets were a subset of 140 individuals from each of the African, Asian, and European groups from 1000 Genomes Project, and
reference sets consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project&lt;/li>
&lt;/ul>
&lt;p>The amount of study set sample is less. It may make $se(\hat{\beta})$ increase, and then z-score decrease. PrediXcan and
S-PrediXcan results are a little different. So diagonal plot $r^2$ smaller than Figure 2a.&lt;/p>
&lt;p>&lt;strong>2c&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>$Y$ is bipolar disorder and type 1 diabetes studies&lt;/li>
&lt;li>study sets consisted of British individuals, reference sets was the European population subset of the 1000 Genomes Project&lt;/li>
&lt;/ul>
&lt;h3 id="colocalization-status-of-s-predixcan">Colocalization Status of S-PrediXcan&lt;/h3>
&lt;div style="text-align: center;">
&lt;img src="fig3.png" style="width:90%;">
&lt;/div>
&lt;p>Five conditions&lt;/p>
&lt;ol>
&lt;li>$H_0:$ SNP signal not associate with eQTL and GWAS.&lt;/li>
&lt;li>$H_1:$ SNP signal associate with eQTL but not GWAS.&lt;/li>
&lt;li>$H_2:$ SNP signal associate with GWAS but not eQTL.&lt;/li>
&lt;li>$H_3:$ SNP signal associate with both eQTL and GWAS, and independent signal (pleiotropy).&lt;/li>
&lt;li>$H_4:$ SNP signal associate with both eQTL and GWAS, and shared signal (colocalized).&lt;/li>
&lt;/ol>
&lt;p>If we keep only Bonferroni-significant S-PrediXcan results, associations tend to cluster into three distinct regions&lt;/p>
&lt;h3 id="compare-s-twas-and-s-predixcan">Compare S-TWAS and S-PrediXcan&lt;/h3>
&lt;div style="text-align: center;">
&lt;img src="fig4.png" style="width:90%;">
&lt;/div>
&lt;div style="text-align: center;">
&lt;img src="supplement12.png" style="width:50%;">
&lt;/div>
&lt;ul>
&lt;li>difference between S-TWAS, S-PrediXcan is prediction models. TWAS uses BSLMM whereas PrediXcan uses elastic net&lt;/li>
&lt;li>For COLOC-estimated proportion of non-colocalized, &lt;strong>polygenic component&lt;/strong> of BSLMM consider the effects of multiple
SNP combinations. It increase the chance of non-colocalized result.&lt;/li>
&lt;li>Mancuso et al filtered out genes with low GCTA heritability, so significant genes in TWAS is less than PrediXcan. But
the significance of TWAS and PrediXcan are similar.&lt;/li>
&lt;/ul>
&lt;h3 id="predicted-performance-by-trait">Predicted Performance by Trait&lt;/h3>
&lt;div style="display: flex; justify-content: space-between;">
&lt;img src="predicted performance R2 by phenotype.png" style="width:48%;">
&lt;img src="predicted performance p-value by phenotype .png" style="width:48%;">
&lt;/div>
&lt;p>Predicted Performance is better as&lt;/p>
&lt;ol>
&lt;li>predicted performance $R^2$ increase&lt;/li>
&lt;li>predicted performance $p-value$ decrease&lt;/li>
&lt;/ol>
&lt;p>Z-score increase when predicted performance is better.It shows the prediction is more reliable if predicted performance is better.
It means that S-PrediXcan associations tend to be more significant when prediction is more reliable.&lt;/p>
&lt;h3 id="hypotesis">Hypotesis&lt;/h3>
&lt;div style="text-align: center;">
&lt;img src="fig7.png" style="width:100%;">
&lt;/div>
&lt;h3 id="example">Example&lt;/h3>
&lt;div style="text-align: center;">
&lt;img src="fig8a.png" style="width:100%;">
&lt;/div>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a class="link" href="https://www.nature.com/articles/s41467-018-03621-1" target="_blank" rel="noopener"
>Paper&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-03621-1/MediaObjects/41467_2018_3621_MOESM3_ESM.pdf" target="_blank" rel="noopener"
>Supplementary Data 1~4 Information&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://www.biologyalive.com/life/classes/genetics/documents/Unit%201/genetics/defin.htm" target="_blank" rel="noopener"
>Definitions in Genetics&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://www.ukbiobank.ac.uk/enable-your-research/register" target="_blank" rel="noopener"
>UK Biobank (Pehenotype Data)&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://rwu.pressbooks.pub/bio103/chapter/regulation-of-gene-expression/" target="_blank" rel="noopener"
>Regulation of Gene Expression&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;a class="link" href="https://www.google.com.tw/books/edition/Genomic_Colocalization_and_Enrichment_An/SaYhEAAAQBAJ?hl=zh-TW&amp;amp;gbpv=1&amp;amp;dq=colocalized&amp;#43;snp&amp;amp;pg=PA98&amp;amp;printsec=frontcover" target="_blank" rel="noopener"
>colocalizing signal&lt;/a> &lt;br>&lt;/li>
&lt;li>&lt;/li>
&lt;/ul>
&lt;h2 id="data">data&lt;/h2>
&lt;p>&lt;a class="link" href="https://cg.bsc.es/gera_summary_stats/" target="_blank" rel="noopener"
>GERA data&lt;/a>&lt;br>
&lt;a class="link" href="https://predictdb.org/post/2021/07/21/gtex-v8-models-on-eqtl-and-sqtl/" target="_blank" rel="noopener"
>GTEx&lt;/a>&lt;br>
&lt;a class="link" href="https://www.chg.ox.ac.uk/~wrayner/tools/" target="_blank" rel="noopener"
>1000 G&lt;/a>&lt;br>
&lt;a class="link" href="http://gene2pheno.org" target="_blank" rel="noopener"
>summary statistic&lt;/a>&lt;/p></description></item><item><title>Polygenic Risk Score</title><link>https://jiangcc.netlify.app/p/polygenic-risk-score/</link><pubDate>Wed, 27 Nov 2024 00:00:00 +0000</pubDate><guid>https://jiangcc.netlify.app/p/polygenic-risk-score/</guid><description>&lt;h2 id="background">Background&lt;/h2>
&lt;p>Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual.
SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS
will get a great performance for European &lt;strong>only&lt;/strong>. In this paper, the we try to add PTRS which improve the
portability of race.&lt;/p>
&lt;h2 id="data">Data&lt;/h2>
&lt;p>(put setup picture )
Use 356,476 unrelated Europeans in the UK Biobank for the discovery set. In training sets, the first set of models had been trained in European
individuals from the GTEx v8 in whole blood. The second set of models had
been trained using array-based expression in monocyte samples of Europeans from MESA.&lt;/p>
$$
\begin{align*}
f(\bold{X}) = T
\end{align*}
$$&lt;p>For predicting transcriptome ($\hat{T}$, set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA
collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.&lt;/p>
&lt;div style="text-align: center;">
&lt;img src="data.png" style="width:90%;">
&lt;/div>
&lt;h3 id="dealing-data">Dealing Data&lt;/h3>
&lt;p>Use UK Biobank data&lt;/p>
&lt;ul>
&lt;li>GTEx data containing SNP and corresponding gene expression data.&lt;/li>
&lt;li>Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race&lt;/li>
&lt;li>labeled individuals as EUR, S.ASN, E.ASN, AFR&lt;/li>
&lt;li>Remove high missing rates data&lt;/li>
&lt;li>individual with multiple arrays (measure many times with different instruments): taking the average&lt;br>
individual with multiple instances (measure many times with stage): Use the first non-missing value&lt;/li>
&lt;li>predicted transcriptome depend on race, tissue. Different tissue or
race with difference function transform SNP to gene expression&lt;/li>
&lt;/ul>
&lt;h3 id="quality-control-on-self-reported-ancestry">Quality control on self-reported ancestry&lt;/h3>
&lt;p>Because lots sample may give wrong information about ancestry race. We defined similarity $S_{ik}$ for individual $i$ in population $k$&lt;/p>
$$
\begin{align*}
S_{ik} = \log{P(\text{PC}_i^1, \cdots, \text{PC}_i^{10} |\ \widehat{\mu_k}, \widehat{\Sigma_k})}
\end{align*}
$$&lt;p>where $\widehat{\mu_k}, \widehat{\Sigma_k}$ are sample mean, sample var respectively&lt;br>
For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with $S_{ik} > -50$.&lt;/p>
&lt;h2 id="proportion-of-variance-explained">Proportion of Variance Explained&lt;/h2>
&lt;p>First, we convert the predicted gene expression $\hat{T}_{ig}$ to $\widetilde{T}_{ig}$.
We can control the range of transformed gene expression like Normal quantile. What&amp;rsquo;s more,
the transformed gene expression will follow $N(0,1)$. For gene $g$, $i$ individual transformed gene expression is&lt;/p>
$$
\begin{align*}
\widetilde{T}_{ig} = \Phi^{-1}\left(\frac{\text{rank}(\hat{T}_{ig})}{N+1}\right)
\end{align*}
$$&lt;p>Suppose the observed phenotype ( $Y_i$ ) has a linear relation with the $l$th covariate ( $C_{il}$ ) and
the inverse-normalized predicted expression of gene g ( $\widetilde{T}_{ig}$ )&lt;/p>
$$
\begin{align*}
&amp; Y_i = \mu + \sum_l C_{il} a_l + \sum_g \widetilde{T}_{ig} \beta_g + \varepsilon_i \\
&amp; \varepsilon_i \overset{\text{iid }}{\sim} N(0, \sigma_e^2) \\
&amp; \beta_g \overset{\text{iid }}{\sim} N\left( 0, \frac{\sigma_g^2}{M} \right)
\end{align*}
$$&lt;p>Proportion of variance explained (PVE) idea like $R^2$. We determine the amount of explained data variation by the covariate.
$R^2$ determine the amount of explained data variation by whole covariates; PVE can focus on the covariate we concerned.
And PVE for each trait will depend on the heritability of the trait. PVE of gene $g$ is&lt;/p>
$$
\begin{align*}
\text{PVE}_g = \frac{\hat{\sigma}_g^2}{M (\hat{\sigma}_e^2 + \hat{\sigma}_g^2)}
\end{align*}
$$&lt;h2 id="polygenic-risk-scores">Polygenic Risk Scores&lt;/h2>
&lt;p>We need to use LD clumping filtering the independent and significant SNP. Using discovery data to compute PRS, polygenic Risk Scores (PRS)
for individual $i$ at GWAS p-value thresholds $t$&lt;/p>
$$
\begin{align*}
\text{PRS}_i^t=\sum_{j: p_j \leq t} X_{i j} \widehat{b}_j
\end{align*}
$$&lt;ul>
&lt;li>$\hat{b}_j$: according to GWAS, we get coefficient of SNPs&lt;/li>
&lt;li>$X_i$: phenotype of $i$ SNP, number of risk allele also, 0 or 1 or 2&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Independent and Significant&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.&lt;/li>
&lt;li>Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It&amp;rsquo;s not accurate.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Fault of PRS&lt;/strong>&lt;br>
Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when
implementing public health measures.&lt;/p>
&lt;h3 id="ld-clumping-vs-ld-pruning">LD Clumping v.s. LD Pruning&lt;/h3>
&lt;p>Because distance of every SNP is not equal.
&lt;strong>LD Clumping&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Ordering SNP by p-value&lt;/li>
&lt;li>Using SNP with the smallest p-value as the center for a range 250kb&lt;/li>
&lt;li>Removing those SNP with $R^2 > 0.1$&lt;/li>
&lt;li>Proceeding to the SNP with the next smallest p-value&lt;/li>
&lt;/ol>
&lt;div style="text-align: center;">
&lt;img src="ld_clumping.png" style="width:50%;">
&lt;/div>
&lt;p>&lt;strong>LD Pruning&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>In a window, we compute $r^2$ of each pair SNP&lt;/li>
&lt;li>If $r^2$ bigger than threshold, delete the snp which with smaller MAF&lt;/li>
&lt;li>Remove to next window&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Example for LD Pruning&lt;/strong>&lt;/p>
&lt;p>We calculate each pair SNP LD in $50000$ SNP window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn&amp;rsquo;t compute LD in different chromosome even though SNP are a few.&lt;br>
Region is that window 1: &lt;code>0 – 50,000-th SNP&lt;/code>; window 2: &lt;code>5,000-th SNP – 55,000-th SNP&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-html" data-lang="html">&lt;span class="line">&lt;span class="cl">plink --bfile xxx1 --chr 1-22 --indep-pairwise 50000 5000 0.2 --out output2
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>We calculate each pair SNP LD in $50$ kb window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn&amp;rsquo;t compute LD in different chromosome even though SNP are a few.
Region is that window 1: &lt;code>0 – 50 kb&lt;/code>; window 2: &lt;code>the next 5 SNP – (the next 5 SNP pos +50 kb)&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-html" data-lang="html">&lt;span class="line">&lt;span class="cl">plink --bfile xxx1 --chr 1-22 --indep-pairwise-kb 50 5 0.2 --out output2
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h3 id="prs-practicality">PRS Practicality&lt;/h3>
&lt;ol>
&lt;li>We choose independent and low correlated SNPs by LD clumping.&lt;/li>
&lt;li>For 11 p-value thresholds, we get different PRS.&lt;/li>
&lt;/ol>
&lt;h3 id="risk-allele">Risk Allele&lt;/h3>
&lt;p>One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase.
For example, Alzheimer with three allele, and APOE $\varepsilon_4$ is risk allele.&lt;/p>
&lt;ol>
&lt;li>APOE $\varepsilon_2$: consist of SNPs rs429358 (T), rs7412 (T)&lt;/li>
&lt;li>APOE $\varepsilon_3$: consist of SNPs rs429358 (T), rs7412 (C)&lt;/li>
&lt;li>APOE $\varepsilon_4$: consist of SNPs rs429358 (C), rs7412 (C)&lt;/li>
&lt;/ol>
&lt;h2 id="polygenic-transcriptome-risk-scores">Polygenic Transcriptome Risk Scores&lt;/h2>
&lt;p>The advantage of polygenic transcriptome risk scores(PTRS) is that the PTRS model has fewer covariates anf requires a smaller sample size for training.And the used gene is more closed to disease rather than SNPs data in PRS. In this paper, we using discovery data to find PTRS. PTRS for individual i at GWAS p-value thresholds $\lambda$ and gene $g$ is&lt;/p>
$$
\begin{align*}
\text{PTRS}_i^{\boldsymbol{\lambda}} = \sum_g {\hat{T}_{ig}\beta_g^{\boldsymbol{\lambda}}}
\end{align*}
$$&lt;p>where $\boldsymbol{\lambda}$ is made up of penalty term $\lambda$ and $\alpha$ in elastic net&lt;/p>
&lt;h3 id="elastic-net">Elastic Net&lt;/h3>
&lt;p>A variable selection method is used, but it’s unsuitable for too many covariates. PTRS has fewer covariates than PRS, so elastic net is applied only for PTRS. For PRS, LD clumping and p-value filtering are used. Elastic net identifies the optimal $\boldsymbol{\beta^{EN}}$. For N individuals&lt;/p>
$$
\begin{align*}
\boldsymbol{\beta^{EN}}
&amp; = \displaystyle \arg \min_{\beta} \left\{ \underbrace{\frac{1}{N} \| \boldsymbol{Y} - \boldsymbol{X} \boldsymbol{\beta}-
\boldsymbol{\beta_0} \|_2^2 }_{\textcolor{blue}{\text{loss}}} + \lambda \left[ \alpha \left\| \boldsymbol{\beta} \right\|_1 +
(1-\alpha) \left\| \boldsymbol{\beta} \right\|_2^2 \right] \right\} \\
\end{align*}
$$&lt;p>where&lt;/p>
$$
\begin{align*}
&amp; \boldsymbol{\beta} = [0, \beta_1, \ldots, \beta_{M+L-1}]^T \in \mathbb{R}^{(M+L) \times 1} \\
&amp; \boldsymbol{\beta_0} = [\beta_0, 0, \ldots, 0]^T \in \mathbb{R}^{(M+L) \times 1} \\
&amp; \boldsymbol{Y} \in \mathbb{R}^{N \times 1}, \text{ observed phenotypes matrix} \\
&amp; \boldsymbol{X} = [\hat{T}_1, \ldots, \hat{T}_M, C_1, \ldots, C_L] \in \mathbb{R}^{N \times (M+L)} \\
&amp; M = \text{ number of genes } \\
&amp; L= \text{ number of covariates } \\
&amp; \hat{T}_i \in \mathbb{R}^{N \times 1}, \text{ predicted standardized i-th gene expression} \\
&amp; C_i \in \mathbb{R}^{N \times 1}, \text{ observed i-th standardized covariate} \\
\end{align*}
$$&lt;p>&lt;strong>Degenerate Model&lt;/strong>&lt;/p>
$$
\begin{align*}
Y= \text{constant}+\varepsilon
\end{align*}
$$&lt;h3 id="ptrs-practicality">PTRS Practicality&lt;/h3>
&lt;ol>
&lt;li>We set $\alpha = 0.1$ and find $\lambda_{\text{max}}$ as the smallest value satisfying $|\nabla l(\beta)| \leq \alpha \lambda$.&lt;/li>
&lt;li>To match the 11 PRS p-value cutoffs, we build a set of $lambda$ by selecting 20 equally spaced points in log scale between
$1.5\lambda_{\text{max}}$ and $\frac{\lambda_{\text{max}}}{10^4}$. And we use the first 11 non-degenerate models for each population.&lt;/li>
&lt;li>At $\alpha=0.1$, get $\beta$ in each $\lambda$ by elastic net.&lt;/li>
&lt;li>For 11 $\lambda$, we get different PTRS.&lt;/li>
&lt;/ol>
&lt;h2 id="partial-r-squared">Partial $R$ Squared&lt;/h2>
&lt;p>TO compare he performance of PRS and PTRS, Partial $R$ Squared ($\widetilde{R^2}$) can be used as the metric for evaluation. Partial $R^2$ called
prediction accuracy also. Let $y_i$ denote the observed phenotype,
$\hat{y_i}$ denote the predicted phenotype. And&lt;/p>
$$
\begin{align*}
\text{Null model}:y ∼ 1+ \textit{ covariates } \quad \text{ v.s. } \quad \text{ Full model }: y ∼ 1+\textit{ covariates }+\hat{y_i}
\end{align*}
$$$$
\begin{align*}
&amp; \widetilde{R^2} \\
= \quad &amp; 1- \dfrac{\text{SSE}_\text{full} }{\text{SSE}_\text{null} } \\
= \quad &amp; \frac{C^2(y, \hat{y})}{C(y, y)C(\hat{y}, \hat{y})} \\
\end{align*}
$$&lt;p>where&lt;/p>
$$
\begin{align*}
&amp; C(u, v) = u^t v - u^t H v \\
&amp; H = \widetilde{C} (\widetilde{C}^t \widetilde{C})^{-1} \widetilde{C}^t \\
&amp; \widetilde{C} = [1, C_1, \dots, C_L]
\end{align*}
$$&lt;h3 id="how-to-choose-hyperparameters">How to choose hyperparameters?&lt;/h3>
&lt;ol>
&lt;li>Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets&lt;/li>
&lt;li>Spliting each target set into two equal-size parts, a validation set and a test set&lt;/li>
&lt;li>Selecting hyperparameters (p-value cutoff in clumping and thresholding,
$\lambda$ in elastic net) maximize $\widetilde{R^2} $ in validation set&lt;/li>
&lt;/ol>
&lt;p>After choosing hyperparameters, we calculate the $\widetilde{R^2} $ in test set.
This procedure was repeated $10$ times and we get the average $\widetilde{R^2} $
as the prediction accuracy.&lt;/p>
&lt;h2 id="combining-ptrs-and-prs">Combining PTRS and PRS&lt;/h2>
&lt;p>combined score $\hat{y_i}$= $c_1 PRS_i^\lambda+ c_2 PTRS_i^\lambda$&lt;/p>
&lt;ol>
&lt;li>Spliting each target set into two equal-size parts, a validation set and a test set.&lt;/li>
&lt;li>Spliting validation set into two equal-size parts.&lt;/li>
&lt;li>For 11 $\lambda$ thresholds, find ${\arg \min}_{c_1,\ c_2} \sum_i (y_i-\hat{y_i})^2$ in first validation set. We get different $y_i$ in different threshold. So $c_1,\ c_2$ will different too. The idea like linear model, different $x_i$ will get different fitted line.&lt;/li>
&lt;li>Now, we get the best $c_1,\ c_2$ in each thresold. Then we use $c_1$, $c_2$ to compute combined score in second validation set.&lt;/li>
&lt;li>Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest $\widetilde{R^2}$.&lt;/li>
&lt;li>Now, we get the best tuning $c_1,\ c_2$ and threshold in each population. And use it in test set.&lt;/li>
&lt;li>Getting final $\widetilde{R^2}$ $\sim \sim$&lt;/li>
&lt;/ol>
&lt;h2 id="portability-of-prs-and-ptrs">Portability of PRS and PTRS&lt;/h2>
&lt;p>Portability of PRS is&lt;/p>
$$
\begin{align*}
\frac{\text{prediction accuracy in target set}}{ \text{prediction accuracy in European reference set}}
\end{align*}
$$&lt;p>Portability of PTRS is&lt;/p>
$$
\begin{align*}
\frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} },
\quad \text{ where } \widetilde{R^2}_{\text{EUR ref}} \text{ is } \widetilde{R^2} \text{ in MESA EUR model }
\end{align*}
$$&lt;p>Since MESA EUR model is expected to perform better
than MESA AFHI model among EUR individuals. So&lt;/p>
$$
\begin{align*}
&amp; \widetilde{R^2}_{\text{MESA AFHI}} &lt;\widetilde{R^2}_{\text{EUR ref}} \\
\Rightarrow \quad &amp; \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} } &lt; \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{MESA AFHI}} }
\end{align*}
$$&lt;p>Therefore, definition of Portability of PTRS is conservative.&lt;/p>
&lt;h1 id="result">Result&lt;/h1>
&lt;ul>
&lt;li>In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability.
In the plot, PTRS $\widetilde{R^2}$ can achieve upper bound (heritability), but PRS&amp;rsquo;s can not.&lt;/li>
&lt;li>The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.&lt;/li>
&lt;li>Using PredictDB data to get the weight, and compute predicted gene expression (in other paper).
There we use the weight directly.&lt;/li>
&lt;li>In the paper, trait is continuous number. It is not related to survival analysis.
As the trait is discrete, we will use survival analysis.&lt;/li>
&lt;li>Finally, each race group will compute one PRS and PTRS.&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;p>&lt;a class="link" href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02591-w" target="_blank" rel="noopener"
>Polygenic transcriptome risk scores(PTRS) can improve portability of polygenic risk scores across ancestries&lt;/a> &lt;br>&lt;/p></description></item></channel></rss>