Genome on Quan's Base

Bayesian Interpretation for Positive False Discovery Rate

Sat, 10 Jan 2026 00:00:00 +0000

Problem

In multiple testing, we concern the rate of false positives among all rejected hypotheses rather than the probability reject wrongly at least a hypotheses. We allow reject null hypotheses is true under controled ratio.

pFDR can be written as a Bayesian posterior probability

pFDR

Concept

$$ \begin{table}[] \centering \begin{tabular}{lccc} \toprule & Not rejected & Rejected & Total \\ \midrule Null true & $U$ & $V$ & $m_0$ \\ Alternative true & $T$ & $S$ & $m_1$ \\ \midrule Total & $W$ & $R$ & $m$ \\ \bottomrule \end{tabular} \caption{Possible outcomes from m hypothesis tests} \end{table} $$

Theorem

posterior Thm

Suppose $m$ identical hypothesis tests are performed with statistics $T_1, \cdots, T_m$ and significance region $\Gamma$ . Assume that $(T_i , H_i)$ are $i.i.d.$ and $T_i \mid H_i \sim (1-H_i)F_0 + H_i F_1$ for null distribution $F_0$ and alternative distribution $F_1$ , and $H_i \sim Ber(\pi_1)$ then

$$ \begin{align*} \mathrm{pFDR}(\Gamma)= P(H=0 \mid T \in \Gamma) \end{align*} $$

where $\pi_0=1-\pi_1$

P-value(t) of observed statistic $T = t$ is defined to be

$$ \begin{align*} \mathrm{p\text{-}value}(t)= \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} P(T\in\Gamma_\alpha \mid H=0) \end{align*} $$

For an observed statistic $T = t$ define the q-value of $t$ to be

$$ \begin{align*} \text{q-value}(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \text{pFDR}(\Gamma_{\alpha}) \end{align*} $$

Corollary 2

Under the assumptions of Theorem 1,

$$ \begin{align*} \text{q-value}(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} P(H=0 \mid T \in \Gamma_{\alpha} ) \end{align*} $$

Thm for dependence stat

Suppose as $m \to \infty$, for each $\alpha>0$ for some conti. function $G_0, G_1$

$$ \begin{align*} \sum_{i=1}^{m} \frac{(1 - H_i)}{m} \to \pi_0, \quad\frac{V_m(\Gamma_\alpha)}{\sum_{i=1}^{m} (1 - H_i)} \to G_0(\alpha), \quad \frac{S_m(\Gamma_\alpha)}{\sum_{i=1}^{m} H_i} \to G_1(\alpha) \end{align*} $$

with probability 1

Then for any $\delta>0$

$$ \begin{align*} \text{(i)} & \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \frac{V_m(\Gamma_\alpha)}{R_m(\Gamma_\alpha) \vee 1} - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| \stackrel{a.s.}{=} 0 \\ \text{(ii)} & \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \text{FDR}_m(\Gamma_\alpha) - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| = 0 \\ \text{(iii)} & \quad \lim_{m \to \infty} \sup_{\alpha \geq \delta} \left| \text{pFDR}_m(\Gamma_\alpha) - P{\infty}(H = 0 \mid X \in \Gamma_\alpha) \right| = 0 \end{align*} $$

where $P{\infty}(H = 0 \mid X \in \Gamma_\alpha) = \frac{\pi_0 \cdot G_0(\alpha)}{\pi_0 \cdot G_0(\alpha) + (1 - \pi_0) \cdot G_1(\alpha)}$

Benefit

Limitation

Common Technology

Omnibus Test

The Omnibus Test uses summary data to deal with multiple cohorts/methods. In this paper, we use the omnibus test to check for significant associations across predictions from YFS, METSIM, and NTR (different tissues). For gene $i$

$$ \begin{align*} \text{omnibus}_i = \mathbf{Z_i^T C_i^{-1} Z_i} \overset{approx}{\sim} \chi^2_3 \end{align*} $$

where

$\mathbf{Z_i}$ is $3 \times 1$ vector, representing $3$ cohort TWAS Z score
$\mathbf{C_i}$ is $3 \times 3$ correlation matrix for $3$ cohort

Performance

True Data

Simulation

Reference

THE POSITIVE FALSE DISCOVERY RATE: A BAYESIAN INTERPRETATION AND THE q-VALUE

Meta Analysis

Fri, 13 Jun 2025 00:00:00 +0000

Introduction

Meta-Analysis idea is combining results across studies. Cause more associated SNPs as sample size increasing. There are two type of meta-analysis, fixed effect meta-analysis and random effect meta-analysis. Suppose there is $N$ studies in hand. And the data $i$ with different effect size $\hat{\beta}_i$ and standard error $\sigma_i$

$$(\hat{\beta}_i, \sigma_i), \ i=1,\ldots,N$$

where $\sigma_i$ is standard error of $\hat{\beta}_i$

$\widetilde{\beta} = \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}}$ is a common weight called inverse variance weight. If $\widetilde{\beta} \sim {N}(\beta, \sigma_i^2)$ and independent

$$ \begin{align*} Var(\widetilde{\beta}) &= {Var}\left( \frac{\sum_{i=1}^N \hat{\beta}_i \sigma_i^{-2}}{\sum_{i=1}^N \sigma_i^{-2}} \right) \\ &= \frac{\sum_{i=1}^N {Var}\left( \hat{\beta}_i \sigma_i^{-2} \right)}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-4} \sigma_i^2}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{\sum_{i=1}^N \sigma_i^{-2}}{\left( \sum_{i=1}^N \sigma_i^{-2} \right)^2} \\ &= \frac{1}{\sum_{i=1}^N \sigma_i^{-2}} \end{align*} $$

Genomic Control ($\lambda_{GC}$)：
fix effect model 使用逆變異數加權法 (Inverse Variance Weighting)，適用於研究間的族群背景、實驗方法非常接近時
random effect model 變異包含研究內誤差 ($\sigma_i^2$) 與研究間異質性 ($\tau^2$)
Heterogeneity Test ，在合併數據前，我們必須確認這些研究是否「合得來」，Cochran’s Q test
Genomic Control ($\lambda_{GC}$)：校正各研究內部的群體分層
Funnel Plot (漏斗圖)：檢查是否存在發表偏倚（Publication Bias）
Forest Plot (森林圖)：觀察單一 SNP 在各研究中的效應方向

Material and Method

flowchart

Data

We utilized data from the 1,000 Genomes Project to perform a GWAS for height. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.

Genotype QC

Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter
Excluding the SNP with MAF $\leq 0.05$ : 6797981 variants and 1092 people pass filter
Excluding the SNP with HWE $< 0.0001$ i.e. pvalue $< 0.0001$ : 4941621 variants and 1092 people pass filter
Excluding the SNP with $r^2 < 0.2$ in 500 window bp to PCA : 299901 variants and 1092 people pass filter
flip beta to -beta

Fix Effect Meta-Analysis

$$\begin{align*} (\hat{\beta}_i, \sigma^2_{i}),\quad i = 1, \ldots, N,\quad N \text{ studies} \end{align*}$$

where

$\hat{\beta_i}$ is effect size
$\sigma_i^2$ is variance

$$\begin{align*} \hat{\beta}_i \sim N(\beta, \sigma^2_{i}) \\ \tilde{\beta} = \frac{ \sum_{i=1}^N \hat{\beta}_i \sigma^{-2}_{i} }{ \sum_{i=1}^N \sigma^{-2}_{i} } \end{align*}$$

Random Effect Meta-Analysis

$$\begin{align*} &\hat{\beta}_i \sim {N}(\beta_i, \sigma_i^2) , \quad \beta_i \sim {N}(\mu, \tau^2) \end{align*}$$

where

$\sigma^2_i$ is sampling variation within study (研究內的抽樣誤差)
$\tau^2$ is variance between studies (研究之間的不同)

$$\begin{align*} \text{Var}(\hat{\beta}_i) &= E(\text{Var}(\hat{\beta}_i | \beta_i)) + \text{Var}(E(\hat{\beta}_i | \beta_i)) \\ &= E(\sigma_i^2) + \text{Var}(\beta_i) \\ &= \sigma_i^2 + \tau^2 \end{align*}$$$$\begin{align*} \Rightarrow \hat{\beta}_i \sim {N}(\mu, \sigma^2 + {\tau}^2) \end{align*}$$

Chi-square Test for Heterogeneity in Effect

$$\begin{align*} Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1} \end{align*}$$

test is there any $\beta_i$ sig. different

$$\begin{align*} I^2 &= 100\% \cdot \frac{Q - \text{df}}{Q}\\ \end{align*}$$

$I^2 = 0-25\%$: Low heterogeneity, then Heterogeneity is small $(\beta_1 = \beta_2 = \cdots = \beta_N)$. Not reject $H_0$
$I^2 = 25-50\%:$ Moderate
$I^2 = 50-75\%:$ Substantial
$I^2 >75\%:$ Considerable, then Heterogeneity is large. Reject $H_0$

where

$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$
df = $N-1$

Cochran’s Q test

it might be underpowered when few studies have been included or when event rates are low. Therefore, it is often recommended to adopt a higher P-value (rather than 0.05) as a threshold for statistical significance when using Cochran’s Q test to determine statistical heterogeneity.

$$Q = \sum_{i=1}^N \left( \frac{\hat{\beta}_i - \widetilde{\beta}}{\sigma_i} \right)^2 \sim \chi^2_{N-1}$$

Under large sample, if p-value $P(\chi^2_{N-1}>Q)<0.05$, reject $H_0$
Under small sample, if p-value $P(\chi^2_{N-1}>Q)<0.1$, reject $H_0$
reference web https://www.ncbi.nlm.nih.gov/books/NBK53317/table/ch3.t2/#:~:text=Cochran's%20Q%20test%20is%20the,within%20subjects%20within%20a%20study.

Heterogeneity in Effect

The genetic influence on a trait varies across different individuals or populations, even when the trait looks the same. May arise from

Differences in LD structure
Interactions with environmental or other genetic exposures at different frequencies

Genomic Control

$$\begin{align*} \lambda_{\text{GC}} &= \frac{\text{median}(\chi^2_{\text{observed}})}{\text{median}(\chi^2_{\text{adjusted}})} = \frac{\text{median}(\chi^2_{\text{observed}})}{0.455} \\ &\lambda_{\text{GC}} \begin{cases} \approx 1: & \text{well-calibrated} \\ > 1: & \text{inflative} \\ < 1: & \text{conservative test} \end{cases} \end{align*}$$

where $\chi^2_{\text{adjusted}} = \frac{\chi^2_{\text{observed}}}{\lambda_{\text{GC}}}$

如何計算 λGC?

取所有 SNP 的$\chi^2$中位數，除以理論中位數（df=1 時）。

$$ \text{median}(\chi^2_{df=1}) = 0.455 $$

矯正每個snp 計算的統計量

$$ \chi^2_{\text{GC}, i} = \frac{\chi^2_i}{\lambda_{\text{GC}}} = \frac{\text{median}(\chi^2_{\text{all SNPs}})}{0.455} $$

再換 p-value

$$ p_i^{\text{GC}} = 1 - F_{\chi^2_{df=1}}(\chi^2_{\text{GC}, i}) $$

Genomic Control 不適合 polygenic traits

Result

SNP Finding

LD

Manhattan Plot

Code

GWAS Model

$$y = \beta_0 + \beta(x) + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon$$

Under logistic regression,

$$y \sim \text{Ber}(p), \ y \in \{0, 1\}$$$$\begin{align*} \text{logit}(p) &= \log\left(\frac{p}{1-p}\right) \\ &= \beta_0 + \beta x + \gamma z_1 + \gamma z_2 + \cdots + \gamma z_k + \epsilon \end{align*}$$$$\begin{align*} \beta &= \log\left( \frac{p(x=1)}{1-p(x=1)} \right) - \log\left( \frac{p(x=0)}{1-p(x=0)} \right) \\ &= \log\left( \frac{p(x=1)}{1-p(x=1)} \middle/ \frac{p(x=0)}{1-p(x=0)} \right) \end{align*}$$

Probability Distributions

$$\begin{align*} P(\hat{\beta}_i | \mu, \tau) &\propto \int P(\hat{\beta}_i | \beta_i) \, d\beta_i \\ &\propto \int P(\hat{\beta}_i | \beta_i) P(\beta_i | \mu, \tau) \, d\beta_i \\ &\propto \int \exp\left\{ -\frac{1}{2\sigma_i^2} (\hat{\beta}_i - \beta_i)^2 - \frac{1}{2\tau^2} (\beta_i - \mu)^2 \right\} \, d\beta_i \end{align*}$$

$P(\hat{\beta}_i ,\beta_i)$ is bivariate normal distribution, and the marginal distribution is still normal dist.

Allelic Chi-square Test

假設有一個 SNP，兩個等位基因：A, a，出現在case, control 的次數是

$$ \begin{array}{c|cc} & A & a \\ \hline \text{Case} & O_{1A} & O_{1a} \\ \text{Control} & O_{0A} & O_{0a} \\ \end{array} $$

where $N = O_{1A}+O_{1a}+O_{0A}+O_{0a}$

$$ E_{1A} = \frac{(O_{1A}+O_{0A})(O_{1A}+O_{1a})}{N} $$

其餘類似，把期望值算出

$$ E_{1a}, E_{0A}, E_{0a} $$

接著計算每個snp的 chi-square statistic

$$ \chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i} $$

Genomic Control 不是重新計算 $\chi^2$，而是認為資料的$\chi^2$ 偏高，需要除上 $\lambda_{\text{GC}}$ 矯正

$$ \chi^2_{\text{GC}} = \frac{\chi^2}{\lambda_{\text{GC}}} $$

Future Work

Following the GWAS, several post-GWAS analyses can be conducted, including fine-mapping, functional annotation, and the calculation of polygenic risk scores (PRS). Furthermore, the GWAS catalog provides a vast repository of existing GWAS summary statistics. We can leverage this data to validate the significant SNPs identified in our study.

Reference

1000G
Prof. lhchien Course

TWAS

Sun, 11 May 2025 00:00:00 +0000

Problem

Studies of complex traits often have small sample sizes. There are some methods to address this, such as overlapping analysis of eQTLs and GWAS trait variants, but these may miss small effect size expression.

TWAS

Concept

First, check that $h^2_{cis} \neq 0$ is significant. Then we use true expression data to train an imputed expression model. There are three imputed expression models, using cis-eQTL and BLUP or BSLMM, respectively. We compare their $\frac{r^2}{h^2}$, and BSLMM is the best one. We impute expression-trait association statistics from GWAS summary statistics and the imputed expression model.

Benefit

Gene expression data is not required in TWAS.

Limitation

We assume that SNPs affect traits through gene expression.
TWAS can’t distinguish causality; how to solve this? Add a trait term to the linear model. If the imputed expression becomes not significant, it means that there is a phenotype-mediated effect (SNP → trait → expression).

Common Technology

Omnibus Test

$$ \begin{align*} \text{omnibus}_i = \mathbf{Z_i^T C_i^{-1} Z_i} \overset{approx}{\sim} \chi^2_3 \end{align*} $$

where

$\mathbf{Z_i}$ is $3 \times 1$ vector, representing $3$ cohort TWAS Z score
$\mathbf{C_i}$ is $3 \times 3$ correlation matrix for $3$ cohort

Permutation Test

Permutation test doesn’t need distribution assumption. It’s a nonparameter method and testing multiple group data is significant different. In this paper. we shuffle expression-trait association 1,000 times for each TWAS gene, plot the distribution of shuffled Z score $Z_{perm}$ which follows $\sim N(0, \Sigma_{s,s})$) . We compute p-value

$$ \begin{align*} \text{p-value} = \frac{\displaystyle \sum_i^{1000}I(Z_{obs} < Z_{perm,i})}{1000} \end{align*} $$

If p-value$<0.05$, we reject null hypothesis (expression $\perp$ trait).

Performance

True Data

TWAS Identify 25 novel expression-trait associations using summary association statistics from a 2010 lipid GWAS.

Simulation

Under null

We simulate expression from two null expression models. For expression $\perp$ SNP, cis-heritable trait model

$$Z-score \sim N\left(0,\mathbf{\frac{WZ}{(W\Sigma_{s,s} W')^{1/2}}}\right) ,\ \text{expression} \sim N(0,1)$$

For trait $\perp$ SNP, cis-heritable expression model

$$ Z-score \sim N(0,1) ,\ \text{expression}=\sum_i X_i +\varepsilon$$

where

$\mathbf{W=\Sigma_{e,s}\Sigma^{-1}_{s,s}}$
$\mathbf{\Sigma_{e,s}}:$ covariance between SNPs and expression
$\mathbf{\Sigma_{s,s}}:$ covariance among all SNPs

Under alternative

We use $6000$ unrelated METSIM GWAS samples, $100$ genes and the SNPs in the surrounding 1MB. For $100$ genes, expression simulated as

$$ \begin{align*} \mathbf{E}=\mathbf{X {\beta} + \varepsilon},\ \text{where } \varepsilon,\ \beta \text{ from Normal} \quad (1) \end{align*} $$

to achieve $h^2_{cis-g}=0.17$. $1000$ samples with SNPs and simulated expression were then withheld for training $(1)$. And we use $(1)$ to simulate remaining $5000$ samples expression. For remaining $5000$ samples, phenotype $Y$ simulated as

$$ \begin{align*} Y=E \alpha'+\varepsilon \quad (2) \end{align*} $$

So that $h^2_E=\frac{0.1}{180}$ or $\frac{0.2}{180}$. Repeating $5000$ samples expression simulation $(1)$ and phenotype simulation $(2)$ $60$ times with different $\varepsilon$. After computing Z-score between snp, phenotype, we simulate $5000 \times 60$ size GWAS.

Reference

GWAS

Wed, 23 Apr 2025 00:00:00 +0000

Introduction

Genome-wide Association Study (GWAS) is a classic methodology for identifying SNPs associated with common diseases. By scanning the entire genome without prior hypotheses, it utilizes linear models to identify SNPs with significant statistical associations to specific diseases. However, because allele frequencies vary across populations and current GWAS data are predominantly derived from European ancestry, the predictive performance across different ethnic groups remains limited. Furthermore, due to Linkage Disequilibrium (LD), the significant loci identified by GWAS are often merely “tagging SNPs” rather than the actual causal variants responsible for the disease.

During the Quality Control (QC) process, SNPs with a low Minor Allele Frequency (MAF) are typically excluded. Rare variants have extremely low frequencies; for instance, a MAF of 0.01 implies that, on average, only one individual out of 100 carries that specific SNP. When the sample size of a study is insufficient, the influence of these few rare-variant carriers often fails to reach the stringent p-value threshold required for genome-wide significance.

Material and Method

The GWAS workflow begins with stringent QC based on MAF and Hardy-Weinberg Equilibrium (HWE). Following QC, the filtered SNPs are used for the primary association analysis. For population structure correction, a subset of SNPs is subjected to LD pruning to perform Principal Component Analysis (PCA). The final Linear Regression Model incorporates the post-QC SNPs as the independent variable, with Principal Components (PCs) and gender included as covariates to account for population stratification and confounding factors.

Data

We utilized data from the 1,000 Genomes Project to perform a GWAS for height. The study encompassed chromosomes 1 through 22, analyzing a total of 36,820,992 variants across 1,092 individuals.

Genotype QC

Excluding the SNP or individual with missing rate $> 0.1$ : 36820992 variants and 1092 people pass filter
Excluding the SNP with MAF $\leq 0.05$ : 6797981 variants and 1092 people pass filter
Excluding the SNP with HWE $< 0.0001$ i.e. pvalue $< 0.0001$ : 4941621 variants and 1092 people pass filter
Excluding the SNP with $r^2 < 0.2$ in 500 window bp to PCA : 299901 variants and 1092 people pass filter

Linear model

After applying filters for missingness rate, MAF, and HWE, the dataset retained 4,941,621 variants and 1,092 individuals. This high-quality dataset served as the foundation for constructing the linear model

$$ \begin{align*} Y_i &= X_j + Gender + {PC}_1 + {PC}_3 + {PC}_3,\\ i &=1, \cdots , 1092; \quad j=1, \cdots , 299901 \end{align*} $$

where $Y_i$ is i-th sample, $X_j$ is j-th SNP.

Result

SNP Finding

We identified several SNPs significantly associated with height, using the standard threshold of p-value $< 5 \cdot 10^{-8}$. These lead SNPs and their corresponding statistics are summarized in the table below.

LD

To determine if multiple significant SNPs represent the same underlying genetic signal, we evaluated the LD between them. Our analysis indicates that five significant SNPs on Chromosome 5 are in high LD with one another, suggesting they likely tag the same causal locus. Similarly, high LD was observed between two SNPs on Chromosome 6, as well as two SNPs on Chromosome 17.

Manhattan Plot

The Manhattan plot visualizes the association results across the genome, with the red dashed line indicating the significance threshold ($-\log_{10}(5 \cdot 10^{-8})$). The plot highlights prominent peaks of significant SNPs, most notably on Chromosome 5, which demonstrates the strongest association with the trait.

Code

1
2


library(data.table)
library(dplyr)

merge data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


setwd("D:/GWAS_CLASS/midterm")

for (i in 1:22) {
 ## vcf to bed
 paste0("plink --vcf ALL.chr" ,i ,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz --make-bed --out chr", i ) %>%
 system()

 ## 把snp id是.的轉換成新的名稱
 paste0("plink --bfile chr", i, " --snps-only just-acgt --set-missing-var-ids @:#[b37] --make-bed --out chr", i, "_TransformMissing") %>%
 system()
}


list <- list()
for (i in 2:22) {
 ## merge files
list<- list %>%
 rbind(paste("chr", i, "_TransformMissing",c(".bed",".bim",".fam"),sep=""))
}
## 造merge_files.txt，放你要合併的檔案名稱
write.table((list),file="merge_files.txt",row.names=F,col.names=F,quote=F)

## 把chr22_1_1的bed,bim,fam 跟merge_files.txt 裡的合併
system("plink --bfile chr1_TransformMissing --merge-list merge_files.txt --make-bed --out process//merge")

QC and LD and PCA

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


setwd("D:/GWAS_CLASS/midterm")

## 產生兩個檔案，分別紀錄人跟snp missing 的檔案
system("plink --bfile process/merge --missing" )

# 把人missing rate 超過0.1 的人 丟掉
system("plink --bfile process//merge --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.0001 --make-bed --out process//merge_QC" )

# 轉成ped, map
system("plink --bfile process//merge_QC --recode --out process//merge_QCPed" )

#ld pruning(只輸出prune.in, prune.out)
system("plink --file process//merge_QCPed --indep-pairwise 500 50 0.2 --out process//merge_QCld")

# choose SNP in prune.in, output .ped and .map
system("plink --file process//merge_QCPed --extract process//merge_QCld.prune.in --recode --out process//merge_prune")

#pca
system("plink --file process//merge_prune --pca --out process//merge_pca")


e.vec <- fread("process//merge_pca.eigenvec")

g <- fread("process//pheno.txt")
covar <- data.frame(FID = e.vec$V1,
 IID = e.vec$V2,
 gender = g$Gender,
 PC1 = e.vec$V3,
 PC2 = e.vec$V4,
 PC3 = e.vec$V5)


write.table(covar,file="D://GWAS_CLASS//20101123//process//covar.txt",row.names=F,quote=F)

1
2
3
4
5


## association
### 千萬要注意!!!pheno.txt 檔案顯示成文字檔，每個row文字要用tab區隔，也就是說用記事本打開，看起來不會是對齊的
system("plink --bfile process//merge_QC --pheno process//pheno.txt --pheno-name Height --make-bed --out process//merge_f" )
##
system("plink --bfile process//merge_f --covar process//covar.txt --covar-name gender PC1 PC2 PC3 --allow-no-sex --linear --out process//linear_model" )

choose sig SNP

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


## plot
r <- read.table(file="process//linear_model.assoc.linear", header = T)
head(r)
loc <- fread("process//merge_f.bim", header = F)

# 合併v5,v6
loc <- loc %>%
 mutate(CodingAllele = paste(V5, V6, sep = "/"))

names(loc)[1:4] <- c("CHR","SNP","v3","BP")


## 有4941621 個snp，只取線性模型裡的ADD term
r_snp <- r[1+5*(0:4941620),]
dim(r_snp)
head(r_snp)

dim(r)
dim(loc)


snp_sig <- r_snp[which(r_snp$P<5*10^(-8)), ]
snp_sig <- snp_sig %>%
 left_join(loc %>% select(SNP, CodingAllele), by = "SNP")

sig SNP is LD high?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


prune_out <- fread("process\\merge_QCld.prune.out")
snp_sig_ld <- intersect(snp_sig$SNP, prune_out)
snp_sig_ld

sig <- data.frame(sig_snp = snp_sig$SNP)
write.table(sig, file="process//significant_snp.txt", row.names=F, col.names=F, quote=F)

# select significant_snp.txt snp in merge_QCPed
system("plink --file process//merge_QCPed --extract process//significant_snp.txt --recode --out process//merge_sig")

# get all pairs ld
system("plink --file process//merge_sig --r2 --ld-window-r2 0 --out process//sig_snp_LD")

plot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


library(ggplot2)
library(dplyr)
library(ggrepel) # 為避免標籤重疊

# 計算 -log10(p-value)
r_snp$logP <- -log10(r_snp$P)

# 建立 i 軸 (x 軸索引)
r_snp <- r_snp %>% arrange(CHR, BP) %>%
 mutate(i = 1:n())

# 取得每個染色體中間位置，當作 x 軸刻度的位置
chr_labels <- r_snp %>%
 group_by(CHR) %>%
 summarize(center = median(i))

# 篩選出顯著 SNP（p < 5e-8）
sig_snps <- r_snp %>% filter(P < 5e-8)

# 畫圖
ggplot(r_snp, aes(x = i, y = logP, color = as.factor(CHR %% 2))) +
 geom_point(size = 1) +
 scale_color_manual(values = c("skyblue", "grey")) +
 geom_hline(yintercept = -log10(5e-8), color = "red", linetype = "dashed") +
 scale_x_continuous(breaks = chr_labels$center, labels = chr_labels$CHR) +
 labs(x = "Chromosome", y = "-log10(p-value)", title = "Manhattan Plot") +
 theme_minimal() +
 theme(legend.position = "none") +
 # 加上 SNP 標籤
 geom_text_repel(data = sig_snps, aes(label = SNP), size = 3, max.overlaps = 20)

Next Steps

Reference

1000G
Prof. lhchien Course

Heritability

Tue, 22 Apr 2025 00:00:00 +0000

Introduction

Heritability is the proportion of variation in a trait within a population that can be attributed to genetic differences. There are two type of definition and Nanow-sense Heritability is the common one.

Broad-sense Heritability

$$ \begin{align*} H^2&=\frac{V_G}{V_G+V_E} \\ &=\frac{V_G}{V_P} \end{align*} $$

$V_P$ is phenotype variation
$V_G$ is genetic variation
$V_E$ is environment

$$ \begin{align*} H^2=\frac{V_G}{V_P} \end{align*} $$

Nanow-sense Heritability

$$ \begin{align*} h^2=\frac{V_A}{V_P}, \quad \text{where} V_G=V_A+V_{NA} \end{align*} $$

$V_A$ is additive genetic variation
$V_{NA}$ is non-addrive genetic variation

Example-Dominant Coding

In dominant coding, genotypes $CC$, $C T$ and $T T$ are coded as $1$, $1$ and $0$, respectively, if $C$ is the minor allele.

Example-Epistasis

Labrador coat color is determined by two genes with four genotypes: $BE$, $bE$, $Be$, $be$

Color is black when genotype is $B-E-$
Color is chocolate when genotype is $bbE-$
Color is yellow when genotype is $--ee$

Note

Heritability refers to a specific population, not to individuals.
Heritability $\neq$ inheritance. For example, your brown hair may be inherited from your father, but the heritability of brown hair in the population may be low.
Heritability $\neq$ total genetic contribution. A low $h^2 = \frac{V_A}{V_P}$ does not necessarily mean that genetics plays a small role.
If $h^2$ is low, identifying associated genes might be less fruitful.

There are $3$ common types of heritability.

Family-Based Heritability $(h^2_{\text{family}})$

Family-based studies, often twin studies, estimate heritability by comparing monozygotic (MZ) twins and dizygotic (DZ) twins. Let $r_{MZ}$ be the phenotypic correlation for MZ twins and $r_{DZ}$ for DZ twins.

$$ \begin{align*} \left\lbrace \begin{array}{lll} r_{MZ} = A+C \\ r_{DZ} = \frac{A}{2}+C \end{array} \right. \end{align*} $$

where

$A$ is additive genetic effect
$C$ is shared (common) environmental effect

We esitmate

$$ \begin{align*} h^2_{\text{family}} & =A=2(r_{MZ}-r_{DZ}) \\ C & = A-r_{MZ} \end{align*} $$

error $E = 1-C$ and $A+C+E=1$

SNP-Based Heritability $(h^2_{\text{SNP}})$

Estimated using tools such as GCTA under the mixed linear model

$$ \begin{align*} \mathbf{Y=X \beta+W u+\varepsilon} \end{align*} $$

where

$\mathbf{u} \sim N\left(0, \mathbf{I \sigma_u^2}\right)$
$\varepsilon \sim N\left(0, \mathbf{I \sigma_2^2}\right)$
$\beta$ is a fixed effect (no variation)

So the variance of $\mathbf{Y}$ is

$$ \begin{align*} \operatorname{Var}(\mathbf{Y})= & V \\ = & \operatorname{Var}(\mathbf{W u})+\operatorname{Var}(\varepsilon) \\ = & \mathbf{W W^T \sigma_u^2+I \sigma_{\varepsilon}^2} \end{align*} $$

The standardized genotype matrix

$$ \begin{align*} \mathbf{W}=\left\{w_{i j}\right\}, \quad w_{i j}=\frac{X_{i j}-2 p_j}{\sqrt{2 p_j\left(1-p_j\right)}} \end{align*} $$

where

$X_{ij}$ is $j-th$ SNP for $i-th$ individual
$p_j$ is $j-th$ SNP MAF

Define Genetic Relationship Matrix (GRM)

$$ \begin{align*} A=\frac{\mathbf{WW^T}}{N} \end{align*} $$

where $\mathbf{\sigma_g^2}=N \mathbf{\sigma_u^2}$

Rewiting the model

$$ \begin{align*} \Rightarrow \quad \mathbf{Y=X \beta+g+\varepsilon}, \quad \mathbf{g} \sim N\left(\mathbf{0, A \sigma_g^2}\right) \end{align*} $$

We estimate $\sigma_g^2$ using REML (Restricted Maximum Likelihood). The proportion of phenotypic variance explained by the SNPs used to construct the GRM is given by

$$ \begin{align*} h^2_{\text{SNP}} = \frac{\sigma_g^2}{\text{Var}(Y)} \end{align*} $$

GWAS-based Heritability $(h_{\text {GWAS }}^2)$

This estimates heritability using only the significant SNPs identified in GWAS. Assuming $m$ significant SNPs are linearly associated with the trait

$$ \begin{align*} {Y}=\beta_0+\sum_{i=1}^m \beta_i X_i+\varepsilon \end{align*} $$

The heritability is

$$ \begin{align*} h_{\text {GWAS }}^2 = \frac{Var(\hat{Y})}{Var(Y)} \end{align*} $$

Relationship Between Heritability Types

$$ \begin{align*} h_{\text {family }}^2 > h_{\text {SNP }}^2 >> h_{\text {GWAS }}^2 \end{align*} $$

This gap is known as missing heritability.

Code

We used the gcta64 command to estimate the Genetic Relationship Matrix (GRM) for each of the 22 chromosomes separately. Compared to estimating the GRM for all autosomes together, we found that the results are identical. However, the former method took approximately two and a half hours, significantly longer than the latter, which required only ten minutes.

Computing Separately

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


## estimate h^2_SNP
getwd()
setwd("D:/GWAS_CLASS/GCTA")


## step0: split snp data to different chr
for (i in 1:22) {
 system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i, " --make-bed --out D:/GWAS_CLASS/GCTA/data/merge_chr", i))
}

## step1: make GRM
# --maf: filter SNPs
# --make-grm: make GRM
# --thread-num: Parallel computation. You should generally not specify a number of threads that exceeds the number of physical cores.
for (i in 1:22) {
 system(paste0("gcta64 --bfile D:/GWAS_CLASS/20101123/process/merge --chr ", i ," --maf 0.01 --make-grm --out D:/GWAS_CLASS/GCTA/data/merge_chr", i, " --thread-num 10"))
}

## step2: build grm_chrs.txt put in all chr GRM file name
writeLines(paste0("D:/GWAS_CLASS/GCTA/data/merge_chr", 1:22), "D:/GWAS_CLASS/GCTA/grm_list.txt")

## step3: merge all the GRMs by the following command:
system("gcta64 --mgrm D:/GWAS_CLASS/GCTA/grm_list.txt --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge")

## step4: remove cryptic relatedness: 0.025 roughly corresponds to individuals who are less related than third-degree
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge --grm-cutoff 0.025 --make-grm --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered")

## step5: estimating the variance explained by the SNPs (heritability)
### input: GRM in step 3 (grm_merge) + phenotype info (pheno.txt)
system("gcta64 --grm D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --pheno D:/GWAS_CLASS/20101123/process/pheno.txt --reml --out D:/GWAS_CLASS/GCTA/data/grm_merge_filtered --thread-num 10")

Computing Together

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


## estimate h^2_GWAS
system("plink --bfile D:/GWAS_CLASS/20101123/process/merge --extract D:/GWAS_CLASS/20101123/process/significant_snp.txt --recodeA --out D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp")

df <- fread("D:/GWAS_CLASS/GCTA/data/GWAS_sig_snp.raw")
pheno <- fread("D:/GWAS_CLASS/20101123/process/pheno.txt")
df <- df[,-(1:6)]
df$y <- pheno$Height

lm_sigsnp <- lm(y ~ ., data = df)
summary(lm_sigsnp)

Result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# Summary result of REML analysis:
# Source Variance SE
# V(G) 10.927486 69.918584
# V(e) 0.000011 69.472995
# Vp 10.927497 5.158109
# V(G)/Vp 0.999999 6.357631
#
# Sampling variance/covariance of the estimates of variance components:
# 4.888608e+03 -4.844250e+03
# -4.844250e+03 4.826497e+03

Reference

Labrador
Surface from bing

GCTA

Fri, 07 Mar 2025 00:00:00 +0000

gcta上網站，gcta compute heritability, 放QC後的snp，計算GRM，reml 估出variance，算出 heritability。以往是使用GWAS找出跟trait顯著有關的位點，計算這些位點的heritability。但是有些位點跟trait有關，不過effect size很小，很難用GWAS找出。GCTA 好處是，可以估計所有 SNP 綜合起來的 heritability，省略掉使用GWAS尋找顯著相關的snp。假如同一筆資料，GCTA算出的heritability很高，但GWAS找到的位點算出很低，代表可以再多得到樣本，找到更多的資訊

{GCTA}

GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait \textcolor{blue}{rather than} testing the association of any particular SNP to the trait.
GCTA’s five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation.

% GCTA用來估計一群SNP的var，我們可以把SNP用一個個染色體來區分，或是分成編碼乘蛋白質的genetic snp跟不直接影響蛋白質，但是會影響基因表達的intergenetic snp，而不是估計特定的1,2個snp % 估計不同樣本之間的snp相近程度；估計snp可以解釋的變異

{Precess}

Quality control the SNPs
Compute GRM/relatedness matrix in GCTA
choose a mixed linear model
REML method in GCTA

% 步驟是使用QC過的SNP，計算GRM 親緣矩陣，選個混合模型，使用 algorithm 算出最像是unbiased 的variance，知道variance 之後，就能算Heritability %首先，因為GCTA裡面沒有函數可以做QC，所以會使用QC過的SNP，像是MAF、 Hardy-Weinberg equilibrium

{GRM(Genetic Relationship Matrix)}

For $n$ individuals, $m$ SNPs

$ \mathbf{X} $: $ n \times m $ genotype matrix
$ p_j $: $j-th$ SNP MAF

$\mathbf{X}^{\text{norm}}$ is standardized genotype matrix by

$$ X_{ij}^{\text{norm}} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}} $$

Define GRM $\mathbf{A}=\frac{1}{m}\mathbf{X}^{\text{norm}} (\mathbf{X}^{\text{norm}})'$ by

$$ A_{jk} = \frac{1}{m} \sum_{i=1}^{m} \frac{(X_{ij} - 2p_j)(X_{ik} - 2p_j)}{2p_j(1 - p_j)} $$

% A_jk就是第j,k個樣本，都有標準化的SNP 向量，然後是s維度的，把向量做內積 % 算出GRM可以知道樣本之間的基因相似程度。接著我們要挑選掉基因相似的樣本

{GRM}

Suppose minor allele is a and MAF is $p$, then

$$ \begin{align*} \left\lbrace \begin{array}{lll} P(SNP = AA) = (1-p)^2 & \text{with} & 0\text{ minor allele} \\ P(SNP = Aa) = 2p(1-p) & \text{with} & 1\text{ minor allele} \\ P(SNP = aa) = p^2 & \text{with} & 2\text{ minor allele} \\ \end{array} \right. \end{align*} $$$$ \begin{align*} E(SNP) &= 0 \times (1-p)^2 +1 \times 2p(1-p) + 2 \times p^2 \\ & = 2p \end{align*} $$$$ \begin{align*} Var(SNP) &= 0^2 \times (1-p)^2 +1^2 \times 2p(1-p) + 2^2 \times p^2 -E^2(SNP) \\ & = 2p(1-p) \end{align*} $$

{Example}

$$ \begin{array}{|c|c|c|c|} \hline \textbf{Individual} & \textbf{SNP1} & \textbf{SNP2} & \textbf{SNP3} \\ \hline A & 0 & 1 & 1 \\ \hline B & 1 & 0 & 1 \\ \hline \text{MAF} (p_j) & 0.25 & 0.25 & 0.5 \\ \hline \end{array} $$

$X^{\text{norm}}_{ij} = \frac{X_{ij} - 2p_j}{\sqrt{2p_j(1 - p_j)}}$

Standardize $X$

$$ \begin{array}{|c|c|c|c|} \hline individual & SNP1 & SNP2 & SNP3 \\ \hline A & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline B & \displaystyle \frac{1 - 0.5}{0.612} & \displaystyle \frac{0 - 0.5}{0.612} & \displaystyle \frac{1 - 1}{0.707} \\ \hline \end{array} $$

{Example}

the standardized genotype matrix is:

$$ X^{\text{norm}} = \begin{bmatrix} -0.816 & 0.816 & 0 \\ 0.816 & -0.816 & 0 \end{bmatrix} $$

Genetic Relationship Matrix is $\frac{1}{3} X^{\text{norm}} (X^{\text{norm}})^T$

$$ \begin{bmatrix} 0.443 & -0.443 \\ -0.443 & 0.443 \end{bmatrix} $$

{Exclude Close Relatives}

Including close relatives, this estimate could be a \textcolor{blue}{biased} estimate of total genetic variance

% 很相似的基因好比兄弟姊妹，他們來自相同的生長環境，這環境有可能影響基因表達，假如這資料也拿來使用，variance 就會估不好

{Mixed Linear Model}

$$ \begin{align*} \mathbf{Y_{n \times 1}}= \underbrace{\mathbf{X_{n \times p} \beta_{p \times 1}} }_{\text{fixed term}} + \underbrace{\mathbf{W_{n \times q}u_{q \times 1}} }_{\text{random term}} +\mathbf{\varepsilon_{n \times 1}} \end{align*} $$

$\mathbf{u} \sim N(0,\mathbf{I} \sigma_u^2)$, $\mathbf{\varepsilon} \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
$\mathbf{W}$ is standardized genotype matrix with $w_{ij} = \frac{x_{ij} - 2p_j}{\sqrt{2p_j(1-p_j)}}$, where $x_{ij}$ is $i-th$ individual $j-th$ SNP and $p_j$ is $j-th$ SNP MAF
$Var(\mathbf{Y}) = \mathbf{W W'}\sigma_u^2 +\mathbf{I} \sigma_\varepsilon^2$

%X is sex, age, 20PCs, trait?；u is snp effect % Y is phenotype，像是身高、BMI 等的性狀，上次講錯了

{Model 1}

To estimate the variance explained by all autosomal SNPs, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{\varepsilon} \end{align*} $$

The Mixed Linear Model is equivalent to this model with $\mathbf{A*g} = \frac{\mathbf{W W'}}{m},\ \sigma*\mathbf{g}^2 = m\sigma\_\mathbf{u}^2 $

$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
$Var(\mathbf{Y}) =\mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{I} \sigma_\varepsilon^2$

%這個模型，把所有的SNP看成同一個影響

{Model 2}

To partition genetic variance onto each of the 22 autosomes, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \sum_{i=1}^{22} \mathbf{g_i} + \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g_i} \sim N(0,\mathbf{A_i} \sigma_i^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_i}$ is GRM from the SNPs on $i-th$ chromosome
$Var(\mathbf{Y}) = \sum_{i=1}^{22} \mathbf{A_i} \sigma_i^2 + \mathbf{I} \mathbf{\sigma_\varepsilon^2}$

% 把SNP用染色體來區分，分成22個effect

{Model 3}

To estimate the variance of genotype-environment interaction effects, we specify the model as

$$ \begin{align*} \mathbf{Y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{ge} + \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g} \sim N(0,\mathbf{A_g} \sigma_\mathbf{g}^2)$, $\mathbf{ge} \sim N(0,\mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$, where $\mathbf{A_g}$ is GRM
$Var(\mathbf{Y}) = \mathbf{A_g} \sigma_\mathbf{g}^2 + \mathbf{A_{ge}} \sigma_{\mathbf{ge}}^2 + \mathbf{I} \sigma_\varepsilon^2$
$$ \mathbf{A_{ge}}= \left\lbrace \begin{array}{lll} \mathbf{A_{g}} & \text{if} & \text{pairs of individuals in the same environment} \\ \mathbf{0} & \text{if} & \text{pairs of individuals in different environment}\end{array} \right. $$

% 假如想知道環境對基因表達的影響，也可以多一個SNp跟環境的交互作用向，這裡用ge表示

{Model 3}

$Cov(y_i,y_k) = \left\lbrace \begin{array}{lll} {A_{ik}}(\sigma_{\mathbf{ge}}^2+\sigma_{\mathbf{g}}^2) & \text{if} & \text{same environment} \\ {A_{ik}} \sigma_{\mathbf{g}}^2 & \text{if} & \text{different environment} \end{array} \right.$

##　｛Build Model} For model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$

$\mathbf{g_{cis}} \sim N(0,\mathbf{A_{cis}} \sigma_{cis}^2)$, $\mathbf{g_{trans}} \sim N(0,\mathbf{A_{trans}} \sigma_{trans}^2)$, $\mathbf{g_{GE}} \sim N(0,\mathbf{A_{GE}} \sigma_{GE}^2)$, $\varepsilon \sim N(0,\mathbf{I} \sigma_\varepsilon^2)$
$\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $
$\theta = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_{\varepsilon}^2)$

Log likelihood function

$$ \begin{align*} & L_Y(\beta, \theta) \\ = & -\frac{n}{2} \ln(2\pi) -\frac{1}{2} \ln|\mathbf{V}| -\frac{1}{2} \mathbf{(Y - X\beta)^T V^{-1} (Y - X\beta)} \\ \propto & -\frac{1}{2} \left[ \ln |\mathbf{V}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \end{align*} $$

REML

Log likelihood function independent to $\beta$ is the target of REML (restricted maximum likelihood). Comparing to ML, variance estimator is unbiased in REML. Let $\mathbf{M}\ s.t. \mathbf{M}\mathbf{X} = 0$

$$ \begin{align*} &\text{Let } \mathbf{W} = \mathbf{M}\mathbf{Y} \\ &E(\mathbf{W}) = \mathbf{M}\mathbf{X}\beta = 0 \\ &\text{Var}(\mathbf{W}) = \mathbf{M}\mathbf{V}\mathbf{M}^T \end{align*} $$

Transfer $\mathbf{X}$ to $\mathbf{M}\mathbf{X}$，$\mathbf{Y}$ to $\mathbf{M}\mathbf{Y}$. Log likelihood function of $\mathbf{W}$

$$ \begin{align*} L_\text{REML}(\theta) & \propto -\frac{1}{2} \left[ \ln |\mathbf{V}| + \ln |\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}| + (\mathbf{Y} - \mathbf{X}\beta)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln | \mathbf{M}\mathbf{V}\mathbf{M}^T | + \ln |\mathbf{(MX)}^T(\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1}\mathbf{MX}| + (\mathbf{MY} - \mathbf{MX}\beta)^T \mathbf{V}^{-1} (\mathbf{MY} - \mathbf{MX}\beta) \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{Y}^T\mathbf{M}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{M}^T\mathbf{Y} \right] \\ &\propto -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \\ \end{align*} $$

Therefore

$$ \begin{align*} &L_\text{REML}(\theta) = -\frac{n-p}{2}\ln(2 \pi) -\frac{1}{2} \left[ \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| + \mathbf{W}^T (\mathbf{M}\mathbf{V}\mathbf{M}^T)^{-1} \mathbf{W} \right] \end{align*} $$

Log likelihood function of $\mathbf{W}$ is independent to $\beta$.

some problem

It’s very common that $\mathbf{M}\mathbf{V}\mathbf{M}^T \text{ be singular}$

$$\Rightarrow \ln |\mathbf{M}\mathbf{V}\mathbf{M}^T| \to -\infty$$

REML avoid it.

Generalized least square (GLS)

$$ \begin{align*} &\mathbf{Y} = \mathbf{X}\beta + \epsilon, \quad \text{Var}(\epsilon) = \mathbf{V}, \quad \mathbf{V} = \mathbf{\Sigma} \mathbf{\Sigma}^T \\ &E(\epsilon) = 0 \\ &\mathbf{\Sigma}^{-1}\mathbf{Y} = \mathbf{\Sigma}^{-1}\mathbf{X}\beta + \mathbf{\Sigma}^{-1}\epsilon \\ &{\mathbf{Y}} = {\mathbf{X}}\beta + {\epsilon} \\ &\text{Var}({\epsilon}) = \text{Var}(\mathbf{\Sigma}^{-1}\epsilon) \\ &\quad = \mathbf{\Sigma}^{-1} \text{Var}(\epsilon) {\mathbf{\Sigma}^{-1}}^T \\ &\quad = \mathbf{\Sigma}^{-1} \mathbf{\Sigma} \mathbf{\Sigma}^T {\mathbf{\Sigma}^{-1}}^T \\ &\quad = I \end{align*} $$$$ \begin{align*} \hat{\beta} &= ({\mathbf{X}}^T {\mathbf{X}})^{-1} ({\mathbf{X}}^T {\mathbf{Y}}) \\ &= \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^{-1} \left( \mathbf{\Sigma}^{-1}\mathbf{X} \right)^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{X})^{-1} \mathbf{X}^T{\mathbf{\Sigma}^{-1}}^T \mathbf{\Sigma}^{-1}\mathbf{Y} \\ &= (\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y} \end{align*} $$

{EM Algorithm}

For estimating $(\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$, we use EM algorithm as an initial step to determine the direction of the iteration updates

$$ \begin{align*} \sigma^{2(1)}_i = \frac{1}{n} \left[ \sigma^{4(0)}_i \mathbf{Y^T P A_i P Y} + \operatorname{tr} (\sigma^{2(0)}_i \mathbf{I} - \sigma^{4(0)}_i \mathbf{P A_i}) \right] \end{align*} $$

where

$$ \begin{align*} & i = 1, \cdots, 4 \\ & \mathbf{P = V^{-1} - V^{-1} X (X^T V^{-1} X)^{-1} X^T V^{-1}} \end{align*} $$

{Average Information

Algorithm} After one EM iteration, GCTA switches to the average information algorithm

$$ \begin{align*} \bm{\theta}^{(t+1)} = \bm{\theta}^{(t)} + (\mathbf{AI}^{(t)})^{-1} \frac{\partial L}{\partial \bm{\theta}} \Big|_{\bm{\theta}^{(t)}} \end{align*} $$

where $\bm{\theta} = (\sigma_{cis}^2, \sigma_{trans}^2, \sigma_{GE}^2, \sigma_\varepsilon^2) = (\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_4 ^2)$

The iteration stop when $L^{(t+1)}-L^{(t)}< 10^{-4}$
In the iteration process, if $\sigma_i^2<0$, set $\sigma_i^2= 10^{-6}\sigma_Y^2$, where $\sigma_Y^2$ is phenotype variance

{REML Method}

$$ \begin{align*} & \mathbf{A I}=\mathbf{1} / \mathbf{2}\left[\begin{array}{cccc} \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_1 \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P} A_1 \mathbf{P P Y} \\ \vdots & \vdots & \vdots & \vdots \\ \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P} A_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P A}_r \mathbf{P A}_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P A _ { r }} \mathbf{P P Y} \\ \mathbf{Y}^{\prime} \mathbf{P P} \mathbf{A}_1 \mathbf{P Y} & \cdots & \mathbf{Y}^{\prime} \mathbf{P P} A_r \mathbf{P Y} & \mathbf{Y}^{\prime} \mathbf{P P P Y} \end{array}\right] ;\\ & \partial L / \partial \boldsymbol{\theta}=-1 / 2\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{P} \mathbf{A}_1\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_1 \mathbf{P} \mathbf{Y} \\ \vdots \\ \operatorname{tr}\left(\mathbf{P} A_r\right)-\mathbf{Y}^{\prime} \mathbf{P} \mathbf{A}_r \mathbf{P Y} \\ \operatorname{tr}(\mathbf{P})-\mathbf{Y}^{\prime} \mathbf{P P Y}\end{array}\right] \end{align*} $$

{REML (restricted maximum likelihood)}

As $Y_1, \cdots , Y_n$ is constant mean, $A^T X\beta = 0$. Hence, $L_W(\beta, \sigma_c^2, \sigma_t^2)$ doesn’t depend on $\beta$
Compared to ML, REML is less affected by fixed effects
Compared to ML, REML has lower bias.

{Heritability}

Assume model

$$ \begin{align*} \mathbf{Y_{n \times 1}} = \mathbf{X\beta} + \mathbf{g}_{cis} + \mathbf{g}_{trans} + \mathbf{g}_{GE}+ \mathbf{\varepsilon} \end{align*} $$

where $\mathbf{V} = Var(\mathbf{Y}) =\mathbf{A*{cis}} \sigma*{cis}^2 + \mathbf{A*{trans}} \sigma*{trans}^2+ \mathbf{A*{GE}} \sigma*{GE}^2+ \mathbf{I} \sigma\_\varepsilon^2 $

$h^2_{\mathbf{g},cis} = \frac{\sigma^2_{\mathbf{g},cis}}{\mathbf{V}}$
$h^2_{\mathbf{g},trans} = \frac{\sigma^2_{\mathbf{g},trans}}{\mathbf{V}}$
$h^2_{\mathbf{g},GE} = \frac{\sigma^2_{\mathbf{g},GE}}{\mathbf{V}}$

Reference

RNA- Sequencing

Sun, 16 Feb 2025 00:00:00 +0000

Introduction

RNA- Sequencing is a powerful technology enables researchers to quantify RNA levels, identify novel transcripts, and analyze alternative splicing events, offering deeper insights into cellular function and disease mechanisms. Key processes such as read alignment, normalization methods like RPKM and TPM, and splicing analysis ensure accurate interpretation of RNA-Seq data. However, despite its advantages, RNA-Seq faces challenges, including biases in library preparation, sequencing errors, and the complexity of data analysis.

Workflow

Alternative Splicing

A cellular process in which exons from the same gene are joined in different combinations, resulting in distinct but related mRNA transcripts (isoforms).

Read Sequencing Technology

There are three common technologies for splitting transcripts into reads.

Illumina
Nanopore
PacBio

High throughout technology refers to massive parallel sequencing, which generates millions to billions of reads in a single experiment.

Read Alignment

There are two types of read alignment: aligning reads to a reference transcriptome and aligning reads to a reference genome.
Aligning reads to reference transcriptome:

Performs well if the reference transcripts is enough
Faster
Used for computing gene expression

Aligning reads to reference genome:

Enables the identification of new isoforms, we call it novel isoforms
Deal intron-exon structure, providing a more complete but slower analysis

When introns are large, the latter method (aligning to the genome) requires significantly more time to process intron-exon structures. On the other hand, transcriptome-based alignment is faster because it does not handle intron-exon structures. However, this trade-off makes it difficult to detect rare disease-related transcripts and reduces accuracy when the reference transcriptome is incomplete.

RPKM

RPKM(Reads Per Kilobase Per Million), a gene expression unit of account.

$$ \begin{align*} \text{RPKM}_g = \frac{r_g \times 10^9}{{fl}_g \times R} \end{align*} $$

where

$$ \begin{align*} & r_g =\text{ reads number to gene } g \\ & {fl}_g = \text{ mapped gene length } \\ & R = \text{ total number of reads in all gene} \end{align*} $$

Example

Genes are not limited to A, B, and C in this example. Let’s focus on gene A in sample 1, which has $12$ reads and a mapped gene length of $600$. The total number of reads across all genes is $6*10^6$.

$$ \begin{align*} \text{RPKM}_A & = \frac{12 \times 10^9}{600 \times 6*10^6} \\ & = 3.33 \end{align*} $$

TPM

TPM(Transcripts Per Million), a gene expression unit of account.

$$ \begin{align*} \text{TPM}=\frac{r_g \times rl \times 10^6}{{fl}_g \times T}, \end{align*} $$

where

$$ \begin{align*} & r_g =\text{ reads number to gene } g \\ & rl = \text{ reads length to gene } g \\ & {fl}_g = \text{ mapped gene length} \\ & T=\displaystyle \sum_{g \in G} \frac{r_g \times rl}{{fl}_g} \end{align*} $$

Read length depends on the sequencing technology rather than the transcript itself, so we use $rl$ instead of $rl_g$.

Example

We standardize gene length because it is easier to align reads for longer genes. Therefore, the read count divided by gene length is defined as RPK (Reads Per Kilobase).

Suppose there are only genes A, B, and C, and the total RPK values are $650$ and $700$, respectively.

Additional

Sequencing Depth

Sequencing depth refers to the average number of times a nucleotide is mapped by a read. A higher sequencing depth generates more informative reads but comes at a higher cost.

$$ \begin{align*} \text{Sequencing Depth} = \frac{\text{reads length}\times \text{reads number}}{\text{reference sequence length}} \end{align*} $$

Example

$10^8$ reads with length $150$ bp, the reference sequence length $3 \times 10^9$ bp.

$$ \begin{align*} \text{Sequencing Depth} & = \frac{10^8 \times 150}{3 \times 10^9} \\ & = 5 \text{X} \end{align*} $$

Reference

Summary PrediXcan

Thu, 02 Jan 2025 00:00:00 +0000

Background

GWAS identifies SNPs that affect a trait, but the mechanism is unknown. We use S-PrediXcan to determine whether SNPs affect the trait through gene expression.

Gene Regulation

Although each cell in body contains the same DNA sequences, each cell does not express the same set of genes. Each cell with different genes encoded in the DNA and transcribed into mRNA or translated into protein. The process of express genes to produce mRNA and protein is called gene expression. And the mechanism of controlling specific genes express is called gene regulation. If human chromosome stretched out linearly, it would be over $4$ cm long. And every gene expressed, the cell have to be enormous.

Alternative RNA splicing is a common mechanism of gene regulation in eukaryotes. Up to $70\%$ of genes in humans are expressed as multiple proteins through it. Different combinations of introns and exons made up per-mRNA. And introns or exons to be removed from the primary transcript. Spliced mRNAs will create different proteins.

Colocalization

Colocalization is that GWAS, eQTL signal are overlaped on the same locus. It can determine whether the SNP in GWAS affect gene expression. There are three conditions

Linkage:
Two independent causal variants are closely located in the genome, leading to overlapping signals.
Causality:
A SNP directly affects the trait by changing gene expression, representing a direct causal relationship.
Pleiotropy:
A single SNP independently affects multiple traits. The association between these traits is caused by the same SNP, but the effects occur through different biological pathways.

We use Mendelian Randomization to check condition is causality or pleiotropy.

Heterogeneity

allelic heterogeneity:
A similar phenotype is produced by different alleles within the same gene
locus heterogeneity :
A similar phenotype is produced by mutations at different loci.

Bayesian Sparse Linear Mixed Models (BSLMM)

For $n$ sample and $p$ SNP

$$ \begin{align*} Y_i = \sum_{j=1}^p X_{ij} \beta_j + u_i + \epsilon_i \end{align*} $$

$Y_i$: phenotype of i-th sample
$X_{ij}$: genotype of i-th sample at j-th SNP
$\beta_j$: effect size
$u_i$: random effect for i-th sample
$\epsilon_i$: error term

Sparse Component
Lots SNP effect size will be zero, and contain important SNP only.

Polygenic Component
Lots SNP effect size are very small, so SNPs contribute together to trait.

Method

GWAS and PrediXcan

We assume that phenotype is linear function of $X_l$ and $T_g$ respectively.

$$ \begin{align} & Y = \alpha_1 + X_l \beta_l + \eta \\ & Y = \alpha_2 + T_g \gamma_g + \epsilon \end{align} $$

$\alpha_1 , \ \alpha_2$ are constant
$\eta, \ \epsilon$ are error term
$T_g = \sum_{l \in \text{Model}_g}^{} w_{lg} X_l$, predicted gene expression (transcriptome)<>
$\text{Var}(T_g) = \hat{\sigma}_g^2$
$X_l$ is $l$-th SNP allelic dosage (genotype)
$\text{Var}(X_l) = \hat{\sigma}_l^2$
$Y$ is level of the trait (phenotype)
$\text{Var}(Y) = \hat{\sigma}_Y^2$

PrediXcan

$w_{lg}$ from predictDB
get transcriptome $\hat{T}$
get $\hat{\gamma_g}$

PrediXcan is a computational algorithm developed to exploit GTEx data, including eQTLs identification and their relationship to complex traits. PrediXcan evaluates the aggregate effects of cis-regulatory variants (within in 1MB upstream or downstream of genes of interest) on gene expression via an elastic net regression method, and consequently, PrediXcan may identify loci with modest to weak effect sizes that do not achieve significance in variant-based association studies.

S-PrediXcan

$w_{lg}$ from predictDB
$\hat{\sigma}_g$ from training set or reference set
$\hat{\beta}_l, \ \text{se}(\hat{\beta}_l)$ from GWAS

We get

$$ \begin{align*} Z_g = & \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)} {\sqrt{\dfrac{1-\mathit{R}_l^2}{1-\mathit{R}_g^2}}} \\ \approx & \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)} \end{align*} $$

make sure that the GWAS and prediction model are based on the same population.
get $\hat{\gamma_g}$, z score

PVE by SNP and Transcriptome

Proportion of variance explained (PVE) by covariate $X_l$ and $T_g$ are

$$ \begin{align*} & R_g^2 = \frac{ \text{var}(T_g \hat{\gamma_g} ) }{ \text{var}(Y) } = \hat{\gamma}_g^2 \ \frac{\hat{\sigma}_g^2}{\hat{\sigma}_Y^2} \\ & R_l^2 = \frac{ \text{var}(X_l \hat{\beta}_l ) }{ \text{var}(Y) } = \hat{\beta}_l^2 \ \frac{\hat{\sigma}_l^2}{\hat{\sigma}_Y^2} \end{align*} $$

Predicted Effect Size

We represent $\hat{\sigma}_g^2$ in matrix form

$$ \begin{align*} \hat{\sigma}_g^2 &= \text{Var}(\sum_{l \in \text{Model}_g}^{} w_{lg} X_l) \notag \\ &= \text{Var}(\mathbf{W}_g\mathbf{X}_g) \notag \\ &= \mathbf{W}_g' \cdot \text{Var}(\mathbf{X}_g) \cdot \mathbf{W}_g \notag \\ &= \mathbf{W}_g' \cdot \mathbf{\Gamma}_g \cdot \mathbf{W}_g \end{align*} $$

$\mathbf{X}$ is $n \times p$ matrix of SNP data in model $g$
$\bar{\mathbf{X}}$ is $n \times p$ matrix with column $l$ has the column mean of $X_l$
$\mathbf{W}_g$ is the vector of $w_{lg}$ for SNPs in the model of $g$
$\mathbf{\Gamma}_g = (\mathbf{X}_g - \bar{\mathbf{X}_g})'(\mathbf{X}_g - \bar{\mathbf{X}_g})$, the sample covariance matrix of $\mathbf{X}_g$

For the assumption of linear function, the predicted effect size (coefficient) of covariate $X_l$ is

$$ \begin{align*} & \hat{\beta}_l = \frac{\text{Cov}(X_l, Y)}{\text{Var}(X_l)} = \frac{\text{Cov}(X_l, Y)}{\hat{\sigma}_l^2} \\ \Rightarrow \ & \text{Cov}(X_l, Y) = \hat{\beta}_l \hat{\sigma}_l^2 \end{align*} $$

And coefficient of covariate $T_g$ is

$$ \begin{align*} \hat{\gamma_g} &= \dfrac{\text{Cov}(T_g, Y)}{\hat{\sigma}_g^2} \\ &= \dfrac{\text{Cov}(\sum_{l \in \text{Model}_g} w_{lg} X_l, Y)}{\hat{\sigma}_g^2} \\ &= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \text{Cov}(X_l, Y)}{\hat{\sigma}_g^2} \\ &= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \hat{\beta}_l\hat{\sigma}_l^2}{\hat{\sigma}_g^2} \end{align*} $$

As the linear assumption

$$ \begin{align*} & Y = \alpha_1 + X_l \beta_l + \eta \\ \Rightarrow \ & \hat{\sigma}_Y^2 = \hat{\sigma}_\eta^2 + \hat{\sigma}_l^2 \hat{\beta}_l^2 \end{align*} $$

We rewrite the variance

$$ \begin{align*} \text{var}(\hat{\beta_l}) &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(Y_i-\bar{Y})}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})Y_i}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(\alpha_1 + X_{li} \beta_l + \eta)}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l}) \eta}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})^ 2 \cdot \sigma_{\epsilon}^2}{(\sum_{i=1}^n (X_{li}-\bar{X_l})^2)^2} \\ &= \dfrac{\sigma_{\epsilon}^2}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2} \\ &= \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_l^2 \hat{\beta}_l^2}{n\hat{\sigma}_l^2} \\ &= \frac{\hat{\sigma}_Y^2(1 - R_l^2)}{n\hat{\sigma}_l^2} \end{align*} $$

$$ \begin{align} \frac{\hat{\sigma}_Y^2}{n} = \dfrac{se^2(\hat{\beta_l}) \cdot\hat{\sigma}_l^2}{1 - R_l^2} \end{align} $$

Similarly,

$$ \begin{align*} \text{var}(\hat{\gamma_g}) = \frac{\hat{\sigma}_Y^2}{n} \cdot \frac{(1 - R_g^2)}{\hat{\sigma}_g^2} \end{align*} $$

By $(1)$,

$$ \begin{align*} se(\hat{\gamma_g}) & = \sqrt{\text{var}(\hat{\gamma_g})} \\ & = se(\hat{\beta_l}) \cdot \frac{\hat{\sigma}_l}{\hat{\sigma}_g} \cdot \sqrt{ \frac{(1 - R_g^2)}{(1 - R_l^2)}} \end{align*} $$

We infer PrediXcan results ($\hat{\gamma_g},\ \text{se}(\hat{\gamma_g})$) using GWAS results ($\hat{\beta}_l,\ \text{se}(\hat{\beta}_l)$), SNPs information ($\hat{\sigma}_l^2,\ \mathbf{\Gamma}_g$) and PredictDB weights ($w_{lg}$).

Results

Compare PrediXcan and S-PrediXcan

$w_{lg}$ from predictDB that based on EUR Depression Genes and Network’s (DGN) Whole Blood data, GTEx, Framingham, etc. Training set will usually be different from the study sets. When individual level data are not available from the training set we use population reference sets such as 1000 Genomes data.

$Y$ is simulated phenotype which under $H_0:$ phenotype is independent to transcriptome (predicted gene expression). So (2) doesn’t with covariate $\hat{T}$, only some environmet covariates.
study sets (GWAS set) and reference sets (LD calculation set) both consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project

For the same race, S-PrediXcan and PrediXcan are high correlated. Eventhough different race, it is high correlated also. Futhermore, for AFR sutudy/refernce set, the EUR $r^2$ is higher than EAS.

$Y$ is intrinsic growth phenotype
study sets were a subset of 140 individuals from each of the African, Asian, and European groups from 1000 Genomes Project, and reference sets consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project

The amount of study set sample is less. It may make $se(\hat{\beta})$ increase, and then z-score decrease. PrediXcan and S-PrediXcan results are a little different. So diagonal plot $r^2$ smaller than Figure 2a.

$Y$ is bipolar disorder and type 1 diabetes studies
study sets consisted of British individuals, reference sets was the European population subset of the 1000 Genomes Project

Colocalization Status of S-PrediXcan

Five conditions

$H_0:$ SNP signal not associate with eQTL and GWAS.
$H_1:$ SNP signal associate with eQTL but not GWAS.
$H_2:$ SNP signal associate with GWAS but not eQTL.
$H_3:$ SNP signal associate with both eQTL and GWAS, and independent signal (pleiotropy).
$H_4:$ SNP signal associate with both eQTL and GWAS, and shared signal (colocalized).

If we keep only Bonferroni-significant S-PrediXcan results, associations tend to cluster into three distinct regions

Compare S-TWAS and S-PrediXcan

difference between S-TWAS, S-PrediXcan is prediction models. TWAS uses BSLMM whereas PrediXcan uses elastic net
For COLOC-estimated proportion of non-colocalized, polygenic component of BSLMM consider the effects of multiple SNP combinations. It increase the chance of non-colocalized result.
Mancuso et al filtered out genes with low GCTA heritability, so significant genes in TWAS is less than PrediXcan. But the significance of TWAS and PrediXcan are similar.

Predicted Performance by Trait

Predicted Performance is better as

predicted performance $R^2$ increase
predicted performance $p-value$ decrease

Z-score increase when predicted performance is better.It shows the prediction is more reliable if predicted performance is better. It means that S-PrediXcan associations tend to be more significant when prediction is more reliable.

Hypotesis

Example

Reference

data

GERA data
GTEx
1000 G
summary statistic

Polygenic Risk Score

Wed, 27 Nov 2024 00:00:00 +0000

Background

Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual. SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS will get a great performance for European only. In this paper, the we try to add PTRS which improve the portability of race.

Data

(put setup picture ) Use 356,476 unrelated Europeans in the UK Biobank for the discovery set. In training sets, the first set of models had been trained in European individuals from the GTEx v8 in whole blood. The second set of models had been trained using array-based expression in monocyte samples of Europeans from MESA.

$$ \begin{align*} f(\bold{X}) = T \end{align*} $$

For predicting transcriptome ($\hat{T}$, set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.

Dealing Data

Use UK Biobank data

GTEx data containing SNP and corresponding gene expression data.
Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race
labeled individuals as EUR, S.ASN, E.ASN, AFR
Remove high missing rates data
individual with multiple arrays (measure many times with different instruments): taking the average
individual with multiple instances (measure many times with stage): Use the first non-missing value
predicted transcriptome depend on race, tissue. Different tissue or race with difference function transform SNP to gene expression

Quality control on self-reported ancestry

Because lots sample may give wrong information about ancestry race. We defined similarity $S_{ik}$ for individual $i$ in population $k$

$$ \begin{align*} S_{ik} = \log{P(\text{PC}_i^1, \cdots, \text{PC}_i^{10} |\ \widehat{\mu_k}, \widehat{\Sigma_k})} \end{align*} $$

where $\widehat{\mu_k}, \widehat{\Sigma_k}$ are sample mean, sample var respectively
For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with $S_{ik} > -50$.

Proportion of Variance Explained

First, we convert the predicted gene expression $\hat{T}_{ig}$ to $\widetilde{T}_{ig}$. We can control the range of transformed gene expression like Normal quantile. What’s more, the transformed gene expression will follow $N(0,1)$. For gene $g$, $i$ individual transformed gene expression is

$$ \begin{align*} \widetilde{T}_{ig} = \Phi^{-1}\left(\frac{\text{rank}(\hat{T}_{ig})}{N+1}\right) \end{align*} $$

Suppose the observed phenotype ( $Y_i$ ) has a linear relation with the $l$th covariate ( $C_{il}$ ) and the inverse-normalized predicted expression of gene g ( $\widetilde{T}_{ig}$ )

$$ \begin{align*} & Y_i = \mu + \sum_l C_{il} a_l + \sum_g \widetilde{T}_{ig} \beta_g + \varepsilon_i \\ & \varepsilon_i \overset{\text{iid }}{\sim} N(0, \sigma_e^2) \\ & \beta_g \overset{\text{iid }}{\sim} N\left( 0, \frac{\sigma_g^2}{M} \right) \end{align*} $$

Proportion of variance explained (PVE) idea like $R^2$. We determine the amount of explained data variation by the covariate. $R^2$ determine the amount of explained data variation by whole covariates; PVE can focus on the covariate we concerned. And PVE for each trait will depend on the heritability of the trait. PVE of gene $g$ is

$$ \begin{align*} \text{PVE}_g = \frac{\hat{\sigma}_g^2}{M (\hat{\sigma}_e^2 + \hat{\sigma}_g^2)} \end{align*} $$

Polygenic Risk Scores

We need to use LD clumping filtering the independent and significant SNP. Using discovery data to compute PRS, polygenic Risk Scores (PRS) for individual $i$ at GWAS p-value thresholds $t$

$$ \begin{align*} \text{PRS}_i^t=\sum_{j: p_j \leq t} X_{i j} \widehat{b}_j \end{align*} $$

$\hat{b}_j$: according to GWAS, we get coefficient of SNPs
$X_i$: phenotype of $i$ SNP, number of risk allele also, 0 or 1 or 2

Independent and Significant

Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.
Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It’s not accurate.

Fault of PRS
Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when implementing public health measures.

LD Clumping v.s. LD Pruning

Because distance of every SNP is not equal. LD Clumping

Ordering SNP by p-value
Using SNP with the smallest p-value as the center for a range 250kb
Removing those SNP with $R^2 > 0.1$
Proceeding to the SNP with the next smallest p-value

LD Pruning

In a window, we compute $r^2$ of each pair SNP
If $r^2$ bigger than threshold, delete the snp which with smaller MAF
Remove to next window

Example for LD Pruning

We calculate each pair SNP LD in $50000$ SNP window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn’t compute LD in different chromosome even though SNP are a few.
Region is that window 1: 0 – 50,000-th SNP; window 2: 5,000-th SNP – 55,000-th SNP

1

plink --bfile xxx1 --chr 1-22 --indep-pairwise 50000 5000 0.2 --out output2

We calculate each pair SNP LD in $50$ kb window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn’t compute LD in different chromosome even though SNP are a few. Region is that window 1: 0 – 50 kb; window 2: the next 5 SNP – (the next 5 SNP pos +50 kb)

1

plink --bfile xxx1 --chr 1-22 --indep-pairwise-kb 50 5 0.2 --out output2

PRS Practicality

We choose independent and low correlated SNPs by LD clumping.
For 11 p-value thresholds, we get different PRS.

Risk Allele

One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase. For example, Alzheimer with three allele, and APOE $\varepsilon_4$ is risk allele.

APOE $\varepsilon_2$: consist of SNPs rs429358 (T), rs7412 (T)
APOE $\varepsilon_3$: consist of SNPs rs429358 (T), rs7412 (C)
APOE $\varepsilon_4$: consist of SNPs rs429358 (C), rs7412 (C)

Polygenic Transcriptome Risk Scores

The advantage of polygenic transcriptome risk scores(PTRS) is that the PTRS model has fewer covariates anf requires a smaller sample size for training.And the used gene is more closed to disease rather than SNPs data in PRS. In this paper, we using discovery data to find PTRS. PTRS for individual i at GWAS p-value thresholds $\lambda$ and gene $g$ is

$$ \begin{align*} \text{PTRS}_i^{\boldsymbol{\lambda}} = \sum_g {\hat{T}_{ig}\beta_g^{\boldsymbol{\lambda}}} \end{align*} $$

where $\boldsymbol{\lambda}$ is made up of penalty term $\lambda$ and $\alpha$ in elastic net

Elastic Net

A variable selection method is used, but it’s unsuitable for too many covariates. PTRS has fewer covariates than PRS, so elastic net is applied only for PTRS. For PRS, LD clumping and p-value filtering are used. Elastic net identifies the optimal $\boldsymbol{\beta^{EN}}$. For N individuals

$$ \begin{align*} \boldsymbol{\beta^{EN}} & = \displaystyle \arg \min_{\beta} \left\{ \underbrace{\frac{1}{N} \| \boldsymbol{Y} - \boldsymbol{X} \boldsymbol{\beta}- \boldsymbol{\beta_0} \|_2^2 }_{\textcolor{blue}{\text{loss}}} + \lambda \left[ \alpha \left\| \boldsymbol{\beta} \right\|_1 + (1-\alpha) \left\| \boldsymbol{\beta} \right\|_2^2 \right] \right\} \\ \end{align*} $$

where

$$ \begin{align*} & \boldsymbol{\beta} = [0, \beta_1, \ldots, \beta_{M+L-1}]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{\beta_0} = [\beta_0, 0, \ldots, 0]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{Y} \in \mathbb{R}^{N \times 1}, \text{ observed phenotypes matrix} \\ & \boldsymbol{X} = [\hat{T}_1, \ldots, \hat{T}_M, C_1, \ldots, C_L] \in \mathbb{R}^{N \times (M+L)} \\ & M = \text{ number of genes } \\ & L= \text{ number of covariates } \\ & \hat{T}_i \in \mathbb{R}^{N \times 1}, \text{ predicted standardized i-th gene expression} \\ & C_i \in \mathbb{R}^{N \times 1}, \text{ observed i-th standardized covariate} \\ \end{align*} $$

Degenerate Model

$$ \begin{align*} Y= \text{constant}+\varepsilon \end{align*} $$

PTRS Practicality

We set $\alpha = 0.1$ and find $\lambda_{\text{max}}$ as the smallest value satisfying $|\nabla l(\beta)| \leq \alpha \lambda$.
To match the 11 PRS p-value cutoffs, we build a set of $lambda$ by selecting 20 equally spaced points in log scale between $1.5\lambda_{\text{max}}$ and $\frac{\lambda_{\text{max}}}{10^4}$. And we use the first 11 non-degenerate models for each population.
At $\alpha=0.1$, get $\beta$ in each $\lambda$ by elastic net.
For 11 $\lambda$, we get different PTRS.

Partial $R$ Squared

TO compare he performance of PRS and PTRS, Partial $R$ Squared ($\widetilde{R^2}$) can be used as the metric for evaluation. Partial $R^2$ called prediction accuracy also. Let $y_i$ denote the observed phenotype, $\hat{y_i}$ denote the predicted phenotype. And

$$ \begin{align*} \text{Null model}:y ∼ 1+ \textit{ covariates } \quad \text{ v.s. } \quad \text{ Full model }: y ∼ 1+\textit{ covariates }+\hat{y_i} \end{align*} $$$$ \begin{align*} & \widetilde{R^2} \\ = \quad & 1- \dfrac{\text{SSE}_\text{full} }{\text{SSE}_\text{null} } \\ = \quad & \frac{C^2(y, \hat{y})}{C(y, y)C(\hat{y}, \hat{y})} \\ \end{align*} $$

where

$$ \begin{align*} & C(u, v) = u^t v - u^t H v \\ & H = \widetilde{C} (\widetilde{C}^t \widetilde{C})^{-1} \widetilde{C}^t \\ & \widetilde{C} = [1, C_1, \dots, C_L] \end{align*} $$

How to choose hyperparameters?

Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets
Spliting each target set into two equal-size parts, a validation set and a test set
Selecting hyperparameters (p-value cutoff in clumping and thresholding, $\lambda$ in elastic net) maximize $\widetilde{R^2} $ in validation set

After choosing hyperparameters, we calculate the $\widetilde{R^2} $ in test set. This procedure was repeated $10$ times and we get the average $\widetilde{R^2} $ as the prediction accuracy.

Combining PTRS and PRS

combined score $\hat{y_i}$= $c_1 PRS_i^\lambda+ c_2 PTRS_i^\lambda$

Spliting each target set into two equal-size parts, a validation set and a test set.
Spliting validation set into two equal-size parts.
For 11 $\lambda$ thresholds, find ${\arg \min}_{c_1,\ c_2} \sum_i (y_i-\hat{y_i})^2$ in first validation set. We get different $y_i$ in different threshold. So $c_1,\ c_2$ will different too. The idea like linear model, different $x_i$ will get different fitted line.
Now, we get the best $c_1,\ c_2$ in each thresold. Then we use $c_1$, $c_2$ to compute combined score in second validation set.
Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest $\widetilde{R^2}$.
Now, we get the best tuning $c_1,\ c_2$ and threshold in each population. And use it in test set.
Getting final $\widetilde{R^2}$ $\sim \sim$

Portability of PRS and PTRS

Portability of PRS is

$$ \begin{align*} \frac{\text{prediction accuracy in target set}}{ \text{prediction accuracy in European reference set}} \end{align*} $$

Portability of PTRS is

$$ \begin{align*} \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} }, \quad \text{ where } \widetilde{R^2}_{\text{EUR ref}} \text{ is } \widetilde{R^2} \text{ in MESA EUR model } \end{align*} $$

Since MESA EUR model is expected to perform better than MESA AFHI model among EUR individuals. So

$$ \begin{align*} & \widetilde{R^2}_{\text{MESA AFHI}} <\widetilde{R^2}_{\text{EUR ref}} \\ \Rightarrow \quad & \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} } < \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{MESA AFHI}} } \end{align*} $$

Therefore, definition of Portability of PTRS is conservative.

Result

In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability. In the plot, PTRS $\widetilde{R^2}$ can achieve upper bound (heritability), but PRS’s can not.
The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.
Using PredictDB data to get the weight, and compute predicted gene expression (in other paper). There we use the weight directly.
In the paper, trait is continuous number. It is not related to survival analysis. As the trait is discrete, we will use survival analysis.
Finally, each race group will compute one PRS and PTRS.

Reference

Polygenic transcriptome risk scores(PTRS) can improve portability of polygenic risk scores across ancestries