Summary PrediXcan

Background

GWAS identifies SNPs that affect a trait, but the mechanism is unknown. We use S-PrediXcan to determine whether SNPs affect the trait through gene expression.

Gene Regulation

Although each cell in body contains the same DNA sequences, each cell does not express the same set of genes. Each cell with different genes encoded in the DNA and transcribed into mRNA or translated into protein. The process of express genes to produce mRNA and protein is called gene expression. And the mechanism of controlling specific genes express is called gene regulation. If human chromosome stretched out linearly, it would be over $4$ cm long. And every gene expressed, the cell have to be enormous.

Alternative RNA splicing is a common mechanism of gene regulation in eukaryotes. Up to $70\%$ of genes in humans are expressed as multiple proteins through it. Different combinations of introns and exons made up per-mRNA. And introns or exons to be removed from the primary transcript. Spliced mRNAs will create different proteins.

Colocalization

Colocalization is that GWAS, eQTL signal are overlaped on the same locus. It can determine whether the SNP in GWAS affect gene expression. There are three conditions

Linkage:
Two independent causal variants are closely located in the genome, leading to overlapping signals.
Causality:
A SNP directly affects the trait by changing gene expression, representing a direct causal relationship.
Pleiotropy:
A single SNP independently affects multiple traits. The association between these traits is caused by the same SNP, but the effects occur through different biological pathways.

We use Mendelian Randomization to check condition is causality or pleiotropy.

Heterogeneity

allelic heterogeneity:
A similar phenotype is produced by different alleles within the same gene
locus heterogeneity :
A similar phenotype is produced by mutations at different loci.

Bayesian Sparse Linear Mixed Models (BSLMM)

For $n$ sample and $p$ SNP

$$ \begin{align*} Y_i = \sum_{j=1}^p X_{ij} \beta_j + u_i + \epsilon_i \end{align*} $$

$Y_i$: phenotype of i-th sample
$X_{ij}$: genotype of i-th sample at j-th SNP
$\beta_j$: effect size
$u_i$: random effect for i-th sample
$\epsilon_i$: error term

Sparse Component
Lots SNP effect size will be zero, and contain important SNP only.

Polygenic Component
Lots SNP effect size are very small, so SNPs contribute together to trait.

Method

GWAS and PrediXcan

We assume that phenotype is linear function of $X_l$ and $T_g$ respectively.

$$ \begin{align} & Y = \alpha_1 + X_l \beta_l + \eta \\ & Y = \alpha_2 + T_g \gamma_g + \epsilon \end{align} $$

$\alpha_1 , \ \alpha_2$ are constant
$\eta, \ \epsilon$ are error term
$T_g = \sum_{l \in \text{Model}_g}^{} w_{lg} X_l$, predicted gene expression (transcriptome)<>
$\text{Var}(T_g) = \hat{\sigma}_g^2$
$X_l$ is $l$-th SNP allelic dosage (genotype)
$\text{Var}(X_l) = \hat{\sigma}_l^2$
$Y$ is level of the trait (phenotype)
$\text{Var}(Y) = \hat{\sigma}_Y^2$

PrediXcan

$w_{lg}$ from predictDB
get transcriptome $\hat{T}$
get $\hat{\gamma_g}$

PrediXcan is a computational algorithm developed to exploit GTEx data, including eQTLs identification and their relationship to complex traits. PrediXcan evaluates the aggregate effects of cis-regulatory variants (within in 1MB upstream or downstream of genes of interest) on gene expression via an elastic net regression method, and consequently, PrediXcan may identify loci with modest to weak effect sizes that do not achieve significance in variant-based association studies.

S-PrediXcan

$w_{lg}$ from predictDB
$\hat{\sigma}_g$ from training set or reference set
$\hat{\beta}_l, \ \text{se}(\hat{\beta}_l)$ from GWAS

We get

$$ \begin{align*} Z_g = & \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)} {\sqrt{\dfrac{1-\mathit{R}_l^2}{1-\mathit{R}_g^2}}} \\ \approx & \sum_{l \in \text{Model}_g}^{} w_{lg} \ \frac{\sigma_l}{\hat{\sigma}_g} \ \frac{\hat{\beta}_l}{se(\hat{\beta}_l)} \end{align*} $$

make sure that the GWAS and prediction model are based on the same population.
get $\hat{\gamma_g}$, z score

PVE by SNP and Transcriptome

Proportion of variance explained (PVE) by covariate $X_l$ and $T_g$ are

$$ \begin{align*} & R_g^2 = \frac{ \text{var}(T_g \hat{\gamma_g} ) }{ \text{var}(Y) } = \hat{\gamma}_g^2 \ \frac{\hat{\sigma}_g^2}{\hat{\sigma}_Y^2} \\ & R_l^2 = \frac{ \text{var}(X_l \hat{\beta}_l ) }{ \text{var}(Y) } = \hat{\beta}_l^2 \ \frac{\hat{\sigma}_l^2}{\hat{\sigma}_Y^2} \end{align*} $$

Predicted Effect Size

We represent $\hat{\sigma}_g^2$ in matrix form

$$ \begin{align*} \hat{\sigma}_g^2 &= \text{Var}(\sum_{l \in \text{Model}_g}^{} w_{lg} X_l) \notag \\ &= \text{Var}(\mathbf{W}_g\mathbf{X}_g) \notag \\ &= \mathbf{W}_g' \cdot \text{Var}(\mathbf{X}_g) \cdot \mathbf{W}_g \notag \\ &= \mathbf{W}_g' \cdot \mathbf{\Gamma}_g \cdot \mathbf{W}_g \end{align*} $$

$\mathbf{X}$ is $n \times p$ matrix of SNP data in model $g$
$\bar{\mathbf{X}}$ is $n \times p$ matrix with column $l$ has the column mean of $X_l$
$\mathbf{W}_g$ is the vector of $w_{lg}$ for SNPs in the model of $g$
$\mathbf{\Gamma}_g = (\mathbf{X}_g - \bar{\mathbf{X}_g})'(\mathbf{X}_g - \bar{\mathbf{X}_g})$, the sample covariance matrix of $\mathbf{X}_g$

For the assumption of linear function, the predicted effect size (coefficient) of covariate $X_l$ is

$$ \begin{align*} & \hat{\beta}_l = \frac{\text{Cov}(X_l, Y)}{\text{Var}(X_l)} = \frac{\text{Cov}(X_l, Y)}{\hat{\sigma}_l^2} \\ \Rightarrow \ & \text{Cov}(X_l, Y) = \hat{\beta}_l \hat{\sigma}_l^2 \end{align*} $$

And coefficient of covariate $T_g$ is

$$ \begin{align*} \hat{\gamma_g} &= \dfrac{\text{Cov}(T_g, Y)}{\hat{\sigma}_g^2} \\ &= \dfrac{\text{Cov}(\sum_{l \in \text{Model}_g} w_{lg} X_l, Y)}{\hat{\sigma}_g^2} \\ &= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \text{Cov}(X_l, Y)}{\hat{\sigma}_g^2} \\ &= \sum_{l \in \text{Model}_g} \dfrac{w_{lg} \hat{\beta}_l\hat{\sigma}_l^2}{\hat{\sigma}_g^2} \end{align*} $$

As the linear assumption

$$ \begin{align*} & Y = \alpha_1 + X_l \beta_l + \eta \\ \Rightarrow \ & \hat{\sigma}_Y^2 = \hat{\sigma}_\eta^2 + \hat{\sigma}_l^2 \hat{\beta}_l^2 \end{align*} $$

We rewrite the variance

$$ \begin{align*} \text{var}(\hat{\beta_l}) &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(Y_i-\bar{Y})}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})Y_i}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})(\alpha_1 + X_{li} \beta_l + \eta)}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \text{var}\left(\dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l}) \eta}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2}\right) \\ &= \dfrac{\sum_{i=1}^n (X_{li}-\bar{X_l})^ 2 \cdot \sigma_{\epsilon}^2}{(\sum_{i=1}^n (X_{li}-\bar{X_l})^2)^2} \\ &= \dfrac{\sigma_{\epsilon}^2}{\sum_{i=1}^n (X_{li}-\bar{X_l})^2} \\ &= \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_l^2 \hat{\beta}_l^2}{n\hat{\sigma}_l^2} \\ &= \frac{\hat{\sigma}_Y^2(1 - R_l^2)}{n\hat{\sigma}_l^2} \end{align*} $$

$$ \begin{align} \frac{\hat{\sigma}_Y^2}{n} = \dfrac{se^2(\hat{\beta_l}) \cdot\hat{\sigma}_l^2}{1 - R_l^2} \end{align} $$

Similarly,

$$ \begin{align*} \text{var}(\hat{\gamma_g}) = \frac{\hat{\sigma}_Y^2}{n} \cdot \frac{(1 - R_g^2)}{\hat{\sigma}_g^2} \end{align*} $$

By $(1)$,

$$ \begin{align*} se(\hat{\gamma_g}) & = \sqrt{\text{var}(\hat{\gamma_g})} \\ & = se(\hat{\beta_l}) \cdot \frac{\hat{\sigma}_l}{\hat{\sigma}_g} \cdot \sqrt{ \frac{(1 - R_g^2)}{(1 - R_l^2)}} \end{align*} $$

We infer PrediXcan results ($\hat{\gamma_g},\ \text{se}(\hat{\gamma_g})$) using GWAS results ($\hat{\beta}_l,\ \text{se}(\hat{\beta}_l)$), SNPs information ($\hat{\sigma}_l^2,\ \mathbf{\Gamma}_g$) and PredictDB weights ($w_{lg}$).

Results

Compare PrediXcan and S-PrediXcan

$w_{lg}$ from predictDB that based on EUR Depression Genes and Network’s (DGN) Whole Blood data, GTEx, Framingham, etc. Training set will usually be different from the study sets. When individual level data are not available from the training set we use population reference sets such as 1000 Genomes data.

$Y$ is simulated phenotype which under $H_0:$ phenotype is independent to transcriptome (predicted gene expression). So (2) doesn’t with covariate $\hat{T}$, only some environmet covariates.
study sets (GWAS set) and reference sets (LD calculation set) both consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project

For the same race, S-PrediXcan and PrediXcan are high correlated. Eventhough different race, it is high correlated also. Futhermore, for AFR sutudy/refernce set, the EUR $r^2$ is higher than EAS.

$Y$ is intrinsic growth phenotype
study sets were a subset of 140 individuals from each of the African, Asian, and European groups from 1000 Genomes Project, and reference sets consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project

The amount of study set sample is less. It may make $se(\hat{\beta})$ increase, and then z-score decrease. PrediXcan and S-PrediXcan results are a little different. So diagonal plot $r^2$ smaller than Figure 2a.

$Y$ is bipolar disorder and type 1 diabetes studies
study sets consisted of British individuals, reference sets was the European population subset of the 1000 Genomes Project

Colocalization Status of S-PrediXcan

Five conditions

$H_0:$ SNP signal not associate with eQTL and GWAS.
$H_1:$ SNP signal associate with eQTL but not GWAS.
$H_2:$ SNP signal associate with GWAS but not eQTL.
$H_3:$ SNP signal associate with both eQTL and GWAS, and independent signal (pleiotropy).
$H_4:$ SNP signal associate with both eQTL and GWAS, and shared signal (colocalized).

If we keep only Bonferroni-significant S-PrediXcan results, associations tend to cluster into three distinct regions

Compare S-TWAS and S-PrediXcan

difference between S-TWAS, S-PrediXcan is prediction models. TWAS uses BSLMM whereas PrediXcan uses elastic net
For COLOC-estimated proportion of non-colocalized, polygenic component of BSLMM consider the effects of multiple SNP combinations. It increase the chance of non-colocalized result.
Mancuso et al filtered out genes with low GCTA heritability, so significant genes in TWAS is less than PrediXcan. But the significance of TWAS and PrediXcan are similar.

Predicted Performance by Trait

Predicted Performance is better as

predicted performance $R^2$ increase
predicted performance $p-value$ decrease

Z-score increase when predicted performance is better.It shows the prediction is more reliable if predicted performance is better. It means that S-PrediXcan associations tend to be more significant when prediction is more reliable.

Hypotesis

Example

Reference

data

GERA data
GTEx
1000 G
summary statistic