f (x) = \frac{1}{\sqrt{2 \pi}}

Background

Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual. SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS will get a great performance for European only. In this paper, the we try to add PTRS which improve the portability of race.

Data

\begin{align*} f(\bold{X}) = T \end{align*}

For predicting transcriptome ( $\hat{T}$ , set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.

Dealing Data

Use UK Biobank data

GTEx data containing SNP and corresponding gene expression data.
Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race
labeled individuals as EUR, S.ASN, E.ASN, AFR
Remove high missing rates data
individual with multiple arrays (measure many times with different instruments): taking the average
individual with multiple instances (measure many times with stage): Use the first non-missing value
predicted transcriptome depend on race, tissue. Different tissue or race with difference function transform SNP to gene expression

Quality control on self-reported ancestry

\begin{align*} S_{ik} = \log{P(\text{PC}_i^1, \cdots, \text{PC}_i^{10} |\ \widehat{\mu_k}, \widehat{\Sigma_k})} \end{align*}

where $\widehat{\mu_k}, \widehat{\Sigma_k}$ are sample mean, sample var respectively
For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with $S_{ik} > -50$ .

Proportion of Variance Explained

\begin{align*} \widetilde{T}_{ig} = \Phi^{-1}\left(\frac{\text{rank}(\hat{T}_{ig})}{N+1}\right) \end{align*}

\begin{align*} & Y_i = \mu + \sum_l C_{il} a_l + \sum_g \widetilde{T}_{ig} \beta_g + \varepsilon_i \\ & \varepsilon_i \overset{\text{iid }}{\sim} N(0, \sigma_e^2) \\ & \beta_g \overset{\text{iid }}{\sim} N\left( 0, \frac{\sigma_g^2}{M} \right) \end{align*}

\begin{align*} \text{PVE}_g = \frac{\hat{\sigma}_g^2}{M (\hat{\sigma}_e^2 + \hat{\sigma}_g^2)} \end{align*}

Polygenic Risk Scores

\begin{align*} \text{PRS}_i^t=\sum_{j: p_j \leq t} X_{i j} \widehat{b}_j \end{align*}

$\hat{b}_j$ : according to GWAS, we get coefficient of SNPs
$X_i$ : phenotype of $i$ SNP, number of risk allele also, 0 or 1 or 2

Independent and Significant

Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.
Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It’s not accurate.

Fault of PRS
Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when implementing public health measures.

LD Clumping v.s. LD Pruning

Because distance of every SNP is not equal. LD Clumping

Ordering SNP by p-value
Using SNP with the smallest p-value as the center for a range 250kb
Removing those SNP with $R^2 > 0.1$
Proceeding to the SNP with the next smallest p-value

LD Pruning

In a window, we compute $r^2$ of each pair SNP
If $r^2$ bigger than threshold, delete the snp which with smaller MAF
Remove to next window

PRS Practicality

We choose independent and low correlated SNPs by LD clumping.
For 11 p-value thresholds, we get different PRS.

Risk Allele

One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase. For example, Alzheimer with three allele, and APOE $\varepsilon_4$ is risk allele.

APOE $\varepsilon_2$ : consist of SNPs rs429358 (T), rs7412 (T)
APOE $\varepsilon_3$ : consist of SNPs rs429358 (T), rs7412 (C)
APOE $\varepsilon_4$ : consist of SNPs rs429358 (C), rs7412 (C)

Polygenic Transcriptome Risk Scores

\begin{align*} \text{PTRS}_i^{\boldsymbol{\lambda}} = \sum_g {\hat{T}_{ig}\beta_g^{\boldsymbol{\lambda}}} \end{align*}

where $\boldsymbol{\lambda}$ is made up of penalty term $\lambda$ and $\alpha$ in elastic net

Elastic Net

\begin{align*} \boldsymbol{\beta^{EN}} & = \displaystyle \arg \min_{\beta} \left\{ \underbrace{\frac{1}{N} \| \boldsymbol{Y} - \boldsymbol{X} \boldsymbol{\beta}- \boldsymbol{\beta_0} \|_2^2 }_{\textcolor{blue}{\text{loss}}} + \lambda \left[ \alpha \left\| \boldsymbol{\beta} \right\|_1 + (1-\alpha) \left\| \boldsymbol{\beta} \right\|_2^2 \right] \right\} \\ \end{align*}

\begin{align*} & \boldsymbol{\beta} = [0, \beta_1, \ldots, \beta_{M+L-1}]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{\beta_0} = [\beta_0, 0, \ldots, 0]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{Y} \in \mathbb{R}^{N \times 1}, \text{ observed phenotypes matrix} \\ & \boldsymbol{X} = [\hat{T}_1, \ldots, \hat{T}_M, C_1, \ldots, C_L] \in \mathbb{R}^{N \times (M+L)} \\ & M = \text{ number of genes } \\ & L= \text{ number of covariates } \\ & \hat{T}_i \in \mathbb{R}^{N \times 1}, \text{ predicted standardized i-th gene expression} \\ & C_i \in \mathbb{R}^{N \times 1}, \text{ observed i-th standardized covariate} \\ \end{align*}

\begin{align*} Y= \text{constant}+\varepsilon \end{align*}

PTRS Practicality

We set $\alpha = 0.1$ and find $\lambda_{\text{max}}$ as the smallest value satisfying $|\nabla l(\beta)| \leq \alpha \lambda$ .
To match the 11 PRS p-value cutoffs, we build a set of $lambda$ by selecting 20 equally spaced points in log scale between $1.5\lambda_{\text{max}}$ and $\frac{\lambda_{\text{max}}}{10^4}$ . And we use the first 11 non-degenerate models for each population.
At $\alpha=0.1$ , get $\beta$ in each $\lambda$ by elastic net.
For 11 $\lambda$ , we get different PTRS.

Partial $R$ Squared

\begin{align*} \text{Null model}:y ∼ 1+ \textit{ covariates } \quad \text{ v.s. } \quad \text{ Full model }: y ∼ 1+\textit{ covariates }+\hat{y_i} \end{align*}

\begin{align*} & \widetilde{R^2} \\ = \quad & 1- \dfrac{\text{SSE}_\text{full} }{\text{SSE}_\text{null} } \\ = \quad & \frac{C^2(y, \hat{y})}{C(y, y)C(\hat{y}, \hat{y})} \\ \end{align*}

\begin{align*} & C(u, v) = u^t v - u^t H v \\ & H = \widetilde{C} (\widetilde{C}^t \widetilde{C})^{-1} \widetilde{C}^t \\ & \widetilde{C} = [1, C_1, \dots, C_L] \end{align*}

How to choose hyperparameters?

Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets
Spliting each target set into two equal-size parts, a validation set and a test set
Selecting hyperparameters (p-value cutoff in clumping and thresholding, $\lambda$ in elastic net) maximize $\widetilde{R^2}$ in validation set

After choosing hyperparameters, we calculate the $\widetilde{R^2}$ in test set. This procedure was repeated $10$ times and we get the average $\widetilde{R^2}$ as the prediction accuracy.

Combining PTRS and PRS

combined score $\hat{y_i}$ = $c_1 PRS_i^\lambda+ c_2 PTRS_i^\lambda$

Spliting each target set into two equal-size parts, a validation set and a test set.
Spliting validation set into two equal-size parts.
For 11 $\lambda$ thresholds, find ${\arg \min}_{c_1,\ c_2} \sum_i (y_i-\hat{y_i})^2$ in first validation set. We get different $y_i$ in different threshold. So $c_1,\ c_2$ will different too. The idea like linear model, different $x_i$ will get different fitted line.
Now, we get the best $c_1,\ c_2$ in each thresold. Then we use $c_1$ , $c_2$ to compute combined score in second validation set.
Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest $\widetilde{R^2}$ .
Now, we get the best tuning $c_1,\ c_2$ and threshold in each population. And use it in test set.
Getting final $\widetilde{R^2}$ $\sim \sim$

Portability of PRS and PTRS

\begin{align*} \frac{\text{prediction accuracy in target set}}{ \text{prediction accuracy in European reference set}} \end{align*}

\begin{align*} \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} }, \quad \text{ where } \widetilde{R^2}_{\text{EUR ref}} \text{ is } \widetilde{R^2} \text{ in MESA EUR model } \end{align*}

Since MESA EUR model is expected to perform better than MESA AFHI model among EUR individuals. So

\begin{align*} & \widetilde{R^2}_{\text{MESA AFHI}} <\widetilde{R^2}_{\text{EUR ref}} \\ \Rightarrow \quad & \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} } < \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{MESA AFHI}} } \end{align*}

Therefore, definition of Portability of PTRS is conservative.

Result

In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability. In the plot, PTRS $\widetilde{R^2}$ can achieve upper bound (heritability), but PRS’s can not.
The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.
Using PredictDB data to get the weight, and compute predicted gene expression (in other paper). There we use the weight directly.
In the paper, trait is continuous number. It is not related to survival analysis. As the trait is discrete, we will use survival analysis.
Finally, each race group will compute one PRS and PTRS.

Reference

Polygenic transcriptome risk scores(PTRS) can improve portability of polygenic risk scores across ancestries