Polygenic Risk Score

Background

Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual. SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS will get a great performance for European only. In this paper, the we try to add PTRS which improve the portability of race.

Data

(put setup picture ) Use 356,476 unrelated Europeans in the UK Biobank for the discovery set. In training sets, the first set of models had been trained in European individuals from the GTEx v8 in whole blood. The second set of models had been trained using array-based expression in monocyte samples of Europeans from MESA.

$$ \begin{align*} f(\bold{X}) = T \end{align*} $$

For predicting transcriptome ($\hat{T}$, set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.

Dealing Data

Use UK Biobank data

  • GTEx data containing SNP and corresponding gene expression data.
  • Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race
  • labeled individuals as EUR, S.ASN, E.ASN, AFR
  • Remove high missing rates data
  • individual with multiple arrays (measure many times with different instruments): taking the average
    individual with multiple instances (measure many times with stage): Use the first non-missing value
  • predicted transcriptome depend on race, tissue. Different tissue or race with difference function transform SNP to gene expression

Quality control on self-reported ancestry

Because lots sample may give wrong information about ancestry race. We defined similarity $S_{ik}$ for individual $i$ in population $k$

$$ \begin{align*} S_{ik} = \log{P(\text{PC}_i^1, \cdots, \text{PC}_i^{10} |\ \widehat{\mu_k}, \widehat{\Sigma_k})} \end{align*} $$

where $\widehat{\mu_k}, \widehat{\Sigma_k}$ are sample mean, sample var respectively
For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with $S_{ik} > -50$.

Proportion of Variance Explained

First, we convert the predicted gene expression $\hat{T}_{ig}$ to $\widetilde{T}_{ig}$. We can control the range of transformed gene expression like Normal quantile. What’s more, the transformed gene expression will follow $N(0,1)$. For gene $g$, $i$ individual transformed gene expression is

$$ \begin{align*} \widetilde{T}_{ig} = \Phi^{-1}\left(\frac{\text{rank}(\hat{T}_{ig})}{N+1}\right) \end{align*} $$

Suppose the observed phenotype ( $Y_i$ ) has a linear relation with the $l$th covariate ( $C_{il}$ ) and the inverse-normalized predicted expression of gene g ( $\widetilde{T}_{ig}$ )

$$ \begin{align*} & Y_i = \mu + \sum_l C_{il} a_l + \sum_g \widetilde{T}_{ig} \beta_g + \varepsilon_i \\ & \varepsilon_i \overset{\text{iid }}{\sim} N(0, \sigma_e^2) \\ & \beta_g \overset{\text{iid }}{\sim} N\left( 0, \frac{\sigma_g^2}{M} \right) \end{align*} $$

Proportion of variance explained (PVE) idea like $R^2$. We determine the amount of explained data variation by the covariate. $R^2$ determine the amount of explained data variation by whole covariates; PVE can focus on the covariate we concerned. And PVE for each trait will depend on the heritability of the trait. PVE of gene $g$ is

$$ \begin{align*} \text{PVE}_g = \frac{\hat{\sigma}_g^2}{M (\hat{\sigma}_e^2 + \hat{\sigma}_g^2)} \end{align*} $$

Polygenic Risk Scores

We need to use LD clumping filtering the independent and significant SNP. Using discovery data to compute PRS, polygenic Risk Scores (PRS) for individual $i$ at GWAS p-value thresholds $t$

$$ \begin{align*} \text{PRS}_i^t=\sum_{j: p_j \leq t} X_{i j} \widehat{b}_j \end{align*} $$
  • $\hat{b}_j$: according to GWAS, we get coefficient of SNPs
  • $X_i$: phenotype of $i$ SNP, number of risk allele also, 0 or 1 or 2

Independent and Significant

  • Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.
  • Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It’s not accurate.

Fault of PRS
Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when implementing public health measures.

LD Clumping v.s. LD Pruning

Because distance of every SNP is not equal. LD Clumping

  1. Ordering SNP by p-value
  2. Using SNP with the smallest p-value as the center for a range 250kb
  3. Removing those SNP with $R^2 > 0.1$
  4. Proceeding to the SNP with the next smallest p-value

LD Pruning

  1. In a window, we compute $r^2$ of each pair SNP
  2. If $r^2$ bigger than threshold, delete the snp which with smaller MAF
  3. Remove to next window

Example for LD Pruning

We calculate each pair SNP LD in $50000$ SNP window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn’t compute LD in different chromosome even though SNP are a few.
Region is that window 1: 0 – 50,000-th SNP; window 2: 5,000-th SNP – 55,000-th SNP

1
plink --bfile xxx1 --chr 1-22 --indep-pairwise 50000 5000 0.2 --out output2

We calculate each pair SNP LD in $50$ kb window size, if $r^2 > 0.2$, delete the smaller MAF one. And do it for next window. It doesn’t compute LD in different chromosome even though SNP are a few. Region is that window 1: 0 – 50 kb; window 2: the next 5 SNP – (the next 5 SNP pos +50 kb)

1
plink --bfile xxx1 --chr 1-22 --indep-pairwise-kb 50 5 0.2 --out output2

PRS Practicality

  1. We choose independent and low correlated SNPs by LD clumping.
  2. For 11 p-value thresholds, we get different PRS.

Risk Allele

One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase. For example, Alzheimer with three allele, and APOE $\varepsilon_4$ is risk allele.

  1. APOE $\varepsilon_2$: consist of SNPs rs429358 (T), rs7412 (T)
  2. APOE $\varepsilon_3$: consist of SNPs rs429358 (T), rs7412 (C)
  3. APOE $\varepsilon_4$: consist of SNPs rs429358 (C), rs7412 (C)

Polygenic Transcriptome Risk Scores

The advantage of polygenic transcriptome risk scores(PTRS) is that the PTRS model has fewer covariates anf requires a smaller sample size for training.And the used gene is more closed to disease rather than SNPs data in PRS. In this paper, we using discovery data to find PTRS. PTRS for individual i at GWAS p-value thresholds $\lambda$ and gene $g$ is

$$ \begin{align*} \text{PTRS}_i^{\boldsymbol{\lambda}} = \sum_g {\hat{T}_{ig}\beta_g^{\boldsymbol{\lambda}}} \end{align*} $$

where $\boldsymbol{\lambda}$ is made up of penalty term $\lambda$ and $\alpha$ in elastic net

Elastic Net

A variable selection method is used, but it’s unsuitable for too many covariates. PTRS has fewer covariates than PRS, so elastic net is applied only for PTRS. For PRS, LD clumping and p-value filtering are used. Elastic net identifies the optimal $\boldsymbol{\beta^{EN}}$. For N individuals

$$ \begin{align*} \boldsymbol{\beta^{EN}} & = \displaystyle \arg \min_{\beta} \left\{ \underbrace{\frac{1}{N} \| \boldsymbol{Y} - \boldsymbol{X} \boldsymbol{\beta}- \boldsymbol{\beta_0} \|_2^2 }_{\textcolor{blue}{\text{loss}}} + \lambda \left[ \alpha \left\| \boldsymbol{\beta} \right\|_1 + (1-\alpha) \left\| \boldsymbol{\beta} \right\|_2^2 \right] \right\} \\ \end{align*} $$

where

$$ \begin{align*} & \boldsymbol{\beta} = [0, \beta_1, \ldots, \beta_{M+L-1}]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{\beta_0} = [\beta_0, 0, \ldots, 0]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{Y} \in \mathbb{R}^{N \times 1}, \text{ observed phenotypes matrix} \\ & \boldsymbol{X} = [\hat{T}_1, \ldots, \hat{T}_M, C_1, \ldots, C_L] \in \mathbb{R}^{N \times (M+L)} \\ & M = \text{ number of genes } \\ & L= \text{ number of covariates } \\ & \hat{T}_i \in \mathbb{R}^{N \times 1}, \text{ predicted standardized i-th gene expression} \\ & C_i \in \mathbb{R}^{N \times 1}, \text{ observed i-th standardized covariate} \\ \end{align*} $$

Degenerate Model

$$ \begin{align*} Y= \text{constant}+\varepsilon \end{align*} $$

PTRS Practicality

  1. We set $\alpha = 0.1$ and find $\lambda_{\text{max}}$ as the smallest value satisfying $|\nabla l(\beta)| \leq \alpha \lambda$.
  2. To match the 11 PRS p-value cutoffs, we build a set of $lambda$ by selecting 20 equally spaced points in log scale between $1.5\lambda_{\text{max}}$ and $\frac{\lambda_{\text{max}}}{10^4}$. And we use the first 11 non-degenerate models for each population.
  3. At $\alpha=0.1$, get $\beta$ in each $\lambda$ by elastic net.
  4. For 11 $\lambda$, we get different PTRS.

Partial $R$ Squared

TO compare he performance of PRS and PTRS, Partial $R$ Squared ($\widetilde{R^2}$) can be used as the metric for evaluation. Partial $R^2$ called prediction accuracy also. Let $y_i$ denote the observed phenotype, $\hat{y_i}$ denote the predicted phenotype. And

$$ \begin{align*} \text{Null model}:y ∼ 1+ \textit{ covariates } \quad \text{ v.s. } \quad \text{ Full model }: y ∼ 1+\textit{ covariates }+\hat{y_i} \end{align*} $$$$ \begin{align*} & \widetilde{R^2} \\ = \quad & 1- \dfrac{\text{SSE}_\text{full} }{\text{SSE}_\text{null} } \\ = \quad & \frac{C^2(y, \hat{y})}{C(y, y)C(\hat{y}, \hat{y})} \\ \end{align*} $$

where

$$ \begin{align*} & C(u, v) = u^t v - u^t H v \\ & H = \widetilde{C} (\widetilde{C}^t \widetilde{C})^{-1} \widetilde{C}^t \\ & \widetilde{C} = [1, C_1, \dots, C_L] \end{align*} $$

How to choose hyperparameters?

  1. Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets
  2. Spliting each target set into two equal-size parts, a validation set and a test set
  3. Selecting hyperparameters (p-value cutoff in clumping and thresholding, $\lambda$ in elastic net) maximize $\widetilde{R^2} $ in validation set

After choosing hyperparameters, we calculate the $\widetilde{R^2} $ in test set. This procedure was repeated $10$ times and we get the average $\widetilde{R^2} $ as the prediction accuracy.

Combining PTRS and PRS

combined score $\hat{y_i}$= $c_1 PRS_i^\lambda+ c_2 PTRS_i^\lambda$

  1. Spliting each target set into two equal-size parts, a validation set and a test set.
  2. Spliting validation set into two equal-size parts.
  3. For 11 $\lambda$ thresholds, find ${\arg \min}_{c_1,\ c_2} \sum_i (y_i-\hat{y_i})^2$ in first validation set. We get different $y_i$ in different threshold. So $c_1,\ c_2$ will different too. The idea like linear model, different $x_i$ will get different fitted line.
  4. Now, we get the best $c_1,\ c_2$ in each thresold. Then we use $c_1$, $c_2$ to compute combined score in second validation set.
  5. Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest $\widetilde{R^2}$.
  6. Now, we get the best tuning $c_1,\ c_2$ and threshold in each population. And use it in test set.
  7. Getting final $\widetilde{R^2}$ $\sim \sim$

Portability of PRS and PTRS

Portability of PRS is

$$ \begin{align*} \frac{\text{prediction accuracy in target set}}{ \text{prediction accuracy in European reference set}} \end{align*} $$

Portability of PTRS is

$$ \begin{align*} \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} }, \quad \text{ where } \widetilde{R^2}_{\text{EUR ref}} \text{ is } \widetilde{R^2} \text{ in MESA EUR model } \end{align*} $$

Since MESA EUR model is expected to perform better than MESA AFHI model among EUR individuals. So

$$ \begin{align*} & \widetilde{R^2}_{\text{MESA AFHI}} <\widetilde{R^2}_{\text{EUR ref}} \\ \Rightarrow \quad & \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} } < \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{MESA AFHI}} } \end{align*} $$

Therefore, definition of Portability of PTRS is conservative.

Result

  • In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability. In the plot, PTRS $\widetilde{R^2}$ can achieve upper bound (heritability), but PRS’s can not.
  • The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.
  • Using PredictDB data to get the weight, and compute predicted gene expression (in other paper). There we use the weight directly.
  • In the paper, trait is continuous number. It is not related to survival analysis. As the trait is discrete, we will use survival analysis.
  • Finally, each race group will compute one PRS and PTRS.

Reference

Polygenic transcriptome risk scores(PTRS) can improve portability of polygenic risk scores across ancestries

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
使用 Hugo 建立
主題 StackJimmy 設計