Polygenic Risk Score

f(x)=12πf (x) = \frac{1}{\sqrt{2 \pi}}

Background

Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual. SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS will get a great performance for European only. In this paper, the we try to add PTRS which improve the portability of race.

Data

f(X)=T\begin{align*} f(\bold{X}) = T \end{align*}

For predicting transcriptome (T^\hat{T}, set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.

Dealing Data

Use UK Biobank data

  • GTEx data containing SNP and corresponding gene expression data.
  • Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race
  • labeled individuals as EUR, S.ASN, E.ASN, AFR
  • Remove high missing rates data
  • individual with multiple arrays (measure many times with different instruments): taking the average
    individual with multiple instances (measure many times with stage): Use the first non-missing value
  • predicted transcriptome depend on race, tissue. Different tissue or race with difference function transform SNP to gene expression

Quality control on self-reported ancestry

Sik=logP(PCi1,,PCi10 μk^,Σk^)\begin{align*} S_{ik} = \log{P(\text{PC}_i^1, \cdots, \text{PC}_i^{10} |\ \widehat{\mu_k}, \widehat{\Sigma_k})} \end{align*}

where μk^,Σk^\widehat{\mu_k}, \widehat{\Sigma_k} are sample mean, sample var respectively
For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with Sik>50S_{ik} > -50.

Proportion of Variance Explained

T~ig=Φ1(rank(T^ig)N+1)\begin{align*} \widetilde{T}_{ig} = \Phi^{-1}\left(\frac{\text{rank}(\hat{T}_{ig})}{N+1}\right) \end{align*}Yi=μ+lCilal+gT~igβg+εiεiiid N(0,σe2)βgiid N(0,σg2M)\begin{align*} & Y_i = \mu + \sum_l C_{il} a_l + \sum_g \widetilde{T}_{ig} \beta_g + \varepsilon_i \\ & \varepsilon_i \overset{\text{iid }}{\sim} N(0, \sigma_e^2) \\ & \beta_g \overset{\text{iid }}{\sim} N\left( 0, \frac{\sigma_g^2}{M} \right) \end{align*}PVEg=σ^g2M(σ^e2+σ^g2)\begin{align*} \text{PVE}_g = \frac{\hat{\sigma}_g^2}{M (\hat{\sigma}_e^2 + \hat{\sigma}_g^2)} \end{align*}

Polygenic Risk Scores

PRSit=j:pjtXijb^j\begin{align*} \text{PRS}_i^t=\sum_{j: p_j \leq t} X_{i j} \widehat{b}_j \end{align*}
  • b^j\hat{b}_j: according to GWAS, we get coefficient of SNPs
  • XiX_i: phenotype of ii SNP, number of risk allele also, 0 or 1 or 2

Independent and Significant

  • Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.
  • Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It’s not accurate.

Fault of PRS
Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when implementing public health measures.

LD Clumping v.s. LD Pruning

Because distance of every SNP is not equal. LD Clumping

  1. Ordering SNP by p-value
  2. Using SNP with the smallest p-value as the center for a range 250kb
  3. Removing those SNP with R2>0.1R^2 > 0.1
  4. Proceeding to the SNP with the next smallest p-value

LD Pruning

  1. In a window, we compute r2r^2 of each pair SNP
  2. If r2r^2 bigger than threshold, delete the snp which with smaller MAF
  3. Remove to next window

PRS Practicality

  1. We choose independent and low correlated SNPs by LD clumping.
  2. For 11 p-value thresholds, we get different PRS.

Risk Allele

One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase. For example, Alzheimer with three allele, and APOE ε4\varepsilon_4 is risk allele.

  1. APOE ε2\varepsilon_2: consist of SNPs rs429358 (T), rs7412 (T)
  2. APOE ε3\varepsilon_3: consist of SNPs rs429358 (T), rs7412 (C)
  3. APOE ε4\varepsilon_4: consist of SNPs rs429358 (C), rs7412 (C)

Polygenic Transcriptome Risk Scores

PTRSiλ=gT^igβgλ\begin{align*} \text{PTRS}_i^{\boldsymbol{\lambda}} = \sum_g {\hat{T}_{ig}\beta_g^{\boldsymbol{\lambda}}} \end{align*}

where λ\boldsymbol{\lambda} is made up of penalty term λ\lambda and α\alpha in elastic net

Elastic Net

βEN=argminβ{1NYXββ022loss+λ[αβ1+(1α)β22]}\begin{align*} \boldsymbol{\beta^{EN}} & = \displaystyle \arg \min_{\beta} \left\{ \underbrace{\frac{1}{N} \| \boldsymbol{Y} - \boldsymbol{X} \boldsymbol{\beta}- \boldsymbol{\beta_0} \|_2^2 }_{\textcolor{blue}{\text{loss}}} + \lambda \left[ \alpha \left\| \boldsymbol{\beta} \right\|_1 + (1-\alpha) \left\| \boldsymbol{\beta} \right\|_2^2 \right] \right\} \\ \end{align*}β=[0,β1,,βM+L1]TR(M+L)×1β0=[β0,0,,0]TR(M+L)×1YRN×1, observed phenotypes matrixX=[T^1,,T^M,C1,,CL]RN×(M+L)M= number of genes L= number of covariates T^iRN×1, predicted standardized i-th gene expressionCiRN×1, observed i-th standardized covariate\begin{align*} & \boldsymbol{\beta} = [0, \beta_1, \ldots, \beta_{M+L-1}]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{\beta_0} = [\beta_0, 0, \ldots, 0]^T \in \mathbb{R}^{(M+L) \times 1} \\ & \boldsymbol{Y} \in \mathbb{R}^{N \times 1}, \text{ observed phenotypes matrix} \\ & \boldsymbol{X} = [\hat{T}_1, \ldots, \hat{T}_M, C_1, \ldots, C_L] \in \mathbb{R}^{N \times (M+L)} \\ & M = \text{ number of genes } \\ & L= \text{ number of covariates } \\ & \hat{T}_i \in \mathbb{R}^{N \times 1}, \text{ predicted standardized i-th gene expression} \\ & C_i \in \mathbb{R}^{N \times 1}, \text{ observed i-th standardized covariate} \\ \end{align*}Y=constant+ε\begin{align*} Y= \text{constant}+\varepsilon \end{align*}

PTRS Practicality

  1. We set α=0.1\alpha = 0.1 and find λmax\lambda_{\text{max}} as the smallest value satisfying l(β)αλ|\nabla l(\beta)| \leq \alpha \lambda.
  2. To match the 11 PRS p-value cutoffs, we build a set of lambdalambda by selecting 20 equally spaced points in log scale between 1.5λmax1.5\lambda_{\text{max}} and λmax104\frac{\lambda_{\text{max}}}{10^4}. And we use the first 11 non-degenerate models for each population.
  3. At α=0.1\alpha=0.1, get β\beta in each λ\lambda by elastic net.
  4. For 11 λ\lambda, we get different PTRS.

Partial RR Squared

Null model:y1+ covariates  v.s.  Full model :y1+ covariates +yi^\begin{align*} \text{Null model}:y ∼ 1+ \textit{ covariates } \quad \text{ v.s. } \quad \text{ Full model }: y ∼ 1+\textit{ covariates }+\hat{y_i} \end{align*}R2~=1SSEfullSSEnull=C2(y,y^)C(y,y)C(y^,y^)\begin{align*} & \widetilde{R^2} \\ = \quad & 1- \dfrac{\text{SSE}_\text{full} }{\text{SSE}_\text{null} } \\ = \quad & \frac{C^2(y, \hat{y})}{C(y, y)C(\hat{y}, \hat{y})} \\ \end{align*}C(u,v)=utvutHvH=C~(C~tC~)1C~tC~=[1,C1,,CL]\begin{align*} & C(u, v) = u^t v - u^t H v \\ & H = \widetilde{C} (\widetilde{C}^t \widetilde{C})^{-1} \widetilde{C}^t \\ & \widetilde{C} = [1, C_1, \dots, C_L] \end{align*}

How to choose hyperparameters?

  1. Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets
  2. Spliting each target set into two equal-size parts, a validation set and a test set
  3. Selecting hyperparameters (p-value cutoff in clumping and thresholding, λ\lambda in elastic net) maximize R2~\widetilde{R^2} in validation set

After choosing hyperparameters, we calculate the R2~\widetilde{R^2} in test set. This procedure was repeated 1010 times and we get the average R2~\widetilde{R^2} as the prediction accuracy.

Combining PTRS and PRS

combined score yi^\hat{y_i}= c1PRSiλ+c2PTRSiλc_1 PRS_i^\lambda+ c_2 PTRS_i^\lambda

  1. Spliting each target set into two equal-size parts, a validation set and a test set.
  2. Spliting validation set into two equal-size parts.
  3. For 11 λ\lambda thresholds, find argminc1, c2i(yiyi^)2{\arg \min}_{c_1,\ c_2} \sum_i (y_i-\hat{y_i})^2 in first validation set. We get different yiy_i in different threshold. So c1, c2c_1,\ c_2 will different too. The idea like linear model, different xix_i will get different fitted line.
  4. Now, we get the best c1, c2c_1,\ c_2 in each thresold. Then we use c1c_1, c2c_2 to compute combined score in second validation set.
  5. Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest R2~\widetilde{R^2}.
  6. Now, we get the best tuning c1, c2c_1,\ c_2 and threshold in each population. And use it in test set.
  7. Getting final R2~\widetilde{R^2} \sim \sim

Portability of PRS and PTRS

prediction accuracy in target setprediction accuracy in European reference set\begin{align*} \frac{\text{prediction accuracy in target set}}{ \text{prediction accuracy in European reference set}} \end{align*}R2~ in target setR2~EUR ref, where R2~EUR ref is R2~ in MESA EUR model \begin{align*} \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} }, \quad \text{ where } \widetilde{R^2}_{\text{EUR ref}} \text{ is } \widetilde{R^2} \text{ in MESA EUR model } \end{align*}

Since MESA EUR model is expected to perform better than MESA AFHI model among EUR individuals. So

R2~MESA AFHI<R2~EUR refR2~ in target setR2~EUR ref<R2~ in target setR2~MESA AFHI\begin{align*} & \widetilde{R^2}_{\text{MESA AFHI}} <\widetilde{R^2}_{\text{EUR ref}} \\ \Rightarrow \quad & \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{EUR ref}} } < \frac{ \widetilde{R^2} \text{ in target set}}{ \widetilde{R^2}_{\text{MESA AFHI}} } \end{align*}

Therefore, definition of Portability of PTRS is conservative.

Result

  • In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability. In the plot, PTRS R2~\widetilde{R^2} can achieve upper bound (heritability), but PRS’s can not.
  • The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.
  • Using PredictDB data to get the weight, and compute predicted gene expression (in other paper). There we use the weight directly.
  • In the paper, trait is continuous number. It is not related to survival analysis. As the trait is discrete, we will use survival analysis.
  • Finally, each race group will compute one PRS and PTRS.

Reference

Polygenic transcriptome risk scores(PTRS) can improve portability of polygenic risk scores across ancestries

Licensed under CC BY-NC-SA 4.0
使用 Hugo 建立
主題 StackJimmy 設計