Polygenic Risk Score (PRS), a score to estimate the risk of a disease or disease-related trait for an individual.
SNP affect the disease more as PRS increase. But PRS is limited for race. If the data sample is European, the PRS
will get a great performance for European only. In this paper, the we try to add PTRS which improve the
portability of race.
Data
f(X)=T
For predicting transcriptome (T^, set of predicted genes expression), we downloaded the prediction weights (coefficients of linear function) from data GTEx in tissue, MESA
collected in PredictDB. The weights for calculating PRS and PTRS were estimated in the discovery set.
Dealing Data
Use UK Biobank data
GTEx data containing SNP and corresponding gene expression data.
Covariates include first genetic 20 PCs, age, sex, 17 phenotypes (血球等指標), ancestry race
labeled individuals as EUR, S.ASN, E.ASN, AFR
Remove high missing rates data
individual with multiple arrays (measure many times with different instruments): taking the average individual with multiple instances (measure many times with stage): Use the first non-missing value
predicted transcriptome depend on race, tissue. Different tissue or
race with difference function transform SNP to gene expression
Quality control on self-reported ancestry
Sik=logP(PCi1,⋯,PCi10∣μk,Σk)
where μk,Σk are sample mean, sample var respectively For 4 populations (EUR, S.ASN, E.ASN, and AFR) and un-assigned populations, we choose those data with Sik>−50.
b^j: according to GWAS, we get coefficient of SNPs
Xi: phenotype of i SNP, number of risk allele also, 0 or 1 or 2
Independent and Significant
Significant: Obviously, we want to get the influent SNP for disease. So choose significant ones.
Independent: Putting related SNP into PRS, we get the score that repeated specific effect. It’s not accurate.
Fault of PRS Computing PRS, we may choose different SNPs for difference race. Minority racial groups may be ignored when
implementing public health measures.
LD Clumping v.s. LD Pruning
Because distance of every SNP is not equal.
LD Clumping
Ordering SNP by p-value
Using SNP with the smallest p-value as the center for a range 250kb
Removing those SNP with R2>0.1
Proceeding to the SNP with the next smallest p-value
LD Pruning
In a window, we compute r2 of each pair SNP
If r2 bigger than threshold, delete the snp which with smaller MAF
Remove to next window
PRS Practicality
We choose independent and low correlated SNPs by LD clumping.
For 11 p-value thresholds, we get different PRS.
Risk Allele
One allele consist of two SNP generally. Getting disease risk will increase as the number of risk allele increase.
For example, Alzheimer with three allele, and APOE ε4 is risk allele.
APOE ε2: consist of SNPs rs429358 (T), rs7412 (T)
APOE ε3: consist of SNPs rs429358 (T), rs7412 (C)
APOE ε4: consist of SNPs rs429358 (C), rs7412 (C)
Polygenic Transcriptome Risk Scores
PTRSiλ=g∑T^igβgλ
where λ is made up of penalty term λ and α in elastic net
Elastic Net
βEN=argβmin⎩⎨⎧lossN1∥Y−Xβ−β0∥22+λ[α∥β∥1+(1−α)∥β∥22]⎭⎬⎫β=[0,β1,…,βM+L−1]T∈R(M+L)×1β0=[β0,0,…,0]T∈R(M+L)×1Y∈RN×1, observed phenotypes matrixX=[T^1,…,T^M,C1,…,CL]∈RN×(M+L)M= number of genes L= number of covariates T^i∈RN×1, predicted standardized i-th gene expressionCi∈RN×1, observed i-th standardized covariateY=constant+ε
PTRS Practicality
We set α=0.1 and find λmax as the smallest value satisfying ∣∇l(β)∣≤αλ.
To match the 11 PRS p-value cutoffs, we build a set of lambda by selecting 20 equally spaced points in log scale between
1.5λmax and 104λmax. And we use the first 11 non-degenerate models for each population.
At α=0.1, get β in each λ by elastic net.
For 11 λ, we get different PTRS.
Partial R Squared
Null model:y∼1+ covariates v.s. Full model :y∼1+ covariates +yi^==R21−SSEnullSSEfullC(y,y)C(y^,y^)C2(y,y^)C(u,v)=utv−utHvH=C(CtC)−1CtC=[1,C1,…,CL]
How to choose hyperparameters?
Computing PTRS weights in discovery set (UKB EUR) and tested in the 5 target sets
Spliting each target set into two equal-size parts, a validation set and a test set
Selecting hyperparameters (p-value cutoff in clumping and thresholding,
λ in elastic net) maximize R2 in validation set
After choosing hyperparameters, we calculate the R2 in test set.
This procedure was repeated 10 times and we get the average R2
as the prediction accuracy.
Combining PTRS and PRS
combined score yi^= c1PRSiλ+c2PTRSiλ
Spliting each target set into two equal-size parts, a validation set and a test set.
Spliting validation set into two equal-size parts.
For 11 λ thresholds, find argminc1,c2∑i(yi−yi^)2 in first validation set. We get different yi in different threshold. So c1,c2 will different too. The idea like linear model, different xi will get different fitted line.
Now, we get the best c1,c2 in each thresold. Then we use c1, c2 to compute combined score in second validation set.
Now, we get the combine score in 11 thresolds for each population. We select the threshold with biggest R2.
Now, we get the best tuning c1,c2 and threshold in each population. And use it in test set.
Getting final R2∼∼
Portability of PRS and PTRS
prediction accuracy in European reference setprediction accuracy in target setR2EUR refR2 in target set, where R2EUR ref is R2 in MESA EUR model
Since MESA EUR model is expected to perform better
than MESA AFHI model among EUR individuals. So
⇒R2MESA AFHI<R2EUR refR2EUR refR2 in target set<R2MESA AFHIR2 in target set
Therefore, definition of Portability of PTRS is conservative.
Result
In paper, fig 3 (b) means PTRS will use PVE; PRS will use heritability.
In the plot, PTRS R2 can achieve upper bound (heritability), but PRS’s can not.
The performance of PTRS worse than PRS in fixed race. But PTRS + PRS better than PRS only.
Using PredictDB data to get the weight, and compute predicted gene expression (in other paper).
There we use the weight directly.
In the paper, trait is continuous number. It is not related to survival analysis.
As the trait is discrete, we will use survival analysis.
Finally, each race group will compute one PRS and PTRS.