Introduction
In a linear model, when the predictors $x$ highly correlated with each other, but only a small subset actually influences the response $y$ , it means that many of the regression coefficients $\beta$ will be zero. In such cases, how can we identify which predictors truly affect $y$ ?
From a linear modeling perspective, we can consider all possible combinations of predictors—that is, whether to include each variable $x$ in the model or not. For $j$ predictors, there are $2^j$ possible models.
For each variable $x$ , we can calculate the probability that it appears in a model and has a non-zero estimated coefficient $\beta \neq 0$ after observing the data. By summing these probabilities across all models where $x$ is included and $\beta \neq 0$ , we obtain the probability that $x$ has a linear effect on $y$ .
This is how I understand the concept of posterior inclusion probability.
Model
Fine-mapping using individual-level data is typically performed by fitting the multiple linear regression model
$$ Y_{n \times 1} = X_{n \times J} \beta_{J \times 1} + \epsilon_{n \times 1} $$where
- $X$ is a matrix of highly correlated SNPs
- $\beta$ represents sparse effect sizes (most elements are zero)
- $\epsilon \sim N(0, \sigma^2 I_n)$ is an error term
This model is natural because SNPs are often correlated due to linkage disequilibrium (LD), and only a few SNPs are expected to be causal.
PIP
The posterior inclusion probability (PIP) is defined as
$$ \text{PIP}_j = P(\beta_j \neq 0 \mid X, Y) $$This is the posterior probability that SNP $j$ has a non-zero effect. It can be computed by summing the posterior probabilities of all models that include $X_j$. A high PIP for SNP $j$ indicates a high probability that it has a causal effect on the phenotype $Y$. PIP is a key metric used in fine-mapping.
Credible Set
A credible set is a collection of SNPs selected based on their PIPs. We rank all SNPs by their PIPs in descending order and include SNPs in the credible set until the cumulative sum of PIPs reaches a chosen threshold (e.g. 0.95). This means we are 95% confident that the true causal variant lies within the credible set.
MCMC
Since evaluating all $2^J$ possible models is computationally infeasible when $J$ is large, Markov Chain Monte Carlo (MCMC) methods are used to approximate the posterior distribution efficiently.