Introduction
在 case, control study ,取樣時需要確認 case, control 的分配在不同年齡、性別、種族 等變數之下分配一致,否則會有假陽性的結果產生。舉例說明,case 全部都是 60 歲,他們的 SNP rs123 genotype 都是 AA,control 的人都不是 60 歲,且 SNP rs123 都不是 AA,最後結論是, rs123 genotype AA 的罹患疾病風險是其他 genotype 的數倍,這樣合理?
Statistical Test
以下這些是常用的檢定,可以根據手上的樣本,解答我們的疑惑。像是說變數分配是否符合我們預期的? 兩變數是否獨立,不干擾彼此? 小樣本的話可以做檢定? 連續或是類別變數檢定方法有不同?
Chi-square Goodness of Fit Test
It’s testing whether the sample is a specific distribution. We suppose
$$ \begin{align*} \text{samples} \sim Multinomial(k,p_1, \cdots , p_k) \end{align*} $$Samples are classified into k classes, there is a proportion $p_i$ cases in each of classes i.
$$ \begin{array}{l|llll|l} \textbf{Class} & 1 & 2 & \dots & k & \textbf{Total} \\ \hline \textbf{Observation} & O_1 & O_2 & \dots & O_k & n \\ \textbf{Probability} & p_1^* & p_2^* & \dots & p_k^* & 1 \\ \textbf{Expectation} & n p_1^* & n p_2^* & \dots & n p_k^* & n \end{array} $$Statistic
$$ \begin{align*} \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \end{align*} $$Testing
$$ \begin{align*} H_0:p_i = p_i^* \quad v.s. \quad H_1: \text{not } H_0 \end{align*} $$If $\chi^2 >\chi^2_{k-1, \alpha} $, rej $H_0$
Chi-square Test
Two categorical variable $A,B$ are classified into $i,j$ classes respectively.
Statistic
$$ \begin{align*} \chi^2 = \sum_{i=1}^r \sum_{j=1}^k \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{align*} $$Testing
$$ \begin{align*} H_0: A \perp B \quad v.s. \quad H_1: \text{not } H_0 \end{align*} $$If $\chi^2 >\chi^2_{(k-1)(r-1), \alpha} $, rej $H_0$
Fisher Exact Test
When number of observations is small
$$ \begin{array}{l|ll|l} A \setminus B & \textbf{1} & \textbf{2} & \textbf{Total} \\ \hline \textbf{1} & a & b & a+b \\ \textbf{2} & c & d & c+d \\ \hline \textbf{Total} & a+c & b+d & n \end{array} $$Testing
$$ \begin{align*} H_0: A \perp B \quad v.s. \quad H_1: \text{not } H_0 \end{align*} $$$$ \begin{align*} \text{p-value} = \frac{(a+c)!(b+d)!(a+b)!(c+d)!}{a!b!c!d!n!} \end{align*} $$If $\text{p-value} < \alpha $, rej $H_0$
Yates Continuity Correction
When number of observations is small, Yates suggested use the correction as at least one cell of the contingency table has an expected count smaller than 5.
$$ \begin{array}{l|ll|l} A \setminus B & \textbf{1} & \textbf{2} & \textbf{Total} \\ \hline \textbf{1} & a & b & a+b \\ \textbf{2} & c & d & c+d \\ \hline \textbf{Total} & a+c & b+d & n \end{array} $$Statistic
$$ \begin{align*} \chi^2 = \sum_{i=1}^{k} \frac{(\left| O_i - E_i \right|-0.5)^2}{E_i} \end{align*} $$Testing
$$ \begin{align*} H_0: A \perp B \quad v.s. \quad H_1: \text{not } H_0 \end{align*} $$If $\chi^2 >\chi^2_{k-1, \alpha} $, rej $H_0$
Mann-Whitney U Test (Wilcoxon Rank-sum Test)
Define two independent samples $A,B$ where $n_1, n_2$ are the size of $A,B$, and $R_1, R_2$ are the rank-sum for $A,B$
Statistic
$$ \begin{align*} U_1 &= n_1 n_2 + \frac{n_1(n_1 + 1)}{2} - R_1 \\ U_2 &= n_1 n_2 + \frac{n_2(n_2 + 1)}{2} - R_2 \\ U &= \min(U_1, U_2) \end{align*} $$Testing
$$ \begin{align*} H_0: \text{distribution of } A, B \text{ are equal} \quad v.s. \quad H_1: \text{not } H_0 \end{align*} $$If $U \leq U_{\frac{\alpha}{2}, (n1,n2)} $, rej $H_0$
T Test
Test of Proportion
Suppose each samples is independent from another. Define two group $A,B$ where $n_1, n_2$ are the size of $A,B$, and $\hat{p_1}, \hat{p_2}$ are the sample proportion for $A,B$. Pooled proportion $\hat{p}= \frac{x_1+x_2}{n_1+n_2}$
Statistic
$$ \begin{align*} z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}\left(1-\hat{p}\right)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \end{align*} $$Testing
$$ \begin{align*} H_0: p_1=p_2 \quad v.s. \quad H_1:p_1\neq p_2 \end{align*} $$If $\left| z \right|>z_{\frac{\alpha}{2}}$, rej $H_0$
Confusing
常用 Chi-Square Test 檢查性別的分配在 case, control 沒有顯著差異。我可以用 Goodness of fit test 假設變數是 multinomial distribution,性別的比例都是 $a, 1-a$ ,做兩次檢定分別確認性別的分配在 case, control 沒有顯著差異?這是我當時的疑問,後來發現,這麼做有些缺陷
- 需引入任何先驗的或假設的比例 $a$, $a$ 該怎麼選?
- 假如 case 之下, 比例 $55\%$ vs. $50\%$ ,差異不顯著; control 則是 $45\%$ vs. $50\%$ 差異同樣不顯著,兩個 Goodness of fit test 都沒顯著差異。相比之下 Chi-Square Test 獨立性檢定,可能檢測到 $55\%$ vs. $45\%$ 的差異
When to use it ?
Case, control study 經常使用 logistic regression 來比較,樣本在不同情況下的疾病風險。需要確認 case, control 的分配在不同年齡、性別、種族 等變數下一致,檢查是否違背虛無假設 $H_0$。當分配不一致時,可以考慮將變數(性別、年齡、種族的 PC 等) 加入 logistic regression 模型裡,估計出這項變數的係數,衡量它所造成的影響
$$ \begin{align*} Y &= \ln{\frac{p}{1-p}} \\ &= \beta_0+\beta_1 G +\beta_2 \text{Sex} + \beta_3 \text{PC}_1 + \beta_4 \text{PC}_2 \end{align*} $$加進種族的 PCA 前幾個,有效避免人群混雜 (population stritification) 這類的問題。