RNA- Sequencing

Introduction

RNA- Sequencing is a powerful technology enables researchers to quantify RNA levels, identify novel transcripts, and analyze alternative splicing events, offering deeper insights into cellular function and disease mechanisms. Key processes such as read alignment, normalization methods like RPKM and TPM, and splicing analysis ensure accurate interpretation of RNA-Seq data. However, despite its advantages, RNA-Seq faces challenges, including biases in library preparation, sequencing errors, and the complexity of data analysis.

Workflow

Alternative Splicing

A cellular process in which exons from the same gene are joined in different combinations, resulting in distinct but related mRNA transcripts (isoforms).

Read Sequencing Technology

There are three common technologies for splitting transcripts into reads.

Illumina
Nanopore
PacBio

High throughout technology refers to massive parallel sequencing, which generates millions to billions of reads in a single experiment.

Read Alignment

There are two types of read alignment: aligning reads to a reference transcriptome and aligning reads to a reference genome.
Aligning reads to reference transcriptome:

Performs well if the reference transcripts is enough
Faster
Used for computing gene expression

Aligning reads to reference genome:

Enables the identification of new isoforms, we call it novel isoforms
Deal intron-exon structure, providing a more complete but slower analysis

When introns are large, the latter method (aligning to the genome) requires significantly more time to process intron-exon structures. On the other hand, transcriptome-based alignment is faster because it does not handle intron-exon structures. However, this trade-off makes it difficult to detect rare disease-related transcripts and reduces accuracy when the reference transcriptome is incomplete.

RPKM

RPKM(Reads Per Kilobase Per Million), a gene expression unit of account.

$$ \begin{align*} \text{RPKM}_g = \frac{r_g \times 10^9}{{fl}_g \times R} \end{align*} $$

where

$$ \begin{align*} & r_g =\text{ reads number to gene } g \\ & {fl}_g = \text{ mapped gene length } \\ & R = \text{ total number of reads in all gene} \end{align*} $$

Example

Genes are not limited to A, B, and C in this example. Let’s focus on gene A in sample 1, which has $12$ reads and a mapped gene length of $600$. The total number of reads across all genes is $6*10^6$.

$$ \begin{align*} \text{RPKM}_A & = \frac{12 \times 10^9}{600 \times 6*10^6} \\ & = 3.33 \end{align*} $$

TPM

TPM(Transcripts Per Million), a gene expression unit of account.

$$ \begin{align*} \text{TPM}=\frac{r_g \times rl \times 10^6}{{fl}_g \times T}, \end{align*} $$

where

$$ \begin{align*} & r_g =\text{ reads number to gene } g \\ & rl = \text{ reads length to gene } g \\ & {fl}_g = \text{ mapped gene length} \\ & T=\displaystyle \sum_{g \in G} \frac{r_g \times rl}{{fl}_g} \end{align*} $$

Read length depends on the sequencing technology rather than the transcript itself, so we use $rl$ instead of $rl_g$.

Example

We standardize gene length because it is easier to align reads for longer genes. Therefore, the read count divided by gene length is defined as RPK (Reads Per Kilobase).

Suppose there are only genes A, B, and C, and the total RPK values are $650$ and $700$, respectively.

Additional

Sequencing Depth

Sequencing depth refers to the average number of times a nucleotide is mapped by a read. A higher sequencing depth generates more informative reads but comes at a higher cost.

$$ \begin{align*} \text{Sequencing Depth} = \frac{\text{reads length}\times \text{reads number}}{\text{reference sequence length}} \end{align*} $$

Example

$10^8$ reads with length $150$ bp, the reference sequence length $3 \times 10^9$ bp.

$$ \begin{align*} \text{Sequencing Depth} & = \frac{10^8 \times 150}{3 \times 10^9} \\ & = 5 \text{X} \end{align*} $$

Introduction

Workflow

Alternative Splicing

Read Sequencing Technology

Read Alignment

RPKM

Example

TPM

Example

Additional

Sequencing Depth

Example

Reference