RPKM and FPKM explained

RNA-Seq provides quantitative approximations of the abundance of target transcripts in the form of counts. However, these counts must be normalized to remove technical biases inherent in the preparation steps for RNA-Seq, in particular the length of the RNA species and the sequencing depth of a sample. For example, expectedly, deeper sequencing results in higher counts, biasing comparisons between different runs with different depths. Similarly, longer transcripts are more likely to have sequences mapped to their region resulting in higher counts, biasing comparisons between transcripts different lengths.

Both RPKM and FPKM provide the means for such normalization.

RPKM = reads per kilobase per million
= [# of mapped reads]/[length of transcript in kilo base]/[million mapped reads]
= [# of mapped reads]/([length of transcript]/1000)/([total reads]/10^6)

FPKM = fragments per kilobase per million
= [# of fragments]/[length of transcript in kilo base]/[million mapped reads]
= [# of fragments]/([length of transcript]/1000)/([total reads]/10^6)

FPKM is essentially analogous to RPKM but, rather than using read counts, approximates the relative abundance of transcripts in terms of fragments observed from an RNA-Seq experiment, which may not be represented by a single read, such as in paired-end RNA-Seq experiments.

1. Ying Zhang |

Don’t you need to divide it by the total number of reads generated in each sample? For example, if the depth of sequencing yields 50 million reads per sample, you should divide it by 50 instead of 10^6.

• Jean |

Yes, you are right! The total reads variable was missing from the original equation. The page has been updated to correct for the error. Thanks for catching that!

Please refer to Mortazavi et al., 2008 (RPKM) and Trapnell et al., 2010 (FPKM) to confirm.

• Jean |

Depends. I don’t think there’s a hard rule, though, in general, FPKMs are more appropriate for paired-end RNA-seq experiments; a set of paired-end reads should not be counted as 2 separate transcripts. But depending on your problem of interest, you may even want raw counts, you may not want to normalize, you may want to remove duplicates, and other metrics that deviate from the standard FPKM or RPKM.

2. Shelley B |

Very useful! Thank you for a clear and concise definition. I’m also new to the field.

3. lorraine |

This is the most concise and clear explanation for a novice to genomics like me. Also high five for not letting HFZ rip off your words. (his site was unfortunately what show up first from my search results.)

4. sonal mundhra |

I am new to this. So my question may seem a little stupid. If I am using next gen sequencing differential gene expression studies in treated vs control smaples, then what should I use, RPKM or FPKM?

• Jean |

Whether RPKM or FPKM should be used depends on whether your sequencing was done with single end or paired end reads, not the type of analysis you are doing. Though if you are doing differential expression analysis, many pipelines such as DESeq require counts (neither RPKM nor FPKM).

5. Van |

May I ask how can you get length of transcript information to calculate RPKM. In my case, I am trying to get the RPKM value for Chip_seq data. Thank you

library(biomaRt)
transcript_length <- gos$end_position - gos$start_position