Design Considerations

The success of a short read sequencing experiment is dependent upon proper design.  Substantial resources - money, time, and samples - are committed to each run and it is extremely inefficient, in the least, to discover after the fact that the design was inadequate for the questions asked.


For most short read sequencing studies, a fundamental question is the depth of sequencing - how many reads are necessary to answer the question.  For determination of genetic variation - i.e., SNP analysis - one can assume a relatively even distribution of reads across the genome.  However, one must give some thought to how many times a variation needs to be read before it can be declared real instead of a sequencing anomaly.   For differential gene expression, the primary question is how rare is the transcript in question, since the amount of the least abundant transcript will determine how many reads are necessary.  There is some evidence that increasing the number of reads does not linearly increase the depth of detection, but that instead the number of transcripts detected approaches a limit between 4 and 6 million reads (Ramskold et al, 2009, PLoS Comput Biol 5 e1000598, doi:10.1271/journal.pcbi.1000598).  If this is correct, then more reads is tantamount to more measures and can still serve to make the data more reliable.


Microarray studies have focused on the need for biological replicates.  For gene expression studies on the HiSeq 2000, the same issues need to be addressed - is variation between two samples random or ordered?  This question can only be answered by biological replicates.  The case for technical replicates is more controversial.  For technical variation, it is likely that sample preparation bias is more important than sequencing variations.  This means that running more lanes may not address the issue of technical variation.  Instead the issue of technical variation may need to be addressed at the level of library construction, with the most notable point of variability being the fragmentation of the sample.


Before starting an experiment, one also needs to verify that a reference genome is available for assembly of the reads.  Although there are tools available for de novo assembly of short read data, most experiments will proceed using a well-annotated reference genome as a scaffold for the assembly of the data.  This is particularly true for gene expression studies using RNA-seq.  Reference genomes are available for many model organisms, but the quality can vary, and the first step in any project should be identification and acquisition of the reference data.