My blog

Cleaning second-generation sequencing data

As the title implies, I would like to talk about the pre-processing of sequencing reads. Originally, this was intended to be a question posted on the bioinformatics forum biostars.org but I discovered, once I finished writing it, that the content length exceeded the limit that is imposed. That is how this text became my first blog post. Nonetheless, I still want to get all your answers, comments, suggestions and protests just as much as if I had loaded it as a question on the forum. To be honest, the main purpose this post is to start a discussion on the subject. So, please, don’t hesitate to go to the comment section afterwards.

Preparing and cleaning data from second-generation sequencing data is an important step that heavily influences downstream analysis. This is certainly the case for metagenomics-type studies that don’t necessarily proceed by first performing an assembly. Yet this appears to be a vaguely described step with no clear procedures established. In other words, I think this cleaning business hasn’t been sorted out. There are a multitude of solutions out there all advertising themselves as reliable, but often poorly documented and badly coded. It seems it is more a question of habit and favorite command line options then a real methodology. I would like to ask everyone working with such problems: how do you solve them ?

So that the answers you might come up with are the best possible, I would like to provide some context and describe the problem fully. To do this I will first present my case and summarize the purpose of cleaning sequencing data. Finally, I will list and rant about the softwares I have tried until now.

My case is the following: I got a few SFF files back from the sequencing facility for a study on some bacterial communities in lakes. They are the result of a “shotgun metagenomics” experiment where all DNA in an environment sample is randomly sheered and sequenced. The machine used is a 454 GS FLX Titanium. Some runs are standard, some runs are paired-end with a linker sequence that I know. Having computed a few statistics on the data with the nice program called fastqc, it is evident that it needs heavy cleaning before further analysis, like do probably all sequencing runs of the kind. One would not dare operate gene calling on the reads as they stand. An example output of this program is shown below:

Example fastqc graph

As far as I can see, the goals of cleaning can be summarized in eleven points:

  1. Duplicates (removal of artificial ones, conservation of natural ones)
  2. Homopolymers (removal or correction)
  3. Chimeras (removal or correction)
  4. Contamination (e.g. removal of sequencing from homo sapiens)
  5. Undesirables (e.g. removing viral sequences)
  6. Low complexity regions (removal or masking)
  7. Low quality regions (removal or masking)
  8. Tags, adaptors (removal)
  9. Barcodes (removal or demultiplexing)
  10. Splitting (only for paired-end reads).
  11. Other (e.g. extra trimming of bases at beginning and end of reads)

I have done a short survey of the current tools, and can only report a dis-satisfactory situation. Let’s look at what the community of biologist/programmers has produced. The tools are listed in no particular order. This review is an attempt at venting my frustration and tries to capture the reactions of a new user coming to the field of DNA sequence analysis, starting with the pre-treatement of the data.

FASTX-Toolkit

Looking at their webpage, the project hasn’t been updated since 2010 and marks itself as version 0.0.13. It is written in a mix of c, perl and shell and composed of individual scripts one has to glue together. It does not handle Duplicates, Contamination, Chimeras, Splitting or Homopolymers. Briefly looking at the source, I didn’t see any unittests present. When specifying a quality score, the documentation does not tell us what standard is expected (Illumina 1.0, Illumina 1.3 or Illumina 1.8+).

It is licensed under AGLP3. There is no paper associated.

PRINSEQ

The main interface to this one seems to be via a web browser. That works great when you have only a few samples but manually uploading each one or having to interact with a GUI every time one wants to try different parameters is too cumbersome. They do have a standalone downloadable version of the code but it’s a “Lite” version with less options. Looking at the code there was no trace of unittests. On the front page they have a very discouraging notice setup in a red background reading approximately “Until version 0.19.5 sequences were not being trimmed correctly when using some options please rerun your data if you used prior versions”.

It is written in perl and GPL3 licensed. Link to the paper.

COMAV’s clean_reads

The longer and more detailed documentation when compared to the others is welcome. When testing it for the first time, I specified -e 10,0 as an option telling the program to clip ten bases from the start of every read. However, the resulting FASTQ file had reads with unchanged lengths. I reported this to the developer and he replied it was indeed a bug in the program. He also admitted in his post that there were no unittests and probably never will be because the projects was dropped in favor of writing an other new cleaning pipeline.

It is written in python and GPL3 licensed. Link to the paper.

Pyrocleaner

The installation of this one is pretty hard as one dependency is a piece of proprietary and commercial software from Roche. An other dependency is obtained only by filling a lengthly form, e-mailing someone at Washington University and waiting for his reply. A particular installation procedure, but if you are ready to do all that you will find one of the rare packages that takes care of artificial duplicates at the same as quality control. Unfortunately, my testing showed that it is extremely slow. On one dataset it ran over 30 hours and produced a result that was quantified by fastqc to still have >30% duplication levels. On the paired-end dataset it did not complete and produced the following error:

Traceback (most recent call last):
  File "~/bin/pyrocleaner", line 783, in <module>
    [seqs, in_shot_gun, del_by_length_pe, del_by_ns_pe, del_by_complexity_pe, del_by_quality_pe, del_by_pairends] = filter_pairends(seqs, options)
  File "~/bin/pyrocleaner", line 446, in filter_pairends
    sub_seq1.id = id + ".r"
AttributeError: 'list' object has no attribute 'id'

The code that can be downloaded from the Toulouse INRA website contains no unittests. This is a pity as they advertised in the paper the fact that they “used technical validation run produced in collaboration with Roche”.

It is written in python and GLP3 licensed. Link to the paper.

QIIME

Indubitably, the hardest of package to install in this list. On the website the suggestion is made to download a virtualbox image to run the software. Making a virtualbox is usually a glaring symptom of a program that has become so complicated to install in a *nix environment, that one has to distribute a full operating system set up just right so as to make it work. The drawbacks of running computationally heavy operations inside a virtualized machine are too strong anyways to consider this approach. I was unable to install it on a machine where I did not have super user access so reverted to install it on a slower personal machine that I controlled. Even that failed on my first attempt. If one finally gets access to a working version of QIIME, the script split_libraries.py takes SFF files and will provide some types quality control. It includes homopolymers, demultiplexing and can use a sliding quality window (why a read is discarded instead of trimmed when the window goes below the given threshold but is still larger than the minimum sequence length is beyond me). However, even if one doesn’t want to demultiplex, the script still demands a particular “mapping file”. My tests have shown that it bugs if the reads don’t have barcodes and the -b 0 option is given. Qiime has many other irritating idiosyncrasies such as not functioning if ones uses underscores in the name of ones samples in the mapping file. Qiime needs pairs of FASTA and QUAL files, and cannot use FASTQ files. A separate script identify_chimeric_seqs.py can identify sequences that are chimeric. Over all, it is was a highly un-userfirendly experience. An extra downside for me was the fact that the support forums are hosted on Google Groups. To get a google account you have to provide a valid phone number on which they will call you or text a validation code. Qiime does have an extensive suite of unittests. This did not however prevent them from releasing code that outputs erroneous p-values without noticing it for quite some time. Just the other day, another script was discovered to be performing an incorrect t-statistic. This raises once again the question of the trust we place in code that is not reviewed or tested. They also have a worrisome quantity of issues in their github tracker.

Qiime is written in python and appears to be GLP1 licensed. Link to the paper.

Mothur

Easier to install than Qiime, but equal in ignoring user friendliness. In this project, they actually created their own replicate of a shell as an interface to the software. Happily, there is nonetheless a “batch mode” for Mothur that enables you to start non interactive subprocesses though with a non standard syntax. Other nuisances include that it doesn’t support the FASTQ format and only takes combination of FASTA and QUAL files. As usual, nothing is said about the expected standard to be used in the QUAL file. One cannot specify an output directory for the operation and the trimmed read file is always created in the input file’s directory, regardless of the current directory. Mothur does not just use the standard output alone. It will, at every command, duplicate its standard output into endless logfiles. The documentation for cleaning reads describes some of the parameters for the Trim command but a bit vague on several parts. Does the qtrim option also apply to qwindowaverage ? What does the removelast parameter really do ? Not the slightest trace of tests within the code provided is to be found.

Mothur is written in C++ with some fortran and is GPL3 licensed. Link to the paper.

CD-HIT-454

This program only handles the duplicate read analysis while ignoring all other aspects of the cleaning problem. It seems to be, with Pyrocleaner, the only solution to provide this. The parameters all have default values but are poorly described. One has no idea how to fine tune the parameters for one’s own use or how changing them affects the output. Plus, when using it, should one run it before or after trimming ? As trimming could make two artificial duplicate sequences more similar to one another helping CD-HIT detect them, as well as have the opposite effect of separating two duplicate reads preventing their detection by CD-HIT. Also, should it intervene before or after homopolymer removal ? What about paired-end datasets, are the reads artificially duplicated after ligation or before ? I have found no review, articles or documentation that addresses all these questions. publications that use this tool just seem to say “Oh yeah yeah, we CDhitted them in our pipeline, don’t worry” without extra detail. No sign of unit tests with the code. In fact on the CD-HIT website one can read the following message: “Bugs: There are a number of outstanding bugs in the current implementation. We are always looking for volunteers […]”.

Written in C++ with some perl and licensed under GPL2. Link to the paper.

Qtrim

Consists of a single 1800 line long badly written python file. It is not free software as commercial users may need to pay a license fee to the South African National Bioinformatics Institute. No information about PHRED format expected. No homopolymer, no duplication etc. No unit tests. Draws some nice plots. No paper associated.

NGS QC toolkit

A collection of 14 perl scripts to be used individually. Supports FASTQ variants with homopolymer trimming and can even draw some graphs. However, no tests. Didn’t find any license information except this: “All rights about the toolkit are reserved to National Institute of Plant Genome Research, Aruna Asaf Ali Marg, New Delhi”. Link to the paper.

Others

To make this post complete here are a few other tools which I didn’t take as much time to test. However, the story is often similar to the previous ones you have just read.

  1. Trimmomatic (Specific for Illumina reads)
  2. biopieces (Piece together single scripts)
  3. SeqtrimNEXT (Calls CD-HIT)

Conclusion

Ideally, one would have a solution that is installable with the only command easy_install --user supercleaner and simply takes an SFF file as input while outputting a FASTQ file. By default these would come from the standard in and standard output and all other parameters would be optional. It would take care of all eleven aspects of cleaning mentioned above in a thoroughly tested manner. Such a software will probably never be available, but the situation could certainly be improved. I have the impression that preparing sequencing data for analysis is an obscure task in the current situation and that it is guided more by good prayers and trust on unknown code pieces rather than by scientific evidence and facts. What do you think ?

-Edit 1

One of the criticism I received from a reader of this post is that ”[…] scientists should concentrate on science, not design”. While I agree with this statement in a broad scope, I don’t think it is a good argument for justifying the sloppiness with which some scientific software is written. Firstly, the difficulty of installation and then execution is maybe linked to a slight gain of time for the developer of the software, but how much time does it suck up from the countless users ? Each of these PhD, Postdocs and Professors now all have to stop doing science for an hour or two and start battling with IT problems. Better interfaces means more time doing science and focusing on problems that matter. Secondly, the lack of code review, validation and unittesting is certainly unscientific. The writing of tests for a program can be seen as part of the software design process, but isn’t it just as importantly a part of the scientific process ? Who checks that the method or simply the implementation of the algorithm described is accurate ? I haven’t played the publishing game much yet, but I know that we cannot count on the peer-review process to do this. From what I have seen, it seems that the reviewers don’t always try to install and execute the software that is provided with a submission, let alone try to criticize it. This has happened even in the case of methodology submissions.

-Edit 2

So, I continued playing a bit around with these tools after this post but my thesis must advance. The time I planned to dedicate at setting up this part of my analysis has run out and I need to move on to the next steps. For anyone interested, this is what I ended up putting in my pipeline. It’s just one more custom made script in the jungle of bioinformatic code and not anything special. It probably resembles what other people have been doing before me ! I will return to this part of the problem later if it proves necessary to the quality of the results produced further down.

#!/usr/bin/env python

"""
A custom made script to perform quality control on a
set of nucleotide sequences coming from a sequencing experiment.
Takes an SFF file on the standard in, and outputs a FASTQ file.

You can use this script from the shell like this:
$ supercleaner < reads.sff > reads.clean.fastq

It needs these python modules installed:
- biopython
- numpy
- sh

And these binaries in your $PATH:
- cd-hit-454

Written by Lucas Sinclair.
Kopimi.
"""

# Modules #
import sys, argparse, tempfile, shutil
import numpy, sh
from Bio import SeqIO

###############################################################################
def test():
    # Constants #
    seq = "TGACGGAGTGTAACTCGATG"
    scores = [31]*10 + [10]*10
    assert len(seq) == len(scores)
    # Make object #
    from Bio.SeqRecord import SeqRecord
    from Bio.Seq import Seq
    read = SeqRecord(Seq(seq), id="test", name="test", description="test")
    read.letter_annotations["phred_quality"] = scores
    # Trim it #
    trimmed = trim([read], 10, 21, 5).next()
    assert str(trimmed.seq) == "TGACGGAGTGTA"

###############################################################################
def trim(reads, minlength, threshold, windowsize):
    for read in reads:
        # Length #
        if len(read) < minlength: continue
        # PHRED score #
        scores = read.letter_annotations["phred_quality"]
        averaged = moving_average(scores, windowsize)
        for i,value in enumerate(averaged):
            if value < threshold:
                length = i+windowsize-1
                if length >= minlength:
                    yield read[:length]
                break
        # All ok #
        else:
            yield read

###############################################################################
def moving_average(interval, windowsize):
    window = numpy.ones(int(windowsize))/float(windowsize)
    return numpy.convolve(interval, window, 'valid')

###############################################################################
minlength_help = """Sequences below this length are discarded. Sequences higher
or equal to this length proceed on to further quality control. Defaults to
150 base pairs."""
threshold_help = """Sequences falling below this quality will be trimmed.
Defaults to 21."""
windowsize_help = """The PHRED scores of each sequences will be averaged over
a window of this size. Sequences are truncated when the average quality inside
the window falls below the threshold such that no region of length *windowsize*
can have an average quality below that of *threshold*. The sequence is trimmed
regardless of the fact that downstream windows might again rise above the
average quality score threshold. Defaults to 5 base pairs."""

###############################################################################
if __name__ == '__main__':
    # Parse the shell arguments #
    parser = argparse.ArgumentParser()
    parser.description = sys.modules[__name__].__doc__
    parser.formatter_class = argparse.RawDescriptionHelpFormatter
    parser.add_argument("--minlength", help=minlength_help, default=150)
    parser.add_argument("--threshold", help=threshold_help, default=21)
    parser.add_argument("--windowsize", help=windowsize_help, default=5)
    args = parser.parse_args()
    # Temporary files #
    tmp_dir = tempfile.mkdtemp()
    trim_file = tmp_dir + '/trimmed.fastq'
    clean_file = tmp_dir + '/clean.fastq'
    log_file = tmp_dir + '/cdhit.log'
    # Parse sequences #
    reads = SeqIO.parse(sys.stdin, "sff-trim")
    trimmed = trim(reads, args.minlength, args.threshold, args.windowsize)
    # Do it #
    try: SeqIO.write(trimmed, trim_file, 'fastq')
    except IOError: parser.error('ERROR: Could not read the standard in.')
    except AssertionError: parser.error('Problem with the standard in.')
    # Cluster #
    cdhit = sh.Command('cd-hit-454')
    parameters = ['-g', 1, '-c', 0.96, '-T', 1, '-M', 3000, '-aS', 0.8]
    cdhit(*parameters, i=trim_file, o=clean_file, _out=log_file)
    # Print on standard out #
    sh.cat(clean_file, _out=sys.stdout)
    # Cleanup #
    shutil.rmtree(tmp_dir)

Go to the reddit comment section

Go to the bio-support answers section