crest4.classify

Written by Lucas Sinclair. GNUv3 Licensed. Contact at www.sinclair.bio

Created in May 2021. Last updated in September 2023.

View Source

  1#!/usr/bin/env python3
  2# -*- coding: utf-8 -*-
  3
  4"""
  5Written by Lucas Sinclair.
  6GNUv3 Licensed.
  7Contact at www.sinclair.bio
  8
  9Created in May 2021.
 10Last updated in September 2023.
 11"""
 12
 13# Built-in modules #
 14import os, multiprocessing
 15
 16# Internal modules #
 17import crest4
 18import crest4.databases
 19from crest4.query import Query
 20from crest4.databases import CrestDatabase
 21
 22# First party modules #
 23from plumbing.cache      import property_cached
 24from autopaths.file_path import FilePath
 25from autopaths.dir_path  import DirectoryPath
 26
 27# Constants #
 28all_db_choices = ('midori253darn', 'silvamod138pr2', 'silvamod128', 'mitofish',
 29                  'ssuome')
 30
 31###############################################################################
 32class Classify:
 33    """
 34    This is the main object offered by the `crest4` package.
 35    It enables you to automatically assign taxonomic names to DNA sequences
 36    obtained from environmental sequencing.
 37
 38    After creating a new instance of a `Classify` object, you can simply call
 39    it to get your input data processed and the assignments output file
 40    generated. Examples are included in the `README.md` file of this package
 41    which can directly be seen on the GitHub page at:
 42
 43    https://github.com/xapple/crest4/
 44    """
 45
 46    def __init__(self,
 47                 fasta,
 48                 search_algo = 'blast',
 49                 num_threads = 1,
 50                 search_db   = 'ssuome',
 51                 output_dir  = None,
 52                 search_hits = None,
 53                 min_score   = None,
 54                 score_drop  = 2.0,
 55                 min_smlrty  = True,
 56                 otu_table   = None,
 57                 ):
 58        """
 59        Args:
 60
 61            fasta: The path to a single FASTA file as a string.
 62                   These are the sequences that will be taxonomically
 63                   classified.
 64
 65            search_algo: The algorithm used for the sequence similarity search
 66                         that will be run to match the sequences against the
 67                         database chosen. Either `blast` or `vsearch`. No
 68                         other values are currently supported. By default,
 69                         `blast`.
 70
 71            num_threads: The number of processors to use for the sequence
 72                         similarity search. By default, parallelism is turned
 73                         off and this value is 1. If you pass the value `True`
 74                         we will run as many processes as there are CPUs but
 75                         no more than 32.
 76
 77            search_db: The database used for the sequence similarity search.
 78                       Either `midori253darn`, `silvamod138pr2`, 'mitofish',
 79                       `silvamod128` or `ssuome`.
 80                       By default, `ssuome`. Optionally, the user can
 81                       provide a custom database by specifying the full path
 82                       to a directory containing all required files under
 83                       `search_db`. See the README for more information.
 84
 85            output_dir: The directory into which all the classification
 86                        results will be written to. This defaults to a
 87                        directory with the same name as the original FASTA
 88                        file and a `.crest4` suffix appended.
 89
 90            search_hits: The path where the search results will be stored.
 91                         This defaults to the output directory. However,
 92                         if the search operation has already been completed
 93                         beforehand, specify the path here to skip the
 94                         sequence similarity search step and go directly to
 95                         the taxonomy step. If a hits file exists in the output
 96                         directory and this option is not specified, it is
 97                         deleted and regenerated.
 98
 99            min_score: The minimum bit-score for a search hit to be considered
100                       when using BLAST as the search algorithm. All hits below
101                       this score are ignored. When using VSEARCH, this value
102                       instead indicates the minimum identity between two
103                       sequences for the hit to be considered.
104                       The default is `155` for BLAST and `0.75` for VSEARCH.
105
106            score_drop: Determines the range of hits to retain and the range
107                        to discard based on a drop in percentage from the score
108                        of the best hit. Any hit below the following value:
109                        "(100 - score_drop)/100 * best_hit_score" is ignored.
110                        By default, `2.0`.
111
112            min_smlrty: Determines if the minimum similarity filter is turned
113                        on or off. Pass any value like `False` to turn it off.
114                        The minimum similarity filter prevents classification
115                        to higher ranks when a minimum rank-identity is not
116                        met. The default is `True`.
117
118            otu_table: Optionally, one can specify the path to an existing OTU
119                       table in CSV or TSV format when running `crest4`.
120                       The sequence names in the OTU table must be rows and
121                       have to match the names in the FASTA file. The column,
122                       on the other hand, provide your samples names.
123                       When this option is used, then two extra output files
124                       are generated. Firstly, a table summarizing the
125                       assignment counts per taxa. Secondly, a table
126                       propagating the sequence counts upwards
127                       in a cumulative fashion.
128                       """
129        # Save attributes #
130        self.fasta       = fasta
131        self.search_algo = search_algo
132        self.num_threads = num_threads
133        self.search_db   = search_db
134        self.output_dir  = output_dir
135        self.search_hits = search_hits
136        self.min_score   = min_score
137        self.score_drop  = score_drop
138        self.min_smlrty  = min_smlrty
139        self.otu_table   = otu_table
140        # Assign default values and change others #
141        self.transform()
142        # Validate attributes #
143        self.validate()
144
145    def transform(self):
146        """
147        This method will replace empty attributes with defaults when this is
148        needed and will convert others to proper types.
149        """
150        # The fasta should be a FASTA object #
151        if self.fasta is not None:
152            from fasta import FASTA
153            self.fasta = FASTA(self.fasta)
154        # Default for the number of threads #
155        if not isinstance(self.num_threads, int):
156            if self.num_threads is True:
157                self.num_threads = min(multiprocessing.cpu_count(), 32)
158            elif self.num_threads is False:
159                self.num_threads = 1
160            elif self.num_threads.lower() == 'true':
161                self.num_threads = min(multiprocessing.cpu_count(), 32)
162        # The database is always a string #
163        self.search_db = str(self.search_db)
164        # Default for the output directory #
165        if self.output_dir is None:
166            self.output_dir = self.fasta + '.crest4/'
167        self.output_dir = DirectoryPath(self.output_dir)
168        # The search hits is a file somewhere if passed #
169        if self.search_hits is not None:
170            self.search_hits = FilePath(self.search_hits)
171        # Default for the search hits file if not passed #
172        if self.search_hits is None:
173            self.search_hits = FilePath(self.output_dir + 'search.hits')
174            self.search_hits.remove()
175        # Default for the minimum score #
176        if self.min_score is None:
177            if self.search_algo == 'blast':
178                self.min_score = 155
179            if self.search_algo == 'vsearch':
180                self.min_score = 0.75
181        # The score drop has to be a float not a string #
182        try:
183            self.score_drop = float(self.score_drop)
184        except ValueError:
185            msg = "The score drop value must be numerical (not '%s')."
186            raise ValueError(msg % self.score_drop)
187        # Turn off the minimum similarity filter if the user passed any value #
188        if self.min_smlrty is not True: self.min_smlrty = False
189        # The OTU table is a file somewhere if passed #
190        if self.otu_table is not None:
191            self.otu_table = FilePath(self.otu_table)
192            self.otu_table.must_exist()
193
194    def validate(self):
195        """
196        This method will raise an Exception if any of the arguments passed by
197        the user are illegal.
198        """
199        # The fasta should exist if passed #
200        if self.fasta is not None:
201            self.fasta.must_exist()
202        # Either the FASTA file or the hits file has to contain something #
203        if not self.fasta and not self.search_hits:
204            msg = "Neither the FASTA file at '%s' nor the search hits file at" \
205                  " '%s' contain any data. Cannot proceed."
206            raise Exception(msg % (self.fasta, self.search_hits))
207        # Check the search algorithm #
208        if self.search_algo not in ('blast', 'vsearch'):
209            msg = "The search algorithm '%s' is not supported."
210            raise ValueError(msg % self.search_algo)
211        # The search database is a known entry or exists on the filesystem #
212        if self.search_db not in all_db_choices:
213            if not os.path.exists(self.search_db):
214                msg = "The search database '%s' is not supported."
215                raise ValueError(msg % self.search_db)
216        # Check the minimum score value is above zero #
217        if self.min_score < 0.0:
218            msg = "The minimum score cannot be smaller than zero ('%s')."
219            raise ValueError(msg % self.min_score)
220        # Check the minimum score value is below one #
221        if self.min_score > 1.0:
222            if self.search_algo == 'vsearch':
223                msg = "The minimum score cannot be more than 1.0 when" \
224                      " using VSEARCH ('%s') because it represents the" \
225                      " the minimum identity between two sequences."
226                raise ValueError(msg % self.min_score)
227        # Check the score drop value #
228        if self.score_drop < 0.0:
229            msg = "The score drop value cannot be smaller than zero ('%s')."
230            raise ValueError(msg % self.min_score)
231        if self.score_drop > 100.0:
232            msg = "The score drop value cannot be over 100 ('%s')."
233            raise ValueError(msg % self.min_score)
234
235    def __repr__(self):
236        """A simple representation of this object to avoid memory addresses."""
237        return "<%s object on '%s'>" % (self.__class__.__name__, self.fasta)
238
239    @property_cached
240    def database(self):
241        """
242        Retrieve the database object that the user has selected.
243        This can be either one of the standard databases, or a custom
244        specified path.
245        """
246        if self.search_db not in all_db_choices:
247            # Take the absolute path #
248            user_path = os.path.abspath(self.search_db)
249            # Make the object #
250            return CrestDatabase(custom_path=user_path)
251        else:
252            return getattr(crest4.databases, self.search_db)
253
254    #------------------------------ Searching --------------------------------#
255    @property_cached
256    def seqsearch(self):
257        """
258        An object representing the sequence similarity search.
259        Makes use of the `seqsearch` module. For reference:
260
261        * Setting `-outfmt` to 5 means XML output.
262        * Setting `-outfmt` to 6 means tabular output.
263        * Setting `-outfmt` to 7 means tabular output with comments.
264        """
265        # Initialize #
266        params = {}
267        # If the user chose BLAST then we have to specify tabular output #
268        if self.search_algo == 'blast':
269            params = {'-outfmt': '7 qseqid sseqid bitscore length nident'}
270        # In case the user chose VSEARCH we specify the minimum identify
271        # and the minimum sequence match length
272        if self.search_algo == 'vsearch':
273            params = {'--id':      self.min_score,
274                      '--mincols': 25}
275        # Build and return the object #
276        from seqsearch.search import SeqSearch
277        return SeqSearch(input_fasta = self.fasta,
278                         database    = self.database,
279                         seq_type    = 'nucl',
280                         algorithm   = self.search_algo,
281                         filtering   = {'max_targets': 100},
282                         num_threads = self.num_threads,
283                         out_path    = self.search_hits,
284                         params      = params)
285
286    def search(self):
287        """A method to launch the sequence similarity search."""
288        # Launch the search algorithm #
289        return self.seqsearch.run()
290
291    #----------------------------- Assigning ---------------------------------#
292    @property_cached
293    def score_frac(self):
294        """
295        Using the parameter `self.score_drop` which is a percentage (e.g. 2)
296        indicating a drop, we compute the minimum remaining amount of score
297        allowed, as a fraction (e.g. 0.98).
298        """
299        return 1 - (self.score_drop / 100)
300
301    @property_cached
302    def queries(self):
303        """
304        Parses the output of the sequence search program used which returns
305        a list of `QueryResult` objects from the `SearchIO.parse` module.
306        The list contains one Query object per sequence that was originally
307        inputted. Use these objects to access the taxonomic assignments.
308        """
309        # Check if the search has been done already #
310        if not self.search_hits: self.search()
311        # Iterate on the sequence search results #
312        result = [Query(self, query) for query in self.seqsearch.results]
313        # VSEARCH entirely forgets about sequences that had no hits.
314        # Instead of still listing them in the output like BLAST.
315        # So we have to add them back to the list in this awkward manner
316        if self.search_algo == 'vsearch':
317            reported_names = set(query.name for query in result)
318            for seq in self.fasta:
319                if seq.id not in reported_names:
320                    q = type('FakeQuery', (), {'hits': [], 'id': seq.id})
321                    result.append(Query(self, q))
322        # Return #
323        return result
324
325    @property_cached
326    def queries_by_id(self):
327        """
328        References the same Query objects as the `queries` property above,
329        except that this time they are in a dictionary with the query ids
330        (i.e. the original fasta ids) as keys instead of in a list.
331        """
332        return {query.name: query for query in self.queries}
333
334    #------------------------------- Outputs ---------------------------------#
335    @property_cached
336    def out_file(self):
337        """
338        The path to the file that will contain the taxonomic assignments
339        for every sequence.
340        """
341        # Make sure that output directory exists #
342        self.output_dir.create_if_not_exists()
343        # Return #
344        return self.output_dir + "assignments.txt"
345
346    @property_cached
347    def otu_info(self):
348        """An object giving access to the OTU table information and methods."""
349        from crest4.otu_tables import InfoFromTableOTUs
350        return InfoFromTableOTUs(self, self.otu_table)
351
352    def __call__(self):
353        """Generate outputs."""
354        # Intro message #
355        print('Running crest4 version ' + crest4.__version__)
356        # Iterate #
357        self.out_file.writelines(query.tax_string for query in self.queries)
358        # Special case where an OTU table was passed #
359        if self.otu_table:
360            path_by_rank    = self.output_dir + 'otus_by_rank.tsv'
361            path_cumulative = self.output_dir + 'otus_cumulative.tsv'
362            otus_by_rank    = self.otu_info.otus_by_rank
363            otus_cumulative = self.otu_info.otus_cumulative
364            otus_by_rank.to_csv(path_by_rank, index=False, sep='\t')
365            otus_cumulative.to_csv(path_cumulative, index=False, sep='\t')
366        # Print a success message #
367        msg = "Classification ran successfully. Results are placed in '%s'."
368        print(msg % self.out_file)
369        # Return #
370        return self.out_file

all_db_choices = ('midori253darn', 'silvamod138pr2', 'silvamod128', 'mitofish', 'ssuome')

class Classify: View Source

 33class Classify:
 34    """
 35    This is the main object offered by the `crest4` package.
 36    It enables you to automatically assign taxonomic names to DNA sequences
 37    obtained from environmental sequencing.
 38
 39    After creating a new instance of a `Classify` object, you can simply call
 40    it to get your input data processed and the assignments output file
 41    generated. Examples are included in the `README.md` file of this package
 42    which can directly be seen on the GitHub page at:
 43
 44    https://github.com/xapple/crest4/
 45    """
 46
 47    def __init__(self,
 48                 fasta,
 49                 search_algo = 'blast',
 50                 num_threads = 1,
 51                 search_db   = 'ssuome',
 52                 output_dir  = None,
 53                 search_hits = None,
 54                 min_score   = None,
 55                 score_drop  = 2.0,
 56                 min_smlrty  = True,
 57                 otu_table   = None,
 58                 ):
 59        """
 60        Args:
 61
 62            fasta: The path to a single FASTA file as a string.
 63                   These are the sequences that will be taxonomically
 64                   classified.
 65
 66            search_algo: The algorithm used for the sequence similarity search
 67                         that will be run to match the sequences against the
 68                         database chosen. Either `blast` or `vsearch`. No
 69                         other values are currently supported. By default,
 70                         `blast`.
 71
 72            num_threads: The number of processors to use for the sequence
 73                         similarity search. By default, parallelism is turned
 74                         off and this value is 1. If you pass the value `True`
 75                         we will run as many processes as there are CPUs but
 76                         no more than 32.
 77
 78            search_db: The database used for the sequence similarity search.
 79                       Either `midori253darn`, `silvamod138pr2`, 'mitofish',
 80                       `silvamod128` or `ssuome`.
 81                       By default, `ssuome`. Optionally, the user can
 82                       provide a custom database by specifying the full path
 83                       to a directory containing all required files under
 84                       `search_db`. See the README for more information.
 85
 86            output_dir: The directory into which all the classification
 87                        results will be written to. This defaults to a
 88                        directory with the same name as the original FASTA
 89                        file and a `.crest4` suffix appended.
 90
 91            search_hits: The path where the search results will be stored.
 92                         This defaults to the output directory. However,
 93                         if the search operation has already been completed
 94                         beforehand, specify the path here to skip the
 95                         sequence similarity search step and go directly to
 96                         the taxonomy step. If a hits file exists in the output
 97                         directory and this option is not specified, it is
 98                         deleted and regenerated.
 99
100            min_score: The minimum bit-score for a search hit to be considered
101                       when using BLAST as the search algorithm. All hits below
102                       this score are ignored. When using VSEARCH, this value
103                       instead indicates the minimum identity between two
104                       sequences for the hit to be considered.
105                       The default is `155` for BLAST and `0.75` for VSEARCH.
106
107            score_drop: Determines the range of hits to retain and the range
108                        to discard based on a drop in percentage from the score
109                        of the best hit. Any hit below the following value:
110                        "(100 - score_drop)/100 * best_hit_score" is ignored.
111                        By default, `2.0`.
112
113            min_smlrty: Determines if the minimum similarity filter is turned
114                        on or off. Pass any value like `False` to turn it off.
115                        The minimum similarity filter prevents classification
116                        to higher ranks when a minimum rank-identity is not
117                        met. The default is `True`.
118
119            otu_table: Optionally, one can specify the path to an existing OTU
120                       table in CSV or TSV format when running `crest4`.
121                       The sequence names in the OTU table must be rows and
122                       have to match the names in the FASTA file. The column,
123                       on the other hand, provide your samples names.
124                       When this option is used, then two extra output files
125                       are generated. Firstly, a table summarizing the
126                       assignment counts per taxa. Secondly, a table
127                       propagating the sequence counts upwards
128                       in a cumulative fashion.
129                       """
130        # Save attributes #
131        self.fasta       = fasta
132        self.search_algo = search_algo
133        self.num_threads = num_threads
134        self.search_db   = search_db
135        self.output_dir  = output_dir
136        self.search_hits = search_hits
137        self.min_score   = min_score
138        self.score_drop  = score_drop
139        self.min_smlrty  = min_smlrty
140        self.otu_table   = otu_table
141        # Assign default values and change others #
142        self.transform()
143        # Validate attributes #
144        self.validate()
145
146    def transform(self):
147        """
148        This method will replace empty attributes with defaults when this is
149        needed and will convert others to proper types.
150        """
151        # The fasta should be a FASTA object #
152        if self.fasta is not None:
153            from fasta import FASTA
154            self.fasta = FASTA(self.fasta)
155        # Default for the number of threads #
156        if not isinstance(self.num_threads, int):
157            if self.num_threads is True:
158                self.num_threads = min(multiprocessing.cpu_count(), 32)
159            elif self.num_threads is False:
160                self.num_threads = 1
161            elif self.num_threads.lower() == 'true':
162                self.num_threads = min(multiprocessing.cpu_count(), 32)
163        # The database is always a string #
164        self.search_db = str(self.search_db)
165        # Default for the output directory #
166        if self.output_dir is None:
167            self.output_dir = self.fasta + '.crest4/'
168        self.output_dir = DirectoryPath(self.output_dir)
169        # The search hits is a file somewhere if passed #
170        if self.search_hits is not None:
171            self.search_hits = FilePath(self.search_hits)
172        # Default for the search hits file if not passed #
173        if self.search_hits is None:
174            self.search_hits = FilePath(self.output_dir + 'search.hits')
175            self.search_hits.remove()
176        # Default for the minimum score #
177        if self.min_score is None:
178            if self.search_algo == 'blast':
179                self.min_score = 155
180            if self.search_algo == 'vsearch':
181                self.min_score = 0.75
182        # The score drop has to be a float not a string #
183        try:
184            self.score_drop = float(self.score_drop)
185        except ValueError:
186            msg = "The score drop value must be numerical (not '%s')."
187            raise ValueError(msg % self.score_drop)
188        # Turn off the minimum similarity filter if the user passed any value #
189        if self.min_smlrty is not True: self.min_smlrty = False
190        # The OTU table is a file somewhere if passed #
191        if self.otu_table is not None:
192            self.otu_table = FilePath(self.otu_table)
193            self.otu_table.must_exist()
194
195    def validate(self):
196        """
197        This method will raise an Exception if any of the arguments passed by
198        the user are illegal.
199        """
200        # The fasta should exist if passed #
201        if self.fasta is not None:
202            self.fasta.must_exist()
203        # Either the FASTA file or the hits file has to contain something #
204        if not self.fasta and not self.search_hits:
205            msg = "Neither the FASTA file at '%s' nor the search hits file at" \
206                  " '%s' contain any data. Cannot proceed."
207            raise Exception(msg % (self.fasta, self.search_hits))
208        # Check the search algorithm #
209        if self.search_algo not in ('blast', 'vsearch'):
210            msg = "The search algorithm '%s' is not supported."
211            raise ValueError(msg % self.search_algo)
212        # The search database is a known entry or exists on the filesystem #
213        if self.search_db not in all_db_choices:
214            if not os.path.exists(self.search_db):
215                msg = "The search database '%s' is not supported."
216                raise ValueError(msg % self.search_db)
217        # Check the minimum score value is above zero #
218        if self.min_score < 0.0:
219            msg = "The minimum score cannot be smaller than zero ('%s')."
220            raise ValueError(msg % self.min_score)
221        # Check the minimum score value is below one #
222        if self.min_score > 1.0:
223            if self.search_algo == 'vsearch':
224                msg = "The minimum score cannot be more than 1.0 when" \
225                      " using VSEARCH ('%s') because it represents the" \
226                      " the minimum identity between two sequences."
227                raise ValueError(msg % self.min_score)
228        # Check the score drop value #
229        if self.score_drop < 0.0:
230            msg = "The score drop value cannot be smaller than zero ('%s')."
231            raise ValueError(msg % self.min_score)
232        if self.score_drop > 100.0:
233            msg = "The score drop value cannot be over 100 ('%s')."
234            raise ValueError(msg % self.min_score)
235
236    def __repr__(self):
237        """A simple representation of this object to avoid memory addresses."""
238        return "<%s object on '%s'>" % (self.__class__.__name__, self.fasta)
239
240    @property_cached
241    def database(self):
242        """
243        Retrieve the database object that the user has selected.
244        This can be either one of the standard databases, or a custom
245        specified path.
246        """
247        if self.search_db not in all_db_choices:
248            # Take the absolute path #
249            user_path = os.path.abspath(self.search_db)
250            # Make the object #
251            return CrestDatabase(custom_path=user_path)
252        else:
253            return getattr(crest4.databases, self.search_db)
254
255    #------------------------------ Searching --------------------------------#
256    @property_cached
257    def seqsearch(self):
258        """
259        An object representing the sequence similarity search.
260        Makes use of the `seqsearch` module. For reference:
261
262        * Setting `-outfmt` to 5 means XML output.
263        * Setting `-outfmt` to 6 means tabular output.
264        * Setting `-outfmt` to 7 means tabular output with comments.
265        """
266        # Initialize #
267        params = {}
268        # If the user chose BLAST then we have to specify tabular output #
269        if self.search_algo == 'blast':
270            params = {'-outfmt': '7 qseqid sseqid bitscore length nident'}
271        # In case the user chose VSEARCH we specify the minimum identify
272        # and the minimum sequence match length
273        if self.search_algo == 'vsearch':
274            params = {'--id':      self.min_score,
275                      '--mincols': 25}
276        # Build and return the object #
277        from seqsearch.search import SeqSearch
278        return SeqSearch(input_fasta = self.fasta,
279                         database    = self.database,
280                         seq_type    = 'nucl',
281                         algorithm   = self.search_algo,
282                         filtering   = {'max_targets': 100},
283                         num_threads = self.num_threads,
284                         out_path    = self.search_hits,
285                         params      = params)
286
287    def search(self):
288        """A method to launch the sequence similarity search."""
289        # Launch the search algorithm #
290        return self.seqsearch.run()
291
292    #----------------------------- Assigning ---------------------------------#
293    @property_cached
294    def score_frac(self):
295        """
296        Using the parameter `self.score_drop` which is a percentage (e.g. 2)
297        indicating a drop, we compute the minimum remaining amount of score
298        allowed, as a fraction (e.g. 0.98).
299        """
300        return 1 - (self.score_drop / 100)
301
302    @property_cached
303    def queries(self):
304        """
305        Parses the output of the sequence search program used which returns
306        a list of `QueryResult` objects from the `SearchIO.parse` module.
307        The list contains one Query object per sequence that was originally
308        inputted. Use these objects to access the taxonomic assignments.
309        """
310        # Check if the search has been done already #
311        if not self.search_hits: self.search()
312        # Iterate on the sequence search results #
313        result = [Query(self, query) for query in self.seqsearch.results]
314        # VSEARCH entirely forgets about sequences that had no hits.
315        # Instead of still listing them in the output like BLAST.
316        # So we have to add them back to the list in this awkward manner
317        if self.search_algo == 'vsearch':
318            reported_names = set(query.name for query in result)
319            for seq in self.fasta:
320                if seq.id not in reported_names:
321                    q = type('FakeQuery', (), {'hits': [], 'id': seq.id})
322                    result.append(Query(self, q))
323        # Return #
324        return result
325
326    @property_cached
327    def queries_by_id(self):
328        """
329        References the same Query objects as the `queries` property above,
330        except that this time they are in a dictionary with the query ids
331        (i.e. the original fasta ids) as keys instead of in a list.
332        """
333        return {query.name: query for query in self.queries}
334
335    #------------------------------- Outputs ---------------------------------#
336    @property_cached
337    def out_file(self):
338        """
339        The path to the file that will contain the taxonomic assignments
340        for every sequence.
341        """
342        # Make sure that output directory exists #
343        self.output_dir.create_if_not_exists()
344        # Return #
345        return self.output_dir + "assignments.txt"
346
347    @property_cached
348    def otu_info(self):
349        """An object giving access to the OTU table information and methods."""
350        from crest4.otu_tables import InfoFromTableOTUs
351        return InfoFromTableOTUs(self, self.otu_table)
352
353    def __call__(self):
354        """Generate outputs."""
355        # Intro message #
356        print('Running crest4 version ' + crest4.__version__)
357        # Iterate #
358        self.out_file.writelines(query.tax_string for query in self.queries)
359        # Special case where an OTU table was passed #
360        if self.otu_table:
361            path_by_rank    = self.output_dir + 'otus_by_rank.tsv'
362            path_cumulative = self.output_dir + 'otus_cumulative.tsv'
363            otus_by_rank    = self.otu_info.otus_by_rank
364            otus_cumulative = self.otu_info.otus_cumulative
365            otus_by_rank.to_csv(path_by_rank, index=False, sep='\t')
366            otus_cumulative.to_csv(path_cumulative, index=False, sep='\t')
367        # Print a success message #
368        msg = "Classification ran successfully. Results are placed in '%s'."
369        print(msg % self.out_file)
370        # Return #
371        return self.out_file

This is the main object offered by the crest4 package. It enables you to automatically assign taxonomic names to DNA sequences obtained from environmental sequencing.

After creating a new instance of a Classify object, you can simply call it to get your input data processed and the assignments output file generated. Examples are included in the README.md file of this package which can directly be seen on the GitHub page at:

https://github.com/xapple/crest4/

Classify( fasta, search_algo='blast', num_threads=1, search_db='ssuome', output_dir=None, search_hits=None, min_score=None, score_drop=2.0, min_smlrty=True, otu_table=None) View Source

 47    def __init__(self,
 48                 fasta,
 49                 search_algo = 'blast',
 50                 num_threads = 1,
 51                 search_db   = 'ssuome',
 52                 output_dir  = None,
 53                 search_hits = None,
 54                 min_score   = None,
 55                 score_drop  = 2.0,
 56                 min_smlrty  = True,
 57                 otu_table   = None,
 58                 ):
 59        """
 60        Args:
 61
 62            fasta: The path to a single FASTA file as a string.
 63                   These are the sequences that will be taxonomically
 64                   classified.
 65
 66            search_algo: The algorithm used for the sequence similarity search
 67                         that will be run to match the sequences against the
 68                         database chosen. Either `blast` or `vsearch`. No
 69                         other values are currently supported. By default,
 70                         `blast`.
 71
 72            num_threads: The number of processors to use for the sequence
 73                         similarity search. By default, parallelism is turned
 74                         off and this value is 1. If you pass the value `True`
 75                         we will run as many processes as there are CPUs but
 76                         no more than 32.
 77
 78            search_db: The database used for the sequence similarity search.
 79                       Either `midori253darn`, `silvamod138pr2`, 'mitofish',
 80                       `silvamod128` or `ssuome`.
 81                       By default, `ssuome`. Optionally, the user can
 82                       provide a custom database by specifying the full path
 83                       to a directory containing all required files under
 84                       `search_db`. See the README for more information.
 85
 86            output_dir: The directory into which all the classification
 87                        results will be written to. This defaults to a
 88                        directory with the same name as the original FASTA
 89                        file and a `.crest4` suffix appended.
 90
 91            search_hits: The path where the search results will be stored.
 92                         This defaults to the output directory. However,
 93                         if the search operation has already been completed
 94                         beforehand, specify the path here to skip the
 95                         sequence similarity search step and go directly to
 96                         the taxonomy step. If a hits file exists in the output
 97                         directory and this option is not specified, it is
 98                         deleted and regenerated.
 99
100            min_score: The minimum bit-score for a search hit to be considered
101                       when using BLAST as the search algorithm. All hits below
102                       this score are ignored. When using VSEARCH, this value
103                       instead indicates the minimum identity between two
104                       sequences for the hit to be considered.
105                       The default is `155` for BLAST and `0.75` for VSEARCH.
106
107            score_drop: Determines the range of hits to retain and the range
108                        to discard based on a drop in percentage from the score
109                        of the best hit. Any hit below the following value:
110                        "(100 - score_drop)/100 * best_hit_score" is ignored.
111                        By default, `2.0`.
112
113            min_smlrty: Determines if the minimum similarity filter is turned
114                        on or off. Pass any value like `False` to turn it off.
115                        The minimum similarity filter prevents classification
116                        to higher ranks when a minimum rank-identity is not
117                        met. The default is `True`.
118
119            otu_table: Optionally, one can specify the path to an existing OTU
120                       table in CSV or TSV format when running `crest4`.
121                       The sequence names in the OTU table must be rows and
122                       have to match the names in the FASTA file. The column,
123                       on the other hand, provide your samples names.
124                       When this option is used, then two extra output files
125                       are generated. Firstly, a table summarizing the
126                       assignment counts per taxa. Secondly, a table
127                       propagating the sequence counts upwards
128                       in a cumulative fashion.
129                       """
130        # Save attributes #
131        self.fasta       = fasta
132        self.search_algo = search_algo
133        self.num_threads = num_threads
134        self.search_db   = search_db
135        self.output_dir  = output_dir
136        self.search_hits = search_hits
137        self.min_score   = min_score
138        self.score_drop  = score_drop
139        self.min_smlrty  = min_smlrty
140        self.otu_table   = otu_table
141        # Assign default values and change others #
142        self.transform()
143        # Validate attributes #
144        self.validate()

Args:

fasta: The path to a single FASTA file as a string.
       These are the sequences that will be taxonomically
       classified.

search_algo: The algorithm used for the sequence similarity search
             that will be run to match the sequences against the
             database chosen. Either `blast` or `vsearch`. No
             other values are currently supported. By default,
             `blast`.

num_threads: The number of processors to use for the sequence
             similarity search. By default, parallelism is turned
             off and this value is 1. If you pass the value `True`
             we will run as many processes as there are CPUs but
             no more than 32.

search_db: The database used for the sequence similarity search.
           Either `midori253darn`, `silvamod138pr2`, 'mitofish',
           `silvamod128` or `ssuome`.
           By default, `ssuome`. Optionally, the user can
           provide a custom database by specifying the full path
           to a directory containing all required files under
           `search_db`. See the README for more information.

output_dir: The directory into which all the classification
            results will be written to. This defaults to a
            directory with the same name as the original FASTA
            file and a `.crest4` suffix appended.

search_hits: The path where the search results will be stored.
             This defaults to the output directory. However,
             if the search operation has already been completed
             beforehand, specify the path here to skip the
             sequence similarity search step and go directly to
             the taxonomy step. If a hits file exists in the output
             directory and this option is not specified, it is
             deleted and regenerated.

min_score: The minimum bit-score for a search hit to be considered
           when using BLAST as the search algorithm. All hits below
           this score are ignored. When using VSEARCH, this value
           instead indicates the minimum identity between two
           sequences for the hit to be considered.
           The default is `155` for BLAST and `0.75` for VSEARCH.

score_drop: Determines the range of hits to retain and the range
            to discard based on a drop in percentage from the score
            of the best hit. Any hit below the following value:
            "(100 - score_drop)/100 * best_hit_score" is ignored.
            By default, `2.0`.

min_smlrty: Determines if the minimum similarity filter is turned
            on or off. Pass any value like `False` to turn it off.
            The minimum similarity filter prevents classification
            to higher ranks when a minimum rank-identity is not
            met. The default is `True`.

otu_table: Optionally, one can specify the path to an existing OTU
           table in CSV or TSV format when running `crest4`.
           The sequence names in the OTU table must be rows and
           have to match the names in the FASTA file. The column,
           on the other hand, provide your samples names.
           When this option is used, then two extra output files
           are generated. Firstly, a table summarizing the
           assignment counts per taxa. Secondly, a table
           propagating the sequence counts upwards
           in a cumulative fashion.

fasta

search_algo

num_threads

search_db

output_dir

search_hits

min_score

score_drop

min_smlrty

otu_table

def transform(self): View Source

146    def transform(self):
147        """
148        This method will replace empty attributes with defaults when this is
149        needed and will convert others to proper types.
150        """
151        # The fasta should be a FASTA object #
152        if self.fasta is not None:
153            from fasta import FASTA
154            self.fasta = FASTA(self.fasta)
155        # Default for the number of threads #
156        if not isinstance(self.num_threads, int):
157            if self.num_threads is True:
158                self.num_threads = min(multiprocessing.cpu_count(), 32)
159            elif self.num_threads is False:
160                self.num_threads = 1
161            elif self.num_threads.lower() == 'true':
162                self.num_threads = min(multiprocessing.cpu_count(), 32)
163        # The database is always a string #
164        self.search_db = str(self.search_db)
165        # Default for the output directory #
166        if self.output_dir is None:
167            self.output_dir = self.fasta + '.crest4/'
168        self.output_dir = DirectoryPath(self.output_dir)
169        # The search hits is a file somewhere if passed #
170        if self.search_hits is not None:
171            self.search_hits = FilePath(self.search_hits)
172        # Default for the search hits file if not passed #
173        if self.search_hits is None:
174            self.search_hits = FilePath(self.output_dir + 'search.hits')
175            self.search_hits.remove()
176        # Default for the minimum score #
177        if self.min_score is None:
178            if self.search_algo == 'blast':
179                self.min_score = 155
180            if self.search_algo == 'vsearch':
181                self.min_score = 0.75
182        # The score drop has to be a float not a string #
183        try:
184            self.score_drop = float(self.score_drop)
185        except ValueError:
186            msg = "The score drop value must be numerical (not '%s')."
187            raise ValueError(msg % self.score_drop)
188        # Turn off the minimum similarity filter if the user passed any value #
189        if self.min_smlrty is not True: self.min_smlrty = False
190        # The OTU table is a file somewhere if passed #
191        if self.otu_table is not None:
192            self.otu_table = FilePath(self.otu_table)
193            self.otu_table.must_exist()

This method will replace empty attributes with defaults when this is needed and will convert others to proper types.

def validate(self): View Source

195    def validate(self):
196        """
197        This method will raise an Exception if any of the arguments passed by
198        the user are illegal.
199        """
200        # The fasta should exist if passed #
201        if self.fasta is not None:
202            self.fasta.must_exist()
203        # Either the FASTA file or the hits file has to contain something #
204        if not self.fasta and not self.search_hits:
205            msg = "Neither the FASTA file at '%s' nor the search hits file at" \
206                  " '%s' contain any data. Cannot proceed."
207            raise Exception(msg % (self.fasta, self.search_hits))
208        # Check the search algorithm #
209        if self.search_algo not in ('blast', 'vsearch'):
210            msg = "The search algorithm '%s' is not supported."
211            raise ValueError(msg % self.search_algo)
212        # The search database is a known entry or exists on the filesystem #
213        if self.search_db not in all_db_choices:
214            if not os.path.exists(self.search_db):
215                msg = "The search database '%s' is not supported."
216                raise ValueError(msg % self.search_db)
217        # Check the minimum score value is above zero #
218        if self.min_score < 0.0:
219            msg = "The minimum score cannot be smaller than zero ('%s')."
220            raise ValueError(msg % self.min_score)
221        # Check the minimum score value is below one #
222        if self.min_score > 1.0:
223            if self.search_algo == 'vsearch':
224                msg = "The minimum score cannot be more than 1.0 when" \
225                      " using VSEARCH ('%s') because it represents the" \
226                      " the minimum identity between two sequences."
227                raise ValueError(msg % self.min_score)
228        # Check the score drop value #
229        if self.score_drop < 0.0:
230            msg = "The score drop value cannot be smaller than zero ('%s')."
231            raise ValueError(msg % self.min_score)
232        if self.score_drop > 100.0:
233            msg = "The score drop value cannot be over 100 ('%s')."
234            raise ValueError(msg % self.min_score)

This method will raise an Exception if any of the arguments passed by the user are illegal.

database

Retrieve the database object that the user has selected. This can be either one of the standard databases, or a custom specified path.

seqsearch

An object representing the sequence similarity search. Makes use of the seqsearch module. For reference:

Setting -outfmt to 5 means XML output.
Setting -outfmt to 6 means tabular output.
Setting -outfmt to 7 means tabular output with comments.

def search(self): View Source

287    def search(self):
288        """A method to launch the sequence similarity search."""
289        # Launch the search algorithm #
290        return self.seqsearch.run()

A method to launch the sequence similarity search.

score_frac

Using the parameter self.score_drop which is a percentage (e.g. 2) indicating a drop, we compute the minimum remaining amount of score allowed, as a fraction (e.g. 0.98).

queries

Parses the output of the sequence search program used which returns a list of QueryResult objects from the SearchIO.parse module. The list contains one Query object per sequence that was originally inputted. Use these objects to access the taxonomic assignments.

queries_by_id

References the same Query objects as the queries property above, except that this time they are in a dictionary with the query ids (i.e. the original fasta ids) as keys instead of in a list.

out_file

The path to the file that will contain the taxonomic assignments for every sequence.

otu_info

An object giving access to the OTU table information and methods.