crest4.classify
Written by Lucas Sinclair. GNUv3 Licensed. Contact at www.sinclair.bio
Created in May 2021. Last updated in September 2023.
1#!/usr/bin/env python3 2# -*- coding: utf-8 -*- 3 4""" 5Written by Lucas Sinclair. 6GNUv3 Licensed. 7Contact at www.sinclair.bio 8 9Created in May 2021. 10Last updated in September 2023. 11""" 12 13# Built-in modules # 14import os, multiprocessing 15 16# Internal modules # 17import crest4 18import crest4.databases 19from crest4.query import Query 20from crest4.databases import CrestDatabase 21 22# First party modules # 23from plumbing.cache import property_cached 24from autopaths.file_path import FilePath 25from autopaths.dir_path import DirectoryPath 26 27# Constants # 28all_db_choices = ('midori253darn', 'silvamod138pr2', 'silvamod128', 'mitofish', 29 'ssuome') 30 31############################################################################### 32class Classify: 33 """ 34 This is the main object offered by the `crest4` package. 35 It enables you to automatically assign taxonomic names to DNA sequences 36 obtained from environmental sequencing. 37 38 After creating a new instance of a `Classify` object, you can simply call 39 it to get your input data processed and the assignments output file 40 generated. Examples are included in the `README.md` file of this package 41 which can directly be seen on the GitHub page at: 42 43 https://github.com/xapple/crest4/ 44 """ 45 46 def __init__(self, 47 fasta, 48 search_algo = 'blast', 49 num_threads = 1, 50 search_db = 'ssuome', 51 output_dir = None, 52 search_hits = None, 53 min_score = None, 54 score_drop = 2.0, 55 min_smlrty = True, 56 otu_table = None, 57 ): 58 """ 59 Args: 60 61 fasta: The path to a single FASTA file as a string. 62 These are the sequences that will be taxonomically 63 classified. 64 65 search_algo: The algorithm used for the sequence similarity search 66 that will be run to match the sequences against the 67 database chosen. Either `blast` or `vsearch`. No 68 other values are currently supported. By default, 69 `blast`. 70 71 num_threads: The number of processors to use for the sequence 72 similarity search. By default, parallelism is turned 73 off and this value is 1. If you pass the value `True` 74 we will run as many processes as there are CPUs but 75 no more than 32. 76 77 search_db: The database used for the sequence similarity search. 78 Either `midori253darn`, `silvamod138pr2`, 'mitofish', 79 `silvamod128` or `ssuome`. 80 By default, `ssuome`. Optionally, the user can 81 provide a custom database by specifying the full path 82 to a directory containing all required files under 83 `search_db`. See the README for more information. 84 85 output_dir: The directory into which all the classification 86 results will be written to. This defaults to a 87 directory with the same name as the original FASTA 88 file and a `.crest4` suffix appended. 89 90 search_hits: The path where the search results will be stored. 91 This defaults to the output directory. However, 92 if the search operation has already been completed 93 beforehand, specify the path here to skip the 94 sequence similarity search step and go directly to 95 the taxonomy step. If a hits file exists in the output 96 directory and this option is not specified, it is 97 deleted and regenerated. 98 99 min_score: The minimum bit-score for a search hit to be considered 100 when using BLAST as the search algorithm. All hits below 101 this score are ignored. When using VSEARCH, this value 102 instead indicates the minimum identity between two 103 sequences for the hit to be considered. 104 The default is `155` for BLAST and `0.75` for VSEARCH. 105 106 score_drop: Determines the range of hits to retain and the range 107 to discard based on a drop in percentage from the score 108 of the best hit. Any hit below the following value: 109 "(100 - score_drop)/100 * best_hit_score" is ignored. 110 By default, `2.0`. 111 112 min_smlrty: Determines if the minimum similarity filter is turned 113 on or off. Pass any value like `False` to turn it off. 114 The minimum similarity filter prevents classification 115 to higher ranks when a minimum rank-identity is not 116 met. The default is `True`. 117 118 otu_table: Optionally, one can specify the path to an existing OTU 119 table in CSV or TSV format when running `crest4`. 120 The sequence names in the OTU table must be rows and 121 have to match the names in the FASTA file. The column, 122 on the other hand, provide your samples names. 123 When this option is used, then two extra output files 124 are generated. Firstly, a table summarizing the 125 assignment counts per taxa. Secondly, a table 126 propagating the sequence counts upwards 127 in a cumulative fashion. 128 """ 129 # Save attributes # 130 self.fasta = fasta 131 self.search_algo = search_algo 132 self.num_threads = num_threads 133 self.search_db = search_db 134 self.output_dir = output_dir 135 self.search_hits = search_hits 136 self.min_score = min_score 137 self.score_drop = score_drop 138 self.min_smlrty = min_smlrty 139 self.otu_table = otu_table 140 # Assign default values and change others # 141 self.transform() 142 # Validate attributes # 143 self.validate() 144 145 def transform(self): 146 """ 147 This method will replace empty attributes with defaults when this is 148 needed and will convert others to proper types. 149 """ 150 # The fasta should be a FASTA object # 151 if self.fasta is not None: 152 from fasta import FASTA 153 self.fasta = FASTA(self.fasta) 154 # Default for the number of threads # 155 if not isinstance(self.num_threads, int): 156 if self.num_threads is True: 157 self.num_threads = min(multiprocessing.cpu_count(), 32) 158 elif self.num_threads is False: 159 self.num_threads = 1 160 elif self.num_threads.lower() == 'true': 161 self.num_threads = min(multiprocessing.cpu_count(), 32) 162 # The database is always a string # 163 self.search_db = str(self.search_db) 164 # Default for the output directory # 165 if self.output_dir is None: 166 self.output_dir = self.fasta + '.crest4/' 167 self.output_dir = DirectoryPath(self.output_dir) 168 # The search hits is a file somewhere if passed # 169 if self.search_hits is not None: 170 self.search_hits = FilePath(self.search_hits) 171 # Default for the search hits file if not passed # 172 if self.search_hits is None: 173 self.search_hits = FilePath(self.output_dir + 'search.hits') 174 self.search_hits.remove() 175 # Default for the minimum score # 176 if self.min_score is None: 177 if self.search_algo == 'blast': 178 self.min_score = 155 179 if self.search_algo == 'vsearch': 180 self.min_score = 0.75 181 # The score drop has to be a float not a string # 182 try: 183 self.score_drop = float(self.score_drop) 184 except ValueError: 185 msg = "The score drop value must be numerical (not '%s')." 186 raise ValueError(msg % self.score_drop) 187 # Turn off the minimum similarity filter if the user passed any value # 188 if self.min_smlrty is not True: self.min_smlrty = False 189 # The OTU table is a file somewhere if passed # 190 if self.otu_table is not None: 191 self.otu_table = FilePath(self.otu_table) 192 self.otu_table.must_exist() 193 194 def validate(self): 195 """ 196 This method will raise an Exception if any of the arguments passed by 197 the user are illegal. 198 """ 199 # The fasta should exist if passed # 200 if self.fasta is not None: 201 self.fasta.must_exist() 202 # Either the FASTA file or the hits file has to contain something # 203 if not self.fasta and not self.search_hits: 204 msg = "Neither the FASTA file at '%s' nor the search hits file at" \ 205 " '%s' contain any data. Cannot proceed." 206 raise Exception(msg % (self.fasta, self.search_hits)) 207 # Check the search algorithm # 208 if self.search_algo not in ('blast', 'vsearch'): 209 msg = "The search algorithm '%s' is not supported." 210 raise ValueError(msg % self.search_algo) 211 # The search database is a known entry or exists on the filesystem # 212 if self.search_db not in all_db_choices: 213 if not os.path.exists(self.search_db): 214 msg = "The search database '%s' is not supported." 215 raise ValueError(msg % self.search_db) 216 # Check the minimum score value is above zero # 217 if self.min_score < 0.0: 218 msg = "The minimum score cannot be smaller than zero ('%s')." 219 raise ValueError(msg % self.min_score) 220 # Check the minimum score value is below one # 221 if self.min_score > 1.0: 222 if self.search_algo == 'vsearch': 223 msg = "The minimum score cannot be more than 1.0 when" \ 224 " using VSEARCH ('%s') because it represents the" \ 225 " the minimum identity between two sequences." 226 raise ValueError(msg % self.min_score) 227 # Check the score drop value # 228 if self.score_drop < 0.0: 229 msg = "The score drop value cannot be smaller than zero ('%s')." 230 raise ValueError(msg % self.min_score) 231 if self.score_drop > 100.0: 232 msg = "The score drop value cannot be over 100 ('%s')." 233 raise ValueError(msg % self.min_score) 234 235 def __repr__(self): 236 """A simple representation of this object to avoid memory addresses.""" 237 return "<%s object on '%s'>" % (self.__class__.__name__, self.fasta) 238 239 @property_cached 240 def database(self): 241 """ 242 Retrieve the database object that the user has selected. 243 This can be either one of the standard databases, or a custom 244 specified path. 245 """ 246 if self.search_db not in all_db_choices: 247 # Take the absolute path # 248 user_path = os.path.abspath(self.search_db) 249 # Make the object # 250 return CrestDatabase(custom_path=user_path) 251 else: 252 return getattr(crest4.databases, self.search_db) 253 254 #------------------------------ Searching --------------------------------# 255 @property_cached 256 def seqsearch(self): 257 """ 258 An object representing the sequence similarity search. 259 Makes use of the `seqsearch` module. For reference: 260 261 * Setting `-outfmt` to 5 means XML output. 262 * Setting `-outfmt` to 6 means tabular output. 263 * Setting `-outfmt` to 7 means tabular output with comments. 264 """ 265 # Initialize # 266 params = {} 267 # If the user chose BLAST then we have to specify tabular output # 268 if self.search_algo == 'blast': 269 params = {'-outfmt': '7 qseqid sseqid bitscore length nident'} 270 # In case the user chose VSEARCH we specify the minimum identify 271 # and the minimum sequence match length 272 if self.search_algo == 'vsearch': 273 params = {'--id': self.min_score, 274 '--mincols': 25} 275 # Build and return the object # 276 from seqsearch.search import SeqSearch 277 return SeqSearch(input_fasta = self.fasta, 278 database = self.database, 279 seq_type = 'nucl', 280 algorithm = self.search_algo, 281 filtering = {'max_targets': 100}, 282 num_threads = self.num_threads, 283 out_path = self.search_hits, 284 params = params) 285 286 def search(self): 287 """A method to launch the sequence similarity search.""" 288 # Launch the search algorithm # 289 return self.seqsearch.run() 290 291 #----------------------------- Assigning ---------------------------------# 292 @property_cached 293 def score_frac(self): 294 """ 295 Using the parameter `self.score_drop` which is a percentage (e.g. 2) 296 indicating a drop, we compute the minimum remaining amount of score 297 allowed, as a fraction (e.g. 0.98). 298 """ 299 return 1 - (self.score_drop / 100) 300 301 @property_cached 302 def queries(self): 303 """ 304 Parses the output of the sequence search program used which returns 305 a list of `QueryResult` objects from the `SearchIO.parse` module. 306 The list contains one Query object per sequence that was originally 307 inputted. Use these objects to access the taxonomic assignments. 308 """ 309 # Check if the search has been done already # 310 if not self.search_hits: self.search() 311 # Iterate on the sequence search results # 312 result = [Query(self, query) for query in self.seqsearch.results] 313 # VSEARCH entirely forgets about sequences that had no hits. 314 # Instead of still listing them in the output like BLAST. 315 # So we have to add them back to the list in this awkward manner 316 if self.search_algo == 'vsearch': 317 reported_names = set(query.name for query in result) 318 for seq in self.fasta: 319 if seq.id not in reported_names: 320 q = type('FakeQuery', (), {'hits': [], 'id': seq.id}) 321 result.append(Query(self, q)) 322 # Return # 323 return result 324 325 @property_cached 326 def queries_by_id(self): 327 """ 328 References the same Query objects as the `queries` property above, 329 except that this time they are in a dictionary with the query ids 330 (i.e. the original fasta ids) as keys instead of in a list. 331 """ 332 return {query.name: query for query in self.queries} 333 334 #------------------------------- Outputs ---------------------------------# 335 @property_cached 336 def out_file(self): 337 """ 338 The path to the file that will contain the taxonomic assignments 339 for every sequence. 340 """ 341 # Make sure that output directory exists # 342 self.output_dir.create_if_not_exists() 343 # Return # 344 return self.output_dir + "assignments.txt" 345 346 @property_cached 347 def otu_info(self): 348 """An object giving access to the OTU table information and methods.""" 349 from crest4.otu_tables import InfoFromTableOTUs 350 return InfoFromTableOTUs(self, self.otu_table) 351 352 def __call__(self): 353 """Generate outputs.""" 354 # Intro message # 355 print('Running crest4 version ' + crest4.__version__) 356 # Iterate # 357 self.out_file.writelines(query.tax_string for query in self.queries) 358 # Special case where an OTU table was passed # 359 if self.otu_table: 360 path_by_rank = self.output_dir + 'otus_by_rank.tsv' 361 path_cumulative = self.output_dir + 'otus_cumulative.tsv' 362 otus_by_rank = self.otu_info.otus_by_rank 363 otus_cumulative = self.otu_info.otus_cumulative 364 otus_by_rank.to_csv(path_by_rank, index=False, sep='\t') 365 otus_cumulative.to_csv(path_cumulative, index=False, sep='\t') 366 # Print a success message # 367 msg = "Classification ran successfully. Results are placed in '%s'." 368 print(msg % self.out_file) 369 # Return # 370 return self.out_file
33class Classify: 34 """ 35 This is the main object offered by the `crest4` package. 36 It enables you to automatically assign taxonomic names to DNA sequences 37 obtained from environmental sequencing. 38 39 After creating a new instance of a `Classify` object, you can simply call 40 it to get your input data processed and the assignments output file 41 generated. Examples are included in the `README.md` file of this package 42 which can directly be seen on the GitHub page at: 43 44 https://github.com/xapple/crest4/ 45 """ 46 47 def __init__(self, 48 fasta, 49 search_algo = 'blast', 50 num_threads = 1, 51 search_db = 'ssuome', 52 output_dir = None, 53 search_hits = None, 54 min_score = None, 55 score_drop = 2.0, 56 min_smlrty = True, 57 otu_table = None, 58 ): 59 """ 60 Args: 61 62 fasta: The path to a single FASTA file as a string. 63 These are the sequences that will be taxonomically 64 classified. 65 66 search_algo: The algorithm used for the sequence similarity search 67 that will be run to match the sequences against the 68 database chosen. Either `blast` or `vsearch`. No 69 other values are currently supported. By default, 70 `blast`. 71 72 num_threads: The number of processors to use for the sequence 73 similarity search. By default, parallelism is turned 74 off and this value is 1. If you pass the value `True` 75 we will run as many processes as there are CPUs but 76 no more than 32. 77 78 search_db: The database used for the sequence similarity search. 79 Either `midori253darn`, `silvamod138pr2`, 'mitofish', 80 `silvamod128` or `ssuome`. 81 By default, `ssuome`. Optionally, the user can 82 provide a custom database by specifying the full path 83 to a directory containing all required files under 84 `search_db`. See the README for more information. 85 86 output_dir: The directory into which all the classification 87 results will be written to. This defaults to a 88 directory with the same name as the original FASTA 89 file and a `.crest4` suffix appended. 90 91 search_hits: The path where the search results will be stored. 92 This defaults to the output directory. However, 93 if the search operation has already been completed 94 beforehand, specify the path here to skip the 95 sequence similarity search step and go directly to 96 the taxonomy step. If a hits file exists in the output 97 directory and this option is not specified, it is 98 deleted and regenerated. 99 100 min_score: The minimum bit-score for a search hit to be considered 101 when using BLAST as the search algorithm. All hits below 102 this score are ignored. When using VSEARCH, this value 103 instead indicates the minimum identity between two 104 sequences for the hit to be considered. 105 The default is `155` for BLAST and `0.75` for VSEARCH. 106 107 score_drop: Determines the range of hits to retain and the range 108 to discard based on a drop in percentage from the score 109 of the best hit. Any hit below the following value: 110 "(100 - score_drop)/100 * best_hit_score" is ignored. 111 By default, `2.0`. 112 113 min_smlrty: Determines if the minimum similarity filter is turned 114 on or off. Pass any value like `False` to turn it off. 115 The minimum similarity filter prevents classification 116 to higher ranks when a minimum rank-identity is not 117 met. The default is `True`. 118 119 otu_table: Optionally, one can specify the path to an existing OTU 120 table in CSV or TSV format when running `crest4`. 121 The sequence names in the OTU table must be rows and 122 have to match the names in the FASTA file. The column, 123 on the other hand, provide your samples names. 124 When this option is used, then two extra output files 125 are generated. Firstly, a table summarizing the 126 assignment counts per taxa. Secondly, a table 127 propagating the sequence counts upwards 128 in a cumulative fashion. 129 """ 130 # Save attributes # 131 self.fasta = fasta 132 self.search_algo = search_algo 133 self.num_threads = num_threads 134 self.search_db = search_db 135 self.output_dir = output_dir 136 self.search_hits = search_hits 137 self.min_score = min_score 138 self.score_drop = score_drop 139 self.min_smlrty = min_smlrty 140 self.otu_table = otu_table 141 # Assign default values and change others # 142 self.transform() 143 # Validate attributes # 144 self.validate() 145 146 def transform(self): 147 """ 148 This method will replace empty attributes with defaults when this is 149 needed and will convert others to proper types. 150 """ 151 # The fasta should be a FASTA object # 152 if self.fasta is not None: 153 from fasta import FASTA 154 self.fasta = FASTA(self.fasta) 155 # Default for the number of threads # 156 if not isinstance(self.num_threads, int): 157 if self.num_threads is True: 158 self.num_threads = min(multiprocessing.cpu_count(), 32) 159 elif self.num_threads is False: 160 self.num_threads = 1 161 elif self.num_threads.lower() == 'true': 162 self.num_threads = min(multiprocessing.cpu_count(), 32) 163 # The database is always a string # 164 self.search_db = str(self.search_db) 165 # Default for the output directory # 166 if self.output_dir is None: 167 self.output_dir = self.fasta + '.crest4/' 168 self.output_dir = DirectoryPath(self.output_dir) 169 # The search hits is a file somewhere if passed # 170 if self.search_hits is not None: 171 self.search_hits = FilePath(self.search_hits) 172 # Default for the search hits file if not passed # 173 if self.search_hits is None: 174 self.search_hits = FilePath(self.output_dir + 'search.hits') 175 self.search_hits.remove() 176 # Default for the minimum score # 177 if self.min_score is None: 178 if self.search_algo == 'blast': 179 self.min_score = 155 180 if self.search_algo == 'vsearch': 181 self.min_score = 0.75 182 # The score drop has to be a float not a string # 183 try: 184 self.score_drop = float(self.score_drop) 185 except ValueError: 186 msg = "The score drop value must be numerical (not '%s')." 187 raise ValueError(msg % self.score_drop) 188 # Turn off the minimum similarity filter if the user passed any value # 189 if self.min_smlrty is not True: self.min_smlrty = False 190 # The OTU table is a file somewhere if passed # 191 if self.otu_table is not None: 192 self.otu_table = FilePath(self.otu_table) 193 self.otu_table.must_exist() 194 195 def validate(self): 196 """ 197 This method will raise an Exception if any of the arguments passed by 198 the user are illegal. 199 """ 200 # The fasta should exist if passed # 201 if self.fasta is not None: 202 self.fasta.must_exist() 203 # Either the FASTA file or the hits file has to contain something # 204 if not self.fasta and not self.search_hits: 205 msg = "Neither the FASTA file at '%s' nor the search hits file at" \ 206 " '%s' contain any data. Cannot proceed." 207 raise Exception(msg % (self.fasta, self.search_hits)) 208 # Check the search algorithm # 209 if self.search_algo not in ('blast', 'vsearch'): 210 msg = "The search algorithm '%s' is not supported." 211 raise ValueError(msg % self.search_algo) 212 # The search database is a known entry or exists on the filesystem # 213 if self.search_db not in all_db_choices: 214 if not os.path.exists(self.search_db): 215 msg = "The search database '%s' is not supported." 216 raise ValueError(msg % self.search_db) 217 # Check the minimum score value is above zero # 218 if self.min_score < 0.0: 219 msg = "The minimum score cannot be smaller than zero ('%s')." 220 raise ValueError(msg % self.min_score) 221 # Check the minimum score value is below one # 222 if self.min_score > 1.0: 223 if self.search_algo == 'vsearch': 224 msg = "The minimum score cannot be more than 1.0 when" \ 225 " using VSEARCH ('%s') because it represents the" \ 226 " the minimum identity between two sequences." 227 raise ValueError(msg % self.min_score) 228 # Check the score drop value # 229 if self.score_drop < 0.0: 230 msg = "The score drop value cannot be smaller than zero ('%s')." 231 raise ValueError(msg % self.min_score) 232 if self.score_drop > 100.0: 233 msg = "The score drop value cannot be over 100 ('%s')." 234 raise ValueError(msg % self.min_score) 235 236 def __repr__(self): 237 """A simple representation of this object to avoid memory addresses.""" 238 return "<%s object on '%s'>" % (self.__class__.__name__, self.fasta) 239 240 @property_cached 241 def database(self): 242 """ 243 Retrieve the database object that the user has selected. 244 This can be either one of the standard databases, or a custom 245 specified path. 246 """ 247 if self.search_db not in all_db_choices: 248 # Take the absolute path # 249 user_path = os.path.abspath(self.search_db) 250 # Make the object # 251 return CrestDatabase(custom_path=user_path) 252 else: 253 return getattr(crest4.databases, self.search_db) 254 255 #------------------------------ Searching --------------------------------# 256 @property_cached 257 def seqsearch(self): 258 """ 259 An object representing the sequence similarity search. 260 Makes use of the `seqsearch` module. For reference: 261 262 * Setting `-outfmt` to 5 means XML output. 263 * Setting `-outfmt` to 6 means tabular output. 264 * Setting `-outfmt` to 7 means tabular output with comments. 265 """ 266 # Initialize # 267 params = {} 268 # If the user chose BLAST then we have to specify tabular output # 269 if self.search_algo == 'blast': 270 params = {'-outfmt': '7 qseqid sseqid bitscore length nident'} 271 # In case the user chose VSEARCH we specify the minimum identify 272 # and the minimum sequence match length 273 if self.search_algo == 'vsearch': 274 params = {'--id': self.min_score, 275 '--mincols': 25} 276 # Build and return the object # 277 from seqsearch.search import SeqSearch 278 return SeqSearch(input_fasta = self.fasta, 279 database = self.database, 280 seq_type = 'nucl', 281 algorithm = self.search_algo, 282 filtering = {'max_targets': 100}, 283 num_threads = self.num_threads, 284 out_path = self.search_hits, 285 params = params) 286 287 def search(self): 288 """A method to launch the sequence similarity search.""" 289 # Launch the search algorithm # 290 return self.seqsearch.run() 291 292 #----------------------------- Assigning ---------------------------------# 293 @property_cached 294 def score_frac(self): 295 """ 296 Using the parameter `self.score_drop` which is a percentage (e.g. 2) 297 indicating a drop, we compute the minimum remaining amount of score 298 allowed, as a fraction (e.g. 0.98). 299 """ 300 return 1 - (self.score_drop / 100) 301 302 @property_cached 303 def queries(self): 304 """ 305 Parses the output of the sequence search program used which returns 306 a list of `QueryResult` objects from the `SearchIO.parse` module. 307 The list contains one Query object per sequence that was originally 308 inputted. Use these objects to access the taxonomic assignments. 309 """ 310 # Check if the search has been done already # 311 if not self.search_hits: self.search() 312 # Iterate on the sequence search results # 313 result = [Query(self, query) for query in self.seqsearch.results] 314 # VSEARCH entirely forgets about sequences that had no hits. 315 # Instead of still listing them in the output like BLAST. 316 # So we have to add them back to the list in this awkward manner 317 if self.search_algo == 'vsearch': 318 reported_names = set(query.name for query in result) 319 for seq in self.fasta: 320 if seq.id not in reported_names: 321 q = type('FakeQuery', (), {'hits': [], 'id': seq.id}) 322 result.append(Query(self, q)) 323 # Return # 324 return result 325 326 @property_cached 327 def queries_by_id(self): 328 """ 329 References the same Query objects as the `queries` property above, 330 except that this time they are in a dictionary with the query ids 331 (i.e. the original fasta ids) as keys instead of in a list. 332 """ 333 return {query.name: query for query in self.queries} 334 335 #------------------------------- Outputs ---------------------------------# 336 @property_cached 337 def out_file(self): 338 """ 339 The path to the file that will contain the taxonomic assignments 340 for every sequence. 341 """ 342 # Make sure that output directory exists # 343 self.output_dir.create_if_not_exists() 344 # Return # 345 return self.output_dir + "assignments.txt" 346 347 @property_cached 348 def otu_info(self): 349 """An object giving access to the OTU table information and methods.""" 350 from crest4.otu_tables import InfoFromTableOTUs 351 return InfoFromTableOTUs(self, self.otu_table) 352 353 def __call__(self): 354 """Generate outputs.""" 355 # Intro message # 356 print('Running crest4 version ' + crest4.__version__) 357 # Iterate # 358 self.out_file.writelines(query.tax_string for query in self.queries) 359 # Special case where an OTU table was passed # 360 if self.otu_table: 361 path_by_rank = self.output_dir + 'otus_by_rank.tsv' 362 path_cumulative = self.output_dir + 'otus_cumulative.tsv' 363 otus_by_rank = self.otu_info.otus_by_rank 364 otus_cumulative = self.otu_info.otus_cumulative 365 otus_by_rank.to_csv(path_by_rank, index=False, sep='\t') 366 otus_cumulative.to_csv(path_cumulative, index=False, sep='\t') 367 # Print a success message # 368 msg = "Classification ran successfully. Results are placed in '%s'." 369 print(msg % self.out_file) 370 # Return # 371 return self.out_file
This is the main object offered by the crest4
package.
It enables you to automatically assign taxonomic names to DNA sequences
obtained from environmental sequencing.
After creating a new instance of a Classify
object, you can simply call
it to get your input data processed and the assignments output file
generated. Examples are included in the README.md
file of this package
which can directly be seen on the GitHub page at:
47 def __init__(self, 48 fasta, 49 search_algo = 'blast', 50 num_threads = 1, 51 search_db = 'ssuome', 52 output_dir = None, 53 search_hits = None, 54 min_score = None, 55 score_drop = 2.0, 56 min_smlrty = True, 57 otu_table = None, 58 ): 59 """ 60 Args: 61 62 fasta: The path to a single FASTA file as a string. 63 These are the sequences that will be taxonomically 64 classified. 65 66 search_algo: The algorithm used for the sequence similarity search 67 that will be run to match the sequences against the 68 database chosen. Either `blast` or `vsearch`. No 69 other values are currently supported. By default, 70 `blast`. 71 72 num_threads: The number of processors to use for the sequence 73 similarity search. By default, parallelism is turned 74 off and this value is 1. If you pass the value `True` 75 we will run as many processes as there are CPUs but 76 no more than 32. 77 78 search_db: The database used for the sequence similarity search. 79 Either `midori253darn`, `silvamod138pr2`, 'mitofish', 80 `silvamod128` or `ssuome`. 81 By default, `ssuome`. Optionally, the user can 82 provide a custom database by specifying the full path 83 to a directory containing all required files under 84 `search_db`. See the README for more information. 85 86 output_dir: The directory into which all the classification 87 results will be written to. This defaults to a 88 directory with the same name as the original FASTA 89 file and a `.crest4` suffix appended. 90 91 search_hits: The path where the search results will be stored. 92 This defaults to the output directory. However, 93 if the search operation has already been completed 94 beforehand, specify the path here to skip the 95 sequence similarity search step and go directly to 96 the taxonomy step. If a hits file exists in the output 97 directory and this option is not specified, it is 98 deleted and regenerated. 99 100 min_score: The minimum bit-score for a search hit to be considered 101 when using BLAST as the search algorithm. All hits below 102 this score are ignored. When using VSEARCH, this value 103 instead indicates the minimum identity between two 104 sequences for the hit to be considered. 105 The default is `155` for BLAST and `0.75` for VSEARCH. 106 107 score_drop: Determines the range of hits to retain and the range 108 to discard based on a drop in percentage from the score 109 of the best hit. Any hit below the following value: 110 "(100 - score_drop)/100 * best_hit_score" is ignored. 111 By default, `2.0`. 112 113 min_smlrty: Determines if the minimum similarity filter is turned 114 on or off. Pass any value like `False` to turn it off. 115 The minimum similarity filter prevents classification 116 to higher ranks when a minimum rank-identity is not 117 met. The default is `True`. 118 119 otu_table: Optionally, one can specify the path to an existing OTU 120 table in CSV or TSV format when running `crest4`. 121 The sequence names in the OTU table must be rows and 122 have to match the names in the FASTA file. The column, 123 on the other hand, provide your samples names. 124 When this option is used, then two extra output files 125 are generated. Firstly, a table summarizing the 126 assignment counts per taxa. Secondly, a table 127 propagating the sequence counts upwards 128 in a cumulative fashion. 129 """ 130 # Save attributes # 131 self.fasta = fasta 132 self.search_algo = search_algo 133 self.num_threads = num_threads 134 self.search_db = search_db 135 self.output_dir = output_dir 136 self.search_hits = search_hits 137 self.min_score = min_score 138 self.score_drop = score_drop 139 self.min_smlrty = min_smlrty 140 self.otu_table = otu_table 141 # Assign default values and change others # 142 self.transform() 143 # Validate attributes # 144 self.validate()
Args:
fasta: The path to a single FASTA file as a string.
These are the sequences that will be taxonomically
classified.
search_algo: The algorithm used for the sequence similarity search
that will be run to match the sequences against the
database chosen. Either `blast` or `vsearch`. No
other values are currently supported. By default,
`blast`.
num_threads: The number of processors to use for the sequence
similarity search. By default, parallelism is turned
off and this value is 1. If you pass the value `True`
we will run as many processes as there are CPUs but
no more than 32.
search_db: The database used for the sequence similarity search.
Either `midori253darn`, `silvamod138pr2`, 'mitofish',
`silvamod128` or `ssuome`.
By default, `ssuome`. Optionally, the user can
provide a custom database by specifying the full path
to a directory containing all required files under
`search_db`. See the README for more information.
output_dir: The directory into which all the classification
results will be written to. This defaults to a
directory with the same name as the original FASTA
file and a `.crest4` suffix appended.
search_hits: The path where the search results will be stored.
This defaults to the output directory. However,
if the search operation has already been completed
beforehand, specify the path here to skip the
sequence similarity search step and go directly to
the taxonomy step. If a hits file exists in the output
directory and this option is not specified, it is
deleted and regenerated.
min_score: The minimum bit-score for a search hit to be considered
when using BLAST as the search algorithm. All hits below
this score are ignored. When using VSEARCH, this value
instead indicates the minimum identity between two
sequences for the hit to be considered.
The default is `155` for BLAST and `0.75` for VSEARCH.
score_drop: Determines the range of hits to retain and the range
to discard based on a drop in percentage from the score
of the best hit. Any hit below the following value:
"(100 - score_drop)/100 * best_hit_score" is ignored.
By default, `2.0`.
min_smlrty: Determines if the minimum similarity filter is turned
on or off. Pass any value like `False` to turn it off.
The minimum similarity filter prevents classification
to higher ranks when a minimum rank-identity is not
met. The default is `True`.
otu_table: Optionally, one can specify the path to an existing OTU
table in CSV or TSV format when running `crest4`.
The sequence names in the OTU table must be rows and
have to match the names in the FASTA file. The column,
on the other hand, provide your samples names.
When this option is used, then two extra output files
are generated. Firstly, a table summarizing the
assignment counts per taxa. Secondly, a table
propagating the sequence counts upwards
in a cumulative fashion.
146 def transform(self): 147 """ 148 This method will replace empty attributes with defaults when this is 149 needed and will convert others to proper types. 150 """ 151 # The fasta should be a FASTA object # 152 if self.fasta is not None: 153 from fasta import FASTA 154 self.fasta = FASTA(self.fasta) 155 # Default for the number of threads # 156 if not isinstance(self.num_threads, int): 157 if self.num_threads is True: 158 self.num_threads = min(multiprocessing.cpu_count(), 32) 159 elif self.num_threads is False: 160 self.num_threads = 1 161 elif self.num_threads.lower() == 'true': 162 self.num_threads = min(multiprocessing.cpu_count(), 32) 163 # The database is always a string # 164 self.search_db = str(self.search_db) 165 # Default for the output directory # 166 if self.output_dir is None: 167 self.output_dir = self.fasta + '.crest4/' 168 self.output_dir = DirectoryPath(self.output_dir) 169 # The search hits is a file somewhere if passed # 170 if self.search_hits is not None: 171 self.search_hits = FilePath(self.search_hits) 172 # Default for the search hits file if not passed # 173 if self.search_hits is None: 174 self.search_hits = FilePath(self.output_dir + 'search.hits') 175 self.search_hits.remove() 176 # Default for the minimum score # 177 if self.min_score is None: 178 if self.search_algo == 'blast': 179 self.min_score = 155 180 if self.search_algo == 'vsearch': 181 self.min_score = 0.75 182 # The score drop has to be a float not a string # 183 try: 184 self.score_drop = float(self.score_drop) 185 except ValueError: 186 msg = "The score drop value must be numerical (not '%s')." 187 raise ValueError(msg % self.score_drop) 188 # Turn off the minimum similarity filter if the user passed any value # 189 if self.min_smlrty is not True: self.min_smlrty = False 190 # The OTU table is a file somewhere if passed # 191 if self.otu_table is not None: 192 self.otu_table = FilePath(self.otu_table) 193 self.otu_table.must_exist()
This method will replace empty attributes with defaults when this is needed and will convert others to proper types.
195 def validate(self): 196 """ 197 This method will raise an Exception if any of the arguments passed by 198 the user are illegal. 199 """ 200 # The fasta should exist if passed # 201 if self.fasta is not None: 202 self.fasta.must_exist() 203 # Either the FASTA file or the hits file has to contain something # 204 if not self.fasta and not self.search_hits: 205 msg = "Neither the FASTA file at '%s' nor the search hits file at" \ 206 " '%s' contain any data. Cannot proceed." 207 raise Exception(msg % (self.fasta, self.search_hits)) 208 # Check the search algorithm # 209 if self.search_algo not in ('blast', 'vsearch'): 210 msg = "The search algorithm '%s' is not supported." 211 raise ValueError(msg % self.search_algo) 212 # The search database is a known entry or exists on the filesystem # 213 if self.search_db not in all_db_choices: 214 if not os.path.exists(self.search_db): 215 msg = "The search database '%s' is not supported." 216 raise ValueError(msg % self.search_db) 217 # Check the minimum score value is above zero # 218 if self.min_score < 0.0: 219 msg = "The minimum score cannot be smaller than zero ('%s')." 220 raise ValueError(msg % self.min_score) 221 # Check the minimum score value is below one # 222 if self.min_score > 1.0: 223 if self.search_algo == 'vsearch': 224 msg = "The minimum score cannot be more than 1.0 when" \ 225 " using VSEARCH ('%s') because it represents the" \ 226 " the minimum identity between two sequences." 227 raise ValueError(msg % self.min_score) 228 # Check the score drop value # 229 if self.score_drop < 0.0: 230 msg = "The score drop value cannot be smaller than zero ('%s')." 231 raise ValueError(msg % self.min_score) 232 if self.score_drop > 100.0: 233 msg = "The score drop value cannot be over 100 ('%s')." 234 raise ValueError(msg % self.min_score)
This method will raise an Exception if any of the arguments passed by the user are illegal.
Retrieve the database object that the user has selected. This can be either one of the standard databases, or a custom specified path.
An object representing the sequence similarity search.
Makes use of the seqsearch
module. For reference:
- Setting
-outfmt
to 5 means XML output. - Setting
-outfmt
to 6 means tabular output. - Setting
-outfmt
to 7 means tabular output with comments.
287 def search(self): 288 """A method to launch the sequence similarity search.""" 289 # Launch the search algorithm # 290 return self.seqsearch.run()
A method to launch the sequence similarity search.
Using the parameter self.score_drop
which is a percentage (e.g. 2)
indicating a drop, we compute the minimum remaining amount of score
allowed, as a fraction (e.g. 0.98).
Parses the output of the sequence search program used which returns
a list of QueryResult
objects from the SearchIO.parse
module.
The list contains one Query object per sequence that was originally
inputted. Use these objects to access the taxonomic assignments.
References the same Query objects as the queries
property above,
except that this time they are in a dictionary with the query ids
(i.e. the original fasta ids) as keys instead of in a list.