Filtering query results with taxonomic restrictions is a “must” for biological questions and is implemented in HitKeeper. The functionality is pretty much similar to that of hit_query, with the two exceptions cla_parent and not_cla_parent. Some of the queries supported are:
- Retrieve the sequences that come from a species or group of organisms.
- Get a list of species that belong to a species or to group of organisms.
- Determine if a given species is a descendant of a group of organisms.
- Get the ordered list of the ancestors of a given species (the answers of this query will allow us to display a local taxonomic distribution of a set of sequences).
The following query refers to all sequences that belong to the species Homo sapiens:
hat_query cla_name=Homo_sapiens -ref=$HUMAN
Thus, if you need to retrieve all hits from a single organism, you can use the following construct. Note that the query $HUMAN is re-used:
hit_query seq_name=$HUMAN -out/path/to/outfile.txt
The following query first declares Homo_sapiens as root and searches from there “downwards”. In the second step, all sequences that belong to this taxonomic parent are queried:
cla_query cla_parent=Homo_sapiens -ref=$HUMAN
hat_query cla_name=$HUMAN -ref=$HUMAN_SEQ
A similar example, retrieving all sequences from SwissProt that belong to birds:
cla_query cla_parent=Aves -ref=$BIRDS
hat_query cla_name=$BIRDS seq_source=sw -ref=$BIRDSEQ
The following query searches for every organism that does not have Homo sapiens as taxonomic parent:
cla_query not_cla_parent=Homo_sapiens -ref=$NOT_HUMAN
All the examples:
Caveat: Not all entries in a sequence database possess a reference to a taxonomic identifier, and some entries may possess more than one. Nevertheless, sequence databases and taxonomy data are synchronized “sooner or later”: even if the taxonomic tree changes (which happens frequently1), the sequence entries will nearly always reference a valid taxonomic identifier. Using public databases like Swiss-Prot, these changes need about one week to propagate.
The following constraints can be used in any combination and order. Please keep in mind that the constraints must be satisfied independantly (they are joined with a logical AND): if one of those is not fulfilled, the query will return an empty result.
- A non-empty list of sequence database names.
- A list of sequence entry names (given explicitly, or implicitly using query identifiers) to be included in the results.
- A list of sequence entry names to be included in the results (logical AND with the previous constraint).
- A list of sequence entry names to be excluded from the results (logical NOT to restrict the two previous constraints).
- A non-empty list of classification database names.
- A list of classification entry names (given explicitly, or implicitly using query identifiers) to be included in the results.
- A list of classification entry names to be included in the results (logical AND with the previous constraint).
- A list of classification entry names to be excluded from the results (logical NOT to restrict the two previous constraints).
- A hat list given using query identifiers to be included in the results.
- Maximum number of rows to be returned.
- A query identifier, i.e. a string that starts with "$" followed by a letter, possibly followed by more letters, digits or underscores. This is how a query can be saved to be re-used later in other operations. When supplied, this option prevents the query to be executed.
1Taxonomy changes e.g. when a species is sequenced for the first time (it must be inserted), or when phylogenic studies define a new classification of species. In addition, taxonomic identifiers may be deleted or merged.