11.1 Scope

Filtering query results with taxonomic restrictions is a “must” for biological questions and is implemented in HitKeeper. The functionality is pretty much similar to that of hit_query, with the two exceptions cla_parent and not_cla_parent. Some of the queries supported are:

  • Retrieve the sequences that come from a species or group of organisms.
  • Get a list of species that belong to a species or to group of organisms.
  • Determine if a given species is a descendant of a group of organisms.
  • Get the ordered list of the ancestors of a given species (the answers of this query will allow us to display a local taxonomic distribution of a set of sequences).

 

11.2 Querying

The following query refers to all sequences that belong to the species Homo sapiens:

 

hat_query cla_name=Homo_sapiens -ref=$HUMAN

Thus, if you need to retrieve all hits from a single organism, you can use the following construct. Note that the query $HUMAN is re-used:

 

hit_query seq_name=$HUMAN -out/path/to/outfile.txt

The following query first declares Homo_sapiens as root and searches from there “downwards”. In the second step, all sequences that belong to this taxonomic parent are queried:

 

cla_query cla_parent=Homo_sapiens -ref=$HUMAN 
hat_query cla_name=$HUMAN -ref=$HUMAN_SEQ

A similar example, retrieving all sequences from SwissProt that belong to birds:

 

cla_query cla_parent=Aves -ref=$BIRDS 
hat_query cla_name=$BIRDS seq_source=sw -ref=$BIRDSEQ

The following query searches for every organism that does not have Homo sapiens as taxonomic parent:

 

cla_query not_cla_parent=Homo_sapiens -ref=$NOT_HUMAN

All the examples:

Caveat: Not all entries in a sequence database possess a reference to a taxonomic identifier, and some entries may possess more than one. Nevertheless, sequence databases and taxonomy data are synchronized “sooner or later”: even if the taxonomic tree changes (which happens frequently1), the sequence entries will nearly always reference a valid taxonomic identifier. Using public databases like Swiss-Prot, these changes need about one week to propagate[4].

 

11.3 Options

The following constraints can be used in any combination and order. Please keep in mind that the constraints must be satisfied independantly (they are joined with a logical AND): if one of those is not fulfilled, the query will return an empty result.

seq_source=...
A non-empty list of sequence database names.
seq_name=...
A list of sequence entry names (given explicitly, or implicitly using query identifiers) to be included in the results.
and_seq_name=...
A list of sequence entry names to be included in the results (logical AND with the previous constraint).
not_seq_name=...
A list of sequence entry names to be excluded from the results (logical NOT to restrict the two previous constraints).
cla_source=...
A non-empty list of classification database names.
cla_name=...
A list of classification entry names (given explicitly, or implicitly using query identifiers) to be included in the results.
and_cla_name=...
A list of classification entry names to be included in the results (logical AND with the previous constraint).
not_cla_name=...
A list of classification entry names to be excluded from the results (logical NOT to restrict the two previous constraints).
hat_name=...
A hat list given using query identifiers to be included in the results.
-lim=...
Maximum number of rows to be returned.
-ref=...
A query identifier, i.e. a string that starts with "$" followed by a letter, possibly followed by more letters, digits or underscores. This is how a query can be saved to be re-used later in other operations. When supplied, this option prevents the query to be executed.

1Taxonomy changes e.g. when a species is sequenced for the first time (it must be inserted), or when phylogenic studies define a new classification of species. In addition, taxonomic identifiers may be deleted or merged.