Index query¶
kmindex query¶
kmindex query allows to query all sequences in FASTA/Q file (gz/bz2) against all sub-indexes registered into a global index \(G\). For each sequence, the output is a list of either shared \(k\)-mer ratios (presence/absence mode) or abundance classes (abundance mode).
Options
kmindex query v0.6.0
DESCRIPTION
Query index.
USAGE
kmindex query -i/--index <STR> -q/--fastx <STR> [-n/--names <STR>] [-z/--zvalue <INT>]
[-r/--threshold <FLOAT>] [-o/--output <STR>] [-s/--single-query <STR>]
[-f/--format <STR>] [-b/--batch-size <INT>] [-t/--threads <INT>]
[-v/--verbose <STR>] [-a/--aggregate] [--fast] [-h/--help] [--version]
OPTIONS
[global]
-i --index - Global index path.
-n --names - Sub-indexes to query, comma separated. {all}
-z --zvalue - Index s-mers and query (s+z)-mers (findere algorithm). {0}
-r --threshold - Shared k-mers threshold. in [0.0, 1.0] {0.0}
-o --output - Output directory. {output}
-q --fastx - Input fasta/q file (supports gz/bzip2) containing the sequence(s) to query.
-s --single-query - Query identifier. All sequences are considered as a unique query.
-f --format - Output format [json|matrix|json_vec|jsonl|jsonl_vec] {json}
-b --batch-size - Size of query batches (0≈nb_seq/nb_thread). {0}
-a --aggregate - Aggregate results from batches into one file. [⚑]
--fast - Keep more pages in cache (see doc for details). [⚑]
[common]
-t --threads - Number of threads. {1}
-h --help - Show this message and exit. [⚑]
--version - Show version and exit. [⚑]
-v --verbose - Verbosity level [debug|info|warning|error]. {info}
--single-query <STR>
All sequences in an input file can be considered a unique query by using --single-query query_name. The output file contains then shared \(k\)-mer ratios between the input file and each indexed sample. This corresponds to the query behavior of kmindex-server (see server-query)
--batch-size <INT>
The number of queries in memory is actually batch-size\(\times\)threads.
Presence/Absence query¶
Main parameters
--index <STR>: Path to global index.--fastx <STR>:File containing queries.--names <[STR]>: Sub-indexes to query, comma separated.--zvalue <INT>: Usually in \([0,6]\), see findere algorithm.--threshold <FLOAT>: Report only ratios >threshold, in \([0.0,1.0]\).
Example: Querying sequence 1 and 2 from query.fasta against the presence/absence index \(D1\). The kmindex repository offers an examples directory where these commands can be tested.
- Several indexes can be queried at the same time by using
--names D1,D2,....
The results are shared ratios between queries and each sample indexed in \(D1\). See Output formats.
Output formats¶
JSONL (--format jsonl)
Abundance query¶
The query mode is automatically determined by the index type. The values reported in the output files are simply abundance classes instead of shared \(k\)-mer ratios.
About the z parameter¶
findere is a query-time technique allowing to reduce the false positive rate of an arbitrary AMQF (as Bloom filters) while reducing the number of lookups to resolve a query. Its scope concerns the indexing of objects that are decomposable into sub-objects, i.e. a sequence of characters. It is based on the following straightforward statement: if a sequence \(S \in E\), then its sub-words \(w_i \in E\).
To benefit from findere, small \(k\)-mers are actually indexed (\(s\)-mer), and (\(s+z\))-mers are considered are query time. As a result, for a given false positive rate of \(\epsilon\), the probability of a false match becomes \(\epsilon^z\) at query time, reducing the false postive rate by several orders of magnitude. Note that \(z\) cannot be arbitrary large, usually \(z \in [0, 6]\), see findere paper for more details.