The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Bioinformatics

Dufresne Y, Lemane T, Marijon P, Peterlongo P, Rahman A, Kokot M, Medvedev P, Deorowicz S, Chikhi R.

Sets of k-mers are widely used in DNA sequence analysis, for instance in genome assembly [e.g. SPAdes (Bankevich et al., 2012)], indexes of sequence aligners [e.g. minimap2 (Li, 2018)], large-scale sequence search tools (Marchet et al., 2021). Often, bioinformatics tools are k-mer consumers, i.e. they take as input a k-mer set given by one of the k-mer producers, typically k-mer counters [e.g. KMC (Deorowicz et al., 2013), DSK (Rizk et al., 2013)]. Producers use ad hoc binary formats for storing k-mers on disk. This leads to inefficient development practices, as consumers need to write specific parsers for each producer format. Standard file formats greatly facilitate interoperability, e.g. in the case of the SAM/BAM formats (Cock et al., 2015) for sequence alignment and HDF5 (Folk et al., 2011) for general structured data.

We propose the K-mer File Format (KFF), an interoperable and efficient approach to store k-mer sets. We provide APIs in C++ and Rust, as well as file manipulation and conversion tools to facilitate inspection and integration into other tools. KFF has already been integrated in several tools: the KMC and DSK k-mer counters, the ESS-Compress (Rahman et al., 2020) compression tool and kmtricks (Lemane et al., 2022) for k-mer matrix construction. We present the rationale of our approach, the KFF 1.0 file format, and demonstrate the efficiency of KFF for storing k-mers from sequencing data.

More information at https://doi.org/10.1093/bioinformatics/btac528

Differential stress responsiveness determines intraspecies virulence heterogeneity and host adaptation in Listeria monocytogenes

Nature Microbiology Lukas Hafner, Enzo Gadin, Lei Huang, Arthur Frouin, Fabien Laporte, Charlotte Gaultier, Afonso Vieira, Claire Maudet,...

Assessing the effect of model specificationand prior sensitivity on Bayesian tests oftemporal signal

PLOS COMPUTATIONAL BIOLOGY John H. Tay, Arthur Kocher, Sebastian Duchene* Abstract Our understanding of the evolution of many microbes...

Expanding the diversity of origin of transfer-containing sequences inmobilizable plasmids

Nature Microbiology Manuel Ares-Arroyo, Amandine Nucci & Eduardo P. C. Rocha Abstract Conjugative plasmids are important drivers of...

The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Recent Posts

Comments