Dufresne Y, Lemane T, Marijon P, Peterlongo P, Rahman A, Kokot M, Medvedev P, Deorowicz S, Chikhi R.
Sets of k-mers are widely used in DNA sequence analysis, for instance in genome assembly [e.g. SPAdes (Bankevich et al., 2012)], indexes of sequence aligners [e.g. minimap2 (Li, 2018)], large-scale sequence search tools (Marchet et al., 2021). Often, bioinformatics tools are k-mer consumers, i.e. they take as input a k-mer set given by one of the k-mer producers, typically k-mer counters [e.g. KMC (Deorowicz et al., 2013), DSK (Rizk et al., 2013)]. Producers use ad hoc binary formats for storing k-mers on disk. This leads to inefficient development practices, as consumers need to write specific parsers for each producer format. Standard file formats greatly facilitate interoperability, e.g. in the case of the SAM/BAM formats (Cock et al., 2015) for sequence alignment and HDF5 (Folk et al., 2011) for general structured data.
We propose the K-mer File Format (KFF), an interoperable and efficient approach to store k-mer sets. We provide APIs in C++ and Rust, as well as file manipulation and conversion tools to facilitate inspection and integration into other tools. KFF has already been integrated in several tools: the KMC and DSK k-mer counters, the ESS-Compress (Rahman et al., 2020) compression tool and kmtricks (Lemane et al., 2022) for k-mer matrix construction. We present the rationale of our approach, the KFF 1.0 file format, and demonstrate the efficiency of KFF for storing k-mers from sequencing data.
More information at https://doi.org/10.1093/bioinformatics/btac528