top of page

Disk Compression of k-mer Sets

  • oueb70
  • Sep 7, 2020
  • 1 min read

Updated: Nov 24, 2022

WABI Rahman A, Chikhi R, Medvedev P



K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such largedatasets is a detriment to tool developers, tool users, and reproducibility efforts. General purposecompressors like gzip, or those designed for read data, are sub-optimal because they do not take intoaccount the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev,RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving stringset representation to compress a set of k-mers to disk. In this paper, we present two improvedmethods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from therepresentation of UST-Compress. We explore their behavior both theoretically and on real data.We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent,across a breadth of datasets. We also derive lower bounds on how well this type of compressionstrategy can hope to do.



Recent Posts

See All
The extended mobility of plasmids

Nucleic Acid Research Maria Pilar Garcillán-Barcia, Fernando de la Cruz, Eduardo P C Rocha Summary Plasmids play key roles in the...

 
 
 

Comments


bottom of page