WABI Rahman A, Chikhi R, Medvedev P
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such largedatasets is a detriment to tool developers, tool users, and reproducibility efforts. General purposecompressors like gzip, or those designed for read data, are sub-optimal because they do not take intoaccount the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev,RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving stringset representation to compress a set of k-mers to disk. In this paper, we present two improvedmethods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from therepresentation of UST-Compress. We explore their behavior both theoretically and on real data.We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent,across a breadth of datasets. We also derive lower bounds on how well this type of compressionstrategy can hope to do.
More info at: https://doi.org/10.4230/LIPIcs.WABI.2020.16