A Supporting Information for “Mapping-Friendly Sequence Reductions: Going Beyond Homopolymer Compression”

A.1 “TandemTools” dataset generation

This dataset was obtained by taking a human X chromosome HOR sequence, concatenating it 500 times with added mutations in order to obtain an approximately 1 Mbp long sequence. Then 1200 reads were simulated from the sequence using nanosim318 and assembled using a centromere-tailored pipeline743. A 10kbp deletion was then added to this assembly. The resulting sequence is the one we refer to as the “Centromeric sequence”.

A.2 MSR performance comparison

A.3 Analyzing read origin on whole human genome

**Origin of correctly (teal) and incorrectly (red) mapped raw reads.**  
Distribution of the origin of correctly and incorrectly mapped simulated
reads (in teal and red respectively) on the different chromosomes of the
whole human genome. The dark grey rectangle for each chromosome
represents the centromere of that chromosome. The lighter gray rectangle
on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as
"stalk", another repetitive region.

Figure A.1: Origin of correctly (teal) and incorrectly (red) mapped raw reads.
Distribution of the origin of correctly and incorrectly mapped simulated reads (in teal and red respectively) on the different chromosomes of the whole human genome. The dark grey rectangle for each chromosome represents the centromere of that chromosome. The lighter gray rectangle on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as “stalk”, another repetitive region.

**Origin of correctly (teal) and incorrectly (red) mapped reads,
transformed with HPC.**  
Distribution of the origin of correctly and incorrectly mapped simulated
reads (in teal and red respectively) on the different chromosomes of the
whole human genome. The dark grey rectangle for each chromosome
represents the centromere of that chromosome. The lighter gray rectangle
on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as
"stalk", another repetitive region.

Figure A.2: Origin of correctly (teal) and incorrectly (red) mapped reads, transformed with HPC.
Distribution of the origin of correctly and incorrectly mapped simulated reads (in teal and red respectively) on the different chromosomes of the whole human genome. The dark grey rectangle for each chromosome represents the centromere of that chromosome. The lighter gray rectangle on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as “stalk”, another repetitive region.

**Origin of correctly (teal) and incorrectly (red) mapped reads,
transformed with MSR~E~.**  
Distribution of the origin of correctly and incorrectly mapped simulated
reads (in teal and red respectively) on the different chromosomes of the
whole human genome. The dark grey rectangle for each chromosome
represents the centromere of that chromosome. The lighter gray rectangle
on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as
"stalk", another repetitive region.

Figure A.3: Origin of correctly (teal) and incorrectly (red) mapped reads, transformed with MSRE.
Distribution of the origin of correctly and incorrectly mapped simulated reads (in teal and red respectively) on the different chromosomes of the whole human genome. The dark grey rectangle for each chromosome represents the centromere of that chromosome. The lighter gray rectangle on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as “stalk”, another repetitive region.

**Origin of correctly (teal) and incorrectly (red) mapped reads,
transformed with MSR~P~.**  
Distribution of the origin of correctly and incorrectly mapped simulated
reads (in teal and red respectively) on the different chromosomes of the
whole human genome. The dark grey rectangle for each chromosome
represents the centromere of that chromosome. The lighter gray rectangle
on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as
"stalk", another repetitive region.

Figure A.4: Origin of correctly (teal) and incorrectly (red) mapped reads, transformed with MSRP.
Distribution of the origin of correctly and incorrectly mapped simulated reads (in teal and red respectively) on the different chromosomes of the whole human genome. The dark grey rectangle for each chromosome represents the centromere of that chromosome. The lighter gray rectangle on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as “stalk”, another repetitive region.

**Origin of correctly (teal) and incorrectly (red) mapped reads,
transformed with MSR~F~.**  
Distribution of the origin of correctly and incorrectly mapped simulated
reads (in teal and red respectively) on the different chromosomes of the
whole human genome. The dark grey rectangle for each chromosome
represents the centromere of that chromosome. The lighter gray rectangle
on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as
"stalk", another repetitive region.

Figure A.5: Origin of correctly (teal) and incorrectly (red) mapped reads, transformed with MSRF.
Distribution of the origin of correctly and incorrectly mapped simulated reads (in teal and red respectively) on the different chromosomes of the whole human genome. The dark grey rectangle for each chromosome represents the centromere of that chromosome. The lighter gray rectangle on chromosomes 13, 14, 15, 21 and 22 correspond to satellites denoted as “stalk”, another repetitive region.

A.4 Performance of MSRs on the Drosophila genome

**Results of the `paftools mapeval` evaluation on reads simulated and
mapped to whole *Drosophila melanogaster* and *Escherichia coli* genomes.**  
MSRs E, F and P are shown in different shades of blue to
differentiate them from other MSRs. Reads were simulated with `nanosim`,
and mapped with `minimap2`. The *E. coli* genome was obtained from Genbank ID [U00096.2](https://www.ncbi.nlm.nih.gov/nuccore/U00096.2)

Figure A.6: Results of the paftools mapeval evaluation on reads simulated and mapped to whole Drosophila melanogaster and Escherichia coli genomes.
MSRs E, F and P are shown in different shades of blue to differentiate them from other MSRs. Reads were simulated with nanosim, and mapped with minimap2. The E. coli genome was obtained from Genbank ID U00096.2

A.5 Key resource table

References

119.
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
120.
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
318.
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience 6, (2017).
743.
Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat Biotechnol 38, 1309–1316 (2020).