# Chapter 2 Aligning Sequence Data

## 2.1 What is an alignment ?

In biology, comparison is at the heart of many studies: between individuals, between species, between sequencing runs, *etc*… In order to do this at a fine-grained level and extract knowledge from it, we need to compare what is comparable, this is where sequence alignment steps in. In broad terms, during sequence alignment, we aim to find regions similar to each other in two or more sequences and group them together. When this process is done with only two sequences it is called a *pairwise alignment*. When three or more sequences are used it is called *multiple alignment*. We will first focus on pairwise alignment as it is used as the basis for the more complex multiple alignment.

### 2.1.1 Why align ?

The first question we might ask ourselves is why align at all? If we want to compare two sequences there are plenty of distances and metrics out there to use. Something like the Hamming distance^{128} is very quick and easy to compute by comparing characters two by two. It is however ill-suited to our needs in biology: it can handle substitutions but indels induce very large Hamming distances. Indeed, insertions and/or deletions shift one of the sequences, compared to the other, and introduce many character-to-character differences that could be explained by a single indel.

For example, let us consider the two following sequences: `ATGTGCAGTA`

and `AGTGCAGTAC`

. if we count the differences character by character, except the first pair of A, all the characters are different (c.f. below). However, if we consider that the first T was deleted and a C was inserted at the end of the second sequence then we can see that none of the characters are actually different. In order to represent insertions and deletions *gaps* are inserted in the sequences as seen below:

`ATGTGCAGTA-`

`A-GTGCAGTAC`

This problem of comparing two sequences with insertions or deletions is a fairly well-studied one in text algorithmics: the string-edit problem^{129}. Some metrics like the Levenshtein distance^{130} and the edit distance^{129} exist and are closely related to the pairwise sequence alignment problem, finding the minimal number of substitution, insertion of deletion operations to go from one sequence to the other.

Sequence alignments have many downstream uses. They are the basis of comparative genomics^{131} and are used to infer evolutionary relationships, phylogenetic tree reconstruction methods usually take multiple sequence alignments as input^{132–136}. Sequence alignments have been used to study protein structure^{137,138 } and function^{139,140}. They can be used to correct sequencing errors^{82,84,141} or detect structural variations in genomes^{142,143}. All this to say that they are absolutely fundamental to the field of computational biology and errors in alignments can lead to errors somewhere down the line.

### 2.1.2 How to align two sequences ?

There are two approaches for pairwise alignment^{144}: *global alignment,* where the entirety of both sequences is used, and *local alignment,* where we only seek to find regions in each sequence that are most similar to each other. Global alignment is used when the two sequences are expected to be quite similar (e.g. comparing two related proteins), whereas local alignment is mostly used when we expect the sequences to be fairly different but with highly similar regions, like genomes of two distantly related species that share a highly conserved region.

The seminal method for global pairwise alignment was the Needleman-Wünsch algorithm^{145} based on a dynamic programming method. A decade later, the Smith-Waterman algorithm^{146} was developed with similar ideas to perform local alignment. Both are still used today for pairwise alignment.

Dynamic programming is often used to solve complex problems by breaking it into smaller sub-problems and solving each one optimally and separately^{147,148}, it is particularly useful when we wish to have a precise alignment between 2 sequences.

#### 2.1.2.1 Global alignment

The fundamental algorithm for globally aligning two sequences is the Needleman-Wünsch (NW) algorithm^{145}. The goal is finding the alignment with 1) the lowest edit-distance or 2) the highest alignment score. These two are equivalent so in this section we will maximize the alignment score.

The first thing we need to know is how to compute a score on a given alignment. To do this, we assign costs to each operation. Usually matches (i.e. aligning two identical characters) are given a positive cost and mismatches or indels a negative cost. If we assign a cost of +1 to a match and a cost of -1 to mismatches and indels then the alignment presented above in Section 2.1.1 would have an alignment score of 9 - 2 = 7 *(9 matches and two indels)*.

The NW algorithm is based on a simple recurrence relation: the optimal alignment score of two sequences \(S_1\) and \(S_2\) of lengths \(n\) and \(m\) respectively is the maximum of:

The optimal alignment score of \(S_1[1,n-1]\)

^{4}and \(S_2[1,m-1]\) plus the cost of a match or mismatch between the \(n^{th}\) character of \(S_1\) and the \(m^{th}\) character of \(S_2\)The optimal alignment score of \(S_1\) and \(S_2[1,m-1]\) plus the cost of an indel

The optimal alignment score of \(S_1[1,n-1]\) and \(S_2\) plus the cost of an indel

This simple relation can be used to compute optimal global alignment score for two sequences. However, if it is implemented naively it can be very inefficient as the number of scores to compute grows exponentially with sequence lengths, and many intermediary alignment scores need to be computed many times. This is where dynamic programming comes in: these intermediary costs are pre-computed in an efficient manner and one can then deduce the optimal alignment from these. This pre-computing step is usually represented as filling out a matrix, whose rows and columns represent the characters in each sequence to be aligned, with partial alignment scores.

If \(S_1\) represents the rows of the matrix, and \(S_2\) the columns, the value \(C(i,j)\) of a cell \((i,j)\) of this matrix represents the optimal alignment score between \(S_1[1,i]\) and \(S_2[1,j]\). In the recurrence relation described above the alignment score as dependent on the optimal alignment scores of subsequences, when filling out the dynamic programming matrix we proceed in the inverse fashion by using the scores of short subsequences to build up the scores of progressively longer sequences.

We will go here through a short example showing how the NW algorithm is used to align two short sequences: \(S_1=\)`ACCTGA`

and \(S_2=\)`ACGGA`

. The first step is to represent the dynamic programming matrix, prefix each sequence with an empty character and label the rows of the matrix with one of the sequences and the columns with the other *(this extra row and column at the beginning of each sequence are indexed as column and row 0)*. In this matrix, due to the recurrence relation stated above, the score of a particular cell, \(C(i,j)\), is the maximum of:

- The score in the diagonally adjacent cell \(C(i-1,j-1)\) plus the cost of a match or mismatch between \(S_1[i]\) and \(S_2[j]\).
- The score of the cell to the left \(C(i,j-1)\) plus the cost of an indel
- The score of the cell on top \(C(i-1,j)\) plus the cost of an indel

Therefore, in order to compute \(C(i,j)\) we need to know the three values of \(C(i-1,j-1)\), \(C(i-1,j)\) and \(C(i,j-1)\). This is the reason why we start with an extra column and row at the beginning of each sequence that we can fill out with the increasing costs of indels. In our case since the cost of an indel is -1, this row and column are filled out with decreasing relative integers, as can be seen in Figure 2.1A.

From this starting point we can fill out the dynamic programming matrix with all the alignment scores. To compute \(C(1,1)\) we have three possible values:

- \(C(0,0)\) plus the cost of a match between \(S_1[1]=A\) and \(S_2[1]=A\): \(0+1=1\)
- \(C(0,1)\) plus the cost of an indel: \(-1 -1 = -2\)
- \(C(0,1)\) plus the cost of an indel: \(-1-1=-2\)

By taking the maximum out of these three values we can fill out the matrix cell with \(C(1,1)=1\). By continuing this process until we fill out the whole matrix we obtain the scores visible in Figure 2.1A. This is enough if we only want to compute the optimal global alignment score between \(S_1\) and \(S_2\), contained in cell \((n,m)\). If we want to deduce the operations leading to it, and therefore the alignment itself, we need to keep track of which operation we made to get a specific score. The easiest way to do that is to also consider this matrix as a graph where each cell is a vertex. When we compute the score of cell \((i,j)\) we add an edge from this cell to the previous cell that was used to compute \(C(i,j)\). In our example above, we obtained \(C(1,1)\) from a match and \(C(0,0)\), so we can add an edge in our graph going from cell \((1,1)\) to cell \((0,0)\). The filled out matrix with the graph edges represented as arrows can be seen in Figure 2.1B.

Once this matrix *(and corresponding graph)* is filled out, we can deduce the alignment by following a path through the graph starting at cell \((n,m)\) to cell \((0,0)\). A diagonal edge starting at \((i,j)\) indicates a match or mismatch between \(S_1[i]\) and \(S_2[j]\), a vertical edge indicates a gap in \(S_2\) and a horizontal edge a gap in \(S_1\). This can lead to several optimal alignments if there are several such paths in the graph. In our case, this algorithm yields three equally optimal global alignments shown in Figure 2.1C.

This algorithm although guaranteed to result in an optimal alignment, has a time complexity of \(\mathcal{O}(nm)\) where \(n\) and \(m\) are the lengths of the sequences to align^{144}. Some methods have been proposed to speed up^{149}, however the complexity is still \(\mathcal{O}(nm/\log(n))\). Lower bounds have been studied and there is not much optimization to be done if optimal exact alignment is needed^{150,151}. If we want to do better we have to rely on heuristics.

Another issue is space complexity since we need to store the matrix, the space complexity is also \(\mathcal{O}(nm)\). If we wish to align 2 human genomes we would need to store \(\approx 10^{19}\) matrix cells, which would amount to 10 Exabytes of storage *(i.e. the storage scale of a data-center)* if we use 8bit integers. However, in practice, we can do much better than that, and construct an optimal alignment in linear space complexity \(\mathcal{O}(n+m)\)^{152} meaning we would only need a couple gigabytes to store the matrix for 2 human genomes. This resulted in an improved global alignment algorithm, the Myers-Miller algorithm^{153}, implemented in the EMBOSS `stretcher`

alignment software^{154}.

#### 2.1.2.2 Local alignment

In global alignment two full sequences are aligned to each other. In local alignment the goal is to find the optimal alignment of two subsequences from these parent sequences. The main algorithm for locally aligning is the Smith-Waterman (SW) algorithm^{146}, developed a decade later than NW.

The two algorithms are very similar, SW also relies on first building the dynamic programming matrix with the same parametrizable costs for matches, mismatches and indels as NW. One key difference is that the optimal scores in the matrix are bound by 0 so they cannot become negative. We only store edges in the alignment graph if the starting cell has an alignment score > 0.

In this new formulation, the score in cell \(C(i,j)\) is the maximum of the following values:

- The score in the diagonally adjacent cell \(C(i-1,j-1)\) plus the cost of a match or mismatch between \(S_1[i]\) and \(S_2[j]\).
- The score of the cell to the left \(C(i,j-1)\) plus the cost of an indel.
- The score of the cell on top \(C(i-1,j)\) plus the cost of an indel.
- \(0\).

If we use the SW algorithm to locally align the two example sequences \(S_1\) and \(S_2\) and the same costs as used above, we obtain the dynamic programming matrix and graph shown in Figure 2.2.

The traceback part to determine the optimal alignment is very similar to NW, however instead of starting at cell \((n,m)\), we start at the cell in with the maximal alignment score and follow the path back until we arrive at a cell with an alignment score of 0. In the example shown in Figure 2.2, two cells contain the maximal alignment score of 2. Tracing back from these cells gives two optimal local alignments between \(S_1\) and \(S_2\): `AC`

to `AC`

and `GA`

to `GA`

.

Since the SW algorithm is so similar to NW it has the same quadratic time and space complexity. However, the same optimization can be used to bring it down to a linear space complexity^{144}. These optimizations resulted in the Huang and Miller algorithm^{155}, implemented in the EMBOSS `Lalign`

tool^{154}, and the Waterman Eggert algorithm^{156}.

Both the NW and the SW algorithms are implemented in many different software tools and are used widely to perform pairwise alignments of short sequences^{154,157,158}. Some versions even benefit from hardware acceleration with version implemented for specific CPU instruction sets^{159} or GPUs^{160} to substantially speed up alignment.

### 2.1.3 Scoring and substitution models

In the examples used above to present the NW and SW algorithms, we used a very simple cost function: a match has a cost of +1 while mismatches and indels have a cost of -1. This is really the simplest cost function we can use but also the crudest. In many cases it may be interesting to infuse this cost function with biological knowledge. For example some substitutions occur more rarely than others in nature so it would stand to reason to penalize those more than other, more common, substitutions.

These biology-aware cost functions usually take the form of a matrix, called *scoring* or *substitution matrix*, often corresponding to an underlying evolutionary model. When using these matrices, matches and mismatches between specific characters are given. For example the cost of aligning an A and a G might be lower than aligning that same A with a T. A lot of different substitution matrices have been developed especially for protein alignments^{161}, developed with different techniques and underlying models and with different intended applications.

The earliest and simplest substitution matrices are match/mismatch matrices. They are effectively what we used above, where all matches are given a fixed positive score and all mismatches a fixed negative score. In our examples above the corresponding substitution matrix would be a four by four matrix with ones on the diagonal indicating matches and -1 everywhere else. These are simple and useful, but when dealing with proteins, they have a severe limitation as they ignore the biology of amino acids.

In order to reflect this biological reality of proteins, new substitution matrices were developed using Log-odds models based on the fact that substitutions in amino acids are not equiprobable, and some mutations between related amino acids *(e.g. I and L)* are much more common than others. Two of the most widely used substitution matrices, PAM and BLOSUM matrices, were built this way. The score for aligning residue \(i\) with residue \(j\) is given by the matrix entry \(S_{ij}\) by looking at the background frequencies *(i.e. how often one expects to see a particular residue in a sequence)* of \(i\) and \(j\) denoted \(p_i\) and \(p_j\) respectively and the frequency \(q_{ij}\) with which \(i\) and \(j\) are aligned in accurate biological alignments. With these values we can compute the substitution score \(s_{ij}\) as a Log-odds^{161}:

\[ S_{i,j}=\log\bigg(\frac{q_{ij}}{p_ip_j}\bigg) \]

This Log-odds formulation yields values with nice properties for sequence alignment. \(q_{ij}\) can be thought of as the probability of the alignment between amino acids \(i\) and \(j\) resulting from a substitution, and \(p_ip_j\) is the probability under the null hypothesis that both of these amino acids were aligned randomly. Therefore the log of the ratio is negative when the random alignment is more frequent (meaning the substitution is unlikely), and positive when the substitution is likely. Both \(p_i\) and \(p_j\) are easy to compute from available biological sequence data, the real work in developing a Log-odds based substitution matrix is to estimate \(q_{ij}\) values, and that is often done using biologically accurate protein sequence alignments.

The PAM matrix, developed in 1978^{162}, is one such matrix. A *point accepted mutation* (PAM) is defined as the substitution of one amino acid by another that is accepted by natural selection *(i.e. visible along the branch of a phylogenetic tree)*. Dayhoff *et al.* also defined a PAM as an evolutionary distance, where two sequences distant by one PAM are expected to have one amino acid substitution per one hundred residues, which is equivalent to expecting a substitution at 1% of positions. To develop their matrix, Dayhoff *et al.* used phylogenetic trees built on 71 families of closely related proteins and counted the PAMs that appeared in these trees. This resulted in a matrix \(A\) where \(A_{ij}=A_{ji}=\) the number of times a substitution between amino acids \(i\) and \(j\) was observed in the trees. By using trees built on closely related sequences, Dayhoff *et al.* could be fairly certain that the observed substitutions were the result of a single mutation and not many subsequent mutations over long evolutionary times. From this matrix \(A\), Dayhoff *et al.* reconstructed the mutation probability \(M_1\) where entries \(M_{1,ij}\) represent the probability of amino acid \(j\) being replaced by amino acid \(i\) after an interval of 1 PAM. Entries of this matrix are computed as follows:

here \(m_j\) is the observed mutability of amino acid \(j\), and \(\lambda\) is a constant factor used to tune the matrix so that it reflects mutation rates corresponding to 1 PAM where 99% of positions are unchanged, which means that the diagonal of \(M_1\) must sum to 0.99. By assuming that evolution follows a Markov process it is simple to derive the mutation matrices for sequences separated by greater evolutionary distances. The \(M_n\) matrix, corresponding to a distance of \(n\) PAMs is equal to \(M_1^n\). Finally the \(q_{ij}\) values can be derived with \(q_{ij}=p_jM_{ij}\). By choosing different values of \(n\) for the mutation matrix we can estimate scoring matrices for sequences that are at varying evolutionary distances from one another. The correspondence between PAMs and the observed proportion of different residues is not one to one, therefore a distance of 250 PAMs corresponds to around only about 20% of identical residues where a distance of 180 PAMs corresponds to around 27% identical residues^{161,162}. Therefore the PAM_{250} matrix, derived from \(M_{250}\), is suited to align more distantly related proteins than the PAM_{180} for example. By changing the mathematical model underlying the estimate of mutation probabilities, PAM-like matrices^{163} were later developed based on the same principles.

The other main type of substitution matrix is the BLOSUM matrix (Block Substitution matrix), developed in 1992^{164}. Instead of using whole, closely-related, protein sequences like the PAM matrices, the values of \(q_{ij}\) were estimated on highly conserved segments, called *blocks*, across whole protein families. The \(q_{ij}\) values are then estimated as the number of time amino acids \(i\) and \(j\) are aligned divided by the number of total amino acid pairs in the alignment. Therefore \(q_{ij}\) is the observed frequency of the aligned pair of amino acids \(i\) and \(j\) in all the conserved blocks. As with PAM matrices, several BLOSUM matrices were constructed, designed for aligning sequences with different evolutionary distances. The BLOSUM62 matrix was estimated on blocks in aligned sequences that are at most 62% identical, BLOSUM80 on sequences that are at most 80% identical. Therefore, inversely to the PAM matrices, the higher the number of the BLOSUM matrix the more suited it is to align more closely related sequences.

PAM and BLOSUM matrices have fairly broad use-cases and are widely used in alignment. However, many other protein substitution models exist. Instead of using log-odds, some substitution models were developed by estimating scores with maximum-likelihood approaches^{165,166}. Some matrices were developed with very specific usage conditions in mind, tailored to specific types of proteins like transmembrane^{167,168}, disordered^{169} or polar/non-polar^{170} proteins. Some matrices were developed to align sequences from specific organisms like *P. falciparum*^{171} *(responsible for malaria)* or HIV^{172}. A substitution matrix was even developed in 2005 specifically for global rather than local alignment^{173}.

This wealth of protein substitution matrices reflects the biological and evolutionary diversity of proteins, however substitution matrices for aligning DNA sequences are much less common. Some work has been done to derive matrices similar to PAM matrices from DNA alignments^{174}. Codon substitution matrices^{175,176} have been developed as well, although they are used in DNA sequence alignment, ultimately they use knowledge derived from protein alignments.

### 2.1.4 Dealing with gaps

In the NW and SW examples of Section 2.1.2, as with the simplistic match/mismatch costs, we used a very simple cost of insertions and deletions: any indel has a cost of -1. As was the case with substitutions, this does not reflect the biological reality very well.

In biology, when insertions or deletions occur it is more likely that the indel will span several nucleotides rather than just one^{177}. This means that longer gap stretches are more likely than many individual gaps. For example, the two alignments below have the same number of matches, mismatches, and gaps. The second one is more likely since it is the result of a single insertion (or deletion) of `AGGT`

rather than multiple independent indels.

`AGGAGGTTCG`

`AGGAGGTTCG`

`A-G-G-T-CC`

`AGG----TCC`

The first approach to take this into account was to try and optimize the gaps more generally^{178} over the whole aligned sequence. However, even with dynamic programming, this has at best a time complexity of \(\mathcal{O}(n^2m)\)^{179}. In 1982, Gotoh proposed affine gap costs^{180}. With this model there are two separate costs associated to indels: 1) the gap open cost and 2) the gap extend cost. Usually the costs are set up so that opening a new gap is more costly than extending it, meaning that longer gap stretches are favored over many short indels. The other major advantage is that with Gotoh’s algorithm time complexity is back down to \(\mathcal{O}(nm)\). The algorithm was further refined by Altschul *et al.*^{181}.

Over the years different types of gap costs were developed and tested like the logarithmic gap costs proposed by Waterman^{182} and improved by Miller and Myers^{183} which turned out to be less accurate than affine gap costs^{184}). A bi-linear gap cost was also proposed to replace the affine cost^{185}, with a breakpoint at gaps of length three, the size of a codon. As more and more sequence data became available, similarly to what happened with substitution matrices, empirical profile-based models derived from this data were developed^{186}. Some of these penalties leverage structural information and context for proteins^{187,188}. A context dependent gap penalty depending on the hydrophobicity of aligned residues is implemented in `Clustal X`

^{189}, one of the most widely used sequence aligners. Although quite complex and empirically derived, these profile-based penalties show limited improvement over the affine and bi-linear penalties^{190}.

More recently, methodological and algorithmic developments have resulted in the WaveFront algorithm (WFA) for pairwise alignment^{191}. This algorithm computes a NW alignment with affine gap costs with a much lower time complexity of \(\mathcal{O}(ns)\), where \(s\) is the alignment score, reducing the quadratic relationship to sequence length to a linear one. This algorithm is also easily vectorizable and can take advantage of hardware acceleration, making its implementation run between 10 to 300 times faster than alternative methods depending on the testing context^{191}.

## 2.2 How to speed up pairwise alignment ?

The NW and SW algorithms, as well as their improvements, are proven to be optimal^{192}. However, when dealing with large sequences, which are more and more common, or when having to do many pairwise alignments, they become limiting due to their time and space complexity. In many cases, to get around these limitations, optimality is left aside in favor of heuristics and approximate methods speeding up alignment.

### 2.2.1 Changing the method

One of the early approaches to speed up alignment was to focus on speeding up the dynamic programming which is the time and space consuming step of the NW and SW algorithms. Bounded dynamic programming^{193} is one such approach, in some works it is also called banded dynamic programming. By making the assumption that the majority of alignment operations are matches and mismatches instead of indels we can make a hypothesis about the alignment graph. Most probably, the path in the graph corresponding to the optimal alignment will be around the diagonal of the dynamic programming matrix, and scores far away from the diagonal are probably not needed. By making these assumptions a lot of the scores of the matrix do not need to be computed, speeding up the execution and leading to a sparse dynamic programming matrix (shown in Figure 2.3). This approach was used to speed up alignment early on in 1984^{194}. The advantage of this method is that the optimal alignment can be found very efficiently. If there are many indels in the optimal alignment, this algorithm is not guaranteed to run faster than NW.

More “exotic” methods have also been used successfully for sequence alignment. Fast Fourier Transform (FFT) are used in the `MAFFT`

aligner^{196} in order to quickly find homologous segments between two sequences. These homologous regions can be used as the basis for alignment. `MAFFT`

, primarily a multiple sequence aligner *(c.f. Section* 2.4 *below)*, can also be used for pairwise alignment.

### 2.2.2 Seed and extend with data structures

In parallel to the development of new alignment algorithms, another way of substantially speeding up pairwise alignment is the so-called “seed and extend” method. This is based on the observation that a pairwise alignment most likely has several short subsequences that are almost identical in both sequences to align. These homologous subsequences, *seeds*, can be used to initialize an alignment that can be extended in both directions with dynamic programming until we have a suitable alignment.

This method can be used for 1) local alignment, where seeds indicate possible local matches which can be extended in local alignments; or 2) for global alignment where the seeds anchor the dynamic programming matrix, limiting the number of cells to fill out as shown in Figure 2.4. In both cases this approach follows the divide and conquer philosophy and extending seeds or filling out the dynamic programming matrix between anchors can be done independently and in parallel.

This type of approach can also be used for many-to-one local alignments: either trying to find homologies between a query sequence and a database of sequences, or to find several local alignments in a large reference sequence like in read-mapping (see Section 2.3.1). In these many-to-one scenarios it is useful to index seeds in data structures that allow rapid querying and compact storage. This general framework has proven to be quite flexible with many different ways to pick seeds^{197} and many different data structures to index them^{198}.

#### 2.2.2.1 \(k\)-mers and hash tables

##### 2.2.2.1.1 The BLAST algorithm

One of the early methods for very quick heuristic alignment is the Basic Local Alignment Search Tool: `BLAST`

^{199}. It is widely used to this day to find homologous sequences in large databases and, as such, is one of the most cited papers of all time with over 100,000 citations. It is available as a web service hosted by the NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Over the year many different versions for different use cases have been developed like `BLASTP`

for protein sequences or `BLASTN`

and `MEGABLAST`

for nucleic acid sequences.

In our description of the `BLAST`

algorithm, we will have a *target* and a *query* sequence that we wish to align.

- For both sequences we build a hash table that uses subsequences of length \(k\), called \(k\)
*-mers*, as keys and their position in the whole sequence as values. - The hash tables are then scanned to check for exact matches between \(k\)-mers in the target and query sequences, called
*hits.* - The positions of the hits in the target and query sequences are used to seed a candidate local alignment.
- The candidate local alignments are extended in both directions from the seed with the SW algorithm. If the alignment score reaches a value under a specified threshold, the alignment stops and the candidate is discarded.

By selecting the right size \(k\) of the seeds (by default 11 when aligning nucleotides, 3 when aligning amino acids) as well as the alignment score threshold, one can adjust the sensitivity of the method at the cost of runtime.

It might not seem very useful to pre-compute the target hash-table for a single target. However, in practice `BLAST`

is used to find local alignments between a query sequence and a very large number of target sequences; databases hosted by NCBI have hundreds of millions of target sequences (https://ftp.ncbi.nlm.nih.gov/blast/db/), at these scales pre-computing the target database saves an enormous amount of time.

Over time, several improvements have been developed for `BLAST`

. `PSI-BLAST`

^{200} iteratively refines the alignments, Gapped `BLAST`

^{200}and `BLASTZ`

^{201} use spaced seeds, introduced in the PatternHunter method^{202}, corresponding to seeds where not all characters match, increasing sensitivity. By sorting the target sequences it is possible to stop earlier and gain some speed as well^{203}. The `Diamond`

aligner^{204} increase alignment speed by using double indexing and thus leveraging CPU cache and reducing time waiting for memory or disk access, improving alignment speed up to 360-fold over `BLAST`

in later version^{205}.

`FASTA`

^{206}, an improvement on `FASTP`

^{207}, is another method for local alignment, similar to `BLAST`

. \(k\)-mers for the target and query sequence are indexed in a hash table and hits are found between the two sequences. The \(k\)-mers used in the `FASTA`

tool are usually shorter than for `BLAST`

, so instead of initializing an alignment at a single hit, `FASTA`

identifies regions in both sequences that have a high density of hits, keeping the best 10. These regions are then scored using matrices discussed in Section 2.1.3 and high scoring regions are combined to build an approximate alignment. An optimal version of this alignment is then computed using the SW algorithm with banded dynamic programming.

Both `FASTA`

and `BLAST`

are very fast. It only takes a couple of seconds to find approximate local alignments between a hundred query sequences^{208} in a database of over eighty million target sequences^{209}. Attempting this task with standard SW or NW algorithms would be much slower^{210} but would yield more sensitive, optimal alignments^{211}.

##### 2.2.2.1.2 Other algorithms

One of the problems with such an approach is the size of the index, indeed storing all the \(k\)-mers of a length \(n\) sequence would require a maximum of \((n-k + 1)\cdot k\) characters as the hash table keys, if all \(k\)-mers are distinct. This space constraint is acceptable for very large scale homology search tools on hosted web services such as NCBI BLAST, on a personal computer this can easily exceed memory capacity. Storing the hash table on disk has drastic consequences on query times, therefore methods to reduce the storage needs of these data structures were developed.

One of the ways to make everything fit in memory is to not store all \(k\)-mers. One way is through the use of so-called *minimizers*, introduced independently in 2003^{212} and 2004^{213}. Given a window of \(w\) consecutive \(k\)-mers and an ordering, a \((w,k)\) minimizer is the “smallest” \(k\)-mer in the window w.r.t. the chosen ordering. Let us consider the following window of \(3\)-mers with \(w=4\): `TGACAT`

, yielding the following \(3\)-mers: `TGA`

, `GAC`

, `ACA`

, `CAT`

. Following a simple ordering, such as the lexicographical ordering (i.e. alphabetical order), the “smallest” \(3\)-mer, and our \((4,3)\) minimizer, would be `ACA`

, and only this one would be sampled and added to our hash table. Minimizers have interesting properties: adjacent windows often share a minimizer (see Figure 2.5) and if two strings have a \(w-k+1\) sequence in common then they are guaranteed to share a \((w,k)\) minimizer^{213}. These properties make minimizers very useful for the seed and extend alignment strategy and they are used in several aligners such as `minimap`

^{214} and `minimap2`

^{119}, `mashmap2`

^{215} and `winnowmap`

^{120}.

While the lexicographical ordering is easy to conceptualize, and the one proposed initially by Roberts *et al.,* it has an undesirable characteristic: it tends to select simpler \(k\)-mers with repeated `A`

at the beginning. As discussed in Section 1.4.2, repeated stretches of nucleotides are prone to sequencing errors and as such are not ideal for seeding alignments. Furthermore, when the window shifts \(k\)-mers at the beginning of successive are likely to be selected as minimizers without being shared between windows, meaning that we sample more \(k\)-mers than needed. Roberts *et al.* proposed an alternative ordering based on nucleotide frequencies^{213}, however this is also not ideal. Different orderings have been studied and those based on universal hitting sets^{216}, or random orderings (such as the ones defined by a hash function) have more desirable properties than the lexicographical ordering^{217}. A minimizer ordering based on frequency of appearance of \(k\)-mers has also been shown to provide well-balanced partitioning of \(k\)-mer sets^{218}.

Over the years more strategies have been developed to sample \(k\)-mers and reduce the data structure size for efficient sequence alignment, such as syncmers^{219}, strobemers^{220} or a combination of both^{221}. These novel seed sampling strategies allow for sparser seed sampling, smaller data structures and therefore faster alignment software.

#### 2.2.2.2 Exact matches and suffix trees

While \(k\)-mer seeds have shown success it is not the only way to implement a seed and extend alignment method. The other way to seed alignments is through maximal exact matches (MEMs) which is the longest possible exact match between two sequences. MEMs can be found with data structures like suffix trees^{222}, suffix arrays^{223,224} or FM indices^{225}.

Suffix trees have long been used for pattern matching applications^{129}, the `AVID`

aligner^{226} uses them to find maximal exact matches between two sequences to anchor a global alignment. `MUMmer2`

^{227} uses suffix trees to find unique maximal unique matches (MUMs) to anchor alignments.

Suffix trees, although very useful, have quadratic space complexity w.r.t. to the length of the indexed sequence^{129}. This is fine for small bacterial or viral genomes. However, in the age of whole genome sequencing and the human genome project, it is inadequate. Therefore some aligners have switched data structures to use suffix arrays. In fact, it is possible to replace suffix trees with these more space efficient suffix arrays in any algorithm^{228}. Newer versions of `MUMmer`

^{229} have made this choice and now use suffix arrays for improved performance. It is important to note that compact versions of suffix trees have been created that are also linear in space to sequence length^{230}, however in practice suffix arrays take up less space for comparable query times^{223}.

Finally another data structure that is widely used is the so-called FM index proposed in 2000^{225} and based on the Burrows-Wheeler transform^{231}. The FM index is very memory efficient^{232}, but this comes at the cost of some efficiency in index lookup operations, although some work has been done to improve this^{233}. As such, FM-indices have been used in many aligners such as `BWT-SW`

^{234}, `BWA`

^{235} and `BWA-SW`

^{236}, `BWA-MEM`

^{237}, `CUSHAW`

^{238} or `bowtie2`

^{239}.

The seed and extend paradigm has been very useful in the field of genomics to deal with the scale of data and keep up with sequencing technologies. Some newer alignment algorithms like the WFA algorithm mentioned above, have even been used in such a context^{240}.

Finally, some methodological development have been aimed towards improving alignment sensitivity instead of speed. One of these methods, fairly well studied in general, and in the context of alignment, are hidden markov models (HMMs). In certain circumstances PairHMMs, HMMs used for pairwise alignment, can be mathematically equivalent to NW^{241}. HMMs have been used for sequence alignment in many software tools like `HHsearch`

^{242}, `HMMer`

^{243} or `MCALIGN2`

^{244} which is used to efficiently search for alignments in large databases of sequences.

## 2.3 The specificities of read-mapping

Read-mapping is a special case of pairwise alignment and the focus of Chapter 3, it stands to reason that we use this section to explain the stakes and challenges of the read-mapping task.

### 2.3.1 What is read-mapping ?

*Read-mapping*, or sometimes read-alignment is the process of comparing a sequencing read to a reference sequence and finding the region in the reference homologous to the read. Sometimes, mappers only output the position where this region starts in the reference but more often than not, they output local or semi-global alignments between the reads and the reference. In *semi-global alignment*, two sequences are globally aligned but indels at the end and beginning of each sequence are not penalized, this can be useful to detect overlap between two sequences, or align two sequences of very different sizes.

Read-mapping is often the first step of many bioinformatics analysis pipelines, and as such is often crucial. Therefore it makes sense that this is a very active field with many reviews^{245–249} and some benchmarking procedures^{250} to compare tools.

From a technical and algorithmic standpoint, the task of mapping many sequencing reads to a single reference lends itself very well to the “divide and conquer” approach presented in Section 2.2.2. Indexing the reference beforehand and using this index as a database to align can lead to substantial execution speed gains. As a matter of fact, many of the aligners presented in Section 2.2.2 are actually read-mappers that can also do pairwise alignment. As such most implement the seed-and-extend paradigm with hash-tables like `minimap2`

^{119}; FM-indices like `BWT-SW`

^{234}, `bowtie2`

^{239}, `BWA`

^{235}, `BWA-SW`

^{236}, `BWA-MEM`

^{237} and `CUSHAW`

^{238}; or even other divide and conquer approaches like `Kart`

^{251}. As sequencing technologies yield longer and more numerous reads, these heuristics become more important if we wish to be able to analyze this data. This can be partly mitigated through hardware acceleration^{252–255}.

### 2.3.2 Challenges of read-mapping

Read-mapping, as one might expect, is no easy task. The length of recent sequencing reads and their numbers are of course challenging, but algorithmic tricks described above can help. There are other aspects of sequencing data that make read-mapping as hard as it is.

Sequencing technologies, although they have improved over time, can still make errors, and these errors can lower the homology between reads and reference, making mapping harder^{129}. This is particularly true of long reads where the error rate is higher. To mitigate that some specific long-read mappers take these errors into account when aligning a read to the reference. Some mappers are tied to a specific sequencing technology like BLASR^{256} or `lordFAST`

^{257} for PacBIO reads, and `GraphMap`

^{258} for ONT. Some however, like `NGMLR`

^{259}, `MashMap`

^{260} or `DuploMap`

^{261}, are technology agnostic and can work with any type of long-read. This might not be needed forever though as sequencing accuracy is growing with every new generation of sequencers. Since homopolymer-linked indels are still common in long-read sequencing (cf. Section 1.4.2) many modern read-mappers, designed to work with long reads, include some option to use homopolymer compression (c.f. Section 1.4.3.2).

While the technology producing reads can complicate the read-mapping tasks, some regions of the genome are intrinsically harder to map to. This is particularly true of repetitive regions like telomeres or centromeres^{249}. Repetitive regions mean a lot of potential homologous regions between a read and the reference, producing a lot of seed hits, increasing the runtime of the aligners and lowering the overall confidence in read-placement. Some tools have been developed specifically to deal with such regions. `Winnowmap`

^{120} and `Winnowmap2`

^{262}, assign a weight to \(k\)-mers that might be sampled as minimizers. By under-weighting frequently appearing \(k\)-mers they can improve performance in repetitive regions. `TandemMapper`

^{263} was designed to map long reads to the extra-long tandem repeats (ETRs) present in centromeric regions. It does not use minimizers, like `Winnowmap`

it selects less frequent \(k\)-mers as potential seeds to deal with the repetitiveness and improve the mapping accuracy. Long reads are also much easier to map to repetitive regions since they can span over them, or overlap with more complex regions^{48,56}.

Some challenges however are linked to implementation rather than sequencing data. Some efforts have been done to provide quality scores to mappings in order to easily assess their quality and therefore usefulness. This score, called *mapping quality*, is defined as \(-10\log_{10}(p)\), usually rounded to the nearest integer, where \(p\) corresponds to the probability of the read being mismapped. It was introduced in the `MAQ`

software^{264} but has been implemented in many read-mappers like `BWA`

^{235}, `bowtie2`

^{239} or `minimap2`

^{119} since it was added as part of the widely-used `SAM`

file format specification^{265}.

While the mapping quality score is standardized, each read-mapper has a different way of estimating \(p\), the mismap probability. This creates differences in the reported qualities: *e.g.* the maximum quality that `bowtie2`

can assign is 42, `BWA's`

is 37 and `minimap's`

is 60^{266}. This of course means that comparing mapping quality values between read-mappers is not necessarily meaningful. Furthermore in some cases this mapping quality is not very reflective of the alignment accuracy^{245}, as such alternative approaches have been explored: through a new genome mappability score^{267}, simulations^{268} or even machine learning^{269}.

In conclusion, as a crucial step in many bioinformatics pipelines, read-mapping is a markedly active field with a lot of work in increasing mapping accuracy and speeding up alignment. However, despite all this work, some challenges remain. Further improving mapping is possible and doing so could result in more accurate downstream analyses and avoid drawing some erroneous conclusions.

## 2.4 Multiple sequence alignment

Up until now we have only considered pairwise alignment where we want to find homologies between a pair of sequences. In many cases though, it is helpful to compare more than two sequences together, this is where *multiple sequence alignment* (MSA) steps-in. It is an essential task in many bioinformatics and comparative biology analyses^{270}.

We saw earlier that with dynamic programming and algorithms like NW or SW it is possible to compute an optimal pairwise alignment. For MSA, the task of computing the optimal alignment is, unfortunately, NP-hard^{271,272}, with an exponentially-growing time and space complexity in the number of sequences to align. Therefore, heuristics and approximations are needed from the start in order to get anything meaningful.

An early method, and easy to conceptualize, is the so-called center star alignment method^{195}. In this approach, a single sequence is chosen to be the center sequence. After this each other sequence is aligned to the center sequence and the pairwise alignments are merged, conserving gaps that were inserted. The center sequence is often chosen to be as similar to the other sequences as possible. For this, all pairwise distances between sequences are needed implying a quadratic distance computation step. The pairwise alignments are independent so this approach is easy to parallelize. Some software, like `HAlign`

^{273} use center star alignment to produce MSAs. This method, however, is quite sensitive to the choice of the center sequence. Bad pairwise alignments can lower the accuracy of the overall MSA by conserving gaps.

### 2.4.1 Progressive alignment

One of the most widely used multiple sequence alignment approach is progressive alignment^{274}. Similarly to the center star algorithm, the progressive algorithm reduces the MSA problem to independent pairwise alignments. The first step is to build a phylogenetic tree from the sequences to align, representing the evolutionary relationship between sequences, called the *guide tree*. Starting from the leaves, that correspond to single sequences, pairwise align the sequences and store the alignment (or *profile*) in the parent node. Going up from the leaves to the roots, align sequences together, then sequences to profiles if needed and finally profiles together, merging alignments as we progress up the tree. The final multiple sequence alignment is obtained when this process reaches the root. Profiles at inner nodes of the tree are aligned to each other to conserve gaps. A representation of this process is shown in Figure 2.6.

In many cases a matrix of pairwise distances is needed to construct the guide tree, if we choose the edit distance, \(n(n-1)/2\) pairwise alignments are needed to get this matrix. With a large number of sequences, or long sequences this is not possible in a reasonable amount of time. Therefore, computing of distance matrices through alignment-free methods, usually based on \(k\)-mers, is often used as input to the tree building method^{275,276}.

Tree reconstruction methods from the distance matrix like UPGMA^{277} or neighbor-joining^{278} can be quite time consuming when dealing with a large set of sequences. To counteract this, some multiple sequence aligners also use heuristic methods to approximate a guide tree. `MAFFT`

^{196}, for example, uses `PartTree`

^{279} method to approximate the tree, and `clustal Omega`

^{280} uses an embedding method^{281} to do so.

Although this method is a good heuristic as the guide tree can capture complex relationships between sequences, progressive alignment can still suffer from problems similar to center star alignment: mainly gap propagation. If an early alignment is erroneous and introduces spurious gaps, then these are propagated throughout the MSA. As it is said in the seminal progressive alignment paper: “once a gap, always a gap”^{274}. Iterative refinement of the MSA^{270} was proposed as a solution to this problem. A possible approach is to recompute a guide tree from the alignment and run the whole progressive alignment procedure on the new guide tree, but this is very time consuming and is not practical with the large sequence sets available today. Therefore, the iterative procedure consists of taking an MSA obtained through progressive alignment and splitting it horizontally in two alignments of \(n/2\) sequences each. In each half, the sites composed exclusively of gaps are removed and the two alignments merged through profile alignment. After the realignment of both halves, a scoring metric is computed and as long as this metric improves, repeat the previous steps. There are several of these metrics, the most commonly used is the probably the sum of pairs score^{282} or its weighted variant^{283}. There exist other scores like log-odds and correlation^{284} or a consistency based score^{285}.

Most of the widely used multiple sequence aligners some form of progressive alignment with iterative refinement: `T-Coffee`

^{286} which uses a consistency score for refinement, `MUSCLE`

^{287,288}, `MAFFT`

^{196}, `ProbCons`

^{289} which uses a formal HMM to compute consistency and the various `CLUSTAL`

incarnations^{280,290,291} which are some of the most cited papers of all times.

### 2.4.2 Other methods

While the progressive alignment algorithm has been at the root of some of the most widely used alignment software, other methods to produce MSAs have been explored over the years.

One common other method for creating multiple alignment, whether through profile-profile alignment or sequence-profile alignments are HMMs. Several tools HMMs to generate an alignment such as `HMMer`

^{243}, `MSAProbs`

^{292} or `COVID-align`

^{293}. In some cases, the HMM based approach has similar performance to `clustalW`

^{294}.

Other methods have focused on speeding up the dynamic programming part of aligning multiple sequences. This can be done using simulated annealing^{295–297}, which can also be used to speed up HMM training^{294}. Genetic algorithms have also been used to construct MSAs^{298}, increasing the speed at which this is possible^{299}. Several tools use genetic algorithms like `VGDA`

^{300}, `GAPAM`

^{301} and `SAGA`

^{302}.

With the recent focus on SARS-CoV 2, some specific multiple sequence aligners have been developed to create very large multiple sequence alignments. They often take advantage of the fact that this virus mutates quite slowly meaning that most of the available sequences have very high homology. Furthermore, as the epidemic was tracked in near real time since its beginning, we know the original sequence at the root of the pandemic. Leveraging this knowledge, it is possible to build a profile from aligning new sequences to the ancestral sequence and aligning new sequences to this profile using HMMs like what is done in `COVID-align`

^{293}. The `NextAlign`

^{303} software even forgoes aligning to a profile and creates massive MSAs (millions of sequences) by aligning new sequences to the ancestral sequence using banded SW alignment, the gap penalties are enriched with biological knowledge and dependent on the position within the sequence.

Recently, Garriga *et al.* introduced the regressive alignment method^{304}, where instead of traversing a guide tree from leaf to root, it goes the other way, aligning the more distant sequences first before merging MSAs. Using this approach, they managed to create an MSA of 1.4 million sequences with improved accuracy over progressive methods.

Since multiple sequence alignments are so useful in comparative biology, and that there is such a vast array of methods to construct them it stands to reason that are many resources to help practitioners make their choice. There are many reviews and benchmarking datasets and procedures to do so^{305–309}.

## 2.5 Conclusion

Sequence alignment, multiple or pairwise, is a fundamental part of the bioinformatician’s toolkit. Comparing sequences and finding homologies is at the root of many fields because of the wealth of evolutionary information contained in alignments. As such it is paramount to have the best possible sequence alignments in any situation.

As we have seen now, although we have methods guaranteed to give us optimal pairwise and multiple sequence alignments, they are not practically useful for dealing with sequences at today’s scale. Therefore, most sequence aligners rely on, sometimes numerous, heuristics and approximations. From substitution models to seeding techniques, all these are not necessarily reflective of the biological reality contained within the sequences to align. Each of these heuristics or models is a step where biases and approximations can happen, building up along and over sequences. Therefore, there must be room for improvement.

Having methods that are both fast and accurate are now more necessary than ever with the ever growing scale and number of publicly available sequences. Furthermore, in the “age of pandemics”, accurate alignment methods are indispensable to track and keep an eye on disease spread across the globe, in real-time.

### References

*Human Molecular Genetics*

**27**, R234–r241 (2018).

*Nat Rev Genet*

**21**, 597–614 (2020).

*Bioinformatics*

**30**, 3004–3011 (2014).

*et al.*Hybrid error correction and de novo assembly of single-molecule sequencing reads.

*Nat Biotechnol*

**30**, 693–700 (2012).

*et al.*Weighted minimizer sampling improves long read mapping.

*Bioinformatics*

**36**, i111–i118 (2020).

*Coding and Information Theory*. (Prentice-Hall, 1980).

*Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology*. (Cambridge University Press, 1997). doi:10.1017/cbo9780511574931.

*Soviet Physics Doklady*

**10**, 707 (1966).

*J Mol Evol*

**17**, 368–376 (1981).

*Plos One*

**5**, e9490 (2010).

*et al.*Highly accurate protein structure prediction with AlphaFold.

*Nature*

**596**, 583–589 (2021).

*et al.*Predicting protein structure using only sequence information.

*Proteins: Structure, Function, and Bioinformatics*

**37**, 121–125 (1999).

*Current Opinion in Structural Biology*

**15**, 275–284 (2005).

*Nat Rev Mol Cell Biol*

**8**, 995–1005 (2007).

*Bioinformatics*

**27**, 1455–1461 (2011).

*Nat Methods*

**6**, S13–s20 (2009).

*et al.*Structural variant calling: The long and the short of it.

*Genome Biol*

**20**, 246 (2019).

*Algorithms in Bioinformatics: A Practical Introduction*. (Chapman and Hall/CRC, 2011). doi:10.1201/9781420070347.

*Journal of Molecular Biology*

**48**, 443–453 (1970).

*Journal of Molecular Biology*

**147**, 195–197 (1981).

*Applied Mathematical Programming*. (Addison-Wesley Publishing Company, 1977).

*Journal of Computer and System Sciences*

**20**, 18–31 (1980).

*Journal of Machine Learning Research*

**11**, 18 (2010).

*J. Acm*

**23**, 1–12 (1976).

*Commun. ACM*

**18**, 341–343 (1975).

*Trends in genetics*

**16**, 276–277 (2000).

*Advances in Applied Mathematics*

**12**, 337–357 (1991).

*Journal of Molecular Biology*

**197**, 723–728 (1987).

*et al.*The Bioperl Toolkit: Perl Modules for the Life Sciences.

*Genome Res.*

**12**, 1611–1618 (2002).

*et al.*Bioconductor: Open software development for computational biology and bioinformatics.

*Genome Biol*

**5**, R80 (2004).

*BMC Bioinformatics*

**17**, 81 (2016).

*Bulletin of the Polish Academy of Sciences: Technical Sciences*

**60**, 491–494 (2012).

*eLS*(John Wiley & Sons, Ltd, 2013). doi:10.1002/9780470015902.a0005265.pub3.

*Atlas of Protein Sequence and Structure*345–352 (1978).

*Pnas*

**89**, 10915–10919 (1992).

*Mol Biol Evol*

**18**, 691–699 (2001).

*Molecular Biology and Evolution*

**25**, 1307–1320 (2008).

*Bioinformatics*

**17**, S182–s189 (2001).

*Bioinformatics*

**16**, 760–766 (2000).

*Sci Rep*

**9**, 16380 (2019).

*Proteins: Structure, Function, and Bioinformatics*

**71**, 910–919 (2008).

*Nucleic Acids Res*

**36**, 6664–6675 (2008).

*et al.*HIV-Specific Probabilistic Models of Protein Evolution.

*PLoS One*

**2**, e503 (2007).

*Phys Rev E Stat Nonlin Soft Matter Phys*

**72**, 061917 (2005).

*Biocomputing 2002*115–126 (World Scientific, 2001). doi:10.1142/9789812799623_0012.

*BMC Bioinformatics*

**6**, 134 (2005).

*Molecular Biology and Evolution*

**24**, 388–397 (2007).

*Molecular Biology and Evolution*

**26**, 473–480 (2009).

*Proceedings of the National Academy of Sciences*

**80**, 1382–1386 (1983).

*Advances in Mathematics*

**20**, 367–387 (1976).

*Journal of Molecular Biology*

**162**, 705–708 (1982).

*Bulletin of Mathematical Biology*

**48**, 603–616 (1986).

*Journal of Theoretical Biology*

**108**, 333–337 (1984).

*Bltn Mathcal Biology*

**50**, 97–120 (1988).

*BMC Bioinformatics*

**7**, 527 (2006).

*Nucleic Acids Research*

**32**, 2838–2843 (2004).

*Journal of Molecular Biology*

**229**, 1065–1082 (1993).

*Proteins: Structure, Function, and Bioinformatics*

**54**, 71–87 (2004).

*Plos One*

**3**, e2325 (2008).

*Trends in Biochemical Sciences*

**23**, 403–405 (1998).

*Computational Biology and Chemistry*

**35**, 308–318 (2011).

*Bioinformatics*(2020) doi:10.1093/bioinformatics/btaa777.

*Methods in Enzymology*vol. 210 575–601 (Academic Press, 1992).

*SIAM J. Appl. Math.*

**49**, 1552–1566 (1989).

*Biomolecules*

**12**, 546 (2022).

*Nucleic Acids Research*

**30**, 3059–3066 (2002).

*BMC Bioinformatics*

**7**, 133 (2006).

*Briefings in Bioinformatics*

**11**, 473–483 (2010).

*J Mol Biol*

**215**, 403–410 (1990).

*et al.*Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.

*Nucleic Acids Research*

**25**, 3389–3402 (1997).

*Bioinformatics*

**18**, 440–445 (2002).

*Bioinformatics*

**26**, 2460–2461 (2010).

*Nat Methods*

**12**, 59–60 (2015).

*Nat Methods*

**18**, 366–368 (2021).

*Proc Natl Acad Sci U S A*

**85**, 2444–2448 (1988).

*Science*

**227**, 1435–1441 (1985).

*Bioinformatics*

**32**, 2636 (2016).

*et al.*The Pfam protein families database: Towards a more sustainable future.

*Nucleic Acids Research*

**44**, D279 (2016).

*Bioinformation*

**2**, 166–168 (2007).

*et al.*Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith–Waterman in Hardware to BLAST and FASTA.

*Genomics*

**38**, 179–191 (1996).

*Proceedings of the 2003 ACM SIGMOD international conference on Management of data*76–85 (Association for Computing Machinery, 2003). doi:10.1145/872757.872770.

*Bioinformatics*

**20**, 3363–3369 (2004).

*Bioinformatics*

**32**, 2103–2110 (2016).

*Bioinformatics*

**34**, i748–i756 (2018).

*Algorithms in Bioinformatics*(eds. Frith, M. & Storm Pedersen, C. N.) 257–268 (Springer International Publishing, 2016). doi:10.1007/978-3-319-43681-4_21.

*et al.*Improving the performance of minimizers and winnowing schemes.

*Bioinformatics*

**33**, i110–i117 (2017).

*Research in Computational Molecular Biology*(ed. Sharan, R.) 35–55 (Springer International Publishing, 2014). doi:10.1007/978-3-319-05269-4_4.

*PeerJ*

**9**, e10805 (2021).

*Genome Res.*

**31**, 2080–2094 (2021).

*14th Annual Symposium on Switching and Automata Theory (swat 1973)*1–11 (1973). doi:10.1109/swat.1973.13.

*SIAM J. Comput.*

**22**, 935–948 (1993).

*Algorithms in Bioinformatics*(eds. Guigó, R. & Gusfield, D.) 449–463 (Springer, 2002). doi:10.1007/3-540-45784-4_35.

*Proceedings 41st Annual Symposium on Foundations of Computer Science*390–398 (2000). doi:10.1109/sfcs.2000.892127.

*Nucleic Acids Res*

**30**, 2478–2483 (2002).

*Journal of Discrete Algorithms*

**2**, 53–86 (2004).

*et al.*MUMmer4: A fast and versatile genome alignment system.

*PLOS Computational Biology*

**14**, e1005944 (2018).

*Journal of the ACM*

**23**, 262272 (1976).

*A Block-Sorting Lossless Data Compression Algorithm*. https://www.cs.jhu.edu/%7Elangmea/resources/burrows_wheeler.pdf (1994).

*Nucleic Acids Research*

**40**, 6993–7015 (2012).

*Bioinformatics*

**34**, 416–424 (2018).

*Bioinformatics*

**24**, 791–797 (2008).

*Bioinformatics*

**25**, 1754–1760 (2009).

*Bioinformatics*

**26**, 589–595 (2010).

*Bioinformatics*

**28**, i318–i324 (2012).

*Nat Methods*

**9**, 357–359 (2012).

*et al.*AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication.

*Proceedings of the National Academy of Sciences*

**119**, e2113075119 (2022).

*Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids*. (Cambridge University Press, 1998). doi:10.1017/cbo9780511790492.

*Nucleic Acids Res*

**39**, W29–w37 (2011).

*BMC Bioinformatics*

**7**, 292 (2006).

*Bioinformatics*

**27**, 2790–2796 (2011).

*et al.*Technology dictates algorithms: Recent developments in read alignment.

*Genome Biology*

**22**, 249 (2021).

*Bioinformatics*

**32**, 136–139 (2016).

*Bioinformatics*

**33**, 2281–2287 (2017).

*et al.*Hardware Acceleration of Short Read Mapping. in

*2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines*161–168 (2012). doi:10.1109/fccm.2012.36.

*et al.*LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment. in

*2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)*462–471 (2020). doi:10.1109/ipdps47924.2020.00055.

*BMC Bioinformatics*

**13**, 238 (2012).

*Bioinformatics*

**35**, 20–27 (2019).

*et al.*Fast and sensitive mapping of nanopore sequencing reads with GraphMap.

*Nat Commun*

**7**, 11307 (2016).

*et al.*Accurate detection of complex structural variations using single-molecule sequencing.

*Nat Methods*

**15**, 461–468 (2018).

*J Comput Biol*

**25**, 766–779 (2018).

*Nucleic Acids Research*

**48**, e114 (2020).

*Nat Methods*

**19**, 705–710 (2022).

*Bioinformatics*

**36**, i75–i83 (2020).

*Genome Res.*

**18**, 1851–1858 (2008).

*Bioinformatics*

**28**, 2097–2105 (2012).

*Genome Biology*

**18**, 152 (2017).

*Bioinformatics*

**28**, i349–i355 (2012).

*Multiple Sequence Alignment Methods*. vol. 1079 (Humana Press, 2014).

*Journal of Computational Biology*

**1**, 337–348 (1994).

*Journal of Computational Biology*

**8**, 615–623 (2001).

*et al.*HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.

*Molecular Biology and Evolution*

**39**, msac166 (2022).

*J Mol Evol*

**25**, 351–360 (1987).

*Bioinformatics*

**8**, 275–282 (1992).

*Proceedings of the National Academy of Sciences*

**83**, 5155–5159 (1986).

*Information Processing Letters*

**104**, 205–210 (2007).

*Molecular Biology and Evolution*

**4**, 406–425 (1987).

*Bioinformatics*

**23**, 372–374 (2007).

*et al.*Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

*Mol Syst Biol*

**7**, 539 (2011).

*Algorithms Mol Biol*

**5**, 21 (2010).

*Journal of Theoretical Biology*

**138**, 297–309 (1989).

*Journal of Molecular Biology*

**207**, 647–653 (1989).

*Bioinformatics*

**20**, 1301–1308 (2004).

*Bioinformatics*

**14**, 407–422 (1998).

*Journal of Molecular Biology*

**302**, 205–217 (2000).

*BMC Bioinformatics*

**5**, 113 (2004).

*Nucleic Acids Research*

**32**, 1792–1797 (2004).

*Genome Res*

**15**, 330–340 (2005).

*Nucleic Acids Res*

**22**, 4673–4680 (1994).

*Nucleic Acids Research*

**25**, 4876–4882 (1997).

*Bioinformatics*

**26**, 1958–1964 (2010).

*Bioinformatics*(2020) doi:10.1093/bioinformatics/btaa871.

*International Conference on Intelligent Systems for Molecular Biology*7 (1995).

*Bioinformatics*

**10**, 419–426 (1994).

*Third International Conference on Natural Computation (ICNC 2007)*vol. 2 270–274 (2007).

*Genomics*

**109**, 419–431 (2017).

*Bioinformatics*

**13**, 565–581 (1997).

*BMC Bioinformatics*

**12**, 353 (2011).

*IEEE Transactions on Evolutionary Computation*

**16**, 615–631 (2012).

*Nucleic Acids Res*

**24**, 1515–1524 (1996).

*Joss*

**6**, 3773 (2021).

*et al.*Large multiple sequence alignments with a root-to-leaf regressive method.

*Nat Biotechnol*

**37**, 1466–1470 (2019).

*PLOS Computational Biology*

**3**, e123 (2007).

*Bioinformatics*

**15**, 87–88 (1999).

Here I am using an index starting at 1 and inclusive, so \(S_1[1,n-1]\) represents the first \(n-1\) characters. If \(S_1 = ABCD\) then \(S_1[1;3]=ABC\)↩︎