Chapter 1 What is Sequence Data ?

1.1 Biological sequences, a primer

To fully understand the work that was done during this thesis, as well as the choices that were made, some basic knowledge of molecular biology and genetics is needed. If you are already familiar with biological sequences, feel free to skip ahead to section 1.2.

1.1.1 What is DNA ?

DesoxyriboNucleic Acid (DNA) is one of the most important molecules there is, without it complex life as we know it is impossible. It contains all the genetic information of a given organism, that is to say all the information necessary for the organism to: 1) function as a living being and 2) make a perfect copy of itself. This is the case for the overwhelming majority of living organisms on planet earth, from elephants to potatoes, to micro-organisms like bacteria.

DNA is a polymer, composed of monomeric units called nucleotides. Each nucleotide is composed of ribose (a five carbon sugar) on which are attached a phosphate group as well as one of four nucleobases: Adenine (A), Cytosine (C), Guanine (G) of Thymine (T). These four types of nucleotide monomers link up with one-another, through phosphate-sugar bonds, creating a single strand of DNA. The ordered sequence of these four types of nucleotides in strand encodes all the genetic information necessary for the organism to function. Nucleotides in a strand form strong complementary bonds with nucleotides from another strand, A with T and C with G. These bonds allow two strands of DNA to form the double-helix structure of DNA¹ shown in Figure 1.1. The specificity of nucleotide bonds ensure that the two strands of the double helix are complementary and that the information contained in one strand can be recovered from the other. This ensures a certain structural stability to the DNA molecule and a way to recover the important information that could be lost due to a damaged strand.

**Double-helix structure of DNA.**
Each strand of DNA has a phosphate-sugar backbone on which are attached nucleobases. The two strands are linked by complementary bonds between the nucleobases of different strands (A bonding with T and C bonding with G), encoding the same information of both strands.

Figure 1.1: Double-helix structure of DNA.
Each strand of DNA has a phosphate-sugar backbone on which are attached nucleobases. The two strands are linked by complementary bonds between the nucleobases of different strands (A bonding with T and C bonding with G), encoding the same information of both strands.

The amount of DNA necessary to encode the information varies greatly from organism to organism: 5400 base pairs (5.4kBp) for the $\varphi X174$ phage², 4.9MBp for Escherichia coli³, 3.1GBp for Homo sapiens⁴ all the way up to almost 150GBp for Paris japonica, a Japanese mountain flowering plant⁵. While very small genome size tend to occur in smaller, simpler organisms, genome size does not correlate with organism complexity⁶.

1.1.2 From Information to action

1.1.2.1 Proteins, their structure and their role

The double stranded DNA molecules present in the cells of a living organism contain information only; in order for the organism to live, this information must be read and translated into actions. Most of the actions necessary for “life” are taken by large molecules called proteins, they have a very wide range of functions from catalyzing reactions in the cell to giving it structure⁷.

Proteins are macromolecules that are made up of one or several chains of amino acids. These chains then link together and fold up in a specific three dimensional structure, giving the protein the shape it needs to fulfill its goal. This structure is determined by the sequence of amino acids, and a given protein can be identified by this amino acid sequence⁷.

This sequence is directly dependent on the information contained in the DNA. First the DNA is transcribed in a similar, but single stranded, molecule called RNA (Ribonucleic Acid) which encodes the same sequence. This RNA molecule is then translated into a protein by the following process⁸:

Nucleotides in the RNA sequence are read in groups of three called codons.
These codons are read sequentially along the RNA molecule.
Each codon corresponds to an amino acid, according to the genetic code.
The sequence of codons in RNA (and by extension DNA) determines the sequence of amino acids.
The translation process is stopped when a specific type of codon (a “Stop” codon) is read.

With four types of nucleotides and codons grouping three nucleotides there are $4^3=64$ possible codons. However, as stated above, proteins are only made up of 20 different amino acids, meaning that several different codons correspond to the same amino acid. This gives the translation process a certain robustness to errors that can occur when the DNA is copied to create a new cell, or when it is transformed into RNA prior to protein translation.

The portion of DNA that is read to create the protein is said to be “coding”, and is called a gene. There are several thousands of genes in the human genome⁹ resulting in proteins executing thousands of different functions in a cell. In human beings, coding DNA represents only 1% to 2% of the total genome^10,11. The large majority of the DNA in a human being is not translated into proteins, a portion of it has a regulatory role, controlling transcription and translation, but the role remains unknown for the rest of the human genome^12,13.

1.1.2.2 Making mistakes

Going from DNA sequence to protein is quite a complicated process involving several steps, it is therefore possible for a mistake to happen. There are several mechanisms to avoid mistakes and alteration of the genetic information: the complementary nature of the two strands of DNA, the redundant nature of the genetic code as well as error correction mechanisms in the molecules (called “polymerases”) that read and write DNA and RNA being some of them. Despite all that, some errors in the nucleic acid (DNA and RNA) or protein sequences still make it through, these are called mutations.

1.1.2.2.1 Where can mistakes happen ?

There are several sources of error that can alter genetic information¹⁴:

DNA replication: When a cell divides, or when an organism reproduces, the DNA molecule must be copied in order to preserve and transmit genetic information. This process has a very low rate of errors, with as low as one error for every billion to every hundred billion of replicated base pairs¹⁵. This is due to the fact that the DNA polymerase (the protein responsible for copying DNA molecules), has a relatively low error rate to start with, but mostly to the error correcting mechanisms that are present in certain cells and bacteria¹⁶.
RNA transcription: Since errors in RNA transcripts are less important than in replicated DNA, RNA polymerases have a much higher error rate than their DNA counterparts. This error rate has been estimated to be between four errors for each million¹⁷ to two errors for each hundred thousand¹⁸ transcribed bases.
Protein translation: The process of translating RNA to a protein is done by proteins called ribosomes. This is a very error prone process with a mistranslation rate estimated to be of the order of one error for every 10,000 codons translated¹⁹
Other mutagenic events: Many external events and factors have been shown to provoke mutations in exposed DNA such as Ionizing radiation²⁰, UV rays²¹, toxins²², heat stress²³, cold stress²⁴ or oxidative stress²⁵.

While RNA transcription and protein translation are much more error prone processes than DNA replication, the errors induced only alter the expression of the genetic information. The effects of these errors are localized to the cells where they happen and are not transmitted to offspring. However these transcription errors are not unimportant and increased transcription error rates have been hypothesized to cause severe neurological symptoms in pediatric cohorts²⁶.

1.1.2.2.2 What kind of errors are possible?

In biological sequences (nucleic acids and proteins), mutations can result from one of three error modes:

Substitutions, where the original base unit (nucleotide or amino acid) is mistakenly replaced by another one, for instance inserting an A instead of a G during RNA transcription.
Insertions,where a new base unit not present in the original sequence is added to the newly synthesized biological sequence.
deletions, where a base unit from the original sequence is skipped and not taken into account when synthesizing the new sequence.

While these three types of errors occur both in nucleic acids and proteins there are some things to consider about the consequences of nucleic acid mutations on protein synthesis. Due to the redundant nature of the genetic code mentioned in Section 1.1.2.1, some substitutions in the nucleic acid sequence will result in the same protein sequence and therefore not have altered protein activity. Some mutations however will result in a substitution at the amino acid level which could potentially lead to a physicochemically altered or even non-functional protein. Finally, insertion and deletion errors (collectively called indels) can have big consequences on resulting proteins. Inserting or deleting nucleotides in multiples of three will result in the insertion/deletion of amino acids in the protein, any other length of indel will result in what is called a frameshift mutation²⁷. These mutations causes changes in all the codons following the mutation, potentially resulting in a completely different amino-acid sequence, including premature stop codon apparition as shown in Figure 1.2.

**Effect of frameshift mutations.**
The deletion of a single C (highlighted in red) in the original DNA sequence leads to a change in the codons read during translation. The original codons (shown in green, with corresponding amino acids, above the sequence) translate to a functional protein `MLIRG...`. The new codons caused by the deletion (shown in blue, with corresponding amino acids, below the sequence), induce a premature STOP codon leading to a non-functional protein `M`. The Serine and Valine codons are not translated due to the STOP codon.

Figure 1.2: Effect of frameshift mutations.
The deletion of a single C (highlighted in red) in the original DNA sequence leads to a change in the codons read during translation. The original codons (shown in green, with corresponding amino acids, above the sequence) translate to a functional protein MLIRG.... The new codons caused by the deletion (shown in blue, with corresponding amino acids, below the sequence), induce a premature STOP codon leading to a non-functional protein M. The Serine and Valine codons are not translated due to the STOP codon.

1.1.2.2.3 What effect can mutations have ?

As we stated above, some mutations in DNA may have no repercussions, some others can lead to non-functional proteins. In some cases mutations can be associated with a trait in the mutated individual. For example a single mutation in a gene linked with coagulation can lead to pathological Leiden thrombophilia²⁸, a single amino acid deletion in the CFTR protein leads to (the very deadly) cystic fibrosis²⁹, and many mutations have been linked to complex diseases like type 2 diabetes^30,31. All mutational effects are not necessarily bad for the organism though, and mutations are essential for bacteria³² or viruses like HIV³³ to develop resistance to treatment (more on that in Chapters 5 and 6).

While some mutations, have had their mechanisms and consequences thoroughly studied, in many cases mutations are simply linked to a trait. Since it is easier to show correlation than causation, and that the former does not necessarily imply the latter, it is important to further study mutations of notice to understand their potential consequences.

1.2 Obtaining sequence data

In many fields, especially in computational biology, we need to know what genetic information the studied organism has. That is to say: what is the exact sequence of nucleotides that make up its DNA? The process of figuring out this sequence is, perhaps unsurprisingly, called sequencing. A sequence that is produced by this process is called a sequencing read or, more commonly, just a read.

1.2.1 Sanger sequencing, a breakthrough

The first widely used sequencing method was developed in 1977³⁴. Sanger et al. devised a simple method to read the sequence of nucleotides that make up a DNA sequence known as chain termination sequencing or simply Sanger sequencing (represented in Figure 1.3). Although this method is now mostly obsolete, it established some key concepts in sequencing, some of which are in action in the most modern sequencers.

To understand Sanger sequencing, one must first understand how to synthesize DNA. As we stated in Section 1.1.1, DNA is built up from building blocks that we called nucleotides, more specifically deoxynucleotide triphosphates or dNTPs. These dNTPs are made up of a sugar (deoxyribose), a nucleobase (A, T, G or C) and 3 phosphate groups. By successively adding these dNTPs at the end of an existing DNA molecule, we extend it, linking one of the phospates of the dNTP to an oxygen atom on the last nucleotide of the DNA molecule. Let us now consider a dideoxynucleotide triphosphate (ddNTP), which is identical to a dNTP except we remove a specific oxygen atom. This ddNTP can be added to the growing molecule of DNA like regular dNTPs, but since it is missing that one oxygen atom no more dNTPs or ddNTPs can be added to the DNA molecule after this one. The elongation is terminated and we call these ddNTPs chain-terminators. This combination of DNA synthesis followed by termination are at the heart of Sanger sequencing.

It is important to note that dNTPs and ddNTPs refer to nucleotides with any nucleobase. We can refer to specific dNTPs by replacing the “N” with the base of choice. For example, dATP refers to the dNTP that has adenine as a base. Similarly we have dCTP, dGTP and dTTP (as well as ddATP, ddCTP, ddGTP and ddTTP).

The first step of Sanger sequencing (and most sequencing methods) is to amplify the DNA molecule we wish to sequence, i.e. make many copies of it (usually through a process called PCR). These clones of the sequence are then separated into their two complementary strands one of which will be used as a template for the sequencing steps.
The second step is to prepare 4 different sequencing environments (think of it as 4 test tubes). In each environment we introduce an equal mix of the 4 dNTPs, that will be used to elongate new DNA molecules from the amplified templates, and a single type of ddNTP. So in the first test tube we will have only ddATP, ddCTP in the second, et cetera. In addition, these ddNTP are marked, at first with radioactive isotopes, and later on, as the technology matured, with dyes. This marking means that we can observe the location of these ddNTPs later on.
Then an equal portion of the template is introduced in each environment with DNA polymerases (that will add the nucleotides to elongate a sequence that is complementary to the template), and short specific DNA molecules called primers that are necessary for the polymerases to start synthesizing new DNA.
During synthesis the chain is elongated with dNTPs by the polymerase and the reaction stops once a ddNTP is incorporated. At the end of this process we have plenty of fragments of DNA in each test tube, and we know that these fragments end with a specific base in a given environment. For example, in the test tube where we added ddATP, we know that all the fragments end with an A, and that we have all the possible fragments that start at the beginning of the template and end with an A. If the template is AACTA, then the fragments we would get in the ddATP test tube would be A, AA, and AACTA.
Then, a sample from each environment is taken and deposited on a gel, each in its own lane. A process called electrophoresis is then used to separate the fragments according to their weight. By applying an electrical current to the gel, the fragments of DNA will migrate away from where they were deposited along their lane in the gel. Lighter, shorter DNA fragments will travel further than heavier ones. We then get clusters of fragments ordered by weight (and therefore by length) called bands. With the marked ddNTP we can reveal these bands in the gel.
We know that: 1) bands are ordered by length; 2) consecutive bands correspond to the addition of a single nucleotide; 3) in a specific lane fragments corresponding to a band end with a specific base. This knowledge is enough to deduce the sequence of the template. An example gel is shown in Figure 1.3.

This process allowed Sanger et al. to sequence the first genome, of a $\varphi X174$ bacteriophage, in 1977². Although revolutionary, this method was costly, time consuming and labor intensive. Adjustments to this method were made in order to make it faster and less expensive. An important step was to change the way ddNTPs were marked. By using fluorescent markers, each base having a distinct “color”, we can eliminate the need to have 4 different environments and lanes in the gel^35,36. This also paved the way for automating sequencing, each fluorescently marked band can be excited with a laser, and the resulting specific wavelength can be recorded by optical systems and the corresponding base automatically deduced³⁷ (Also see Figure 1.3). Other improvements were made such as using capillary electrophoresis instead of gel electrophoresis.

**Overview of the sanger sequencing protocol.**
**A)** The sequence to read and all the generated fragments, with highlighted ddNTP chain terminators, ordered by molecular weight (i.e. length). **B)** Classical Sanger sequencing. The fragments are separated by electrophoresis and the lighter fragments travel further from the wells at the botom of the gel. Each lane in the gel correpsonds to a specific ddNTP. The radioactivly marked ddNTPs appear as black band in the gel and we can reconstruct the sequence by reading the bands from top to bottom, the column in wich the band appears indicating which base is at each position. **C)** Automated Sanger sequencing. The fragments are also separated by electrophoresis, as in panel B. Chain terminators are marked with fluorescent markers. When excited by a laser, each ddNTP emits a specific wavelength. This is read by an optical sensor and the corresponding ddNTP is recorded. By exciting each band we can quickly deduce the sequence.

Figure 1.3: Overview of the sanger sequencing protocol.
A) The sequence to read and all the generated fragments, with highlighted ddNTP chain terminators, ordered by molecular weight (i.e. length). B) Classical Sanger sequencing. The fragments are separated by electrophoresis and the lighter fragments travel further from the wells at the botom of the gel. Each lane in the gel correpsonds to a specific ddNTP. The radioactivly marked ddNTPs appear as black band in the gel and we can reconstruct the sequence by reading the bands from top to bottom, the column in wich the band appears indicating which base is at each position. C) Automated Sanger sequencing. The fragments are also separated by electrophoresis, as in panel B. Chain terminators are marked with fluorescent markers. When excited by a laser, each ddNTP emits a specific wavelength. This is read by an optical sensor and the corresponding ddNTP is recorded. By exciting each band we can quickly deduce the sequence.

These gradual improvements to the Sanger sequencing protocol made it possible to sequence longer and more accurate reads, with the latest technologies resulting in reads reaching 1 ,000 base pairs with an accuracy of 99.999%³⁸. These improvements also resulted in a lower cost for sequencing, which was greatly decreased from around $1000 per base-pair³⁹ to only $0.5 per kilobase³⁸. Finally these technological improvements also increased the throughput of sequencing machines from around 1 kilobase per day³⁹ to 120 kilobases per hour⁴⁰.

Despite these improvements, for ambitious endeavors such as the human genome project, sequencing was a massive undertaking: the first human genome is estimated to have cost between 500 million and 1 billion US dollars to sequence⁴¹.

1.2.2 Next-generation sequencing

Thanks to these large sequencing projects and the genomics field in general, the richness and usefulness of sequence data was made ever more apparent. This growing need of sequence data ushered in a new era of sequencing with the development of many new sequencing methods designed to have a higher throughput and a lower cost than Sanger sequencing. This second generation of sequencing technologies is often referred to as Next-Generation Sequencing (NGS) or Massively parallel sequencing. While there are different technologies, they share a few common key points⁴²:

As with Sanger sequencing, we first need to amplify and clone the DNA template. However, since these technologies result in shorter reads than Sanger sequencing, the DNA we want to sequence must first be randomly broken up into small template fragments before being amplified.
The amplified template fragments are attached to some sort of solid support, resulting in a physical support with billions of template fragments attached to it.
As in Sanger sequencing, DNA molecules, complementary to the template fragments, are elongated. This happens for billions of fragments at the same time (hence the “massively parallel” epithet).
The addition of specific nucleotides to a chain are detected in real time, and there is no permanent chain termination. There is no need for the long step of electrophoresis. These detections are simultaneous for all the molecules being elongated at once.

The result of these steps is a very large number of short reads. With data analysis these short reads can be used to deduce longer sequences and eventually a fragmented approximation of the original whole genome sequence through a process called assembly.

The main NGS method is called “sequencing by synthesis”, developed by a company: Illumina. It is commonly referred to as Illumina sequencing. This method is based on reversible chain terminators, developed at the Institut Pasteur in the 90’s⁴³. These are marked dNTPs that can be used to elongate DNA molecules, but that have an additional molecular group that makes them terminators by default. However this terminating group can be removed once the NTP is included in a DNA molecule allowing the elongation process to continue. These dNTPs are fluorescently marked and when excited with a laser they emit light with a distinctive color. During Illumina sequencing, these reversible chain terminators are included to millions of fragments at the same time, stopping elongation. At this point all the fragments are excited with a laser and an optical system takes a picture of the emitted colors for all the fragments at once. In this image, a pixel loosely corresponds to a sequenced fragment, and its color to the most recently added dNTP. The terminating groups are then cleaved and the process can start over by incorporating a new batch of reversible terminators. By observing the successive images we can deduce the sequence of added nucleotides for each sequenced fragment and obtain all of our reads.

Another NGS method is called pyrosequencing, commercialized by 454 Life Sciences. Contrary to Illumina sequencing, this method does not use reversible chain terminators. Instead it uses a special enzyme called luciferase that emits light as specific dNTPs are added. This process is repeated for the 4 dNTPs (similarly to Sanger sequencing) and from the light emissions we can deduce the sequence of nucleotides⁴⁴.

These technologies yield reads around 150 nucleotides for Illumina and 400nt for pyrosequencing⁴⁵. This is much shorter than the 1kB reads obtainable with the latest Sanger sequencing technologies. However the throughputs are much higher⁴⁰: 2.5 to 12.5 Gigabases per hour for Illumina and 30 Megabases per hour for pyrosequencing. Costs are also quite low: $0.07 and $10 per Megabase for Illumina and pyrosequencing respectively. The per-base sequencing accuracies are also quite high, up to 99.9% for both Illumina⁴⁶ and pyrosequencing⁴⁰. A summary of the key characteristics for various sequencing technologies can be found in Table 1.1. The lower cost and higher throughput has made the Illumina sequencing technology the dominant one. The company estimates that 90% of the world’s sequencing data was generated with Illumina machines in 2015⁴⁷.

1.2.3 Long read sequencing

Although NGS technologies revolutionized the sequencing world, recent efforts have been made to get longer reads. These third-generation methods generate reads of tens of kilobases and are commonly called long-read sequencing method. Long reads have a host of applications⁴⁸ for which short NGS reads might not be well suited: De novo assembly of large complex genomes, studying complex repetitive regions such as centromeres or telomeres or detection of structural variants. They have recently been used to assemble the first truly complete human genome, including telomeric and centromeric regions⁴.

The two available long read technologies are: Single Molecule Real Time sequencing (SMRT), commercialized by Pacific Biosciences (PacBio) and Nanopore sequencing, commercialized by Oxford Nanopore Technologies (ONT). While these technologies are quite different, they both result in much longer reads than even Sanger sequencing in real time, without the need for chain terminators or separate sequencing reactions, all with a high throughput and at a reasonably low cost.

SMRT sequencing was first developed in 2009⁴⁹, before being commercialized and furthered by PacBio. The basic principle is as follows:

Fragment and amplify DNA to obtain a very large number of DNA templates.
Link both strands of each DNA template together with known sequences called bell adapters. Denature the DNA to create a single stranded, circular DNA molecule.
Primers and polymerases are attached to the circular molecule specifically on one of the bell adapters.
Add the circular DNA template, primer, polymerases complexes to a SMRT chip. This chip is essentially a large aluminium surface with hundreds of thousands of microscopic wells called Zero-Mode Waveguides (ZMWs) only 100nm in diameter⁵⁰. The polymerases are chemically bonded to the bottom of each of these ZMWs so we effectively get a single DNA template and polymerase per well.
Fluorescently marked dNTPs are incorporated progressively in each of the wells. When a marked dNTP is incorporated in the newly synthesized DNA brand, light of a specific wavelength is emitted.
The size of these ZMWs make the detection of the fluorescence possible with an optical system. Incorporation of dNTPs in each ZMW can be detected simultaneously in a parallel fashion and the resulting sequences deduced.

Nanopore sequencing, thought of in the eighties, further developed along the years⁵¹ and first commercialized by ONT in 2014⁵², is completely different from all the sequencing technologies previously mentioned. Where all the other ones are based on synthesizing a complementary DNA strand and detecting specific dNTP incorporation in some way or another, there is no synthesis in nanopore sequencing. The principle relies on feeding a single strand of a DNA template through a small hole in a membrane, a nanopore, at a controlled speed. As the nucleotides go through the nanopore, an electric current is formed between both sides of the membrane. This current can be measured and is specific to the succession of 5 to 6 nucleotides inside the nanopore channel at any given time. By looking at the evolution of the electric current as the DNA strand goes through the nanopore, we can deduce the sequence of nucleotides through a process called base calling. Base calling is usually done with machine learning methods, mainly artificial neural networks⁵³. In the flow cells used in ONT sequencers, there are hundreds of thousands of nanopores, spread out over a synthetic membrane, allowing for massively parallel sequencing as well. Theoretically, since this method is not based on synthesis, the upper limit for read length is only limited by the length of the template, and in practice ONT sequencing produces the longest reads.

Both technologies yield long reads, the median and highest read lengths being 10 kilobases and 60 kilobases respectively for PacBio sequencing⁵⁴. For nanopore the median read lengths of 10 to 12 kilobases^55,56 are similar to PacBio, but in it can also yield ultra-long reads of 1 up to 2.3 megabases long^57–59. The length of the reads and parallel nature of these two technologies allow these sequencers to have truly massive throughputs. PacBio sequencers can sequence between 2 and 11 gigabases per hour and ONT from 12.5 gigabases per hour, up to a staggering 260 gigabases per hour using the latest ONT PromethION machines⁵⁶. The cost of sequencing with these machines, while higher than for Illumina sequencers, remains reasonably affordable at $0.32 and $0.13 per megabase for PacBio and ONT respectively⁶⁰. These characteristics are summarized in Table 1.1 along with other sequencing technologies.

The length, throughput and sequencing cost of both these technologies paint a pretty picture, and indeed they have proved useful in many settings, but sequencing accuracy is a problem with these technologies. The per-base sequencing accuracy has been estimated to be between 85% and 92% for PacBio sequencers and 87% to 98% for ONT machines^56,61,62. This accuracy is much lower than either Sanger sequencing or Illumina reads. Characterizing, correcting and accounting for these errors is widely studied and it will be discussed in more detail in Sections 1.3 and 1.4.

Table 1.1: **Comparison of sequencing technology characteristics.**
Characteristics for the latest sequencers were used for the Sanger sequencing entry. The length is given in nucleotides, throughputs in sequenced nuctleotides per hour and cost in US dollars per megabase.
technology	read length (nt)	throughput (nt/hour)	cost ($/Mb)	accuracy
Sanger	1000	120 103	$500	99.999%
Illumina	150	2.5-12.5 109	$0.07	99.9%
Pyrosequencing	400	30 106	$10	99.9%
PacBio SMRT	10000 (up to 60000)	2-11 109	$0.32	85-92%
Nanopore	12000 (up to 2.5 106)	12.5-260 109	$0.13	87-98%

While most of the mentioned technologies can also be adapted and used to sequence RNA instead of DNA^63,64, directly sequencing proteins remains a challenge. The sequence of amino acids making up a protein is usually deduced from the codons in sequenced DNA or RNA after detection of potentially coding regions called open reading frames (ORFs). Development of methods to directly sequence protein molecules using mass spectrometry was started not very long after Sanger sequencing⁶⁵ and improved⁶⁶. New methods are still being developed⁶⁷ but protein sequencing remains a challenge.

1.3 Sequencing errors, how to account for them ?

Sequencing technologies are not perfect. They make mistakes, as we can see from the accuracy rates reported in Section 1.2. For technologies based on nucleic acid synthesis (i.e. everything except ONT), since they use polymerases it stands to reason that the same three types of errors, described in Section 1.1.2.2, occur: substitutions, insertions and deletions. For long read technologies though, most of the errors do not come from the polymerase, but from signal processing used to deduce the sequence. Since both technologies execute single molecule sequencing, the signal to noise ratio is low^68,69 making base calling more complicated.

This explains the discrepancy in error rates between short and long read sequencing technologies: the former getting as low as 10^-4 or 10^-5 after computational processing⁷⁰ where the latter are between 10% and 15%. This high error rate long reads is bothersome and many efforts have been made to lower this error rate, computationally and technologically.

1.3.1 Error correction methods

The long read error-correction literature and toolset is rich and active^71–73. There are two main ways to correct errors: 1) hybrid methods where high-accuracy short reads are used, and 2) non-hybrid methods where only the long-reads are used.

In Non-hybrid methods^71,74, by finding regions that overlap fairly well between reads and taking the consensus of the overlapped regions (i.e. the majority nucleotide at each position), some errors can be eliminated. In many analyses and sequencing data processing pipelines, the first step is to break up the reads into all possible overlapping subsequences of length $k$ called $k$-mers (e.g the 3-mers of ATTGC are ATT, TTG and TGC). Rare $k$-mers in the read dataset, i.e. $k$-mers that appear only a handful of times in all the reads, are likely the result of an error and filtering them out can improve analysis. One or both of these procedures are implemented in several pieces of commonly used software such as assembler like wtdbg2⁷⁵, and canu⁷⁶ or standalone long-read correctors like daccord⁷⁷. In some cases, errors are corrected not on the raw reads but after having assembled the long reads into long continuous sequences (contigs), this process is called polishing. The ntEdit polisher⁷⁸ also filters out rare $k$-mers to correct errors. The Arrow⁷⁹ and Nanopolish⁸⁰ polishers correct the assembly using the raw PacBio and ONT long reads respectively, and Racon⁸¹ can use bot types of long-reads to polish assemblies.

Hybrid methods, as their name suggest, make use of short reads to correct errors in long reads. By finding similar regions between the short and long reads we can use the higher accuracy of short reads to correct the long ones. This is implemented in many pieces of software proovread⁸², Jabba⁸³, PBcR⁸⁴ or LoRDEC⁸⁵. Short reads can also be used to polish long read assemblies with tools like Pilon⁸⁶. The first complete human genome was assembled and polished using many different sequencing technologies including PacBio, ONT and Illumina technologies⁴.

1.3.2 More accurate sequencing methods

While a lot of effort is being put into error correction, another angle of attack to lower the error rate of long reads is to improve the sequencing technology.

In 2019, PacBio introduced HiFi reads, based on a circular consensus (CCS) technique⁸⁷. During SRMT sequencing the 2 strands are linked together by bell adapters to form a circular DNA template (c.f. Section 1.2.3), the central idea of CCS is to sequence this molecule multiple times by going over the circle more than once. In the resulting long sequence the known bell adapter sequences can be removed, and a consensus sequence can be built from the multiple passes over the same DNA template. This results in long-read accuracies of 99.8% to 99.9%^56,87. This works because PacBio sequencing errors are mostly randomly distributed along the sequenced template (more on that in Section 1.4.2). Therefore it is unlikely that the same error will appear in multiple passes over the same template portion.

For ONT sequencing, most improvement efforts have been focused on base-callers. These tools were originally based on Hidden Markov Models⁸⁸ (HMMs), but gradually they have been shifting over to neural network based deep learning methods^53,74,89,90 with faster inference times and better performance.

Similarly to PacBio HiFi reads, ONT developed 2D, and 1D² sequencing. In 2D sequencing, both strands of the DNA molecule to sequence are linked with a hairpin adapter to form one long sequence. Each strand is sequenced once and a consensus is built from these 2 passes⁹¹. 1D² sequencing operates in a similar fashion but without the need for a hairpin adapter⁹². 2D sequencing produces reads with 97% accuracy albeit much shorter than standard 1D sequencing⁹¹. Recently, Oxford Nanopore Technologies announced the release of a new technology they call duplex. Using new chemistry, a new basecaller and sequencing of both strands (similarly to 2D and 1D²) they announce raw read accuracies of 99.3%⁹³. Pre-printed research seems to confirm these numbers with one experiment yielding duplex reads with a 99.9% accuracy⁹⁴.

A technologically agnostic method using unique molecular identifiers added during the template preparation phase, and consensus sequencing has been shown, in specific contexts, to improve the accuracies of both ONT and PacBio CCS long reads to 99.59% and 99.93% respectively⁹⁵.

Finally, new sequencing technologies are being developed, like built in error-correction short-read technologies yielding error-free reads of up to 200 nucleotides long⁹⁶. Illumina also recently announced its own high-throughput, high-accuracy long-read sequencing technology in 2022⁹⁷, although details about the performance and technology are scarce.

1.4 The special case of homopolymers

Despite improvement in error correction methods and sequencing technologies, certain genetic patterns are particularly difficult to process, homopolymers are one such pattern.

1.4.1 Homopolymers and the human genome

Homopolymers consist of a stretch of repeated nucleotides (i.e. $\geq 2$) occurring at some point in the genome. For example, the sequence AAAA is a length 4 adenine homopolymer. In the complete human genome assembly (CHM13 v1.1 from the T2T consortium⁴), 50% of its three gigabases are in homopolymers of size 2 or more, and 10% are in homopolymers of length equal or greater than 4. As can be seen in Figure 1.4, short and medium length homopolymers make up a significant part of the genome. In a previous GRCh38 human genome assembly, more than 1.9 megabases are in homopolymers of length 8 or higher⁹⁸, representing about 1‰ of that assembly. The longest homopolymer run in the CHM13 v1.1 assembly is 86 (90 in GRCh38⁹⁸).

$**Homopolymer fraction of the whole human genome by homopolymer length.** The homopolymer counts were calculated from the T2T consortium full human genome assembly CHM13 v1.1. This figure was inspired by Figure 3b of reference [@booeshaghiPseudoalignmentFacilitatesAssignment2022].$

Figure 1.4: Homopolymer fraction of the whole human genome by homopolymer length.
The homopolymer counts were calculated from the T2T consortium full human genome assembly CHM13 v1.1. This figure was inspired by Figure 3b of reference⁹⁸.

In the human genome, homopolymers tend to occur more often in adenine and thymine runs than guanine and cytosine. There are are approximately twice as many nucleotides within A or T homopolymers (481 Mb and 484 Mb) than G or C (278 Mb and 279 Mb). This discrepancy is even more pronounced when looking at homopolymers longer than four nucleotides (c.f. Figure 1.5).

$**Distribution of homopolymer lengths per base in the human genome, for homopolymers of length $\geq$ 4.** The homopolymer counts were calculated from the T2T consortium full human genome assembly CHM13 v1.1.$

Figure 1.5: Distribution of homopolymer lengths per base in the human genome, for homopolymers of length $\geq$ 4.
The homopolymer counts were calculated from the T2T consortium full human genome assembly CHM13 v1.1.

1.4.2 Homopolymers and long reads

Unfortunately, homopolymers are a source of errors in sequencing, particularly for long-read technologies. While substitutions seem to be randomly distributed along the reads for PacBio and ONT, the main error mode seems to be indels in homopolymeric sections, i.e. reading the same nucleotide several times or skipping over one of the repeated nucleotides. Many studies show that homopolymeric indels are the main type of error for PacBIO SMRT and ONT long-read sequencing^68,99–101. This is even the case for PacBio HiFi reads, while the circular consensus approach eliminates the randomly distributed substitutions but homopolymer indels remain⁸⁷. It seems that ONT reads are more prone to this type of error than PacBIo⁵⁶. The rate of these errors is independent of the length of the homopolymer for ONT, but it rises with homopolymer length for short-read and PacBio technologies¹⁰².

1.4.3 Accounting for homopolymers

The fact that they make up a significant part of the human genome, and that they are a source of errors for long read technologies means that homopolymers warrant special attention and care. Methods have been devised and implemented, specifically to counter homopolymer-linked errors.

1.4.3.1 Specific error correction

Homopolymer errors are taken under special consideration during assembly polishing when using certain tools like HomoPolish¹⁰³, NanoPolish⁸⁰ or Pilon⁸⁶. Methods to improve base calling of homopolymer stretches have been developed for nanopore sequencing^104,105, and implemented in state of the art base-callers such as guppy or scrappie⁵³.

Steps before sequencing can also be taken in order to reduce the effect of these errors, like avoiding homopolymers in barcode sequences^106,107 or during the development of DNA based storage systems¹⁰⁸.

Improving the sequencing technologies can also be a solution, by reducing the number of homopolymer errors straight from the source. The latest ONT chemistry R.10 reportedly improves accuracy in homopolymer rich regions^74,109. Non-biological solid-state nanopores also reduces errors in homopolymers^110,111.

1.4.3.2 Homopolymer compression, a nifty trick

Homopolymer-errors can be harmful for downstream analyses such as read-mapping (c.f. Chapter 3). However, in many cases, reads cannot be re-sequenced with newer technologies, or base-called with better base callers. Only the read sequences potentially containing homopolymer errors, are available for usage. In order to account for this sort of error, a simple pre-processing trick was developed: homopolymer compression (HPC).

The idea is very simple: for any sequence, replace a repeated run of any nucleotide (i.e. homopolymers) by a single occurrence of that nucleotide. This means that after going through HPC the sequence AAACTGGG will yield the sequence ACTG. This simple pre-processing step, applied to all the reads and sequences to analyze, removes all indels in homopolymers, and can resolve some ambiguities (c.f. Figure 1.6). It can also remove legitimate information contained in homopolymers, but the trade-off with the reduced error rate has been deemed advantageous.

HPC has been implemented in many sequence bioinformatics software tools. The HiCanu¹¹², MDBG¹¹³, wtdbg2⁷⁵, shasta¹¹⁴ assemblers all use HPC under the hood to provide better assemblies, and HPC was used to assemble the complete human genome sequence⁴. The first published usage of HPC, was actually in the CABOG assembler¹¹⁵ developed for pyrosequencing reads. HPC has also been implemented for other tasks, like clustering¹¹⁶, long read error correction with LSC¹¹⁷ and LSCPlus¹¹⁸, alignment with minimap2¹¹⁹ and winnowmap2¹²⁰, or specific analysis pipelines for satellite tandem repeats¹²¹.

**Homopolymer compression can help resolve ambiguities due to sequencing errors.**
A read with homopolymer related sequencing errors can be homologous to two different regions of the reference genome, with one discrepancy for each region. After applying HPC, this ambiguity is properly accounted for and the read is homologous to only one region. This figure, however, only shows one way homopolymers can be detrimental and others are possible^[Homopolymer indels can be harmful in opposite circumstances as well. Let us consider, for example, a read that should correspond to several repetitions of a conserved motif. Homopolymer indels can artificially resolve an ambiguity by making the read unique and prefer a specific repetition of the motif or entirely misplace the read.].

Figure 1.6: Homopolymer compression can help resolve ambiguities due to sequencing errors.
A read with homopolymer related sequencing errors can be homologous to two different regions of the reference genome, with one discrepancy for each region. After applying HPC, this ambiguity is properly accounted for and the read is homologous to only one region. This figure, however, only shows one way homopolymers can be detrimental and others are possible³.

1.5 Conclusion

I hope, after reading this chapter, you will agree with me that sequencing is fundamental for furthering our knowledge of biological processes, organisms and Life in general. And as such, the sequencing field is still very active with new technologies being developed to improve the current technologies in various aspects. Illumina promises high accuracy long reads with Infinity⁹⁷ and PacBio is developing its own short read sequencing technology, moving away from sequencing by synthesis^122,123. Finally, efforts are also being made to make sequencing more affordable and available in a greater number settings with Ultima genomics promising accurate short reads for as low as $1 per gigabase¹²⁴.

With all these technological improvements we are approaching an era where sequencing is easy and quick, opening the door for massive projects like Tara Oceans¹²⁵ or the BioGenome project¹²⁶ to better understand biodiversity. Routine whole-genome sequencing could also usher in an era personalized medicine¹²⁷.

Despite all these advancements, sequencing errors remain an obstacle to certain analyses. This is particularly true for the ever more used and useful long reads, and the important fraction of genomes made up of homopolymers. Detecting, removing or accounting for these errors in some way is a crucial step to improve any analysis based on sequencing data, and to make sure that no theory or conclusion are built upon erroneous sequence data.

Finally, it is important to note (at least for the remainder of this thesis) that, from a computational standpoint, a biological sequence is simply a succession of letters and a set of reads is simply a text file. Therefore, many analyses and data processing methods are inspired or directly transposed from the field of text algorithmics.

References

Watson, J. D. & Crick, F. H. C. The Structure of Dna. Cold Spring Harb Symp Quant Biol 18, 123–131 (1953).

Sanger, F. et al. Nucleotide sequence of bacteriophage φX174 DNA. Nature 265, 687–695 (1977).

Archer, C. T. et al. The genome sequence of E. Coli W (ATCC 9637): Comparative genome analysis and an improved genome-scale reconstruction of E. coli. BMC Genomics 12, 9 (2011).

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

Pellicer, J., Fay, M. F. & Leitch, I. J. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society 164, 10–15 (2010).

Macgregor, H. C. C-Value Paradox. in Encyclopedia of Genetics (eds. Brenner, S. & Miller, J. H.) 249–250 (Academic Press, 2001). doi:10.1006/rwgn.2001.0301.

Alberts, B. et al. Molecular Biology of the Cell. 4th edition. (Garland Science, 2002).

Crick, F. H. C., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General Nature of the Genetic Code for Proteins. Nature 192, 1227–1232 (1961).

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

10.

Elkon, R. & Agami, R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol 35, 732–746 (2017).

11.

Omenn, G. S. Reflections on the HUPO Human Proteome Project, the Flagship Project of the Human Proteome Organization, at 10Ỹears. Mol Cell Proteomics 20, 100062 (2021).

12.

Shabalina, S. A. & Spiridonov, N. A. The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biol 5, 105 (2004).

13.

Consortium, T. E. P. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012).

14.

Chatterjee, N. & Walker, G. C. Mechanisms of DNA damage, repair, and mutagenesis: DNA Damage and Repair. Environ. Mol. Mutagen. 58, 235–263 (2017).

15.

Fijalkowska, I. J., Schaaper, R. M. & Jonczyk, P. DNA replication fidelity in Escherichia coli: A multi-DNA polymerase affair. FEMS Microbiol Rev 36, 1105–1121 (2012).

16.

Pray, L. DNA replication and causes of mutation. Nature education 1, 214 (2008).

17.

Gout, J.-F., Thomas, W. K., Smith, Z., Okamoto, K. & Lynch, M. Large-scale detection of in vivo transcription errors. Proceedings of the National Academy of Sciences 110, 18584–18589 (2013).

18.

Gout, J.-F. et al. The landscape of transcription errors in eukaryotic cells. Sci Adv 3, e1701484 (2017).

19.

Shcherbakov, D. et al. Ribosomal mistranslation leads to silencing of the unfolded protein response and increased mitochondrial biogenesis. Commun Biol 2, 1–16 (2019).

20.

Desouky, O., Ding, N. & Zhou, G. Targeted and non-targeted effects of ionizing radiation. Journal of Radiation Research and Applied Sciences 8, 247–254 (2015).

21.

Kiefer, J. Effects of Ultraviolet Radiation on DNA. in Chromosomal Alterations: Methods, Results and Importance in Human Health (eds. Obe, G. & Vijayalaxmi) 39–53 (Springer, 2007). doi:10.1007/978-3-540-71414-9_3.

22.

Bennett, J. W. & Klich, M. Mycotoxins. Clin Microbiol Rev 16, 497–516 (2003).

23.

Kantidze, O. L., Velichko, A. K., Luzhin, A. V. & Razin, S. V. Heat Stress-Induced DNA Damage. Acta Naturae 8, 75–78 (2016).

24.

Gregory, C. D. & Milner, A. E. Regulation of cell survival in Burkitt lymphoma: Implications from studies of apoptosis following cold-shock treatment. Int J Cancer 57, 419–426 (1994).

25.

Gafter-Gvili, A. et al. Oxidative Stress-Induced DNA Damage and Repair in Human Peripheral Blood Mononuclear Cells: Protective Role of Hemoglobin. PLoS One 8, e68341 (2013).

26.

Anagnostou, M. E. et al. Transcription errors in aging and disease. Translational Medicine of Aging 5, 31–38 (2021).

27.

Roth, J. R. Frameshift mutations. Annu Rev Genet 8, 319–346 (1974).

28.

Kujovich, J. L. Factor V Leiden thrombophilia. Genetics in Medicine 13, 1–16 (2011).

29.

Cutting, G. R. Cystic fibrosis genetics: From molecular understanding to clinical application. Nat Rev Genet 16, 45–56 (2015).

30.

Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016).

31.

Morris, A. P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44, 981–990 (2012).

32.

Woodford, N. & Ellington, M. J. The emergence of antibiotic resistance by mutation. Clinical Microbiology and Infection 13, 5–18 (2007).

33.

Rhee, S.-Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31, 298–303 (2003).

34.

Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74, 5463–5467 (1977).

35.

Smith, L. M., Fung, S., Hunkapiller, M. W., Hunkapiller, T. J. & Hood, L. E. The synthesis of oligonucleotides containing an aliphatic amino group at the 5′ terminus: Synthesis of fluorescent DNA primers for use in DNA sequence analysis. Nucleic Acids Research 13, 2399–2412 (1985).

36.

Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674–679 (1986).

37.

Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. & Zenke, M. Automated DNA sequencing: Ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic Acids Research 15, 4593–4602 (1987).

38.

Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat Biotechnol 26, 1135–1145 (2008).

39.

Collins, F. S., Morgan, M. & Patrinos, A. The Human Genome Project: Lessons from Large-Scale Biology. Science 300, 286–290 (2003).

40.

Liu, L. et al. Comparison of Next-Generation Sequencing Systems. Journal of Biomedicine and Biotechnology 2012, e251364 (2012).

41.

The Cost of Sequencing a Human Genome. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost.

42.

Metzker, M. L. Sequencing technologies — the next generation. Nat Rev Genet 11, 31–46 (2010).

43.

Canard, B. & Sarfati, R. S. DNA polymerase fluorescent substrates with reversible 3′-tags. Gene 148, 1–6 (1994).

44.

Nyren, P., Pettersson, B. & Uhlen, M. Solid Phase DNA Minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay. Analytical Biochemistry 208, 171–175 (1993).

45.

Mardis, E. R. A decade’s perspective on DNA sequencing technology. Nature 470, 198–203 (2011).

46.

Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics 3, lqab019 (2021).

47.

Sequencing Technology | Sequencing by synthesis. https://www.illumina.com/science/technology/next-generation-sequencing/sequencing-technology.html.

48.

Pollard, M. O., Gurdasani, D., Mentzer, A. J., Porter, T. & Sandhu, M. S. Long reads: Their purpose and place. Human Molecular Genetics 27, R234–r241 (2018).

49.

Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138 (2009).

50.

Levene, M. J. et al. Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations. Science 299, 682–686 (2003).

51.

Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotech 4, 265–270 (2009).

52.

Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat Biotechnol 34, 518–524 (2016).

53.

Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20, 129 (2019).

54.

Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics 13, 278–289 (2015).

55.

Ip, C. L. C. et al. MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Res 4, 1075 (2015).

56.

Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat Rev Genet 21, 597–614 (2020).

57.

Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338–345 (2018).

59.

Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: A graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).

60.

Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).

61.

Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

62.

Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: Delivery of nanopore sequencing to the genomics community. Genome Biol 17, 239 (2016).

63.

Hong, M. et al. RNA sequencing: New technologies and applications in cancer research. Journal of Hematology & Oncology 13, 166 (2020).

64.

Ozsolak, F. & Milos, P. M. RNA sequencing: Advances, challenges and opportunities. Nat Rev Genet 12, 87–98 (2011).

65.

Hunt, D. F., Yates, J. R., Shabanowitz, J., Winston, S. & Hauer, C. R. Protein sequencing by tandem mass spectrometry. Proceedings of the National Academy of Sciences 83, 6233–6237 (1986).

66.

Smith, B. J. Protein Sequencing Protocols. (Springer Science & Business Media, 2002). doi:10.1385/1592593429.

67.

Restrepo-Pérez, L., Joo, C. & Dekker, C. Paving the way to single-molecule protein sequencing. Nature Nanotech 13, 786–796 (2018).

68.

Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 6, 100 (2017).

69.

Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39, 1348–1365 (2021).

70.

Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biology 20, 50 (2019).

71.

Lima, L. et al. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Briefings in Bioinformatics 21, 1164–1181 (2020).

73.

Zhang, H., Jain, C. & Aluru, S. A comprehensive evaluation of long read error correction methods. BMC Genomics 21, 889 (2020).

74.

Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biology 21, 30 (2020).

75.

Ruan, J. & Li, H. Fast and accurate long-read assembly with Wtdbg2. Nat Methods 17, 155–158 (2020).

76.

Koren, S. et al. Canu: Scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

77.

Tischler, G. & Myers, E. W. Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly. 106252 (2017) doi:10.1101/106252.

78.

Warren, R. L. et al. ntEdit: Scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).

79.

Hepler, N. L. et al. An Improved Circular Consensus Algorithm with an Application to Detect HIV-1 Drug-Resistance Associated Mutations (DRAMs). in Conference on advances in genome biology and technology 1 (2016).

80.

Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 14, 407–410 (2017).

81.

Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017).

82.

Hackl, T., Hedrich, R., Schultz, J. & Förster, F. Proovread : Large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).

83.

Miclotte, G. et al. Jabba: Hybrid error correction for long sequencing reads. Algorithms for Molecular Biology 11, 10 (2016).

84.

Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30, 693–700 (2012).

85.

Salmela, L. & Rivals, E. LoRDEC: Accurate and efficient long read error correction. Bioinformatics 30, 3506–3514 (2014).

86.

Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Plos One 9, e112963 (2014).

87.

Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37, 1155–1162 (2019).

88.

Timp, W., Comer, J. & Aksimentiev, A. DNA Base-Calling from a Nanopore Using a Viterbi Algorithm. Biophysical Journal 102, L37–l39 (2012).

89.

Perešíni, P., Boža, V., Brejová, B. & Vinař, T. Nanopore base calling on the edge. Bioinformatics 37, 4661–4667 (2021).

90.

Boža, V., Brejová, B. & Vinař, T. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. Plos One 12, e0178751 (2017).

91.

Tyler, A. D. et al. Evaluation of Oxford Nanopore’s MinION Sequencing Device for Microbial Whole Genome Sequencing Applications. Sci Rep 8, 10931 (2018).

92.

Lin, B., Hui, J. & Mao, H. Nanopore Technology and Its Applications in Gene Sequencing. Biosensors 11, 214 (2021).

93.

Oxford Nanopore Tech Update: New Duplex method for Q30 nanopore single molecule reads, PromethION 2, and more. http://nanoporetech.com/about-us/news/oxford-nanopore-tech-update-new-duplex-method-q30-nanopore-single-molecule-reads-0.

94.

Sanderson, N. et al. Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. 2022.04.29.490057 (2022) doi:10.1101/2022.04.29.490057.

95.

Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods 18, 165–169 (2021).

96.

Chen, Z. et al. Highly accurate fluorogenic DNA sequencing with information theory–based error correction. Nat Biotechnol 35, 1170–1178 (2017).

97.

High Performance Long Read Assay Enables Contiguous Data up to 10Kb on Existing Illumina Platforms. https://www.illumina.com/content/illumina-marketing/amr/en_US/science/genomics-research/articles/infinity-high-performance-long-read-assay.html.

98.

Booeshaghi, A. S. & Pachter, L. Pseudoalignment facilitates assignment of error-prone Ultima Genomics reads. 2022.06.04.494845 (2022) doi:10.1101/2022.06.04.494845.

99.

Delahaye, C. & Nicolas, J. Sequencing DNA with nanopores: Troubles and biases. Plos One 16, e0257521 (2021).

101.

Dohm, J. C., Peters, P., Stralis-Pavese, N. & Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genomics and Bioinformatics 2, (2020).

102.

Foox, J. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 39, 1129–1140 (2021).

103.

Huang, Y.-T., Liu, P.-Y. & Shih, P.-W. Homopolish: A method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biology 22, 95 (2021).

104.

Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: Computational approaches for improving nanopore sequencing read accuracy. Genome Biology 19, 90 (2018).

105.

Sarkozy, P., Jobbágy, Á. & Antal, P. Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times. in Embec & Nbc 2017 (eds. Eskola, H., Väisänen, O., Viik, J. & Hyttinen, J.) 241–244 (Springer, 2018). doi:10.1007/978-981-10-5122-7_61.

106.

Hawkins, J. A., Jones, S. K., Finkelstein, I. J. & Press, W. H. Indel-correcting DNA barcodes for high-throughput sequencing. Proceedings of the National Academy of Sciences 115, E6217–e6226 (2018).

107.

Srivathsan, A. et al. A MinION™-based pipeline for fast and cost-effective DNA barcoding. Molecular Ecology Resources 18, 1035–1049 (2018).

108.

Wang, Y., Noor-A-Rahim, Md., Gunawan, E., Guan, Y. L. & Poh, C. L. Construction of Bio-Constrained Code for DNA Data Storage. IEEE Communications Letters 23, 963–966 (2019).

109.

R10.3: The newest nanopore for high accuracy nanopore sequencing – now available in store. http://nanoporetech.com/about-us/news/r103-newest-nanopore-high-accuracy-nanopore-sequencing-now-available-store.

110.

Zhou, L. et al. Detection of DNA homopolymer with graphene nanopore. Journal of Vacuum Science & Technology B 37, 061809 (2019).

111.

Goto, Y., Yanagi, I., Matsui, K., Yokoi, T. & Takeda, K. Identification of four single-stranded DNA homopolymers with a solid-state nanopore in alkaline CsCl solution. Nanoscale 10, 20844–20850 (2018).

112.

Nurk, S. et al. HiCanu: Accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

113.

Ekim, B., Berger, B. & Chikhi, R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems 12, 958–968.e6 (2021).

114.

Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053 (2020).

115.

Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

116.

Sahlin, K. & Medvedev, P. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm. Journal of Computational Biology 27, 472–484 (2020).

117.

Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. Plos One 7, e46679 (2012).

118.

Hu, R., Sun, G. & Sun, X. LSCplus: A fast solution for improving long read accuracy by short read alignment. BMC Bioinformatics 17, 451 (2016).

119.

Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

120.

Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

121.

Van Neste, C., Van Nieuwerburgh, F., Van Hoofstat, D. & Deforce, D. Forensic STR analysis using massive parallel sequencing. Forensic Science International: Genetics 6, 810–818 (2012).

122.

Short-read sequencing by binding. https://www.pacb.com/technology/sequencing-by-binding/.

123.

Cetin, A. E. et al. Plasmonic Sensor Could Enable Label-Free DNA Sequencing. ACS Sens. 3, 561–568 (2018).

124.

Almogy, G. et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. 2022.05.29.493900 (2022) doi:10.1101/2022.05.29.493900.

125.

Sunagawa, S. et al. Tara Oceans: Towards global ocean ecosystems biology. Nat Rev Microbiol 18, 428–445 (2020).

126.

Lewin, H. A. et al. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115, 4325–4333 (2018).

127.

Lightbody, G. et al. Review of applications of high-throughput sequencing in personalized medicine: Barriers and facilitators of future progress in research and clinical application. Briefings in Bioinformatics 20, 1795–1811 (2019).

Homopolymer indels can be harmful in opposite circumstances as well. Let us consider, for example, a read that should correspond to several repetitions of a conserved motif. Homopolymer indels can artificially resolve an ambiguity by making the read unique and prefer a specific repetition of the motif or entirely misplace the read.↩︎