Global References

Watson, J. D. & Crick, F. H. C. The Structure of Dna. Cold Spring Harb Symp Quant Biol 18, 123–131 (1953).

Sanger, F. et al. Nucleotide sequence of bacteriophage φX174 DNA. Nature 265, 687–695 (1977).

Archer, C. T. et al. The genome sequence of E. Coli W (ATCC 9637): Comparative genome analysis and an improved genome-scale reconstruction of E. coli. BMC Genomics 12, 9 (2011).

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

Pellicer, J., Fay, M. F. & Leitch, I. J. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society 164, 10–15 (2010).

Macgregor, H. C. C-Value Paradox. in Encyclopedia of Genetics (eds. Brenner, S. & Miller, J. H.) 249–250 (Academic Press, 2001). doi:10.1006/rwgn.2001.0301.

Alberts, B. et al. Molecular Biology of the Cell. 4th edition. (Garland Science, 2002).

Crick, F. H. C., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General Nature of the Genetic Code for Proteins. Nature 192, 1227–1232 (1961).

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

10.

Elkon, R. & Agami, R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol 35, 732–746 (2017).

11.

Omenn, G. S. Reflections on the HUPO Human Proteome Project, the Flagship Project of the Human Proteome Organization, at 10Ỹears. Mol Cell Proteomics 20, 100062 (2021).

12.

Shabalina, S. A. & Spiridonov, N. A. The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biol 5, 105 (2004).

13.

Consortium, T. E. P. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012).

14.

Chatterjee, N. & Walker, G. C. Mechanisms of DNA damage, repair, and mutagenesis: DNA Damage and Repair. Environ. Mol. Mutagen. 58, 235–263 (2017).

15.

Fijalkowska, I. J., Schaaper, R. M. & Jonczyk, P. DNA replication fidelity in Escherichia coli: A multi-DNA polymerase affair. FEMS Microbiol Rev 36, 1105–1121 (2012).

16.

Pray, L. DNA replication and causes of mutation. Nature education 1, 214 (2008).

17.

Gout, J.-F., Thomas, W. K., Smith, Z., Okamoto, K. & Lynch, M. Large-scale detection of in vivo transcription errors. Proceedings of the National Academy of Sciences 110, 18584–18589 (2013).

18.

Gout, J.-F. et al. The landscape of transcription errors in eukaryotic cells. Sci Adv 3, e1701484 (2017).

19.

Shcherbakov, D. et al. Ribosomal mistranslation leads to silencing of the unfolded protein response and increased mitochondrial biogenesis. Commun Biol 2, 1–16 (2019).

20.

Desouky, O., Ding, N. & Zhou, G. Targeted and non-targeted effects of ionizing radiation. Journal of Radiation Research and Applied Sciences 8, 247–254 (2015).

21.

Kiefer, J. Effects of Ultraviolet Radiation on DNA. in Chromosomal Alterations: Methods, Results and Importance in Human Health (eds. Obe, G. & Vijayalaxmi) 39–53 (Springer, 2007). doi:10.1007/978-3-540-71414-9_3.

22.

Bennett, J. W. & Klich, M. Mycotoxins. Clin Microbiol Rev 16, 497–516 (2003).

23.

Kantidze, O. L., Velichko, A. K., Luzhin, A. V. & Razin, S. V. Heat Stress-Induced DNA Damage. Acta Naturae 8, 75–78 (2016).

24.

Gregory, C. D. & Milner, A. E. Regulation of cell survival in Burkitt lymphoma: Implications from studies of apoptosis following cold-shock treatment. Int J Cancer 57, 419–426 (1994).

25.

Gafter-Gvili, A. et al. Oxidative Stress-Induced DNA Damage and Repair in Human Peripheral Blood Mononuclear Cells: Protective Role of Hemoglobin. PLoS One 8, e68341 (2013).

26.

Anagnostou, M. E. et al. Transcription errors in aging and disease. Translational Medicine of Aging 5, 31–38 (2021).

27.

Roth, J. R. Frameshift mutations. Annu Rev Genet 8, 319–346 (1974).

28.

Kujovich, J. L. Factor V Leiden thrombophilia. Genetics in Medicine 13, 1–16 (2011).

29.

Cutting, G. R. Cystic fibrosis genetics: From molecular understanding to clinical application. Nat Rev Genet 16, 45–56 (2015).

30.

Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016).

31.

Morris, A. P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44, 981–990 (2012).

32.

Woodford, N. & Ellington, M. J. The emergence of antibiotic resistance by mutation. Clinical Microbiology and Infection 13, 5–18 (2007).

33.

Rhee, S.-Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31, 298–303 (2003).

34.

Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74, 5463–5467 (1977).

35.

Smith, L. M., Fung, S., Hunkapiller, M. W., Hunkapiller, T. J. & Hood, L. E. The synthesis of oligonucleotides containing an aliphatic amino group at the 5′ terminus: Synthesis of fluorescent DNA primers for use in DNA sequence analysis. Nucleic Acids Research 13, 2399–2412 (1985).

36.

Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674–679 (1986).

37.

Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. & Zenke, M. Automated DNA sequencing: Ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic Acids Research 15, 4593–4602 (1987).

38.

Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat Biotechnol 26, 1135–1145 (2008).

39.

Collins, F. S., Morgan, M. & Patrinos, A. The Human Genome Project: Lessons from Large-Scale Biology. Science 300, 286–290 (2003).

40.

Liu, L. et al. Comparison of Next-Generation Sequencing Systems. Journal of Biomedicine and Biotechnology 2012, e251364 (2012).

41.

The Cost of Sequencing a Human Genome. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost.

42.

Metzker, M. L. Sequencing technologies — the next generation. Nat Rev Genet 11, 31–46 (2010).

43.

Canard, B. & Sarfati, R. S. DNA polymerase fluorescent substrates with reversible 3′-tags. Gene 148, 1–6 (1994).

44.

Nyren, P., Pettersson, B. & Uhlen, M. Solid Phase DNA Minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay. Analytical Biochemistry 208, 171–175 (1993).

45.

Mardis, E. R. A decade’s perspective on DNA sequencing technology. Nature 470, 198–203 (2011).

46.

Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics 3, lqab019 (2021).

47.

Sequencing Technology | Sequencing by synthesis. https://www.illumina.com/science/technology/next-generation-sequencing/sequencing-technology.html.

48.

Pollard, M. O., Gurdasani, D., Mentzer, A. J., Porter, T. & Sandhu, M. S. Long reads: Their purpose and place. Human Molecular Genetics 27, R234–r241 (2018).

49.

Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138 (2009).

50.

Levene, M. J. et al. Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations. Science 299, 682–686 (2003).

51.

Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotech 4, 265–270 (2009).

52.

Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat Biotechnol 34, 518–524 (2016).

53.

Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20, 129 (2019).

54.

Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics 13, 278–289 (2015).

55.

Ip, C. L. C. et al. MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Res 4, 1075 (2015).

56.

Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat Rev Genet 21, 597–614 (2020).

57.

Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338–345 (2018).

58.

Thar she blows! Ultra long read method for nanopore sequencing · Loman Labs. http://lab.loman.net/2017/03/09/ultrareads-for-nanopore/.

59.

Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: A graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).

60.

Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).

61.

Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

62.

Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: Delivery of nanopore sequencing to the genomics community. Genome Biol 17, 239 (2016).

63.

Hong, M. et al. RNA sequencing: New technologies and applications in cancer research. Journal of Hematology & Oncology 13, 166 (2020).

64.

Ozsolak, F. & Milos, P. M. RNA sequencing: Advances, challenges and opportunities. Nat Rev Genet 12, 87–98 (2011).

65.

Hunt, D. F., Yates, J. R., Shabanowitz, J., Winston, S. & Hauer, C. R. Protein sequencing by tandem mass spectrometry. Proceedings of the National Academy of Sciences 83, 6233–6237 (1986).

66.

Smith, B. J. Protein Sequencing Protocols. (Springer Science & Business Media, 2002). doi:10.1385/1592593429.

67.

Restrepo-Pérez, L., Joo, C. & Dekker, C. Paving the way to single-molecule protein sequencing. Nature Nanotech 13, 786–796 (2018).

68.

Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 6, 100 (2017).

69.

Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39, 1348–1365 (2021).

70.

Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biology 20, 50 (2019).

71.

Lima, L. et al. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Briefings in Bioinformatics 21, 1164–1181 (2020).

72.

Fu, S., Wang, A. & Au, K. F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biology 20, 26 (2019).

73.

Zhang, H., Jain, C. & Aluru, S. A comprehensive evaluation of long read error correction methods. BMC Genomics 21, 889 (2020).

74.

Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biology 21, 30 (2020).

75.

Ruan, J. & Li, H. Fast and accurate long-read assembly with Wtdbg2. Nat Methods 17, 155–158 (2020).

76.

Koren, S. et al. Canu: Scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

77.

Tischler, G. & Myers, E. W. Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly. 106252 (2017) doi:10.1101/106252.

78.

Warren, R. L. et al. ntEdit: Scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).

79.

Hepler, N. L. et al. An Improved Circular Consensus Algorithm with an Application to Detect HIV-1 Drug-Resistance Associated Mutations (DRAMs). in Conference on advances in genome biology and technology 1 (2016).

80.

Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 14, 407–410 (2017).

81.

Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017).

82.

Hackl, T., Hedrich, R., Schultz, J. & Förster, F. Proovread : Large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).

83.

Miclotte, G. et al. Jabba: Hybrid error correction for long sequencing reads. Algorithms for Molecular Biology 11, 10 (2016).

84.

Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30, 693–700 (2012).

85.

Salmela, L. & Rivals, E. LoRDEC: Accurate and efficient long read error correction. Bioinformatics 30, 3506–3514 (2014).

86.

Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Plos One 9, e112963 (2014).

87.

Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37, 1155–1162 (2019).

88.

Timp, W., Comer, J. & Aksimentiev, A. DNA Base-Calling from a Nanopore Using a Viterbi Algorithm. Biophysical Journal 102, L37–l39 (2012).

89.

Perešíni, P., Boža, V., Brejová, B. & Vinař, T. Nanopore base calling on the edge. Bioinformatics 37, 4661–4667 (2021).

90.

Boža, V., Brejová, B. & Vinař, T. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. Plos One 12, e0178751 (2017).

91.

Tyler, A. D. et al. Evaluation of Oxford Nanopore’s MinION Sequencing Device for Microbial Whole Genome Sequencing Applications. Sci Rep 8, 10931 (2018).

92.

Lin, B., Hui, J. & Mao, H. Nanopore Technology and Its Applications in Gene Sequencing. Biosensors 11, 214 (2021).

93.

Oxford Nanopore Tech Update: New Duplex method for Q30 nanopore single molecule reads, PromethION 2, and more. http://nanoporetech.com/about-us/news/oxford-nanopore-tech-update-new-duplex-method-q30-nanopore-single-molecule-reads-0.

94.

Sanderson, N. et al. Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. 2022.04.29.490057 (2022) doi:10.1101/2022.04.29.490057.

95.

Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods 18, 165–169 (2021).

96.

Chen, Z. et al. Highly accurate fluorogenic DNA sequencing with information theory–based error correction. Nat Biotechnol 35, 1170–1178 (2017).

97.

High Performance Long Read Assay Enables Contiguous Data up to 10Kb on Existing Illumina Platforms. https://www.illumina.com/content/illumina-marketing/amr/en_US/science/genomics-research/articles/infinity-high-performance-long-read-assay.html.

98.

Booeshaghi, A. S. & Pachter, L. Pseudoalignment facilitates assignment of error-prone Ultima Genomics reads. 2022.06.04.494845 (2022) doi:10.1101/2022.06.04.494845.

99.

Delahaye, C. & Nicolas, J. Sequencing DNA with nanopores: Troubles and biases. Plos One 16, e0257521 (2021).

100.

Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 25, 1750–1756 (2015).

101.

Dohm, J. C., Peters, P., Stralis-Pavese, N. & Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genomics and Bioinformatics 2, (2020).

102.

Foox, J. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 39, 1129–1140 (2021).

103.

Huang, Y.-T., Liu, P.-Y. & Shih, P.-W. Homopolish: A method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biology 22, 95 (2021).

104.

Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: Computational approaches for improving nanopore sequencing read accuracy. Genome Biology 19, 90 (2018).

105.

Sarkozy, P., Jobbágy, Á. & Antal, P. Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times. in Embec & Nbc 2017 (eds. Eskola, H., Väisänen, O., Viik, J. & Hyttinen, J.) 241–244 (Springer, 2018). doi:10.1007/978-981-10-5122-7_61.

106.

Hawkins, J. A., Jones, S. K., Finkelstein, I. J. & Press, W. H. Indel-correcting DNA barcodes for high-throughput sequencing. Proceedings of the National Academy of Sciences 115, E6217–e6226 (2018).

107.

Srivathsan, A. et al. A MinION™-based pipeline for fast and cost-effective DNA barcoding. Molecular Ecology Resources 18, 1035–1049 (2018).

108.

Wang, Y., Noor-A-Rahim, Md., Gunawan, E., Guan, Y. L. & Poh, C. L. Construction of Bio-Constrained Code for DNA Data Storage. IEEE Communications Letters 23, 963–966 (2019).

109.

R10.3: The newest nanopore for high accuracy nanopore sequencing – now available in store. http://nanoporetech.com/about-us/news/r103-newest-nanopore-high-accuracy-nanopore-sequencing-now-available-store.

110.

Zhou, L. et al. Detection of DNA homopolymer with graphene nanopore. Journal of Vacuum Science & Technology B 37, 061809 (2019).

111.

Goto, Y., Yanagi, I., Matsui, K., Yokoi, T. & Takeda, K. Identification of four single-stranded DNA homopolymers with a solid-state nanopore in alkaline CsCl solution. Nanoscale 10, 20844–20850 (2018).

112.

Nurk, S. et al. HiCanu: Accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

113.

Ekim, B., Berger, B. & Chikhi, R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems 12, 958–968.e6 (2021).

114.

Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053 (2020).

115.

Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

116.

Sahlin, K. & Medvedev, P. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm. Journal of Computational Biology 27, 472–484 (2020).

117.

Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. Plos One 7, e46679 (2012).

118.

Hu, R., Sun, G. & Sun, X. LSCplus: A fast solution for improving long read accuracy by short read alignment. BMC Bioinformatics 17, 451 (2016).

119.

Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

120.

Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

121.

Van Neste, C., Van Nieuwerburgh, F., Van Hoofstat, D. & Deforce, D. Forensic STR analysis using massive parallel sequencing. Forensic Science International: Genetics 6, 810–818 (2012).

122.

Short-read sequencing by binding. https://www.pacb.com/technology/sequencing-by-binding/.

123.

Cetin, A. E. et al. Plasmonic Sensor Could Enable Label-Free DNA Sequencing. ACS Sens. 3, 561–568 (2018).

124.

Almogy, G. et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. 2022.05.29.493900 (2022) doi:10.1101/2022.05.29.493900.

125.

Sunagawa, S. et al. Tara Oceans: Towards global ocean ecosystems biology. Nat Rev Microbiol 18, 428–445 (2020).

126.

Lewin, H. A. et al. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115, 4325–4333 (2018).

127.

Lightbody, G. et al. Review of applications of high-throughput sequencing in personalized medicine: Barriers and facilitators of future progress in research and clinical application. Briefings in Bioinformatics 20, 1795–1811 (2019).

128.

Hamming, R. W. Coding and Information Theory. (Prentice-Hall, 1980).

129.

Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. (Cambridge University Press, 1997). doi:10.1017/cbo9780511574931.

130.

Levenshtein, V. I. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966).

131.

Hardison, R. C. Comparative Genomics. PLOS Biology 1, e58 (2003).

132.

Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol 17, 368–376 (1981).

133.

Kumar, S., Tamura, K. & Nei, M. MEGA: Molecular Evolutionary Genetics Analysis software for microcomputers. Bioinformatics 10, 189–191 (1994).

134.

Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).

135.

Guindon, S. et al. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Systematic Biology 59, 307–321 (2010).

136.

Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5, e9490 (2010).

137.

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

138.

Karplus, K. et al. Predicting protein structure using only sequence information. Proteins: Structure, Function, and Bioinformatics 37, 121–125 (1999).

139.

Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Current Opinion in Structural Biology 15, 275–284 (2005).

140.

Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005 (2007).

141.

Salmela, L. & Schröder, J. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 1455–1461 (2011).

142.

Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6, S13–s20 (2009).

143.

Mahmoud, M. et al. Structural variant calling: The long and the short of it. Genome Biol 20, 246 (2019).

144.

Sung, W.-K. Algorithms in Bioinformatics: A Practical Introduction. (Chapman and Hall/CRC, 2011). doi:10.1201/9781420070347.

145.

Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970).

146.

Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981).

147.

Bradley, S. P., Hax, A. C. & Magnanti, T. L. Applied Mathematical Programming. (Addison-Wesley Publishing Company, 1977).

148.

Bellman, R. The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 503–515 (1954).

149.

Masek, W. J. & Paterson, M. S. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20, 18–31 (1980).

150.

Vinh, N. X., Epps, J. & Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research 11, 18 (2010).

151.

Ullman, J. D., Aho, A. V. & Hirschberg, D. S. Bounds on the Complexity of the Longest Common Subsequence Problem. J. Acm 23, 1–12 (1976).

152.

Hirschberg, D. S. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18, 341–343 (1975).

153.

Myers, E. W. & Miller, W. Optimal alignments in linear space. Bioinformatics 4, 11–17 (1988).

154.

Rice, P., Longden, I. & Bleasby, A. EMBOSS: The European molecular biology open software suite. Trends in genetics 16, 276–277 (2000).

155.

Huang, X. & Miller, W. A time-efficient, linear-space local similarity algorithm. Advances in Applied Mathematics 12, 337–357 (1991).

156.

Waterman, M. S. & Eggert, M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. Journal of Molecular Biology 197, 723–728 (1987).

157.

Stajich, J. E. et al. The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 12, 1611–1618 (2002).

158.

Gentleman, R. C. et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004).

159.

Daily, J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics 17, 81 (2016).

160.

Frohmberg, W., Kierzynka, M., Blazewicz, J. & Wojciechowski, P. G-PAS 2.0 – an improved version of protein alignment tool with an efficient backtracking routine on multiple GPUs. Bulletin of the Polish Academy of Sciences: Technical Sciences 60, 491–494 (2012).

161.

Altschul, S. F. Substitution Matrices. in eLS (John Wiley & Sons, Ltd, 2013). doi:10.1002/9780470015902.a0005265.pub3.

162.

Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. A Model of Evolutionary Change in Proteins. in Atlas of Protein Sequence and Structure 345–352 (1978).

163.

Müller, T. & Vingron, M. Modeling amino acid replacement. J Comput Biol 7, 761–776 (2000).

164.

Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Pnas 89, 10915–10919 (1992).

165.

Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18, 691–699 (2001).

166.

Le, S. Q. & Gascuel, O. An Improved General Amino Acid Replacement Matrix. Molecular Biology and Evolution 25, 1307–1320 (2008).

167.

Müller, T., Rahmann, S. & Rehmsmeier, M. Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17, S182–s189 (2001).

168.

Ng, P. C., Henikoff, J. G. & Henikoff, S. PHAT: A transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics 16, 760–766 (2000).

169.

Trivedi, R. & Nagarajaram, H. A. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 9, 16380 (2019).

170.

Goonesekere, N. C. W. & Lee, B. Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins: Structure, Function, and Bioinformatics 71, 910–919 (2008).

171.

Paila, U., Kondam, R. & Ranjan, A. Genome bias influences amino acid choices: Analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome. Nucleic Acids Res 36, 6664–6675 (2008).

172.

Nickle, D. C. et al. HIV-Specific Probabilistic Models of Protein Evolution. PLoS One 2, e503 (2007).

173.

Sardiu, M. E., Alves, G. & Yu, Y.-K. Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Phys Rev E Stat Nonlin Soft Matter Phys 72, 061917 (2005).

174.

Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. in Biocomputing 2002 115–126 (World Scientific, 2001). doi:10.1142/9789812799623_0012.

175.

Schneider, A., Cannarozzi, G. M. & Gonnet, G. H. Empirical codon substitution matrix. BMC Bioinformatics 6, 134 (2005).

176.

Doron-Faigenboim, A. & Pupko, T. A Combined Empirical and Mechanistic Codon Model. Molecular Biology and Evolution 24, 388–397 (2007).

177.

Cartwright, R. A. Problems and Solutions for Estimating Indel Rates and Length Distributions. Molecular Biology and Evolution 26, 473–480 (2009).

178.

Fitch, W. M. & Smith, T. F. Optimal sequence alignments. Proceedings of the National Academy of Sciences 80, 1382–1386 (1983).

179.

Waterman, M. S., Smith, T. F. & Beyer, W. A. Some biological sequence metrics. Advances in Mathematics 20, 367–387 (1976).

180.

Gotoh, O. An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982).

181.

Altschul, S. F. & Erickson, B. W. Optimal sequence alignment using affine gap costs. Bulletin of Mathematical Biology 48, 603–616 (1986).

182.

Waterman, M. S. Efficient sequence alignment algorithms. Journal of Theoretical Biology 108, 333–337 (1984).

183.

Miller, W. & Myers, E. W. Sequence comparison with concave weighting functions. Bltn Mathcal Biology 50, 97–120 (1988).

184.

Cartwright, R. A. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 7, 527 (2006).

185.

Goonesekere, N. C. W. & Lee, B. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Research 32, 2838–2843 (2004).

186.

Benner, S. A., Cohen, M. A. & Gonnet, G. H. Empirical and Structural Models for Insertions and Deletions in the Divergent Evolution of Proteins. Journal of Molecular Biology 229, 1065–1082 (1993).

187.

Wrabl, J. O. & Grishin, N. V. Gaps in structurally similar proteins: Towards improvement of multiple sequence alignment. Proteins: Structure, Function, and Bioinformatics 54, 71–87 (2004).

188.

Zhang, W., Liu, S. & Zhou, Y. SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model. Plos One 3, e2325 (2008).

189.

Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G. & Gibson, T. J. Multiple sequence alignment with Clustal X. Trends in Biochemical Sciences 23, 403–405 (1998).

190.

Wang, C., Yan, R.-X., Wang, X.-F., Si, J.-N. & Zhang, Z. Comparison of linear gap penalties and profile-based variable gap penalties in profile–profile alignments. Computational Biology and Chemistry 35, 308–318 (2011).

191.

Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics (2020) doi:10.1093/bioinformatics/btaa777.

192.

Pearson, W. R. & Miller, W. [27] Dynamic programming algorithms for biological sequence comparison. in Methods in Enzymology vol. 210 575–601 (Academic Press, 1992).

193.

Spouge, J. L. Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths. SIAM J. Appl. Math. 49, 1552–1566 (1989).

194.

Fickett, J. W. Fast optimal alignment. Nucleic Acids Research 12, 175–179 (1984).

195.

Chao, J., Tang, F. & Xu, L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 12, 546 (2022).

196.

Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30, 3059–3066 (2002).

197.

Sun, Y. & Buhler, J. Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics 7, 133 (2006).

198.

Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11, 473–483 (2010).

199.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).

200.

Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997).

201.

Schwartz, S. et al. Human–Mouse Alignments with BLASTZ. Genome Res. 13, 103–107 (2003).

202.

Ma, B., Tromp, J. & Li, M. PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002).

203.

Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

204.

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60 (2015).

205.

Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).

206.

Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448 (1988).

207.

Lipman, D. J. & Pearson, W. R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).

208.

Saripella, G. V., Sonnhammer, E. L. L. & Forslund, K. Benchmarking the next generation of homology inference tools. Bioinformatics 32, 2636 (2016).

209.

Finn, R. D. et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research 44, D279 (2016).

210.

Essoussi, N. & Fayech, S. A comparison of four pair-wise sequence alignment methods. Bioinformation 2, 166–168 (2007).

211.

Shpaer, E. G. et al. Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith–Waterman in Hardware to BLAST and FASTA. Genomics 38, 179–191 (1996).

212.

Schleimer, S., Wilkerson, D. S. & Aiken, A. Winnowing: Local algorithms for document fingerprinting. in Proceedings of the 2003 ACM SIGMOD international conference on Management of data 76–85 (Association for Computing Machinery, 2003). doi:10.1145/872757.872770.

213.

Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).

214.

Li, H. Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

215.

Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

216.

Orenstein, Y., Pellow, D., Marçais, G., Shamir, R. & Kingsford, C. Compact Universal k-mer Hitting Sets. in Algorithms in Bioinformatics (eds. Frith, M. & Storm Pedersen, C. N.) 257–268 (Springer International Publishing, 2016). doi:10.1007/978-3-319-43681-4_21.

217.

Marçais, G. et al. Improving the performance of minimizers and winnowing schemes. Bioinformatics 33, i110–i117 (2017).

218.

Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. & Medvedev, P. On the Representation of de Bruijn Graphs. in Research in Computational Molecular Biology (ed. Sharan, R.) 35–55 (Springer International Publishing, 2014). doi:10.1007/978-3-319-05269-4_4.

219.

Edgar, R. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ 9, e10805 (2021).

220.

Sahlin, K. Effective sequence similarity detection with strobemers. Genome Res. 31, 2080–2094 (2021).

221.

Sahlin, K. Flexible seed size enables ultra-fast and accurate read alignment. 2021.06.18.449070 (2022) doi:10.1101/2021.06.18.449070.

222.

Weiner, P. Linear pattern matching algorithms. in 14th Annual Symposium on Switching and Automata Theory (swat 1973) 1–11 (1973). doi:10.1109/swat.1973.13.

223.

Manber, U. & Myers, G. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22, 935–948 (1993).

224.

Abouelhoda, M. I., Kurtz, S. & Ohlebusch, E. The Enhanced Suffix Array and Its Applications to Genome Analysis. in Algorithms in Bioinformatics (eds. Guigó, R. & Gusfield, D.) 449–463 (Springer, 2002). doi:10.1007/3-540-45784-4_35.

225.

Ferragina, P. & Manzini, G. Opportunistic data structures with applications. in Proceedings 41st Annual Symposium on Foundations of Computer Science 390–398 (2000). doi:10.1109/sfcs.2000.892127.

226.

Bray, N., Dubchak, I. & Pachter, L. AVID: A Global Alignment Program. Genome Res 13, 97–102 (2003).

227.

Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30, 2478–2483 (2002).

228.

Abouelhoda, M. I., Kurtz, S. & Ohlebusch, E. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004).

229.

Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944 (2018).

230.

McCreight, E. M. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 262272 (1976).

231.

Burrows, M. & Wheeler, D. A Block-Sorting Lossless Data Compression Algorithm. https://www.cs.jhu.edu/%7Elangmea/resources/burrows_wheeler.pdf (1994).

232.

Vyverman, M., De Baets, B., Fack, V. & Dawyndt, P. Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Research 40, 6993–7015 (2012).

233.

Cheng, H., Wu, M. & Xu, Y. FMtree: A fast locating algorithm of FM-indexes for genomic data. Bioinformatics 34, 416–424 (2018).

234.

Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K. & Yiu, S. M. Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008).

235.

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

236.

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

237.

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. (2013).

238.

Liu, Y. & Schmidt, B. Long read alignment based on maximal exact match seeds. Bioinformatics 28, i318–i324 (2012).

239.

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).

240.

Song, B. et al. AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sciences 119, e2113075119 (2022).

241.

Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, 1998). doi:10.1017/cbo9780511790492.

242.

Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).

243.

Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res 39, W29–w37 (2011).

244.

Wang, J., Keightley, P. D. & Johnson, T. MCALIGN2: Faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution. BMC Bioinformatics 7, 292 (2006).

245.

Ruffalo, M., LaFramboise, T. & Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27, 2790–2796 (2011).

246.

Schbath, S. et al. Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis. J Comput Biol 19, 796–813 (2012).

247.

Hatem, A., Bozdağ, D., Toland, A. E. & Çatalyürek, Ü. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).

248.

Canzar, S. & Salzberg, S. L. Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE 105, 436–458 (2017).

249.

Alser, M. et al. Technology dictates algorithms: Recent developments in read alignment. Genome Biology 22, 249 (2021).

250.

Břinda, K., Boeva, V. & Kucherov, G. RNF: A general framework to evaluate NGS read mappers. Bioinformatics 32, 136–139 (2016).

251.

Lin, H.-N. & Hsu, W.-L. Kart: A divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33, 2281–2287 (2017).

252.

Olson, C. B. et al. Hardware Acceleration of Short Read Mapping. in 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines 161–168 (2012). doi:10.1109/fccm.2012.36.

253.

Chen, P., Wang, C., Li, X. & Zhou, X. Accelerating the Next Generation Long Read Mapping with the FPGA-Based System. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 840–852 (2014).

254.

Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 45 (2018).

255.

Zeni, A. et al. LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment. in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 462–471 (2020). doi:10.1109/ipdps47924.2020.00055.

256.

Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinformatics 13, 238 (2012).

257.

Haghshenas, E., Sahinalp, S. C. & Hach, F. lordFAST: Sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data. Bioinformatics 35, 20–27 (2019).

258.

Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun 7, 11307 (2016).

259.

Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15, 461–468 (2018).

260.

Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J Comput Biol 25, 766–779 (2018).

261.

Prodanov, T. & Bansal, V. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications. Nucleic Acids Research 48, e114 (2020).

262.

Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 19, 705–710 (2022).

263.

Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: Mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).

264.

Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

265.

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 3 (2009).

266.

Understanding MAPQ scores in SAM files: Does 37 = 42? http://www.acgt.me/blog/2014/12/16/understanding-mapq-scores-in-sam-files-does-37-42.

267.

Lee, H. & Schatz, M. C. Genomic dark matter: The reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).

268.

Langmead, B. A tandem simulation framework for predicting mapping quality. Genome Biology 18, 152 (2017).

269.

Ruffalo, M., Koyutürk, M., Ray, S. & LaFramboise, T. Accurate estimation of short read mapping quality for next-generation genome sequencing. Bioinformatics 28, i349–i355 (2012).

270.

Multiple Sequence Alignment Methods. vol. 1079 (Humana Press, 2014).

271.

Wang, L. & Jiang, T. On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology 1, 337–348 (1994).

272.

Just, W. Computational Complexity of Multiple Sequence Alignment with SP-Score. Journal of Computational Biology 8, 615–623 (2001).

273.

Tang, F. et al. HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences. Molecular Biology and Evolution 39, msac166 (2022).

274.

Feng, D.-F. & Doolittle, R. F. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. J Mol Evol 25, 351–360 (1987).

275.

Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Bioinformatics 8, 275–282 (1992).

276.

Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences 83, 5155–5159 (1986).

277.

Gronau, I. & Moran, S. Optimal implementations of UPGMA and other common clustering algorithms. Information Processing Letters 104, 205–210 (2007).

278.

Saitou, N. & Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987).

279.

Katoh, K. & Toh, H. PartTree: An algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–374 (2007).

280.

Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7, 539 (2011).

281.

Blackshields, G., Sievers, F., Shi, W., Wilm, A. & Higgins, D. G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol 5, 21 (2010).

282.

Altschul, S. F. Gap costs for multiple sequence alignment. Journal of Theoretical Biology 138, 297–309 (1989).

283.

Altschul, S. F., Carroll, R. J. & Lipman, D. J. Weights for data related by a tree. Journal of Molecular Biology 207, 647–653 (1989).

284.

Edgar, R. C. & Sjölander, K. A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20, 1301–1308 (2004).

285.

Notredame, C., Holm, L. & Higgins, D. G. COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14, 407–422 (1998).

286.

Notredame, C., Higgins, D. G. & Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment11Edited by J. Thornton. Journal of Molecular Biology 302, 205–217 (2000).

287.

Edgar, R. C. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004).

288.

Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004).

289.

Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–340 (2005).

290.

Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680 (1994).

291.

Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G. The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools. Nucleic Acids Research 25, 4876–4882 (1997).

292.

Liu, Y., Schmidt, B. & Maskell, D. L. MSAProbs: Multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 26, 1958–1964 (2010).

293.

Lemoine, F., Blassel, L., Voznica, J. & Gascuel, O. COVID-Align: Accurate online alignment of hCoV-19 genomes using a profile HMM. Bioinformatics (2020) doi:10.1093/bioinformatics/btaa871.

294.

Eddy, S. R. Multiple Alignment Using Hidden Markov Models. in International Conference on Intelligent Systems for Molecular Biology 7 (1995).

295.

Kim, J., Pramanik, S. & Chung, M. J. Multiple sequence alignment using simulated annealing. Bioinformatics 10, 419–426 (1994).

296.

Ishikawa, M. et al. Multiple sequence alignment by parallel simulated annealing. Bioinformatics 9, 267–273 (1993).

297.

Huo, H. & Stojkovic, V. A simulated annealing algorithm for multiple sequence alignment with guaranteed accuracy. in Third International Conference on Natural Computation (ICNC 2007) vol. 2 270–274 (2007).

298.

Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).

299.

Zhang, C. & Wong, A. K. C. A genetic algorithm for multiple molecular sequence alignment. Bioinformatics 13, 565–581 (1997).

300.

Naznin, F., Sarker, R. & Essam, D. Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment. BMC Bioinformatics 12, 353 (2011).

301.

Naznin, F., Sarker, R. & Essam, D. Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment. IEEE Transactions on Evolutionary Computation 16, 615–631 (2012).

302.

Notredame, C. & Higgins, D. G. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res 24, 1515–1524 (1996).

303.

Aksamentov, I., Roemer, C., Hodcroft, E. & Neher, R. Nextclade: Clade assignment, mutation calling and quality control for viral genomes. Joss 6, 3773 (2021).

304.

Garriga, E. et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol 37, 1466–1470 (2019).

305.

Notredame, C. Recent Evolutions of Multiple Sequence Alignment Algorithms. PLOS Computational Biology 3, e123 (2007).

306.

Notredame, C. Recent progress in multiple sequence alignment: A survey. Pharmacogenomics 3, 131–144 (2002).

307.

Edgar, R. C. & Batzoglou, S. Multiple sequence alignment. Current Opinion in Structural Biology 16, 368–373 (2006).

308.

Pais, F. S.-M., Ruy, P. de C., Oliveira, G. & Coimbra, R. S. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol 9, 4 (2014).

309.

Thompson, J. D., Plewniak, F. & Poch, O. BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88 (1999).

310.

Bragg, L., Stone, G., Imelfort, M., Hugenholtz, P. & Tyson, G. W. Fast, accurate error-correction of amplicon pyrosequences using Acacia. Nat Methods 9, 425–426 (2012).

311.

Sahlin, K. & Medvedev, P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun 12, 2 (2021).

312.

Liu, H. et al. SMARTdenovo: A de novo assembler using long noisy reads. Gigabyte 2021, 1–9 (2021).

313.

Graham, R. L., Knuth, D. E. & Patashnik, O. Concrete mathematics: A foundation for computer science. (Addison-Wesley, 1994).

314.

Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

315.

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245 (2020).

316.

Li, H. New strategies to improve Minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).

317.

Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15, 595–597 (2018).

318.

Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience 6, (2017).

319.

Martin, J. A. & Wang, Z. Next-generation transcriptome assembly. Nat Rev Genet 12, 671–682 (2011).

320.

Kyriakidou, M., Tai, H. H., Anglin, N. L., Ellis, D. & Strömvik, M. V. Current Strategies of Polyploid Plant Genome Sequence Assembly. Frontiers in Plant Science 9, (2018).

321.

Paszkiewicz, K. & Studholme, D. J. De novo assembly of short sequence reads. Briefings in Bioinformatics 11, 457–472 (2010).

322.

Sohn, J. & Nam, J.-W. The present and future of de novo whole-genome assembly. Briefings in Bioinformatics 19, 23–40 (2018).

323.

Sleator, R. D. & Walsh, P. An overview of in silico protein function prediction. Arch Microbiol 192, 151–155 (2010).

324.

Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Medicine 12, 91 (2020).

325.

Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat Rev Genet 12, 363–376 (2011).

326.

Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat Rev Genet 21, 171–189 (2020).

327.

Morrison, D. A. Phylogenetic tree-building. International Journal for Parasitology 26, 589–617 (1996).

328.

Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat Rev Genet 21, 428–444 (2020).

329.

Kuhlman, B. & Bradley, P. Advances in protein structure prediction and design. Nat Rev Mol Cell Biol 20, 681–697 (2019).

330.

Ammad-ud-din, M., Khan, S. A., Wennerberg, K. & Aittokallio, T. Systematic identification of feature combinations for predicting drug response with Bayesian multi-view multi-task linear regression. Bioinformatics 33, i359–i368 (2017).

331.

Steiner, M. C., Gibson, K. M. & Crandall, K. A. Drug Resistance Prediction Using Deep Learning Techniques on HIV-1 Sequence Data. Viruses 12, 560 (2020).

332.

Noé, F., De Fabritiis, G. & Clementi, C. Machine learning for protein folding and dynamics. Current Opinion in Structural Biology 60, 77–84 (2020).

333.

Pearce, R. & Zhang, Y. Toward the solution of the protein structure prediction problem. Journal of Biological Chemistry 297, (2021).

334.

Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).

335.

Cheng, J., Tegge, A. N. & Baldi, P. Machine Learning Methods for Protein Structure Prediction. IEEE Reviews in Biomedical Engineering 1, 41–49 (2008).

336.

AlQuraishi, M. Machine learning in protein structure prediction. Current Opinion in Chemical Biology 65, 1–8 (2021).

337.

Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Current Opinion in Structural Biology 69, 11–18 (2021).

338.

Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694 (2019).

339.

Li, G., Dong, Y. & Reetz, M. T. Can Machine Learning Revolutionize Directed Evolution of Selective Enzymes? Advanced Synthesis & Catalysis 361, 2377–2386 (2019).

340.

Xie, R., Wen, J., Quitadamo, A., Cheng, J. & Shi, X. A deep auto-encoder model for gene expression prediction. BMC Genomics 18, 845 (2017).

341.

Ortuño, F. M. et al. Comparing different machine learning and mathematical regression models to evaluate multiple sequence alignments. Neurocomputing 164, 123–136 (2015).

342.

Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLOS Computational Biology 13, e1005324 (2017).

343.

Haga, H. et al. A machine learning-based treatment prediction model using whole genome variants of hepatitis C virus. Plos One 15, e0242028 (2020).

344.

Zazzi, M. et al. Predicting Response to Antiretroviral Treatment by Machine Learning: The EuResist Project. Int 55, 123–127 (2012).

345.

Ren, Y. et al. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 38, 325–334 (2022).

346.

Kim, J. I. et al. Machine Learning for Antimicrobial Resistance Prediction: Current Practice, Limitations, and Clinical Perspective. Clinical Microbiology Reviews 0, e00179–21 (2022).

347.

Wang, Y. et al. Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks. Sci Rep 6, 19598 (2016).

348.

Rätsch, G., Sonnenburg, S. & Schäfer, C. Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics 7, S9 (2006).

349.

Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999).

350.

Alioto, T. Gene Prediction. in (ed. Anisimova, M.) 175–201 (Humana Press, 2012). doi:10.1007/978-1-61779-582-4_6.

351.

Fang, Z. et al. PlasGUN: Gene prediction in plasmid metagenomic short reads using deep learning. Bioinformatics 36, 3239–3241 (2020).

352.

Wei, L., Ding, Y., Su, R., Tang, J. & Zou, Q. Prediction of human protein subcellular localization using deep learning. Journal of Parallel and Distributed Computing 117, 212–217 (2018).

353.

Wang, H., Yan, L., Huang, H. & Ding, C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 503–513 (2017).

354.

Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016).

355.

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. (Springer Science & Business Media, 2009).

356.

Kriventseva, E. V., Biswas, M. & Apweiler, R. Clustering and analysis of protein families. Current Opinion in Structural Biology 11, 334–339 (2001).

357.

Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

358.

Balaban, M., Moshiri, N., Mai, U., Jia, X. & Mirarab, S. TreeCluster: Clustering biological sequences using phylogenetic trees. Plos One 14, e0221068 (2019).

359.

Zorita, E., Cuscó, P. & Filion, G. J. Starcode: Sequence clustering based on all-pairs search. Bioinformatics 31, 1913–1919 (2015).

360.

Ondov, B. D. et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016).

361.

Baker, D. N. & Langmead, B. Dashing: Fast and accurate genomic distances with HyperLogLog. Genome Biology 20, 265 (2019).

362.

Corso, G. et al. Neural Distance Embeddings for Biological Sequences. in Advances in Neural Information Processing Systems vol. 34 18539–18551 (Curran Associates, Inc., 2021).

363.

Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 35, 128–135 (2017).

364.

Castro, B. M., Lemes, R. B., Cesar, J., Hünemeier, T. & Leonardi, F. A model selection approach for multiple sequence segmentation and dimensionality reduction. Journal of Multivariate Analysis 167, 319–330 (2018).

365.

Haschka, T., Ponger, L., Escudé, C. & Mozziconacci, J. MNHN-Tree-Tools: A toolbox for tree inference using multi-scale clustering of a set of sequences. Bioinformatics 37, 3947–3949 (2021).

366.

Konishi, T. et al. Principal Component Analysis applied directly to Sequence Matrix. Sci Rep 9, 19297 (2019).

367.

Ben-Hur, A. & Guyon, I. Detecting Stable Clusters Using Principal Component Analysis. in Functional Genomics: Methods and Protocols (eds. Brownstein, M. J. & Khodursky, A. B.) 159–182 (Humana Press, 2003). doi:10.1385/1-59259-364-x:159.

368.

Ding, C. & He, X. K-means clustering via principal component analysis. in Proceedings of the twenty-first international conference on Machine learning 29 (Association for Computing Machinery, 2004). doi:10.1145/1015330.1015408.

369.

Casari, G., Sander, C. & Valencia, A. Sequencespace: A tool for family analysis. Nat. Struct. Biol 2, 171–178 (1995).

370.

Clamp, M., Cuff, J., Searle, S. M. & Barton, G. J. The Jalview Java alignment editor. Bioinformatics 20, 426–427 (2004).

371.

Xia, Z., Wu, L.-Y., Zhou, X. & Wong, S. T. Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Systems Biology 4, S6 (2010).

372.

Tamposis, I. A., Tsirigos, K. D., Theodoropoulou, M. C., Kontou, P. I. & Bagos, P. G. Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics 35, 2208–2215 (2019).

373.

Elnaggar, A. et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence vol. Pp 1–1 (2021).

374.

Lu, A. X. et al. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLOS Computational Biology 18, e1010238 (2022).

375.

Townshend, R., Bedi, R., Suriana, P. & Dror, R. End-to-End Learning on 3D Protein Structure for Interface Prediction. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019).

376.

Lee, B., Baek, J., Park, S. & Yoon, S. deepTarget: End-to-end Learning Framework for microRNA Target Prediction using Deep Recurrent Neural Networks. in Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 434–442 (Association for Computing Machinery, 2016). doi:10.1145/2975167.2975212.

377.

Goodfellow, I., Bengio, Y. & Courville, A. Deep learning. (MIT Press, 2016).

378.

Wang, Q., Ma, Y., Zhao, K. & Tian, Y. A Comprehensive Survey of Loss Functions in Machine Learning. Ann. Data. Sci. 9, 187–212 (2022).

379.

Jiao, Y. & Du, P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol 4, 320–330 (2016).

380.

Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The Balanced Accuracy and Its Posterior Distribution. in 2010 20th International Conference on Pattern Recognition 3121–3124 (2010). doi:10.1109/icpr.2010.764.

381.

Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: Formulation, detection, and avoidance. in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining 556–563 (Association for Computing Machinery, 2011). doi:10.1145/2020408.2020496.

382.

Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 23, 169–181 (2022).

383.

Fisher, R. A. On the Interpretation of Χ2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society 85, 87–94 (1922).

384.

Pearson, K. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 157–175 (1900).

385.

Hoerl, A. E. & Kennard, R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55–67 (1970).

386.

Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996).

387.

Zhang, H. The Optimality of Naive Bayes. in Proceedings of the the 17th international FLAIRS conference (FLAIRS2004) 6 (2004).

388.

Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence vol. 3 6 (2001).

389.

Vapnik, V. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). (Springer-Verlag, 1982).

390.

Boser, B. E., Guyon, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. in Proceedings of the fifth annual workshop on Computational learning theory 144–152 (Association for Computing Machinery, 1992). doi:10.1145/130385.130401.

391.

Cortes, C. & Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995).

392.

Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support Vector Regression Machines. in Advances in Neural Information Processing Systems vol. 9 (MIT Press, 1996).

393.

Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

394.

Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and regression trees. (1983). doi:10.1201/9781315139470.

395.

Kingsford, C. & Salzberg, S. L. What are decision trees? Nat Biotechnol 26, 1011–1013 (2008).

396.

Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning 161–168 (Association for Computing Machinery, 2006). doi:10.1145/1143844.1143865.

397.

Yang, P., Hwa Yang, Y., B. Zhou, B. & Y. Zomaya, A. A Review of Ensemble Methods in Bioinformatics. Current Bioinformatics 5, 296–308 (2010).

398.

Potdar, K., S., T. & D., C. A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. Ijca 175, 7–9 (2017).

399.

Hassani Saadi, H., Sameni, R. & Zollanvari, A. Interpretive time-frequency analysis of genomic sequences. BMC Bioinformatics 18, 154 (2017).

400.

Brouwer, R. K. A feed-forward network for input that is both categorical and quantitative. Neural Networks 15, 881–890 (2002).

401.

Kunanbayev, K., Temirbek, I. & Zollanvari, A. Complex Encoding. in 2021 International Joint Conference on Neural Networks (IJCNN) 1–6 (2021). doi:10.1109/ijcnn52387.2021.9534094.

402.

Dufresne, Y. et al. The K-mer File Format: A standardized and compact disk representation of sets of k-mers. Bioinformatics btac528 (2022) doi:10.1093/bioinformatics/btac528.

403.

Wright, E. S. Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R. The R Journal 8, 352–359 (2016).

404.

Zamani, M. & Kremer, S. C. Amino acid encoding schemes for machine learning methods. in 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) 327–333 (2011). doi:10.1109/bibmw.2011.6112394.

405.

Singh, D., Singh, P. & Sisodia, D. S. Evolutionary based optimal ensemble classifiers for HIV-1 protease cleavage sites prediction. Expert Systems with Applications 109, 86–99 (2018).

406.

Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology 202, 865–884 (1988).

407.

Budach, S. & Marsico, A. Pysster: Classification of biological sequences by learning sequence and structure motifs with convolutional neural networks. Bioinformatics 34, 3035–3037 (2018).

408.

Choong, A. C. H. & Lee, N. K. Evaluation of convolutionary neural networks modeling of DNA sequences using ordinal versus one-hot encoding method. in 2017 International Conference on Computer and Drone Applications (IConDA) 60–65 (2017). doi:10.1109/iconda.2017.8270400.

409.

McGinnis, W. et al. Scikit-Learn-Contrib/Categorical-Encoding: Release For Zenodo. (2018) doi:10.5281/zenodo.1157110.

410.

Kawashima, S. et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36, D202–d205 (2008).

411.

Li, Z.-C., Zhou, X.-B., Dai, Z. & Zou, X.-Y. Prediction of protein structural classes by Chou’s pseudo amino acid composition: Approached using continuous wavelet transform and principal component analysis. Amino Acids 37, 415 (2008).

412.

Nanni, L. & Lumini, A. A new encoding technique for peptide classification. Expert Systems with Applications 38, 3185–3191 (2011).

413.

Chen, Z. et al. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502 (2018).

414.

Taylor, W. R. The classification of amino acid conservation. Journal of Theoretical Biology 119, 205–218 (1986).

415.

Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E. Prediction of protein secondary structure and active sites using the alignment of homologous sequences. Journal of Molecular Biology 195, 957–961 (1987).

416.

Kremer, S. & Lac, H. Method, system and computer program product for levinthal process induction from known structure using machine learning. (2009).

417.

Maetschke, S., Towsey, M. & Bodén, M. Blomap: An encoding of amino acids which improves signal peptide cleavage site prediction. in Proceedings of the 3rd Asia-Pacific Bioinformatics Conference vols Volume 1 141–150 (Published By Imperial College Press And Distributed By World Scientific Publishing Co., 2005).

418.

Gök, M. & Özcerit, A. T. A new feature encoding scheme for HIV-1 protease cleavage site prediction. Neural Comput & Applic 22, 1757–1761 (2013).

419.

Saha, S. & Bhattacharya, T. A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining. in International Conference on Innovative Computing and Communications (eds. Bhattacharyya, S., Hassanien, A. E., Gupta, D., Khanna, A. & Pan, I.) 101–108 (Springer, 2019). doi:10.1007/978-981-13-2354-6_12.

420.

Jeffrey, H. J. Chaos game representation of gene structure. Nucleic Acids Research 18, 2163–2170 (1990).

421.

Löchel, H. F. & Heider, D. Chaos game representation and its applications in bioinformatics. Computational and Structural Biotechnology Journal 19, 6263–6271 (2021).

422.

Cartes, J. A., Anand, S., Ciccolella, S., Bonizzoni, P. & Vedova, G. D. Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation. 2022.06.13.495912 (2022) doi:10.1101/2022.06.13.495912.

423.

Ni, H., Mu, H. & Qi, D. Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses. Journal of Molecular Graphics and Modelling 107, 107942 (2021).

424.

Lwoff, A. The concept of virus. J Gen Microbiol 17, 239–253 (1957).

425.

Minor, P. D. Viruses. in eLS (John Wiley & Sons, Ltd, 2014). doi:10.1002/9780470015902.a0000441.pub3.

426.

Stapleton, J. T., Foung, S., Muerhoff, A. S., Bukh, J. & Simmonds, P. The GB viruses: A review and proposed classification of GBV-A, GBV-C (HGV), and GBV-D in genus Pegivirus within the family Flaviviridae. J Gen Virol 92, 233–246 (2011).

427.

Yamamoto, N. et al. Characterization of a non-pathogenic H5N1 influenza virus isolated from a migratory duck flying from Siberia in Hokkaido, Japan, in October 2009. Virology Journal 8, 65 (2011).

428.

Shi, M. et al. The evolutionary history of vertebrate RNA viruses. Nature 556, 197–202 (2018).

429.

Adams, J. R. & Bonami, J.-R. Atlas of Invertebrate Viruses. (CRC Press, 2017). doi:10.1201/9781315149929.

430.

Lefeuvre, P. et al. Evolution and ecology of plant viruses. Nat Rev Microbiol 17, 632–644 (2019).

431.

Wang, A. L. & Wang, C. C. Viruses of parasitic protozoa. Parasitology Today 7, 76–80 (1991).

432.

Fermin, G., Mazumdar-Leighton, S. & Tennant, P. Viruses of prokaryotes, protozoa, fungi, and chromista. in Viruses: Molecular Biology, Host Interactions, and Applications to Biotechnology 217 (Academic Press, 2018). doi:10.1016/B978-0-12-811257-1.00009-7.

433.

Sutela, S., Poimala, A. & Vainio, E. J. Viruses of fungi and oomycetes in the soil environment. FEMS Microbiology Ecology 95, fiz119 (2019).

434.

Twort, F. W. An Investigation On The Nature Of Ultra-microscopic Viruses. The Lancet 186, 1241–1243 (1915).

435.

Delbrock, M. Bacterial Viruses or Bacteriophages. Biological Reviews 21, 30–40 (1946).

436.

Clark, J. R. & March, J. B. Bacterial viruses as human vaccines? Expert Review of Vaccines 3, 463–476 (2004).

437.

van Kan-Davelaar, H. E., van Hest, J. C. M., Cornelissen, J. J. L. M. & Koay, M. S. T. Using viruses as nanomedicines. British Journal of Pharmacology 171, 4001–4009 (2014).

438.

Prangishvili, D., Basta, T., Garrett, R. A. & Krupovic, M. Viruses of the Archaea. in eLS 1–9 (John Wiley & Sons, Ltd, 2016). doi:10.1002/9780470015902.a0000774.pub3.

439.

Prangishvili, D., Forterre, P. & Garrett, R. A. Viruses of the Archaea: A unifying view. Nat Rev Microbiol 4, 837–848 (2006).

440.

Francki, R. I. B. Plant virus satellites. Annual Review Of Microbiology (1985).

441.

Xu, P. & Roossinck, M. J. Plant Virus Satellites. in eLS (John Wiley & Sons, Ltd, 2011). doi:10.1002/9780470015902.a0000771.pub2.

442.

Lai, M. M. The molecular biology of hepatitis delta virus. Annu Rev Biochem 64, 259–286 (1995).

443.

Hughes, S. A., Wedemeyer, H. & Harrison, P. M. Hepatitis delta virus. The Lancet 378, 73–85 (2011).

444.

Desnues, C., Boyer, M. & Raoult, D. Chapter 3 - Sputnik, a Virophage Infecting the Viral Domain of Life. in Advances in Virus Research (eds. Łobocka, M. & Szybalski, W. T.) vol. 82 63–89 (Academic Press, 2012).

445.

Gaia, M. et al. Zamilon, a Novel Virophage with Mimiviridae Host Specificity. Plos One 9, e94923 (2014).

446.

Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).

447.

Nasir, A., Romero-Severson, E. & Claverie, J.-M. Investigating the Concept and Origin of Viruses. Trends in Microbiology 28, 959–967 (2020).

448.

Forterre, P. & Prangishvili, D. The origin of viruses. Research in Microbiology 160, 466–472 (2009).

449.

Forterre, P. The origin of viruses and their possible roles in major evolutionary transitions. Virus Research 117, 5–16 (2006).

450.

Boeke, J. & Stoye, J. Retrotransposons, Endogenous Retroviruses, and the Evolution of Retroelement. in Retroviruses (eds. Coffin, J. M., Hughes, S. H. & Varmus, H. E.) (Cold Spring Harbor Laboratory Press, 1997).

451.

Kojima, S. et al. Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome. Proceedings of the National Academy of Sciences 118, e2010758118 (2021).

452.

Löwer, R., Löwer, J. & R Kurth. The viruses in all of us: Characteristics and biological significance of human endogenous retrovirus sequences. Proceedings of the National Academy of Sciences 93, 5177–5184 (1996).

453.

Griffiths, D. J. Endogenous retroviruses in the human genome sequence. Genome Biol 2, reviews1017.1 (2001).

454.

Baltimore, D. Expression of animal virus genomes. Bacteriol Rev 35, 235–241 (1971).

455.

Koonin, E. V., Krupovic, M. & Agol, V. I. The Baltimore Classification of Viruses 50 Years Later: How Does It Stand in the Light of Virus Evolution? Microbiology and Molecular Biology Reviews 85, e00053–21 (2021).

456.

Domingo, E. & Perales, C. RNA Virus Genomes. in eLS 1–12 (John Wiley & Sons, Ltd, 2018). doi:10.1002/9780470015902.a0001488.pub3.

457.

McGeoch, D. J., Rixon, F. J. & Davison, A. J. Topics in herpesvirus genomics and evolution. Virus Research 117, 90–104 (2006).

458.

Boehmer, P. & Nimonkar, A. Herpes Virus Replication. IUBMB Life 55, 13–22 (2003).

459.

Brentjens, M. H., Yeung-Yue, K. A., Lee, P. C. & Tyring, S. K. Human papillomavirus: A review. Dermatologic Clinics 20, 315–331 (2002).

460.

Kay, A. & Zoulim, F. Hepatitis B virus genetic variability and evolution. Virus Research 127, 164–176 (2007).

461.

Parashar, U. D., Bresee, J. S., Gentsch, J. R. & Glass, R. I. Rotavirus. Emerg Infect Dis 4, 561–570 (1998).

462.

Simmonds, P. Variability of hepatitis C virus. Hepatology 21, 570–583 (1995).

463.

Wimmer, E., Hellen, C. U. T. & Cao, X. Genetics of poliovirus. Annual Review of Genetics 27, 353–437 (1993).

464.

Racaniello, V. R. One hundred years of poliovirus pathogenesis. Virology 344, 9–16 (2006).

465.

Palese, P., Zheng, H., Engelhardt, O. G., Pleschka, S. & García-Sastre, A. Negative-strand RNA viruses: Genetic engineering and applications. Proceedings of the National Academy of Sciences 93, 11354–11358 (1996).

466.

Domingo, E. & Perales, C. Virus Evolution. in eLS (John Wiley & Sons, Ltd, 2014). doi:10.1002/9780470015902.a0000436.pub3.

467.

V’kovski, P., Kratzel, A., Steiner, S., Stalder, H. & Thiel, V. Coronavirus biology and replication: Implications for SARS-CoV-2. Nat Rev Microbiol 19, 155–170 (2021).

468.

Bäck, A. T. & Lundkvist, Å. Dengue viruses – an overview. Infect Ecol Epidemiol 3, 10.3402/iee.v3i0.19839 (2013).

469.

Dustin, L. B., Bartolini, B., Capobianchi, M. R. & Pistello, M. Hepatitis C virus: Life cycle in cells, infection and host response, and analysis of molecular markers influencing the outcome of infection and response to therapy. Clin Microbiol Infect 22, 826–832 (2016).

470.

Kadaja, M., Silla, T., Ustav, E. & Ustav, M. Papillomavirus DNA replication — From initiation to genomic instability. Virology 384, 360–368 (2009).

471.

Weller, S. K. & Coen, D. M. Herpes Simplex Viruses: Mechanisms of DNA Replication. Cold Spring Harb Perspect Biol 4, a013011 (2012).

472.

Beck, J. & Nassal, M. Hepatitis B virus replication. World J Gastroenterol 13, 48–64 (2007).

473.

Pyle, J. D. & Scholthof, K.-B. G. Chapter 58 - Biology and Pathogenesis of Satellite Viruses. in Viroids and Satellites (eds. Hadidi, A., Flores, R., Randles, J. W. & Palukaitis, P.) 627–636 (Academic Press, 2017). doi:10.1016/b978-0-12-801498-1.00058-9.

474.

Raoult, D. et al. The 1.2-megabase genome sequence of Mimivirus. Science 306, 1344–1350 (2004).

475.

Campillo-Balderas, J. A., Lazcano, A. & Becerra, A. Viral Genome Size Distribution Does not Correlate with the Antiquity of the Host Lineages. Frontiers in Ecology and Evolution 3, (2015).

476.

Cann, A. J. Virus Structure. in eLS 1–9 (John Wiley & Sons, Ltd, 2015). doi:10.1002/9780470015902.a0000439.pub2.

477.

Hladik, F. & McElrath, M. J. Setting the stage: Host invasion by HIV. Nat Rev Immunol 8, 447–457 (2008).

478.

Shaw, G. M. & Hunter, E. HIV Transmission. Cold Spring Harb Perspect Med 2, a006965 (2012).

479.

Weiss, R. A. How Does HIV Cause AIDS? Science 260, 1273–1279 (1993).

480.

Melhuish, A. & Lewthwaite, P. Natural history of HIV and AIDS. Medicine 46, 356–361 (2018).

481.

Murray, J. F. et al. Pulmonary complications of the acquired immunodeficiency syndrome. New England Journal of Medicine 310, 1682–1688 (1984).

482.

Sampath, S. et al. Pandemics Throughout the History. Cureus 13, (2021).

483.

World Health Organization. Global report: UNAIDS report on the global AIDS epidemic 2010. (World Health Organization, 2010).

484.

Barré-Sinoussi, F. et al. Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). Science 220, 868–871 (1983).

485.

Gallo, R. C. et al. Isolation of human T-cell leukemia virus in acquired immune deficiency syndrome (AIDS). Science 220, 865–867 (1983).

486.

Clavel, F. et al. Isolation of a New Human Retrovirus from West African Patients with AIDS. Science 233, 343–346 (1986).

487.

Gilbert, P. B. et al. Comparison of HIV-1 and HIV-2 infectivity from a prospective cohort study in Senegal. Statistics in Medicine 22, 573–593 (2003).

488.

van der Loeff, M. F. S. et al. Sixteen years of HIV surveillance in a West African research clinic reveals divergent epidemic trends of HIV-1 and HIV-2. Int J Epidemiol 35, 1322–1328 (2006).

489.

Gao, F. et al. Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397, 436–441 (1999).

490.

Hamel, D. J. et al. Twenty years of prospective molecular epidemiology in Senegal: Changes in HIV diversity. AIDS Res Hum Retroviruses 23, 1189–1196 (2007).

491.

Sharp, P. M. & Hahn, B. H. Origins of HIV and the AIDS Pandemic. Cold Spring Harb Perspect Med 1, a006841 (2011).

492.

Hirsch, V. M., Olmsted, R. A., Murphey-Corb, M., Purcell, R. H. & Johnson, P. R. An African primate lentivirus (SIVsmclosely related to HIV-2. Nature 339, 389–392 (1989).

493.

Gao, F. et al. Human infection by genetically diverse SIVSM-related HIV-2 in West Africa. Nature 358, 495–499 (1992).

494.

Chen, Z. et al. Genetic characterization of new West African simian immunodeficiency virus SIVsm: Geographic clustering of household-derived SIV strains with human immunodeficiency virus type 2 subtypes and genetically diverse viruses from a single feral sooty mangabey troop. J Virol 70, 3617–3627 (1996).

495.

Hemelaar, J. The origin and diversity of the HIV-1 pandemic. Trends in Molecular Medicine 18, 182–192 (2012).

496.

Worobey, M. et al. Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature 455, 661–664 (2008).

497.

Vidal, N. et al. Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol 74, 10498–10507 (2000).

498.

Faria, N. R. et al. The early spread and epidemic ignition of HIV-1 in human populations. Science 346, 56–61 (2014).

499.

Korber, B. et al. Timing the ancestor of the HIV-1 pandemic strains. Science 288, 1789–1796 (2000).

500.

Rambaut, A., Posada, D., Crandall, K. A. & Holmes, E. C. The causes and consequences of HIV evolution. Nat Rev Genet 5, 52–61 (2004).

501.

McCutchan, F. E. Global epidemiology of HIV. Journal of Medical Virology 78, S7–s12 (2006).

502.

Pérez-Losada, M., Arenas, M., Galán, J. C., Palero, F. & González-Candelas, F. Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences. Infect Genet Evol 30, 296–307 (2015).

503.

Robertson, D. L., Hahn, B. H. & Sharp, P. M. Recombination in AIDS viruses. J Mol Evol 40, 249–259 (1995).

504.

HIV Circulating Recombinant Forms (CRFs). https://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html.

505.

Lau, K. A. & Wong, J. J. L. Current Trends of HIV Recombination Worldwide. Infect Dis Rep 5, e4 (2013).

506.

Posada, D., Crandall, K. A. & Holmes, E. C. Recombination in evolutionary genomics. Annu Rev Genet 36, 75–97 (2002).

507.

Taylor, B. S., Sobieszczyk, M. E., McCutchan, F. E. & Hammer, S. M. The Challenge of HIV-1 Subtype Diversity. New England Journal of Medicine 358, 1590–1602 (2008).

508.

Hemelaar, J., Gouws, E., Ghys, P. D. & Osmanov, S. Global trends in molecular epidemiology of HIV-1 during 2000–2007. Aids 25, 679–689 (2011).

509.

Distribution of all HIV-1 sequences: WORLD. https://www.hiv.lanl.gov/components/sequence/HIV/geo/geo.comp.

510.

Freed, E. O. HIV-1 Replication. Somat Cell Mol Genet 26, 13–33 (2001).

511.

Ferguson, M. R., Rojo, D. R., von Lindern, J. J. & O’Brien, W. A. HIV-1 replication cycle. Clin Lab Med 22, 611–635 (2002).

512.

Gougeon, M. L., Laurent-Crawford, A. G., Hovanessian, A. G. & Montagnier, L. Direct and indirect mechanisms mediating apoptosis during HIV infection: Contribution to in vivo CD4 T cell depletion. Seminars in Immunology 5, 187–194 (1993).

513.

Vidya Vijayan, K. K., Karthigeyan, K. P., Tripathi, S. P. & Hanna, L. E. Pathophysiology of CD4+ T-Cell Depletion in HIV-1 and HIV-2 Infections. Front Immunol 8, 580 (2017).

514.

Frankel, A. D. & Young, J. A. HIV-1: Fifteen proteins and an RNA. Annu Rev Biochem 67, 1–25 (1998).

515.

Fossen, T. et al. Solution Structure of the Human Immunodeficiency Virus Type 1 P6 Protein *. Journal of Biological Chemistry 280, 42515–42527 (2005).

516.

Göttlinger, H. G., Dorfman, T., Sodroski, J. G. & Haseltine, W. A. Effect of mutations affecting the P6 gag protein on human immunodeficiency virus particle release. Proceedings of the National Academy of Sciences 88, 3195–3199 (1991).

517.

Huang, M., Orenstein, J. M., Martin, M. A. & Freed, E. O. p6Gag is required for particle production from full-length human immunodeficiency virus type 1 molecular clones expressing protease. Journal of Virology 69, 6810–6818 (1995).

518.

Bour, S., Geleziunas, R. & Wainberg, M. A. The human immunodeficiency virus type 1 (HIV-1) CD4 receptor and its central role in promotion of HIV-1 infection. Microbiological Reviews 59, 63–93 (1995).

519.

Hernandez, L. D., Hoffman, L. R., Wolfsberg, T. G. & White, J. M. Virus-cell and cell-cell fusion. Annu Rev Cell Dev Biol 12, 627–661 (1996).

520.

Jones, K. & Peterlin, B. Control of Rna Initiation and Elongation at the Hiv-1 Promoter. Annu. Rev. Biochem. 63, 717–743 (1994).

521.

Hope, T. J. Viral RNA export. Chemistry & Biology 4, 335–344 (1997).

522.

Mangasarian, A. & Trono, D. The multifaceted role of HIV Nef. Research in Virology 148, 30–33 (1997).

523.

Cohen, é. A., Subbramanian, R. A. & Göttlinger, H. G. Role of Auxiliary Proteins in Retroviral Morphogenesis. in Morphogenesis and Maturation of Retroviruses (ed. Kräusslich, H.-G.) 219–235 (Springer, 1996). doi:10.1007/978-3-642-80145-7_7.

524.

Lamb, R. A. & Pinto, L. H. Do Vpu and Vpr of Human Immunodeficiency Virus Type 1 and NB of Influenza B Virus Have Ion Channel Activities in the Viral Life Cycles? Virology 229, 1–11 (1997).

525.

Khan, N. & Geiger, J. D. Role of Viral Protein U (Vpu) in HIV-1 Infection and Pathogenesis. Viruses 13, 1466 (2021).

526.

Emerman, M. HIV-1, Vpr and the cell cycle. Current Biology 6, 1096–1103 (1996).

527.

Miller, R. H. Human immunodeficiency virus may encode a novel protein on the genomic DNA plus strand. Science 239, 1420–1422 (1988).

528.

Briquet, S. & Vaquero, C. Immunolocalization Studies of an Antisense Protein in HIV-1-Infected Cells and Viral Particles. Virology 292, 177–184 (2002).

529.

Cassan, E., Arigon-Chifolleau, A.-M., Mesnard, J.-M., Gross, A. & Gascuel, O. Concomitant emergence of the antisense protein gene of HIV-1 and of the pandemic. Pnas 113, 11537–11542 (2016).

530.

Savoret, J. et al. A Pilot Study of the Humoral Response Against the AntiSense Protein (ASP) in HIV-1-Infected Patients. Frontiers in Microbiology 11, (2020).

531.

Zardecki, C. et al. PDB-101: Educational resources supporting molecular explorations through biology and medicine. Protein Science 31, 129–140 (2022).

532.

Eisinger, R. W., Dieffenbach, C. W. & Fauci, A. S. HIV Viral Load and Transmissibility of HIV Infection: Undetectable Equals Untransmittable. Jama 321, 451–452 (2019).

533.

Palella, F. J. et al. Declining Morbidity and Mortality among Patients with Advanced Human Immunodeficiency Virus Infection. New England Journal of Medicine 338, 853–860 (1998).

534.

Forsythe, S. S. et al. Twenty Years Of Antiretroviral Therapy For People Living With HIV: Global Costs, Health Achievements, Economic Benefits. Health Affairs 38, 1163–1172 (2019).

535.

Fischl, M. A. et al. The Efficacy of Azidothymidine (AZT) in the Treatment of Patients with AIDS and AIDS-Related Complex. New England Journal of Medicine 317, 185–191 (1987).

536.

Richman, D. D. Susceptibility to nucleoside analogues of zidovudine-resistant isolates of human immunodeficiency virus. The American Journal of Medicine 88, S8–s10 (1990).

537.

Yeo, J. Y., Goh, G.-R., Su, C. T.-T. & Gan, S. K.-E. The Determination of HIV-1 RT Mutation Rate, Its Possible Allosteric Effects, and Its Implications on Drug Resistance. Viruses 12, 297 (2020).

538.

Cuevas, J. M., Geller, R., Garijo, R., López-Aldeguer, J. & Sanjuán, R. Extremely High Mutation Rate of HIV-1 In Vivo. PLOS Biology 13, e1002251 (2015).

539.

Carvajal-Rodríguez, A., Crandall, K. A. & Posada, D. Recombination favors the evolution of drug resistance in HIV-1 during antiretroviral therapy. Infect Genet Evol 7, 476–483 (2007).

540.

Gulick, R. M. et al. Treatment with indinavir, zidovudine, and lamivudine in adults with human immunodeficiency virus infection and prior antiretroviral therapy. N Engl J Med 337, 734–739 (1997).

541.

Wensing, A. M. J., van Maarseveen, N. M. & Nijhuis, M. Fifteen years of HIV Protease Inhibitors: Raising the barrier to resistance. Antiviral Research 85, 59–74 (2010).

542.

Pedersen, O. S. & Pedersen, E. B. Non-Nucleoside Reverse Transcriptase Inhibitors: The NNRTI Boom. Antivir Chem Chemother 10, 285–314 (1999).

543.

Scarsi, K. K., Havens, J. P., Podany, A. T., Avedissian, S. N. & Fletcher, C. V. HIV-1 Integrase Inhibitors: A Comparative Review of Efficacy and Safety. Drugs 80, 1649–1676 (2020).

544.

Fletcher, C. V. Enfuvirtide, a new drug for HIV infection. The Lancet 361, 1577–1578 (2003).

545.

Esté, J. A. & Telenti, A. HIV entry inhibitors. The Lancet 370, 81–88 (2007).

546.

Kilby, J. M. & Eron, J. J. Novel Therapies Based on Mechanisms of HIV-1 Cell Entry. N Engl J Med 348, 2228–2238 (2003).

547.

Yeni, P. Update on HAART in HIV. Journal of Hepatology 44, S100–s103 (2006).

548.

Palmisano, L. & Vella, S. A brief history of antiretroviral therapy of HIV infection: Success and challenges. Ann Ist Super Sanita 47, 44–48 (2011).

549.

Pennings, P. S. HIV drug resistance: Problems and perspectives. Infectious Disease Reports 5, e5 (2013).

550.

Mehta, S., Moore, R. D. & Graham, N. M. H. Potential factors affecting adherence with HIV therapy. Aids 11, 1665–1670 (1997).

551.

Miller, N. H. Compliance with treatment regimens in chronic asymptomatic diseases. The American Journal of Medicine 102, 43–49 (1997).

552.

Chesney, M. A., Morin, M. & Sherr, L. Adherence to HIV combination therapy. Social Science & Medicine 50, 1599–1605 (2000).

553.

Aldir, I., Horta, A. & Serrado, M. Single-tablet regimens in HIV: Does it really make a difference? Current Medical Research and Opinion 30, 89–97 (2014).

554.

Grant, R. M. et al. Preexposure Chemoprophylaxis for HIV Prevention in Men Who Have Sex with Men. N Engl J Med 363, 2587–2599 (2010).

555.

Baeten, J. M. et al. Antiretroviral prophylaxis for HIV prevention in heterosexual men and women. N Engl J Med 367, 399–410 (2012).

556.

Buchbinder, S. P. & Liu, A. Pre-exposure prophylaxis and the promise of combination prevention approaches. AIDS Behav 15 Suppl 1, S72–79 (2011).

557.

Riddell, J., IV, Amico, K. R. & Mayer, K. H. HIV Preexposure Prophylaxis: A Review. Jama 319, 1261–1268 (2018).

558.

Truvada. https://www.ema.europa.eu/en/medicines/human/EPAR/truvada (2018).

559.

About PrEP | PrEP | HIV Basics | HIV/AIDS. https://www.cdc.gov/hiv/basics/prep/about-prep.html (2022).

560.

Zolopa, A. R. The evolution of HIV treatment guidelines: Current state-of-the-art of ART. Antiviral Research 85, 241–244 (2010).

561.

Organization, W. H. Consolidated guidelines on HIV prevention, testing, treatment, service delivery and monitoring: Recommendations for a public health approach. https://www.who.int/publications-detail-redirect/9789240031593 (2021).

562.

Ammaranond, P. & Sanguansittianan, S. Mechanism of HIV antiretroviral drugs progress toward drug resistance. Fundamental & Clinical Pharmacology 26, 146–161 (2012).

563.

Clavel, F. & Hance, A. J. HIV Drug Resistance. New England Journal of Medicine 350, 1023–1035 (2004).

564.

Sarafianos, S. G. et al. Structure and function of HIV-1 reverse transcriptase: Molecular mechanisms of polymerization and inhibition. J Mol Biol 385, 693–713 (2009).

565.

Goodsell, D. S., Autin, L. & Olson, A. J. Illustrate: Software for Biomolecular Illustration. Structure 27, 1716–1720.e1 (2019).

566.

Esnouf, R. M. et al. Unique features in the structure of the complex between HIV-1 reverse transcriptase and the bis(heteroaryl)piperazine (BHAP) U-90152 explain resistance mutations for this nonnucleoside inhibitor. Proc Natl Acad Sci U S A 94, 3984–3989 (1997).

567.

Hang, J. Q. et al. Activity of the isolated HIV RNase H domain and specific inhibition by N-hydroxyimides. Biochemical and Biophysical Research Communications 317, 321–329 (2004).

568.

Klumpp, K. & Mirzadegan, T. Recent Progress in the Design of Small Molecule Inhibitors of HIV RNase H. Current Pharmaceutical Design 12, 1909–1922 (2006).

569.

Menéndez-Arias, L. Mechanisms of resistance to nucleoside analogue inhibitors of HIV-1 reverse transcriptase. Virus Research 134, 124–146 (2008).

570.

Sluis-Cremer, N., Arion, D. & Parniak*, M. A. Molecular mechanisms of HIV-1 resistance to nucleoside reverse transcriptase inhibitors (NRTIs). CMLS, Cell. Mol. Life Sci. 57, 1408–1422 (2000).

571.

Sarafianos, S. G. et al. Lamivudine (3TC) resistance in HIV-1 reverse transcriptase involves steric hindrance with beta-branched amino acids. Proc Natl Acad Sci U S A 96, 10027–10032 (1999).

572.

Meyer, P. R., Matsuura, S. E., Mian, A. M., So, A. G. & Scott, W. A. A mechanism of AZT resistance: An increase in nucleotide-dependent primer unblocking by mutant HIV-1 reverse transcriptase. Mol Cell 4, 35–43 (1999).

573.

Boyer, P. L., Sarafianos, S. G., Arnold, E. & Hughes, S. H. Selective Excision of AZTMP by Drug-Resistant Human Immunodeficiency Virus Reverse Transcriptase. J Virol 75, 4832–4842 (2001).

574.

Deeks, S. G. Nonnucleoside Reverse Transcriptase Inhibitor Resistance. JAIDS Journal of Acquired Immune Deficiency Syndromes 26, S25 (2001).

575.

Ren, J. & Stammers, D. K. Structural basis for drug resistance mechanisms for non-nucleoside inhibitors of HIV reverse transcriptase. Virus Research 134, 157–170 (2008).

576.

Lloyd, S. B., Kent, S. J. & Winnall, W. R. The High Cost of Fidelity. AIDS Research and Human Retroviruses 30, 8–16 (2014).

577.

Pearl, L. H. & Taylor, W. R. A structural model for the retroviral proteases. Nature 329, 351–354 (1987).

578.

Gulnik, S., Erickson, J. W. & Xie, D. HIV protease: Enzyme function and drug resistance. in Vitamins & Hormones vol. 58 213–256 (Academic Press, 2000).

579.

Silva, A. M., Cachau, R. E., Sham, H. L. & Erickson, J. W. Inhibition and catalytic mechanism of HIV-1 aspartic protease. Journal of Molecular Biology 255, 321–340 (1996).

580.

Hornak, V., Okur, A., Rizzo, R. C. & Simmerling, C. HIV-1 protease flaps spontaneously open and reclose in molecular dynamics simulations. Proceedings of the National Academy of Sciences 103, 915–920 (2006).

581.

Freedberg, D. I. et al. Rapid structural fluctuations of the free HIV protease flaps in solution: Relationship to crystal structures and comparison with predictions of dynamics calculations. Protein Sci 11, 221–232 (2002).

582.

Yu, Y. et al. Structural insights into HIV-1 protease flap opening processes and key intermediates. RSC Adv. 7, 45121–45128 (2017).

583.

Roberts, N. A. et al. Rational Design of Peptide-Based HIV Proteinase Inhibitors. Science 248, 358–361 (1990).

584.

Lv, Z., Chu, Y. & Wang, Y. HIV protease inhibitors: A review of molecular selectivity and toxicity. HIV AIDS (Auckl) 7, 95–104 (2015).

585.

Prabu-Jeyabalan, M., Nalivaika, E. & Schiffer, C. A. Substrate Shape Determines Specificity of Recognition for HIV-1 Protease: Analysis of Crystal Structures of Six Substrate Complexes. Structure 10, 369–381 (2002).

586.

Prabu-Jeyabalan, M. et al. Substrate Envelope and Drug Resistance: Crystal Structure of RO1 in Complex with Wild-Type Human Immunodeficiency Virus Type 1 Protease. Antimicrob Agents Chemother 50, 1518–1521 (2006).

587.

Kurt Yilmaz, N., Swanstrom, R. & Schiffer, C. A. Improving Viral Protease Inhibitors to Counter Drug Resistance. Trends in Microbiology 24, 547–557 (2016).

588.

Chiu, T. K. & Davies, D. R. Structure and Function of HIV-1 Integrase. Current Topics in Medicinal Chemistry 4, 965–977 (2004).

589.

Esposito, D. & Craigie, R. HIV Integrase Structure and Function. in Advances in Virus Research (eds. Rlaramorosch, K., Murphy, F. A. & Shawn, A. J.) vol. 52 319–333 (Academic Press, 1999).

590.

Delelis, O., Carayon, K., Saïb, A., Deprez, E. & Mouscadet, J.-F. Integrase and integration: Biochemical activities of HIV-1 integrase. Retrovirology 5, 114 (2008).

591.

Maertens, G. N., Engelman, A. N. & Cherepanov, P. Structure and function of retroviral integrase. Nat Rev Microbiol 20, 20–34 (2022).

592.

Pommier, Y., Johnson, A. A. & Marchand, C. Integrase inhibitors to treat HIV/Aids. Nat Rev Drug Discov 4, 236–248 (2005).

593.

Blanco, J.-L., Varghese, V., Rhee, S.-Y., Gatell, J. M. & Shafer, R. W. HIV-1 Integrase Inhibitor Resistance and Its Clinical Implications. The Journal of Infectious Diseases 203, 1204–1214 (2011).

594.

Geretti, A. M., Armenia, D. & Ceccherini-Silberstein, F. Emerging patterns and implications of HIV-1 integrase inhibitor resistance. Current Opinion in Infectious Diseases 25, 677–686 (2012).

595.

Knox, D. C., Anderson, P. L., Harrigan, P. R. & Tan, D. H. S. Multidrug-Resistant HIV-1 Infection despite Preexposure Prophylaxis. N Engl J Med 376, 501–502 (2017).

596.

Hurt, C. B., Eron, J. J. & Cohen, M. S. Pre-exposure prophylaxis and antiretroviral resistance: HIV prevention at a cost? Clin Infect Dis 53, 1265–1270 (2011).

597.

Gibas, K. M., van den Berg, P., Powell, V. E. & Krakower, D. S. Drug Resistance During HIV Pre-Exposure Prophylaxis. Drugs 79, 609–619 (2019).

598.

Mourad, R. et al. A phylotype-based analysis highlights the role of drug-naive HIV-positive individuals in the transmission of antiretroviral resistance in the UK. Aids 29, 1917–1925 (2015).

599.

Hué, S. et al. Demonstration of Sustained Drug-Resistant Human Immunodeficiency Virus Type 1 Lineages Circulating among Treatment-Naïve Individuals. Journal of Virology 83, 2645–2654 (2009).

600.

Drescher, S. M. et al. Treatment-Naive Individuals Are the Major Source of Transmitted HIV-1 Drug Resistance in Men Who Have Sex With Men in the Swiss HIV Cohort Study. Clinical Infectious Diseases 58, 285–294 (2014).

601.

Boerma, R. S. et al. High levels of pre-treatment HIV drug resistance and treatment failure in Nigerian children. Journal of the International AIDS Society 19, 21140 (2016).

602.

Clutter, D. S., Jordan, M. R., Bertagnolio, S. & Shafer, R. W. HIV-1 drug resistance and resistance testing. Infection, Genetics and Evolution 46, 292–307 (2016).

603.

Kühnert, D. et al. Quantifying the fitness cost of HIV-1 drug resistance mutations through phylodynamics. PLOS Pathogens 14, e1006895 (2018).

604.

Mesplède, T. et al. Viral fitness cost prevents HIV-1 from evading dolutegravir drug pressure. Retrovirology 10, 22 (2013).

605.

Castro, H. et al. Persistence of HIV-1 Transmitted Drug Resistance Mutations. J Infect Dis 208, 1459–1463 (2013).

606.

Blassel, L. et al. Drug resistance mutations in HIV: New bioinformatics approaches and challenges. Current Opinion in Virology 51, 56–64 (2021).

607.

Committee, U. C. S. The creation of a large UK-based multicentre cohort of HIV-infected individuals: The UK Collaborative HIV Cohort (UK CHIC) Study. HIV Medicine 5, 115–124 (2004).

608.

Abeler-Dörner, L. et al. PANGEA-HIV 2: Phylogenetics And Networks for Generalised Epidemics in Africa. Current Opinion in HIV and AIDS 14, 173–180 (2019).

609.

Shafer, R. W. Rationale and Uses of a Public HIV Drug‐Resistance Database. J Infect Dis 194, S51–s58 (2006).

610.

Kuiken, C., Korber, B. & Shafer, R. W. HIV Sequence Databases. AIDS Rev 5, 52–61 (2003).

611.

Wensing, A. M. et al. 2019 update of the drug resistance mutations in HIV-1. Top Antivir Med 27, 111–121 (2019).

612.

Clark, S. A., Calef, C. & Mellors, J. W. Mutations in retroviral genes associated with drug resistance. HIV sequence compendium 58–158 (2007).

613.

Liu, T. F. & Shafer, R. W. Web Resources for HIV Type 1 Genotypic-Resistance Test Interpretation. Clin Infect Dis 42, 1608–1618 (2006).

614.

Johnson, V. A. et al. Update of the Drug Resistance Mutations in HIV-1: March 2013. Top Antivir Med 21, 6–7 (2016).

615.

Villabona-Arenas, C. J. et al. In-depth analysis of HIV-1 drug resistance mutations in HIV-infected individuals failing first-line regimens in West and Central Africa. Aids 30, 2577 (2016).

616.

Shulman, N. S., Bosch, R. J., Mellors, J. W., Albrecht, M. A. & Katzenstein, D. A. Genetic correlates of efavirenz hypersusceptibility. Aids 18, 1781–1785 (2004).

617.

Miller, M. D. et al. Genotypic and Phenotypic Predictors of the Magnitude of Response to Tenofovir Disoproxil Fumarate Treatment in Antiretroviral-Experienced Patients. The Journal of Infectious Diseases 189, 837–846 (2004).

618.

Brown, B. W. & Russell, K. Methods correcting for multiple testing: Operating characteristics. Statistics in Medicine 16, 2511–2528 (1997).

619.

Austin, P. C., Mamdani, M. M., Juurlink, D. N. & Hux, J. E. Testing multiple statistical hypotheses resulted in spurious associations: A study of astrological signs and health. Journal of Clinical Epidemiology 59, 964–969 (2006).

620.

Hochberg, Y. & Tamhane, A. C. Multiple comparison procedures. (1987). doi:10.1002/9780470316672.

621.

Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 57, 289–300 (1995).

622.

Gonzales, M. J. et al. Extended spectrum of HIV-1 reverse transcriptase mutations in patients receiving multiple nucleoside analog inhibitors. Aids 17, 791–799 (2003).

623.

Seoighe, C. et al. A Model of Directional Selection Applied to the Evolution of Drug Resistance in HIV-1. Molecular Biology and Evolution 24, 1025–1031 (2007).

624.

Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics 15, 335–346 (2014).

625.

Alizon, S. et al. Phylogenetic Approach Reveals That Virus Genotype Largely Determines HIV Set-Point Viral Load. PLOS Pathogens 6, e1001123 (2010).

626.

Flynn, W. F. et al. Deep Sequencing of Protease Inhibitor Resistant HIV Patient Isolates Reveals Patterns of Correlated Mutations in Gag and Protease. PLOS Computational Biology 11, e1004249 (2015).

627.

Petropoulos, C. J. et al. A Novel Phenotypic Drug Susceptibility Assay for Human Immunodeficiency Virus Type 1. Antimicrobial Agents and Chemotherapy 44, 920–928 (2000).

628.

Hertogs, K. et al. A Rapid Method for Simultaneous Detection of Phenotypic Resistance to Inhibitors of Protease and Reverse Transcriptase in Recombinant Human Immunodeficiency Virus Type 1 Isolates from Patients Treated with Antiretroviral Drugs. Antimicrobial Agents and Chemotherapy 42, 269–276 (1998).

629.

Heilek-Snyder, G. & Bean, P. Role of HIV phenotypic assays in the management of HIV infection. Am Clin Lab 21, 40–43 (2002 Jan-Feb).

630.

Moyle, G. J. et al. Epidemiology and Predictive Factors for Chemokine Receptor Use in HIV-1 Infection. The Journal of Infectious Diseases 191, 866–872 (2005).

631.

Gartland, M. et al. Susceptibility of global HIV-1 clinical isolates to fostemsavir using the PhenoSense® Entry assay. Journal of Antimicrobial Chemotherapy 76, 648–652 (2021).

632.

Masquelier, B. et al. Genotypic and Phenotypic Resistance Patterns of Human Immunodeficiency Virus Type 1 Variants with Insertions or Deletions in the Reverse Transcriptase (RT): Multicenter Study of Patients Treated with RT Inhibitors. Antimicrobial Agents and Chemotherapy 45, 1836–1842 (2001).

633.

Larder, B. A. & Kemp, S. D. Multiple mutations in HIV-1 reverse transcriptase confer high-level resistance to zidovudine (AZT). Science 246, 1155–1158 (1989).

634.

de Vreese, K. et al. Resistance of human immunodeficiency virus type 1 reverse transcriptase to TIBO derivatives induced by site-directed mutagenesis. Virology 188, 900–904 (1992).

635.

Tambuyzer, L., Nijs, S., Daems, B., Picchio, G. & Vingerhoets, J. Effect of Mutations at Position E138 in HIV-1 Reverse Transcriptase on Phenotypic Susceptibility and Virologic Response to Etravirine. JAIDS Journal of Acquired Immune Deficiency Syndromes 58, 18–22 (2011).

636.

Katzenstein, D. A. et al. Phenotypic susceptibility and virological outcome in nucleoside-experienced patients receiving three or four antiretroviral drugs. Aids 17, 821–830 (2003).

637.

Blassel, L. et al. Using machine learning and big data to explore the drug resistance landscape in HIV. PLOS Computational Biology 17, e1008873 (2021).

638.

Sheik Amamuddy, O., Bishop, N. T. & Tastan Bishop, Ö. Improving fold resistance prediction of HIV-1 against protease and reverse transcriptase inhibitors using artificial neural networks. BMC Bioinformatics 18, 369 (2017).

639.

Beerenwinkel, N. et al. Geno2pheno: Interpreting genotypic HIV drug resistance tests. IEEE Intelligent Systems 16, 35–41 (2001).

640.

Riemenschneider, M., Hummel, T. & Heider, D. SHIVA - a web application for drug resistance and tropism testing in HIV. BMC Bioinformatics 17, 314 (2016).

641.

Beerenwinkel, N. et al. Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. Pnas 99, 8271–8276 (2002).

642.

Heider, D., Senge, R., Cheng, W. & Hüllermeier, E. Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction. Bioinformatics 29, 1946–1952 (2013).

643.

Lepri, A. C. et al. Resistance Profiles in Patients with Viral Rebound on Potent Antiretroviral Therapy. J Infect Dis 181, 1143–1147 (2000).

644.

Verhofstede, C. et al. Detection of drug resistance mutations as a predictor of subsequent virological failure in patients with HIV-1 viral rebounds of less than 1,000 RNA copies/ml. Journal of Medical Virology 79, 1254–1260 (2007).

645.

Zhukova, A., Cutino-Moguel, T., Gascuel, O. & Pillay, D. The Role of Phylogenetics as a Tool to Predict the Spread of Resistance. J Infect Dis 216, S820–s823 (2017).

646.

Bennett, D. E. et al. Drug Resistance Mutations for Surveillance of Transmitted HIV-1 Drug-Resistance: 2009 Update. Plos One 4, e4724 (2009).

647.

Hammond, J., Calef, C., Larder, B., Schinazi, R. & Mellors, J. W. Mutations in Retroviral Genes Associated with Drug Resistance. Human retroviruses and AIDS 11136–11179 (1998).

648.

Wensing, A. M. et al. 2017 Update of the Drug Resistance Mutations in HIV-1., 2017 Update of the Drug Resistance Mutations in HIV-1. Top Antivir Med 24, 24, 132, 132–133 (2016).

649.

Dudoit, S. & Laan, M. J. van der. Multiple Testing Procedures with Applications to Genomics. (Springer Science & Business Media, 2007). doi:10.1007/978-0-387-49317-6.

650.

Maddison, W. P. & FitzJohn, R. G. The Unsolved Challenge to Phylogenetic Correlation Tests for Categorical Characters. Syst Biol 64, 127–136 (2015).

651.

Lengauer, T. & Sing, T. Bioinformatics-assisted anti-HIV therapy. Nat Rev Microbiol 4, 790–797 (2006).

652.

Zhang, J., Rhee, S.-Y., Taylor, J. & Shafer, R. W. Comparison of the Precision and Sensitivity of the Antivirogram and PhenoSense HIV Drug Susceptibility Assays. JAIDS Journal of Acquired Immune Deficiency Syndromes 38, 439–444 (2005).

653.

Beerenwinkel, N. et al. Geno2pheno: Estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Research 31, 3850–3855 (2003).

654.

Shen, C., Yu, X., Harrison, R. W. & Weber, I. T. Automated prediction of HIV drug resistance from genotype data. BMC Bioinformatics 17, 278 (2016).

655.

Yu, X., Weber, I. T. & Harrison, R. W. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. BMC Genomics 15, S1 (2014).

656.

Araya, S. T. & Hazelhurst, S. Support vector machine prediction of HIV-1 drug resistance using the viral nucleotide patterns. Transactions of the Royal Society of South Africa 64, 62–72 (2009).

657.

Riemenschneider, M., Senge, R., Neumann, U., Hüllermeier, E. & Heider, D. Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification. BioData Min 9, 10 (2016).

658.

Dr̆aghici, S. & Potter, R. B. Predicting HIV drug resistance with neural networks. Bioinformatics 19, 98–107 (2003).

659.

Mooney, A. C. et al. Beyond Social Desirability Bias: Investigating Inconsistencies in Self-Reported HIV Testing and Treatment Behaviors Among HIV-Positive Adults in North West Province, South Africa. AIDS Behav 22, 2368–2379 (2018).

660.

Brier, G. W. Verification of Forecasts Expressed in Terms of Probability. Mon. Wea. Rev. 78, 1–3 (1950).

661.

Gascuel, O. et al. Twelve Numerical, Symbolic and Hybrid Supervised Classification Methods. Int. J. Patt. Recogn. Artif. Intell. 12, 517–571 (1998).

662.

Goeman, J. J. & Solari, A. Multiple hypothesis testing in genomics. Statistics in Medicine 33, 1946–1978 (2014).

663.

Rennie, J. D., Shih, L., Teevan, J. & Karger, D. R. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. in Proceedings of the 20th international conference on machine learning (ICML-03) 616–623 (2003).

664.

Alvarez Melis, D. & Jaakkola, T. Towards Robust Interpretability with Self-Explaining Neural Networks. in Advances in Neural Information Processing Systems 31 (eds. Bengio, S. et al.) 7775–7784 (Curran Associates, Inc., 2018).

665.

Zhang, Q., Wu, Y. N. & Zhu, S.-C. Interpretable Convolutional Neural Networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8827–8836 (2018). doi:10.1109/CVPR.2018.00920.

666.

Schrödinger, LLC. The PyMOL molecular graphics system, version 1.8. (2015).

667.

Rhee, S.-Y., Liu, T. F., Holmes, S. P. & Shafer, R. W. HIV-1 Subtype B Protease and Reverse Transcriptase Amino Acid Covariation. PLOS Computational Biology 3, e87 (2007).

668.

De Luca, A. et al. Improved Interpretation of Genotypic Changes in the HIV-1 Reverse Transcriptase Coding Region That Determine the Virological Response to Didanosine. J Infect Dis 196, 1645–1653 (2007).

669.

Marcelin, A.-G. et al. Impact of HIV-1 reverse transcriptase polymorphism at codons 211 and 228 on virological response to didanosine. Antiviral Therapy 8 (2006) doi:10.1177/135965350601100609.

670.

Brown, A. J. L. et al. Reduced Susceptibility of Human Immunodeficiency Virus Type 1 (HIV-1) from Patients with Primary HIV Infection to Nonnucleoside Reverse Transcriptase Inhibitors Is Associated with Variation at Novel Amino Acid Sites. J Virol 74, 10269–10273 (2000).

671.

Clark, S. A., Shulman, N. S., Bosch, R. J. & Mellors, J. W. Reverse transcriptase mutations 118I, 208Y, and 215Y cause HIV-1 hypersusceptibility to non-nucleoside reverse transcriptase inhibitors. Aids 20, 981–984 (2006).

672.

Nebbia, G., Sabin, C. A., Dunn, D. T. & Geretti, A. M. Emergence of the H208Y mutation in the reverse transcriptase (RT) of HIV-1 in association with nucleoside RT inhibitor therapy. J Antimicrob Chemother 59, 1013–1016 (2007).

673.

Saracino, A. et al. Impact of unreported HIV-1 reverse transcriptase mutations on phenotypic resistance to nucleoside and non-nucleoside inhibitors. Journal of Medical Virology 78, 9–17 (2006).

674.

Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–721 (2009).

675.

Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408 (1958).

676.

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

677.

Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 2, 183–197 (1991).

678.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

679.

Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366 (1989).

680.

Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251–257 (1991).

681.

LeCun, Y. et al. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1, 541–551 (1989).

682.

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).

683.

Voznica, J. et al. Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks. Nat Commun 13, 3896 (2022).

684.

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).

685.

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).

686.

Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2016) doi:10.48550/arXiv.1409.0473.

687.

Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017).

688.

How many words are there in English? | Merriam-Webster. https://www.merriam-webster.com/help/faq-how-many-english-words.

689.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space. (2013) doi:10.48550/arXiv.1301.3781.

690.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed Representations of Words and Phrases and their Compositionality. in Advances in Neural Information Processing Systems vol. 26 (Curran Associates, Inc., 2013).

691.

Goldberg, Y. & Levy, O. Word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. (2014) doi:10.48550/arXiv.1402.3722.

692.

Ng, P. Dna2vec: Consistent vector representations of variable-length k-mers. (2017) doi:10.48550/arXiv.1701.06279.

693.

Liang, Y. et al. Hyb4mC: A hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction. BMC Bioinformatics 23, 258 (2022).

694.

Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed Representations for Biological Sequence Analysis. (2016) doi:10.48550/arXiv.1608.05949.

695.

Asgari, E. & Mofrad, M. R. K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 10, e0141287 (2015).

696.

Kimothi, D., Shukla, A., Biyani, P., Anand, S. & Hogan, J. M. Metric learning on biological sequence embeddings. in 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) 1–5 (2017). doi:10.1109/spawc.2017.8227769.

697.

Song, B. et al. Pretraining model for biological sequence data. Briefings in Functional Genomics 20, 181–195 (2021).

698.

Wang, H., Wu, H., He, Z., Huang, L. & Ward Church, K. Progress in Machine Translation. Engineering (2021) doi:10.1016/j.eng.2021.03.023.

699.

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Naacl (2019) doi:10.48550/arXiv.1810.04805.

700.

Brown, T. et al. Language Models are Few-Shot Learners. in Advances in Neural Information Processing Systems vol. 33 1877–1901 (Curran Associates, Inc., 2020).

701.

Madani, A. et al. ProGen: Language Modeling for Protein Generation. bioRxiv (2020) doi:10.1101/2020.03.07.982272.

702.

Erik Nijkamp, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik & Ali Madani. ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv (2022) doi:10.48550/arxiv.2206.13517.

703.

Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Systems 12, 654–669.e3 (2021).

704.

Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. 2020.12.15.422761 (2020) doi:10.1101/2020.12.15.422761.

705.

Rives, A. et al. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. bioRxiv 118, 622803 (2019).

706.

Bhattacharya, N. et al. Single Layers of Attention Suffice to Predict Protein Contacts. 2020.12.21.423882 (2020) doi:10.1101/2020.12.21.423882.

707.

Hu, M. et al. Exploring evolution-based & -free protein language models as protein function predictors. (2022) doi:10.48550/arXiv.2206.06583.

708.

Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 34, 29287–29303 (2021).

709.

Hie, B., Kevin K Yang & Kim, S. K. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Systems 13, 274–285.e6 (2022).

710.

Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).

711.

Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful zero-shot predictors of non-coding variant effects. 2022.08.22.504706 (2022) doi:10.1101/2022.08.22.504706.

712.

Cai, T. et al. Genome-wide Prediction of Small Molecule Binding to Remote Orphan Proteins Using Distilled Sequence Alignment Embedding. 2020.08.04.236729 (2020) doi:10.1101/2020.08.04.236729.

713.

Rao, R. et al. MSA Transformer. bioRxiv (2021) doi:10.1101/2021.02.12.430858.

714.

Sercu, T. et al. Neural Potts Model. 2021.04.08.439084 (2021) doi:10.1101/2021.04.08.439084.

715.

Sturmfels, P., Vig, J., Madani, A. & Rajani, N. F. Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models. (2020) doi:10.48550/arXiv.2012.00195.

716.

Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 1–7 (2022) doi:10.1038/s41587-022-01435-7.

717.

Ourmazd, A., Moffat, K. & Lattman, E. E. Structural biology is solved — now what? Nat Methods 19, 24–26 (2022).

718.

Vig, J. et al. BERTology Meets Biology: Interpreting Attention in Protein Language Models. (2021) doi:10.48550/arXiv.2006.15222.

719.

Gao, M. & Skolnick, J. A novel sequence alignment algorithm based on deep learning of the protein folding code. Bioinformatics 37, 490–496 (2021).

720.

Morton, J. T. et al. Protein Structural Alignments From Sequence. 2020.11.03.365932 (2020) doi:10.1101/2020.11.03.365932.

721.

Berman, H., Henrick, K., Nakamura, H. & Markley, J. L. The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Research 35, D301–d303 (2007).

722.

Guo, Y., Wu, J., Ma, H., Wang, S. & Huang, J. Comprehensive Study on Enhancing Low-Quality Position-Specific Scoring Matrix with Deep Learning for Accurate Protein Structure Property Prediction: Using Bagging Multiple Sequence Alignment Learning. Journal of Computational Biology 28, 346–361 (2021).

723.

Llinares-López, F., Berthet, Q., Blondel, M., Teboul, O. & Vert, J.-P. Deep embedding and alignment of protein sequences. 2021.11.15.468653 (2022) doi:10.1101/2021.11.15.468653.

724.

Suzek, B. E. et al. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

725.

Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–d419 (2021).

726.

Petti, S. et al. End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. 2021.10.23.465204 (2022) doi:10.1101/2021.10.23.465204.

727.

Dotan, E. et al. Harnessing machine translation methods for sequence alignment. 2022.07.22.501063 (2022) doi:10.1101/2022.07.22.501063.

728.

Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: Self-Attention with Linear Complexity. (2020) doi:10.48550/arXiv.2006.04768.

729.

Xiong, Y. et al. Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence 35, 14138–14148 (2021).

730.

Child, R., Gray, S., Radford, A. & Sutskever, I. Generating Long Sequences with Sparse Transformers. (2019) doi:10.48550/arXiv.1904.10509.

731.

Correia, G. M., Niculae, V. & Martins, A. F. T. Adaptively Sparse Transformers. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2174–2184 (Association for Computational Linguistics, 2019). doi:10.18653/v1/D19-1223.

732.

Sukhbaatar, S., Grave, E., Bojanowski, P. & Joulin, A. Adaptive Attention Span in Transformers. (2019) doi:10.48550/arXiv.1905.07799.

733.

Wu, Z., Liu, Z., Lin, J., Lin, Y. & Han, S. Lite Transformer with Long-Short Range Attention. (2020) doi:10.48550/arXiv.2004.11886.

734.

Kitaev, N., Kaiser, Ł. & Levskaya, A. Reformer: The Efficient Transformer. (2020) doi:10.48550/arXiv.2001.04451.

735.

Choromanski, K. et al. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers. (2020) doi:10.48550/arXiv.2006.03555.

736.

Bhattacharya, N. et al. Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention. in Biocomputing 2022 34–45 (World Scientific, 2021). doi:10.1142/9789811250477_0004.

737.

Kraska, T., Beutel, A., Chi, E. H., Dean, J. & Polyzotis, N. The Case for Learned Index Structures. in Proceedings of the 2018 International Conference on Management of Data 489–504 (Association for Computing Machinery, 2018). doi:10.1145/3183713.3196909.

738.

Jung, Y. & Han, D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38, 2404–2413 (2022).

739.

Kirsche, M., Das, A. & Schatz, M. C. Sapling: Accelerating suffix array queries with learned data models. Bioinformatics 37, 744–749 (2021).

740.

Ho, D. et al. LISA: Learned Indexes for Sequence Analysis. 2020.12.22.423964 (2021) doi:10.1101/2020.12.22.423964.

741.

Hoang, M., Zheng, H. & Kingsford, C. Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer. Journal of Computational Biology (2022) doi:10.1089/cmb.2022.0275.

742.

Min, S., Lee, B. & Yoon, S. TargetNet: Functional microRNA target prediction with deep neural networks. Bioinformatics 38, 671–677 (2022).

743.

Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat Biotechnol 38, 1309–1316 (2020).

744.

Castresana, J. Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Mol Biol Evol 17, 540–552 (2000).

745.

Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).

746.

Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020).

747.

Seabold, Skipper & Perktold, Josef. Statsmodels: Econometric and Statistical Modeling with Python. in Proceedings of the 9th Python in Science Conference (eds. Walt, Stéfan van der & Millman, Jarrod) 92–96 (2010). doi:10.25080/Majora-92bf1922-011.

748.

Vinh, N. X. & Epps, J. A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering. in 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering 84–91 (2009). doi:10.1109/bibe.2009.19.

749.

Harremoes, P. Mutual information of contingency tables and related inequalities. in 2014 IEEE International Symposium on Information Theory 2474–2478 (Ieee, 2014). doi:10.1109/isit.2014.6875279.