Global References

1.
Watson, J. D. & Crick, F. H. C. The Structure of Dna. Cold Spring Harb Symp Quant Biol 18, 123–131 (1953).
2.
Sanger, F. et al. Nucleotide sequence of bacteriophage φX174 DNA. Nature 265, 687–695 (1977).
3.
4.
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
5.
Pellicer, J., Fay, M. F. & Leitch, I. J. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society 164, 10–15 (2010).
6.
Macgregor, H. C. C-Value Paradox. in Encyclopedia of Genetics (eds. Brenner, S. & Miller, J. H.) 249–250 (Academic Press, 2001). doi:10.1006/rwgn.2001.0301.
7.
Alberts, B. et al. Molecular Biology of the Cell. 4th edition. (Garland Science, 2002).
8.
Crick, F. H. C., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General Nature of the Genetic Code for Proteins. Nature 192, 1227–1232 (1961).
9.
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
10.
Elkon, R. & Agami, R. Characterization of noncoding regulatory DNA in the human genome. Nat Biotechnol 35, 732–746 (2017).
11.
12.
Shabalina, S. A. & Spiridonov, N. A. The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biol 5, 105 (2004).
13.
Consortium, T. E. P. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012).
14.
Chatterjee, N. & Walker, G. C. Mechanisms of DNA damage, repair, and mutagenesis: DNA Damage and Repair. Environ. Mol. Mutagen. 58, 235–263 (2017).
15.
Fijalkowska, I. J., Schaaper, R. M. & Jonczyk, P. DNA replication fidelity in Escherichia coli: A multi-DNA polymerase affair. FEMS Microbiol Rev 36, 1105–1121 (2012).
16.
Pray, L. DNA replication and causes of mutation. Nature education 1, 214 (2008).
17.
Gout, J.-F., Thomas, W. K., Smith, Z., Okamoto, K. & Lynch, M. Large-scale detection of in vivo transcription errors. Proceedings of the National Academy of Sciences 110, 18584–18589 (2013).
18.
Gout, J.-F. et al. The landscape of transcription errors in eukaryotic cells. Sci Adv 3, e1701484 (2017).
19.
20.
Desouky, O., Ding, N. & Zhou, G. Targeted and non-targeted effects of ionizing radiation. Journal of Radiation Research and Applied Sciences 8, 247–254 (2015).
21.
Kiefer, J. Effects of Ultraviolet Radiation on DNA. in Chromosomal Alterations: Methods, Results and Importance in Human Health (eds. Obe, G. & Vijayalaxmi) 39–53 (Springer, 2007). doi:10.1007/978-3-540-71414-9_3.
22.
Bennett, J. W. & Klich, M. Mycotoxins. Clin Microbiol Rev 16, 497–516 (2003).
23.
Kantidze, O. L., Velichko, A. K., Luzhin, A. V. & Razin, S. V. Heat Stress-Induced DNA Damage. Acta Naturae 8, 75–78 (2016).
24.
25.
26.
Anagnostou, M. E. et al. Transcription errors in aging and disease. Translational Medicine of Aging 5, 31–38 (2021).
27.
Roth, J. R. Frameshift mutations. Annu Rev Genet 8, 319–346 (1974).
28.
Kujovich, J. L. Factor V Leiden thrombophilia. Genetics in Medicine 13, 1–16 (2011).
29.
30.
Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016).
31.
32.
Woodford, N. & Ellington, M. J. The emergence of antibiotic resistance by mutation. Clinical Microbiology and Infection 13, 5–18 (2007).
33.
Rhee, S.-Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31, 298–303 (2003).
34.
Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74, 5463–5467 (1977).
35.
Smith, L. M., Fung, S., Hunkapiller, M. W., Hunkapiller, T. J. & Hood, L. E. The synthesis of oligonucleotides containing an aliphatic amino group at the 5′ terminus: Synthesis of fluorescent DNA primers for use in DNA sequence analysis. Nucleic Acids Research 13, 2399–2412 (1985).
36.
Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674–679 (1986).
37.
Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. & Zenke, M. Automated DNA sequencing: Ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic Acids Research 15, 4593–4602 (1987).
38.
Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat Biotechnol 26, 1135–1145 (2008).
39.
Collins, F. S., Morgan, M. & Patrinos, A. The Human Genome Project: Lessons from Large-Scale Biology. Science 300, 286–290 (2003).
40.
Liu, L. et al. Comparison of Next-Generation Sequencing Systems. Journal of Biomedicine and Biotechnology 2012, e251364 (2012).
41.
42.
Metzker, M. L. Sequencing technologies — the next generation. Nat Rev Genet 11, 31–46 (2010).
43.
Canard, B. & Sarfati, R. S. DNA polymerase fluorescent substrates with reversible 3′-tags. Gene 148, 1–6 (1994).
44.
Nyren, P., Pettersson, B. & Uhlen, M. Solid Phase DNA Minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay. Analytical Biochemistry 208, 171–175 (1993).
45.
Mardis, E. R. A decade’s perspective on DNA sequencing technology. Nature 470, 198–203 (2011).
46.
Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics 3, lqab019 (2021).
47.
48.
Pollard, M. O., Gurdasani, D., Mentzer, A. J., Porter, T. & Sandhu, M. S. Long reads: Their purpose and place. Human Molecular Genetics 27, R234–r241 (2018).
49.
Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138 (2009).
50.
Levene, M. J. et al. Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations. Science 299, 682–686 (2003).
51.
Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotech 4, 265–270 (2009).
52.
Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat Biotechnol 34, 518–524 (2016).
53.
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20, 129 (2019).
54.
Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics 13, 278–289 (2015).
55.
56.
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat Rev Genet 21, 597–614 (2020).
57.
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338–345 (2018).
58.
Thar she blows! Ultra long read method for nanopore sequencing · Loman Labs. http://lab.loman.net/2017/03/09/ultrareads-for-nanopore/.
59.
Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: A graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).
60.
Murigneux, V. et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 9, giaa146 (2020).
61.
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).
62.
Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: Delivery of nanopore sequencing to the genomics community. Genome Biol 17, 239 (2016).
63.
Hong, M. et al. RNA sequencing: New technologies and applications in cancer research. Journal of Hematology & Oncology 13, 166 (2020).
64.
Ozsolak, F. & Milos, P. M. RNA sequencing: Advances, challenges and opportunities. Nat Rev Genet 12, 87–98 (2011).
65.
Hunt, D. F., Yates, J. R., Shabanowitz, J., Winston, S. & Hauer, C. R. Protein sequencing by tandem mass spectrometry. Proceedings of the National Academy of Sciences 83, 6233–6237 (1986).
66.
Smith, B. J. Protein Sequencing Protocols. (Springer Science & Business Media, 2002). doi:10.1385/1592593429.
67.
Restrepo-Pérez, L., Joo, C. & Dekker, C. Paving the way to single-molecule protein sequencing. Nature Nanotech 13, 786–796 (2018).
68.
69.
Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39, 1348–1365 (2021).
70.
Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biology 20, 50 (2019).
71.
Lima, L. et al. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Briefings in Bioinformatics 21, 1164–1181 (2020).
72.
Fu, S., Wang, A. & Au, K. F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biology 20, 26 (2019).
73.
Zhang, H., Jain, C. & Aluru, S. A comprehensive evaluation of long read error correction methods. BMC Genomics 21, 889 (2020).
74.
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biology 21, 30 (2020).
75.
Ruan, J. & Li, H. Fast and accurate long-read assembly with Wtdbg2. Nat Methods 17, 155–158 (2020).
76.
77.
Tischler, G. & Myers, E. W. Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly. 106252 (2017) doi:10.1101/106252.
78.
Warren, R. L. et al. ntEdit: Scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).
79.
Hepler, N. L. et al. An Improved Circular Consensus Algorithm with an Application to Detect HIV-1 Drug-Resistance Associated Mutations (DRAMs). in Conference on advances in genome biology and technology 1 (2016).
80.
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 14, 407–410 (2017).
81.
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017).
82.
Hackl, T., Hedrich, R., Schultz, J. & Förster, F. Proovread : Large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).
83.
Miclotte, G. et al. Jabba: Hybrid error correction for long sequencing reads. Algorithms for Molecular Biology 11, 10 (2016).
84.
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30, 693–700 (2012).
85.
Salmela, L. & Rivals, E. LoRDEC: Accurate and efficient long read error correction. Bioinformatics 30, 3506–3514 (2014).
86.
87.
88.
Timp, W., Comer, J. & Aksimentiev, A. DNA Base-Calling from a Nanopore Using a Viterbi Algorithm. Biophysical Journal 102, L37–l39 (2012).
89.
Perešíni, P., Boža, V., Brejová, B. & Vinař, T. Nanopore base calling on the edge. Bioinformatics 37, 4661–4667 (2021).
90.
Boža, V., Brejová, B. & Vinař, T. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. Plos One 12, e0178751 (2017).
91.
92.
Lin, B., Hui, J. & Mao, H. Nanopore Technology and Its Applications in Gene Sequencing. Biosensors 11, 214 (2021).
93.
Oxford Nanopore Tech Update: New Duplex method for Q30 nanopore single molecule reads, PromethION 2, and more. http://nanoporetech.com/about-us/news/oxford-nanopore-tech-update-new-duplex-method-q30-nanopore-single-molecule-reads-0.
94.
Sanderson, N. et al. Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. 2022.04.29.490057 (2022) doi:10.1101/2022.04.29.490057.
95.
96.
97.
High Performance Long Read Assay Enables Contiguous Data up to 10Kb on Existing Illumina Platforms. https://www.illumina.com/content/illumina-marketing/amr/en_US/science/genomics-research/articles/infinity-high-performance-long-read-assay.html.
98.
Booeshaghi, A. S. & Pachter, L. Pseudoalignment facilitates assignment of error-prone Ultima Genomics reads. 2022.06.04.494845 (2022) doi:10.1101/2022.06.04.494845.
99.
Delahaye, C. & Nicolas, J. Sequencing DNA with nanopores: Troubles and biases. Plos One 16, e0257521 (2021).
100.
101.
Dohm, J. C., Peters, P., Stralis-Pavese, N. & Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genomics and Bioinformatics 2, (2020).
102.
103.
Huang, Y.-T., Liu, P.-Y. & Shih, P.-W. Homopolish: A method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biology 22, 95 (2021).
104.
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: Computational approaches for improving nanopore sequencing read accuracy. Genome Biology 19, 90 (2018).
105.
Sarkozy, P., Jobbágy, Á. & Antal, P. Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times. in Embec & Nbc 2017 (eds. Eskola, H., Väisänen, O., Viik, J. & Hyttinen, J.) 241–244 (Springer, 2018). doi:10.1007/978-981-10-5122-7_61.
106.
Hawkins, J. A., Jones, S. K., Finkelstein, I. J. & Press, W. H. Indel-correcting DNA barcodes for high-throughput sequencing. Proceedings of the National Academy of Sciences 115, E6217–e6226 (2018).
107.
Srivathsan, A. et al. A MinION™-based pipeline for fast and cost-effective DNA barcoding. Molecular Ecology Resources 18, 1035–1049 (2018).
108.
Wang, Y., Noor-A-Rahim, Md., Gunawan, E., Guan, Y. L. & Poh, C. L. Construction of Bio-Constrained Code for DNA Data Storage. IEEE Communications Letters 23, 963–966 (2019).
109.
R10.3: The newest nanopore for high accuracy nanopore sequencing – now available in store. http://nanoporetech.com/about-us/news/r103-newest-nanopore-high-accuracy-nanopore-sequencing-now-available-store.
110.
Zhou, L. et al. Detection of DNA homopolymer with graphene nanopore. Journal of Vacuum Science & Technology B 37, 061809 (2019).
111.
Goto, Y., Yanagi, I., Matsui, K., Yokoi, T. & Takeda, K. Identification of four single-stranded DNA homopolymers with a solid-state nanopore in alkaline CsCl solution. Nanoscale 10, 20844–20850 (2018).
112.
113.
Ekim, B., Berger, B. & Chikhi, R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems 12, 958–968.e6 (2021).
114.
115.
Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).
116.
Sahlin, K. & Medvedev, P. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm. Journal of Computational Biology 27, 472–484 (2020).
117.
Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. Plos One 7, e46679 (2012).
118.
Hu, R., Sun, G. & Sun, X. LSCplus: A fast solution for improving long read accuracy by short read alignment. BMC Bioinformatics 17, 451 (2016).
119.
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
120.
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
121.
Van Neste, C., Van Nieuwerburgh, F., Van Hoofstat, D. & Deforce, D. Forensic STR analysis using massive parallel sequencing. Forensic Science International: Genetics 6, 810–818 (2012).
122.
123.
Cetin, A. E. et al. Plasmonic Sensor Could Enable Label-Free DNA Sequencing. ACS Sens. 3, 561–568 (2018).
124.
Almogy, G. et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. 2022.05.29.493900 (2022) doi:10.1101/2022.05.29.493900.
125.
Sunagawa, S. et al. Tara Oceans: Towards global ocean ecosystems biology. Nat Rev Microbiol 18, 428–445 (2020).
126.
Lewin, H. A. et al. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115, 4325–4333 (2018).
127.
128.
Hamming, R. W. Coding and Information Theory. (Prentice-Hall, 1980).
129.
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. (Cambridge University Press, 1997). doi:10.1017/cbo9780511574931.
130.
Levenshtein, V. I. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966).
131.
Hardison, R. C. Comparative Genomics. PLOS Biology 1, e58 (2003).
132.
Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol 17, 368–376 (1981).
133.
Kumar, S., Tamura, K. & Nei, M. MEGA: Molecular Evolutionary Genetics Analysis software for microcomputers. Bioinformatics 10, 189–191 (1994).
134.
Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).
135.
136.
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5, e9490 (2010).
137.
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
138.
Karplus, K. et al. Predicting protein structure using only sequence information. Proteins: Structure, Function, and Bioinformatics 37, 121–125 (1999).
139.
Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Current Opinion in Structural Biology 15, 275–284 (2005).
140.
Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005 (2007).
141.
Salmela, L. & Schröder, J. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 1455–1461 (2011).
142.
Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6, S13–s20 (2009).
143.
Mahmoud, M. et al. Structural variant calling: The long and the short of it. Genome Biol 20, 246 (2019).
144.
Sung, W.-K. Algorithms in Bioinformatics: A Practical Introduction. (Chapman and Hall/CRC, 2011). doi:10.1201/9781420070347.
145.
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970).
146.
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981).
147.
Bradley, S. P., Hax, A. C. & Magnanti, T. L. Applied Mathematical Programming. (Addison-Wesley Publishing Company, 1977).
148.
Bellman, R. The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 503–515 (1954).
149.
Masek, W. J. & Paterson, M. S. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20, 18–31 (1980).
150.
Vinh, N. X., Epps, J. & Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research 11, 18 (2010).
151.
Ullman, J. D., Aho, A. V. & Hirschberg, D. S. Bounds on the Complexity of the Longest Common Subsequence Problem. J. Acm 23, 1–12 (1976).
152.
Hirschberg, D. S. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18, 341–343 (1975).
153.
Myers, E. W. & Miller, W. Optimal alignments in linear space. Bioinformatics 4, 11–17 (1988).
154.
Rice, P., Longden, I. & Bleasby, A. EMBOSS: The European molecular biology open software suite. Trends in genetics 16, 276–277 (2000).
155.
Huang, X. & Miller, W. A time-efficient, linear-space local similarity algorithm. Advances in Applied Mathematics 12, 337–357 (1991).
156.
Waterman, M. S. & Eggert, M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. Journal of Molecular Biology 197, 723–728 (1987).
157.
Stajich, J. E. et al. The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 12, 1611–1618 (2002).
158.
159.
160.
Frohmberg, W., Kierzynka, M., Blazewicz, J. & Wojciechowski, P. G-PAS 2.0 – an improved version of protein alignment tool with an efficient backtracking routine on multiple GPUs. Bulletin of the Polish Academy of Sciences: Technical Sciences 60, 491–494 (2012).
161.
Altschul, S. F. Substitution Matrices. in eLS (John Wiley & Sons, Ltd, 2013). doi:10.1002/9780470015902.a0005265.pub3.
162.
Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. A Model of Evolutionary Change in Proteins. in Atlas of Protein Sequence and Structure 345–352 (1978).
163.
Müller, T. & Vingron, M. Modeling amino acid replacement. J Comput Biol 7, 761–776 (2000).
164.
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Pnas 89, 10915–10919 (1992).
165.
166.
Le, S. Q. & Gascuel, O. An Improved General Amino Acid Replacement Matrix. Molecular Biology and Evolution 25, 1307–1320 (2008).
167.
Müller, T., Rahmann, S. & Rehmsmeier, M. Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17, S182–s189 (2001).
168.
Ng, P. C., Henikoff, J. G. & Henikoff, S. PHAT: A transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics 16, 760–766 (2000).
169.
170.
Goonesekere, N. C. W. & Lee, B. Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins: Structure, Function, and Bioinformatics 71, 910–919 (2008).
171.
172.
Nickle, D. C. et al. HIV-Specific Probabilistic Models of Protein Evolution. PLoS One 2, e503 (2007).
173.
Sardiu, M. E., Alves, G. & Yu, Y.-K. Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Phys Rev E Stat Nonlin Soft Matter Phys 72, 061917 (2005).
174.
Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. in Biocomputing 2002 115–126 (World Scientific, 2001). doi:10.1142/9789812799623_0012.
175.
Schneider, A., Cannarozzi, G. M. & Gonnet, G. H. Empirical codon substitution matrix. BMC Bioinformatics 6, 134 (2005).
176.
Doron-Faigenboim, A. & Pupko, T. A Combined Empirical and Mechanistic Codon Model. Molecular Biology and Evolution 24, 388–397 (2007).
177.
Cartwright, R. A. Problems and Solutions for Estimating Indel Rates and Length Distributions. Molecular Biology and Evolution 26, 473–480 (2009).
178.
Fitch, W. M. & Smith, T. F. Optimal sequence alignments. Proceedings of the National Academy of Sciences 80, 1382–1386 (1983).
179.
Waterman, M. S., Smith, T. F. & Beyer, W. A. Some biological sequence metrics. Advances in Mathematics 20, 367–387 (1976).
180.
Gotoh, O. An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982).
181.
Altschul, S. F. & Erickson, B. W. Optimal sequence alignment using affine gap costs. Bulletin of Mathematical Biology 48, 603–616 (1986).
182.
Waterman, M. S. Efficient sequence alignment algorithms. Journal of Theoretical Biology 108, 333–337 (1984).
183.
Miller, W. & Myers, E. W. Sequence comparison with concave weighting functions. Bltn Mathcal Biology 50, 97–120 (1988).
184.
Cartwright, R. A. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 7, 527 (2006).
185.
Goonesekere, N. C. W. & Lee, B. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Research 32, 2838–2843 (2004).
186.
Benner, S. A., Cohen, M. A. & Gonnet, G. H. Empirical and Structural Models for Insertions and Deletions in the Divergent Evolution of Proteins. Journal of Molecular Biology 229, 1065–1082 (1993).
187.
Wrabl, J. O. & Grishin, N. V. Gaps in structurally similar proteins: Towards improvement of multiple sequence alignment. Proteins: Structure, Function, and Bioinformatics 54, 71–87 (2004).
188.
189.
Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G. & Gibson, T. J. Multiple sequence alignment with Clustal X. Trends in Biochemical Sciences 23, 403–405 (1998).
190.
Wang, C., Yan, R.-X., Wang, X.-F., Si, J.-N. & Zhang, Z. Comparison of linear gap penalties and profile-based variable gap penalties in profile–profile alignments. Computational Biology and Chemistry 35, 308–318 (2011).
191.
Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics (2020) doi:10.1093/bioinformatics/btaa777.
192.
Pearson, W. R. & Miller, W. [27] Dynamic programming algorithms for biological sequence comparison. in Methods in Enzymology vol. 210 575–601 (Academic Press, 1992).
193.
Spouge, J. L. Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths. SIAM J. Appl. Math. 49, 1552–1566 (1989).
194.
Fickett, J. W. Fast optimal alignment. Nucleic Acids Research 12, 175–179 (1984).
195.
Chao, J., Tang, F. & Xu, L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 12, 546 (2022).
196.
Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30, 3059–3066 (2002).
197.
Sun, Y. & Buhler, J. Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics 7, 133 (2006).
198.
Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11, 473–483 (2010).
199.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).
200.
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997).
201.
Schwartz, S. et al. Human–Mouse Alignments with BLASTZ. Genome Res. 13, 103–107 (2003).
202.
Ma, B., Tromp, J. & Li, M. PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002).
203.
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
204.
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60 (2015).
205.
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).
206.
Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448 (1988).
207.
Lipman, D. J. & Pearson, W. R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).
208.
Saripella, G. V., Sonnhammer, E. L. L. & Forslund, K. Benchmarking the next generation of homology inference tools. Bioinformatics 32, 2636 (2016).
209.
Finn, R. D. et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research 44, D279 (2016).
210.
Essoussi, N. & Fayech, S. A comparison of four pair-wise sequence alignment methods. Bioinformation 2, 166–168 (2007).
211.
212.
Schleimer, S., Wilkerson, D. S. & Aiken, A. Winnowing: Local algorithms for document fingerprinting. in Proceedings of the 2003 ACM SIGMOD international conference on Management of data 76–85 (Association for Computing Machinery, 2003). doi:10.1145/872757.872770.
213.
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
214.
215.
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
216.
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R. & Kingsford, C. Compact Universal k-mer Hitting Sets. in Algorithms in Bioinformatics (eds. Frith, M. & Storm Pedersen, C. N.) 257–268 (Springer International Publishing, 2016). doi:10.1007/978-3-319-43681-4_21.
217.
Marçais, G. et al. Improving the performance of minimizers and winnowing schemes. Bioinformatics 33, i110–i117 (2017).
218.
Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. & Medvedev, P. On the Representation of de Bruijn Graphs. in Research in Computational Molecular Biology (ed. Sharan, R.) 35–55 (Springer International Publishing, 2014). doi:10.1007/978-3-319-05269-4_4.
219.
220.
Sahlin, K. Effective sequence similarity detection with strobemers. Genome Res. 31, 2080–2094 (2021).
221.
Sahlin, K. Flexible seed size enables ultra-fast and accurate read alignment. 2021.06.18.449070 (2022) doi:10.1101/2021.06.18.449070.
222.
Weiner, P. Linear pattern matching algorithms. in 14th Annual Symposium on Switching and Automata Theory (swat 1973) 1–11 (1973). doi:10.1109/swat.1973.13.
223.
Manber, U. & Myers, G. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22, 935–948 (1993).
224.
Abouelhoda, M. I., Kurtz, S. & Ohlebusch, E. The Enhanced Suffix Array and Its Applications to Genome Analysis. in Algorithms in Bioinformatics (eds. Guigó, R. & Gusfield, D.) 449–463 (Springer, 2002). doi:10.1007/3-540-45784-4_35.
225.
Ferragina, P. & Manzini, G. Opportunistic data structures with applications. in Proceedings 41st Annual Symposium on Foundations of Computer Science 390–398 (2000). doi:10.1109/sfcs.2000.892127.
226.
Bray, N., Dubchak, I. & Pachter, L. AVID: A Global Alignment Program. Genome Res 13, 97–102 (2003).
227.
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30, 2478–2483 (2002).
228.
Abouelhoda, M. I., Kurtz, S. & Ohlebusch, E. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004).
229.
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944 (2018).
230.
McCreight, E. M. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 262272 (1976).
231.
Burrows, M. & Wheeler, D. A Block-Sorting Lossless Data Compression Algorithm. https://www.cs.jhu.edu/%7Elangmea/resources/burrows_wheeler.pdf (1994).
232.
Vyverman, M., De Baets, B., Fack, V. & Dawyndt, P. Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Research 40, 6993–7015 (2012).
233.
Cheng, H., Wu, M. & Xu, Y. FMtree: A fast locating algorithm of FM-indexes for genomic data. Bioinformatics 34, 416–424 (2018).
234.
Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K. & Yiu, S. M. Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008).
235.
Li, H. & Durbin, R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 25, 1754–1760 (2009).
236.
Li, H. & Durbin, R. Fast and accurate long-read alignment with BurrowsWheeler transform. Bioinformatics 26, 589–595 (2010).
237.
238.
Liu, Y. & Schmidt, B. Long read alignment based on maximal exact match seeds. Bioinformatics 28, i318–i324 (2012).
239.
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).
240.
241.
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, 1998). doi:10.1017/cbo9780511790492.
242.
Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
243.
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res 39, W29–w37 (2011).
244.
245.
Ruffalo, M., LaFramboise, T. & Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27, 2790–2796 (2011).
246.
247.
Hatem, A., Bozdağ, D., Toland, A. E. & Çatalyürek, Ü. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).
248.
Canzar, S. & Salzberg, S. L. Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE 105, 436–458 (2017).
249.
Alser, M. et al. Technology dictates algorithms: Recent developments in read alignment. Genome Biology 22, 249 (2021).
250.
Břinda, K., Boeva, V. & Kucherov, G. RNF: A general framework to evaluate NGS read mappers. Bioinformatics 32, 136–139 (2016).
251.
Lin, H.-N. & Hsu, W.-L. Kart: A divide-and-conquer algorithm for NGS read alignment. Bioinformatics 33, 2281–2287 (2017).
252.
Olson, C. B. et al. Hardware Acceleration of Short Read Mapping. in 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines 161–168 (2012). doi:10.1109/fccm.2012.36.
253.
Chen, P., Wang, C., Li, X. & Zhou, X. Accelerating the Next Generation Long Read Mapping with the FPGA-Based System. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 840–852 (2014).
254.
255.
Zeni, A. et al. LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment. in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 462–471 (2020). doi:10.1109/ipdps47924.2020.00055.
256.
257.
Haghshenas, E., Sahinalp, S. C. & Hach, F. lordFAST: Sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data. Bioinformatics 35, 20–27 (2019).
258.
Sović, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun 7, 11307 (2016).
259.
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15, 461–468 (2018).
260.
Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J Comput Biol 25, 766–779 (2018).
261.
262.
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 19, 705–710 (2022).
263.
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: Mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
264.
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
265.
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 3 (2009).
266.
267.
268.
Langmead, B. A tandem simulation framework for predicting mapping quality. Genome Biology 18, 152 (2017).
269.
Ruffalo, M., Koyutürk, M., Ray, S. & LaFramboise, T. Accurate estimation of short read mapping quality for next-generation genome sequencing. Bioinformatics 28, i349–i355 (2012).
270.
Multiple Sequence Alignment Methods. vol. 1079 (Humana Press, 2014).
271.
Wang, L. & Jiang, T. On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology 1, 337–348 (1994).
272.
Just, W. Computational Complexity of Multiple Sequence Alignment with SP-Score. Journal of Computational Biology 8, 615–623 (2001).
273.
Tang, F. et al. HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences. Molecular Biology and Evolution 39, msac166 (2022).
274.
Feng, D.-F. & Doolittle, R. F. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. J Mol Evol 25, 351–360 (1987).
275.
Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Bioinformatics 8, 275–282 (1992).
276.
Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences 83, 5155–5159 (1986).
277.
Gronau, I. & Moran, S. Optimal implementations of UPGMA and other common clustering algorithms. Information Processing Letters 104, 205–210 (2007).
278.
Saitou, N. & Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987).
279.
280.
281.
Blackshields, G., Sievers, F., Shi, W., Wilm, A. & Higgins, D. G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol 5, 21 (2010).
282.
Altschul, S. F. Gap costs for multiple sequence alignment. Journal of Theoretical Biology 138, 297–309 (1989).
283.
Altschul, S. F., Carroll, R. J. & Lipman, D. J. Weights for data related by a tree. Journal of Molecular Biology 207, 647–653 (1989).
284.
Edgar, R. C. & Sjölander, K. A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20, 1301–1308 (2004).
285.
Notredame, C., Holm, L. & Higgins, D. G. COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14, 407–422 (1998).
286.
Notredame, C., Higgins, D. G. & Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment11Edited by J. Thornton. Journal of Molecular Biology 302, 205–217 (2000).
287.
288.
Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004).
289.
Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–340 (2005).
290.
291.
Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G. The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools. Nucleic Acids Research 25, 4876–4882 (1997).
292.
293.
Lemoine, F., Blassel, L., Voznica, J. & Gascuel, O. COVID-Align: Accurate online alignment of hCoV-19 genomes using a profile HMM. Bioinformatics (2020) doi:10.1093/bioinformatics/btaa871.
294.
Eddy, S. R. Multiple Alignment Using Hidden Markov Models. in International Conference on Intelligent Systems for Molecular Biology 7 (1995).
295.
Kim, J., Pramanik, S. & Chung, M. J. Multiple sequence alignment using simulated annealing. Bioinformatics 10, 419–426 (1994).
296.
Ishikawa, M. et al. Multiple sequence alignment by parallel simulated annealing. Bioinformatics 9, 267–273 (1993).
297.
Huo, H. & Stojkovic, V. A simulated annealing algorithm for multiple sequence alignment with guaranteed accuracy. in Third International Conference on Natural Computation (ICNC 2007) vol. 2 270–274 (2007).
298.
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
299.
Zhang, C. & Wong, A. K. C. A genetic algorithm for multiple molecular sequence alignment. Bioinformatics 13, 565–581 (1997).
300.
Naznin, F., Sarker, R. & Essam, D. Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment. BMC Bioinformatics 12, 353 (2011).
301.
Naznin, F., Sarker, R. & Essam, D. Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment. IEEE Transactions on Evolutionary Computation 16, 615–631 (2012).
302.
Notredame, C. & Higgins, D. G. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res 24, 1515–1524 (1996).
303.
Aksamentov, I., Roemer, C., Hodcroft, E. & Neher, R. Nextclade: Clade assignment, mutation calling and quality control for viral genomes. Joss 6, 3773 (2021).
304.
Garriga, E. et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol 37, 1466–1470 (2019).
305.
Notredame, C. Recent Evolutions of Multiple Sequence Alignment Algorithms. PLOS Computational Biology 3, e123 (2007).
306.
Notredame, C. Recent progress in multiple sequence alignment: A survey. Pharmacogenomics 3, 131–144 (2002).
307.
Edgar, R. C. & Batzoglou, S. Multiple sequence alignment. Current Opinion in Structural Biology 16, 368–373 (2006).
308.
Pais, F. S.-M., Ruy, P. de C., Oliveira, G. & Coimbra, R. S. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol 9, 4 (2014).
309.
Thompson, J. D., Plewniak, F. & Poch, O. BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88 (1999).
310.
Bragg, L., Stone, G., Imelfort, M., Hugenholtz, P. & Tyson, G. W. Fast, accurate error-correction of amplicon pyrosequences using Acacia. Nat Methods 9, 425–426 (2012).
311.
312.
Liu, H. et al. SMARTdenovo: A de novo assembler using long noisy reads. Gigabyte 2021, 1–9 (2021).
313.
Graham, R. L., Knuth, D. E. & Patashnik, O. Concrete mathematics: A foundation for computer science. (Addison-Wesley, 1994).
314.
Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
315.
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245 (2020).
316.
Li, H. New strategies to improve Minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
317.
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15, 595–597 (2018).
318.
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience 6, (2017).
319.
Martin, J. A. & Wang, Z. Next-generation transcriptome assembly. Nat Rev Genet 12, 671–682 (2011).
320.
Kyriakidou, M., Tai, H. H., Anglin, N. L., Ellis, D. & Strömvik, M. V. Current Strategies of Polyploid Plant Genome Sequence Assembly. Frontiers in Plant Science 9, (2018).
321.
Paszkiewicz, K. & Studholme, D. J. De novo assembly of short sequence reads. Briefings in Bioinformatics 11, 457–472 (2010).
322.
Sohn, J. & Nam, J.-W. The present and future of de novo whole-genome assembly. Briefings in Bioinformatics 19, 23–40 (2018).
323.
Sleator, R. D. & Walsh, P. An overview of in silico protein function prediction. Arch Microbiol 192, 151–155 (2010).
324.
Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Medicine 12, 91 (2020).
325.
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat Rev Genet 12, 363–376 (2011).
326.
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat Rev Genet 21, 171–189 (2020).
327.
Morrison, D. A. Phylogenetic tree-building. International Journal for Parasitology 26, 589–617 (1996).
328.
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat Rev Genet 21, 428–444 (2020).
329.
Kuhlman, B. & Bradley, P. Advances in protein structure prediction and design. Nat Rev Mol Cell Biol 20, 681–697 (2019).
330.
Ammad-ud-din, M., Khan, S. A., Wennerberg, K. & Aittokallio, T. Systematic identification of feature combinations for predicting drug response with Bayesian multi-view multi-task linear regression. Bioinformatics 33, i359–i368 (2017).
331.
Steiner, M. C., Gibson, K. M. & Crandall, K. A. Drug Resistance Prediction Using Deep Learning Techniques on HIV-1 Sequence Data. Viruses 12, 560 (2020).
332.
Noé, F., De Fabritiis, G. & Clementi, C. Machine learning for protein folding and dynamics. Current Opinion in Structural Biology 60, 77–84 (2020).
333.
Pearce, R. & Zhang, Y. Toward the solution of the protein structure prediction problem. Journal of Biological Chemistry 297, (2021).
334.
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
335.
Cheng, J., Tegge, A. N. & Baldi, P. Machine Learning Methods for Protein Structure Prediction. IEEE Reviews in Biomedical Engineering 1, 41–49 (2008).
336.
AlQuraishi, M. Machine learning in protein structure prediction. Current Opinion in Chemical Biology 65, 1–8 (2021).
337.
Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Current Opinion in Structural Biology 69, 11–18 (2021).
338.
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694 (2019).
339.
Li, G., Dong, Y. & Reetz, M. T. Can Machine Learning Revolutionize Directed Evolution of Selective Enzymes? Advanced Synthesis & Catalysis 361, 2377–2386 (2019).
340.
Xie, R., Wen, J., Quitadamo, A., Cheng, J. & Shi, X. A deep auto-encoder model for gene expression prediction. BMC Genomics 18, 845 (2017).
341.
342.
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLOS Computational Biology 13, e1005324 (2017).
343.
344.
345.
346.
347.
348.
Rätsch, G., Sonnenburg, S. & Schäfer, C. Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics 7, S9 (2006).
349.
350.
Alioto, T. Gene Prediction. in (ed. Anisimova, M.) 175–201 (Humana Press, 2012). doi:10.1007/978-1-61779-582-4_6.
351.
Fang, Z. et al. PlasGUN: Gene prediction in plasmid metagenomic short reads using deep learning. Bioinformatics 36, 3239–3241 (2020).
352.
Wei, L., Ding, Y., Su, R., Tang, J. & Zou, Q. Prediction of human protein subcellular localization using deep learning. Journal of Parallel and Distributed Computing 117, 212–217 (2018).
353.
Wang, H., Yan, L., Huang, H. & Ding, C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 503–513 (2017).
354.
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016).
355.
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. (Springer Science & Business Media, 2009).
356.
Kriventseva, E. V., Biswas, M. & Apweiler, R. Clustering and analysis of protein families. Current Opinion in Structural Biology 11, 334–339 (2001).
357.
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
358.
Balaban, M., Moshiri, N., Mai, U., Jia, X. & Mirarab, S. TreeCluster: Clustering biological sequences using phylogenetic trees. Plos One 14, e0221068 (2019).
359.
Zorita, E., Cuscó, P. & Filion, G. J. Starcode: Sequence clustering based on all-pairs search. Bioinformatics 31, 1913–1919 (2015).
360.
Ondov, B. D. et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016).
361.
Baker, D. N. & Langmead, B. Dashing: Fast and accurate genomic distances with HyperLogLog. Genome Biology 20, 265 (2019).
362.
Corso, G. et al. Neural Distance Embeddings for Biological Sequences. in Advances in Neural Information Processing Systems vol. 34 18539–18551 (Curran Associates, Inc., 2021).
363.
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 35, 128–135 (2017).
364.
Castro, B. M., Lemes, R. B., Cesar, J., Hünemeier, T. & Leonardi, F. A model selection approach for multiple sequence segmentation and dimensionality reduction. Journal of Multivariate Analysis 167, 319–330 (2018).
365.
Haschka, T., Ponger, L., Escudé, C. & Mozziconacci, J. MNHN-Tree-Tools: A toolbox for tree inference using multi-scale clustering of a set of sequences. Bioinformatics 37, 3947–3949 (2021).
366.
Konishi, T. et al. Principal Component Analysis applied directly to Sequence Matrix. Sci Rep 9, 19297 (2019).
367.
Ben-Hur, A. & Guyon, I. Detecting Stable Clusters Using Principal Component Analysis. in Functional Genomics: Methods and Protocols (eds. Brownstein, M. J. & Khodursky, A. B.) 159–182 (Humana Press, 2003). doi:10.1385/1-59259-364-x:159.
368.
Ding, C. & He, X. K-means clustering via principal component analysis. in Proceedings of the twenty-first international conference on Machine learning 29 (Association for Computing Machinery, 2004). doi:10.1145/1015330.1015408.
369.
Casari, G., Sander, C. & Valencia, A. Sequencespace: A tool for family analysis. Nat. Struct. Biol 2, 171–178 (1995).
370.
Clamp, M., Cuff, J., Searle, S. M. & Barton, G. J. The Jalview Java alignment editor. Bioinformatics 20, 426–427 (2004).
371.
Xia, Z., Wu, L.-Y., Zhou, X. & Wong, S. T. Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Systems Biology 4, S6 (2010).
372.
Tamposis, I. A., Tsirigos, K. D., Theodoropoulou, M. C., Kontou, P. I. & Bagos, P. G. Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics 35, 2208–2215 (2019).
373.
Elnaggar, A. et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence vol. Pp 1–1 (2021).
374.
375.
Townshend, R., Bedi, R., Suriana, P. & Dror, R. End-to-End Learning on 3D Protein Structure for Interface Prediction. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019).
376.
Lee, B., Baek, J., Park, S. & Yoon, S. deepTarget: End-to-end Learning Framework for microRNA Target Prediction using Deep Recurrent Neural Networks. in Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 434–442 (Association for Computing Machinery, 2016). doi:10.1145/2975167.2975212.
377.
Goodfellow, I., Bengio, Y. & Courville, A. Deep learning. (MIT Press, 2016).
378.
Wang, Q., Ma, Y., Zhao, K. & Tian, Y. A Comprehensive Survey of Loss Functions in Machine Learning. Ann. Data. Sci. 9, 187–212 (2022).
379.
380.
Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The Balanced Accuracy and Its Posterior Distribution. in 2010 20th International Conference on Pattern Recognition 3121–3124 (2010). doi:10.1109/icpr.2010.764.
381.
Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: Formulation, detection, and avoidance. in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining 556–563 (Association for Computing Machinery, 2011). doi:10.1145/2020408.2020496.
382.
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 23, 169–181 (2022).
383.
Fisher, R. A. On the Interpretation of Χ2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society 85, 87–94 (1922).
384.
385.
Hoerl, A. E. & Kennard, R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55–67 (1970).
386.
Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996).
387.
Zhang, H. The Optimality of Naive Bayes. in Proceedings of the the 17th international FLAIRS conference (FLAIRS2004) 6 (2004).
388.
Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence vol. 3 6 (2001).
389.
Vapnik, V. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). (Springer-Verlag, 1982).
390.
Boser, B. E., Guyon, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. in Proceedings of the fifth annual workshop on Computational learning theory 144–152 (Association for Computing Machinery, 1992). doi:10.1145/130385.130401.
391.
Cortes, C. & Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995).
392.
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support Vector Regression Machines. in Advances in Neural Information Processing Systems vol. 9 (MIT Press, 1996).
393.
Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
394.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and regression trees. (1983). doi:10.1201/9781315139470.
395.
Kingsford, C. & Salzberg, S. L. What are decision trees? Nat Biotechnol 26, 1011–1013 (2008).
396.
Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning 161–168 (Association for Computing Machinery, 2006). doi:10.1145/1143844.1143865.
397.
Yang, P., Hwa Yang, Y., B. Zhou, B. & Y. Zomaya, A. A Review of Ensemble Methods in Bioinformatics. Current Bioinformatics 5, 296–308 (2010).
398.
399.
Hassani Saadi, H., Sameni, R. & Zollanvari, A. Interpretive time-frequency analysis of genomic sequences. BMC Bioinformatics 18, 154 (2017).
400.
Brouwer, R. K. A feed-forward network for input that is both categorical and quantitative. Neural Networks 15, 881–890 (2002).
401.
Kunanbayev, K., Temirbek, I. & Zollanvari, A. Complex Encoding. in 2021 International Joint Conference on Neural Networks (IJCNN) 1–6 (2021). doi:10.1109/ijcnn52387.2021.9534094.
402.
Dufresne, Y. et al. The K-mer File Format: A standardized and compact disk representation of sets of k-mers. Bioinformatics btac528 (2022) doi:10.1093/bioinformatics/btac528.
403.
Wright, E. S. Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R. The R Journal 8, 352–359 (2016).
404.
Zamani, M. & Kremer, S. C. Amino acid encoding schemes for machine learning methods. in 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) 327–333 (2011). doi:10.1109/bibmw.2011.6112394.
405.
Singh, D., Singh, P. & Sisodia, D. S. Evolutionary based optimal ensemble classifiers for HIV-1 protease cleavage sites prediction. Expert Systems with Applications 109, 86–99 (2018).
406.
Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology 202, 865–884 (1988).
407.
408.
Choong, A. C. H. & Lee, N. K. Evaluation of convolutionary neural networks modeling of DNA sequences using ordinal versus one-hot encoding method. in 2017 International Conference on Computer and Drone Applications (IConDA) 60–65 (2017). doi:10.1109/iconda.2017.8270400.
409.
McGinnis, W. et al. Scikit-Learn-Contrib/Categorical-Encoding: Release For Zenodo. (2018) doi:10.5281/zenodo.1157110.
410.
Kawashima, S. et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36, D202–d205 (2008).
411.
412.
Nanni, L. & Lumini, A. A new encoding technique for peptide classification. Expert Systems with Applications 38, 3185–3191 (2011).
413.
414.
Taylor, W. R. The classification of amino acid conservation. Journal of Theoretical Biology 119, 205–218 (1986).
415.
Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E. Prediction of protein secondary structure and active sites using the alignment of homologous sequences. Journal of Molecular Biology 195, 957–961 (1987).
416.
417.
Maetschke, S., Towsey, M. & Bodén, M. Blomap: An encoding of amino acids which improves signal peptide cleavage site prediction. in Proceedings of the 3rd Asia-Pacific Bioinformatics Conference vols Volume 1 141–150 (Published By Imperial College Press And Distributed By World Scientific Publishing Co., 2005).
418.
Gök, M. & Özcerit, A. T. A new feature encoding scheme for HIV-1 protease cleavage site prediction. Neural Comput & Applic 22, 1757–1761 (2013).
419.
Saha, S. & Bhattacharya, T. A Novel Approach to Find the Saturation Point of n-Gram Encoding Method for Protein Sequence Classification Involving Data Mining. in International Conference on Innovative Computing and Communications (eds. Bhattacharyya, S., Hassanien, A. E., Gupta, D., Khanna, A. & Pan, I.) 101–108 (Springer, 2019). doi:10.1007/978-981-13-2354-6_12.
420.
Jeffrey, H. J. Chaos game representation of gene structure. Nucleic Acids Research 18, 2163–2170 (1990).
421.
Löchel, H. F. & Heider, D. Chaos game representation and its applications in bioinformatics. Computational and Structural Biotechnology Journal 19, 6263–6271 (2021).
422.
Cartes, J. A., Anand, S., Ciccolella, S., Bonizzoni, P. & Vedova, G. D. Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation. 2022.06.13.495912 (2022) doi:10.1101/2022.06.13.495912.
423.
Ni, H., Mu, H. & Qi, D. Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses. Journal of Molecular Graphics and Modelling 107, 107942 (2021).
424.
Lwoff, A. The concept of virus. J Gen Microbiol 17, 239–253 (1957).
425.
Minor, P. D. Viruses. in eLS (John Wiley & Sons, Ltd, 2014). doi:10.1002/9780470015902.a0000441.pub3.
426.
Stapleton, J. T., Foung, S., Muerhoff, A. S., Bukh, J. & Simmonds, P. The GB viruses: A review and proposed classification of GBV-A, GBV-C (HGV), and GBV-D in genus Pegivirus within the family Flaviviridae. J Gen Virol 92, 233–246 (2011).
427.
428.
Shi, M. et al. The evolutionary history of vertebrate RNA viruses. Nature 556, 197–202 (2018).
429.
Adams, J. R. & Bonami, J.-R. Atlas of Invertebrate Viruses. (CRC Press, 2017). doi:10.1201/9781315149929.
430.
Lefeuvre, P. et al. Evolution and ecology of plant viruses. Nat Rev Microbiol 17, 632–644 (2019).
431.
Wang, A. L. & Wang, C. C. Viruses of parasitic protozoa. Parasitology Today 7, 76–80 (1991).
432.
Fermin, G., Mazumdar-Leighton, S. & Tennant, P. Viruses of prokaryotes, protozoa, fungi, and chromista. in Viruses: Molecular Biology, Host Interactions, and Applications to Biotechnology 217 (Academic Press, 2018). doi:10.1016/B978-0-12-811257-1.00009-7.
433.
Sutela, S., Poimala, A. & Vainio, E. J. Viruses of fungi and oomycetes in the soil environment. FEMS Microbiology Ecology 95, fiz119 (2019).
434.
Twort, F. W. An Investigation On The Nature Of Ultra-microscopic Viruses. The Lancet 186, 1241–1243 (1915).
435.
Delbrock, M. Bacterial Viruses or Bacteriophages. Biological Reviews 21, 30–40 (1946).
436.
Clark, J. R. & March, J. B. Bacterial viruses as human vaccines? Expert Review of Vaccines 3, 463–476 (2004).
437.
van Kan-Davelaar, H. E., van Hest, J. C. M., Cornelissen, J. J. L. M. & Koay, M. S. T. Using viruses as nanomedicines. British Journal of Pharmacology 171, 4001–4009 (2014).
438.
Prangishvili, D., Basta, T., Garrett, R. A. & Krupovic, M. Viruses of the Archaea. in eLS 1–9 (John Wiley & Sons, Ltd, 2016). doi:10.1002/9780470015902.a0000774.pub3.
439.
Prangishvili, D., Forterre, P. & Garrett, R. A. Viruses of the Archaea: A unifying view. Nat Rev Microbiol 4, 837–848 (2006).
440.
Francki, R. I. B. Plant virus satellites. Annual Review Of Microbiology (1985).
441.
Xu, P. & Roossinck, M. J. Plant Virus Satellites. in eLS (John Wiley & Sons, Ltd, 2011). doi:10.1002/9780470015902.a0000771.pub2.
442.
Lai, M. M. The molecular biology of hepatitis delta virus. Annu Rev Biochem 64, 259–286 (1995).
443.
Hughes, S. A., Wedemeyer, H. & Harrison, P. M. Hepatitis delta virus. The Lancet 378, 73–85 (2011).
444.
Desnues, C., Boyer, M. & Raoult, D. Chapter 3 - Sputnik, a Virophage Infecting the Viral Domain of Life. in Advances in Virus Research (eds. Łobocka, M. & Szybalski, W. T.) vol. 82 63–89 (Academic Press, 2012).
445.
Gaia, M. et al. Zamilon, a Novel Virophage with Mimiviridae Host Specificity. Plos One 9, e94923 (2014).
446.
Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
447.
Nasir, A., Romero-Severson, E. & Claverie, J.-M. Investigating the Concept and Origin of Viruses. Trends in Microbiology 28, 959–967 (2020).
448.
Forterre, P. & Prangishvili, D. The origin of viruses. Research in Microbiology 160, 466–472 (2009).
449.
450.
Boeke, J. & Stoye, J. Retrotransposons, Endogenous Retroviruses, and the Evolution of Retroelement. in Retroviruses (eds. Coffin, J. M., Hughes, S. H. & Varmus, H. E.) (Cold Spring Harbor Laboratory Press, 1997).
451.
Kojima, S. et al. Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome. Proceedings of the National Academy of Sciences 118, e2010758118 (2021).
452.
Löwer, R., Löwer, J. & R Kurth. The viruses in all of us: Characteristics and biological significance of human endogenous retrovirus sequences. Proceedings of the National Academy of Sciences 93, 5177–5184 (1996).
453.
Griffiths, D. J. Endogenous retroviruses in the human genome sequence. Genome Biol 2, reviews1017.1 (2001).
454.
Baltimore, D. Expression of animal virus genomes. Bacteriol Rev 35, 235–241 (1971).
455.
Koonin, E. V., Krupovic, M. & Agol, V. I. The Baltimore Classification of Viruses 50 Years Later: How Does It Stand in the Light of Virus Evolution? Microbiology and Molecular Biology Reviews 85, e00053–21 (2021).
456.
Domingo, E. & Perales, C. RNA Virus Genomes. in eLS 1–12 (John Wiley & Sons, Ltd, 2018). doi:10.1002/9780470015902.a0001488.pub3.
457.
McGeoch, D. J., Rixon, F. J. & Davison, A. J. Topics in herpesvirus genomics and evolution. Virus Research 117, 90–104 (2006).
458.
Boehmer, P. & Nimonkar, A. Herpes Virus Replication. IUBMB Life 55, 13–22 (2003).
459.
Brentjens, M. H., Yeung-Yue, K. A., Lee, P. C. & Tyring, S. K. Human papillomavirus: A review. Dermatologic Clinics 20, 315–331 (2002).
460.
Kay, A. & Zoulim, F. Hepatitis B virus genetic variability and evolution. Virus Research 127, 164–176 (2007).
461.
Parashar, U. D., Bresee, J. S., Gentsch, J. R. & Glass, R. I. Rotavirus. Emerg Infect Dis 4, 561–570 (1998).
462.
Simmonds, P. Variability of hepatitis C virus. Hepatology 21, 570–583 (1995).
463.
Wimmer, E., Hellen, C. U. T. & Cao, X. Genetics of poliovirus. Annual Review of Genetics 27, 353–437 (1993).
464.
Racaniello, V. R. One hundred years of poliovirus pathogenesis. Virology 344, 9–16 (2006).
465.
Palese, P., Zheng, H., Engelhardt, O. G., Pleschka, S. & García-Sastre, A. Negative-strand RNA viruses: Genetic engineering and applications. Proceedings of the National Academy of Sciences 93, 11354–11358 (1996).
466.
Domingo, E. & Perales, C. Virus Evolution. in eLS (John Wiley & Sons, Ltd, 2014). doi:10.1002/9780470015902.a0000436.pub3.
467.
V’kovski, P., Kratzel, A., Steiner, S., Stalder, H. & Thiel, V. Coronavirus biology and replication: Implications for SARS-CoV-2. Nat Rev Microbiol 19, 155–170 (2021).
468.
Bäck, A. T. & Lundkvist, Å. Dengue viruses – an overview. Infect Ecol Epidemiol 3, 10.3402/iee.v3i0.19839 (2013).
469.
Dustin, L. B., Bartolini, B., Capobianchi, M. R. & Pistello, M. Hepatitis C virus: Life cycle in cells, infection and host response, and analysis of molecular markers influencing the outcome of infection and response to therapy. Clin Microbiol Infect 22, 826–832 (2016).
470.
Kadaja, M., Silla, T., Ustav, E. & Ustav, M. Papillomavirus DNA replication — From initiation to genomic instability. Virology 384, 360–368 (2009).
471.
Weller, S. K. & Coen, D. M. Herpes Simplex Viruses: Mechanisms of DNA Replication. Cold Spring Harb Perspect Biol 4, a013011 (2012).
472.
Beck, J. & Nassal, M. Hepatitis B virus replication. World J Gastroenterol 13, 48–64 (2007).
473.
Pyle, J. D. & Scholthof, K.-B. G. Chapter 58 - Biology and Pathogenesis of Satellite Viruses. in Viroids and Satellites (eds. Hadidi, A., Flores, R., Randles, J. W. & Palukaitis, P.) 627–636 (Academic Press, 2017). doi:10.1016/b978-0-12-801498-1.00058-9.
474.
Raoult, D. et al. The 1.2-megabase genome sequence of Mimivirus. Science 306, 1344–1350 (2004).
475.
Campillo-Balderas, J. A., Lazcano, A. & Becerra, A. Viral Genome Size Distribution Does not Correlate with the Antiquity of the Host Lineages. Frontiers in Ecology and Evolution 3, (2015).
476.
Cann, A. J. Virus Structure. in eLS 1–9 (John Wiley & Sons, Ltd, 2015). doi:10.1002/9780470015902.a0000439.pub2.
477.
Hladik, F. & McElrath, M. J. Setting the stage: Host invasion by HIV. Nat Rev Immunol 8, 447–457 (2008).
478.
Shaw, G. M. & Hunter, E. HIV Transmission. Cold Spring Harb Perspect Med 2, a006965 (2012).
479.
Weiss, R. A. How Does HIV Cause AIDS? Science 260, 1273–1279 (1993).
480.
Melhuish, A. & Lewthwaite, P. Natural history of HIV and AIDS. Medicine 46, 356–361 (2018).
481.
Murray, J. F. et al. Pulmonary complications of the acquired immunodeficiency syndrome. New England Journal of Medicine 310, 1682–1688 (1984).
482.
Sampath, S. et al. Pandemics Throughout the History. Cureus 13, (2021).
483.
World Health Organization. Global report: UNAIDS report on the global AIDS epidemic 2010. (World Health Organization, 2010).
484.
485.
486.
Clavel, F. et al. Isolation of a New Human Retrovirus from West African Patients with AIDS. Science 233, 343–346 (1986).
487.
Gilbert, P. B. et al. Comparison of HIV-1 and HIV-2 infectivity from a prospective cohort study in Senegal. Statistics in Medicine 22, 573–593 (2003).
488.
489.
Gao, F. et al. Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397, 436–441 (1999).
490.
Hamel, D. J. et al. Twenty years of prospective molecular epidemiology in Senegal: Changes in HIV diversity. AIDS Res Hum Retroviruses 23, 1189–1196 (2007).
491.
Sharp, P. M. & Hahn, B. H. Origins of HIV and the AIDS Pandemic. Cold Spring Harb Perspect Med 1, a006841 (2011).
492.
Hirsch, V. M., Olmsted, R. A., Murphey-Corb, M., Purcell, R. H. & Johnson, P. R. An African primate lentivirus (SIVsmclosely related to HIV-2. Nature 339, 389–392 (1989).
493.
494.
495.
Hemelaar, J. The origin and diversity of the HIV-1 pandemic. Trends in Molecular Medicine 18, 182–192 (2012).
496.
Worobey, M. et al. Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature 455, 661–664 (2008).
497.
498.
Faria, N. R. et al. The early spread and epidemic ignition of HIV-1 in human populations. Science 346, 56–61 (2014).
499.
Korber, B. et al. Timing the ancestor of the HIV-1 pandemic strains. Science 288, 1789–1796 (2000).
500.
Rambaut, A., Posada, D., Crandall, K. A. & Holmes, E. C. The causes and consequences of HIV evolution. Nat Rev Genet 5, 52–61 (2004).
501.
McCutchan, F. E. Global epidemiology of HIV. Journal of Medical Virology 78, S7–s12 (2006).
502.
Pérez-Losada, M., Arenas, M., Galán, J. C., Palero, F. & González-Candelas, F. Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences. Infect Genet Evol 30, 296–307 (2015).
503.
Robertson, D. L., Hahn, B. H. & Sharp, P. M. Recombination in AIDS viruses. J Mol Evol 40, 249–259 (1995).
504.
HIV Circulating Recombinant Forms (CRFs). https://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html.
505.
Lau, K. A. & Wong, J. J. L. Current Trends of HIV Recombination Worldwide. Infect Dis Rep 5, e4 (2013).
506.
Posada, D., Crandall, K. A. & Holmes, E. C. Recombination in evolutionary genomics. Annu Rev Genet 36, 75–97 (2002).
507.
Taylor, B. S., Sobieszczyk, M. E., McCutchan, F. E. & Hammer, S. M. The Challenge of HIV-1 Subtype Diversity. New England Journal of Medicine 358, 1590–1602 (2008).
508.
Hemelaar, J., Gouws, E., Ghys, P. D. & Osmanov, S. Global trends in molecular epidemiology of HIV-1 during 2000–2007. Aids 25, 679–689 (2011).
509.
Distribution of all HIV-1 sequences: WORLD. https://www.hiv.lanl.gov/components/sequence/HIV/geo/geo.comp.
510.
Freed, E. O. HIV-1 Replication. Somat Cell Mol Genet 26, 13–33 (2001).
511.
Ferguson, M. R., Rojo, D. R., von Lindern, J. J. & O’Brien, W. A. HIV-1 replication cycle. Clin Lab Med 22, 611–635 (2002).
512.
Gougeon, M. L., Laurent-Crawford, A. G., Hovanessian, A. G. & Montagnier, L. Direct and indirect mechanisms mediating apoptosis during HIV infection: Contribution to in vivo CD4 T cell depletion. Seminars in Immunology 5, 187–194 (1993).
513.
Vidya Vijayan, K. K., Karthigeyan, K. P., Tripathi, S. P. & Hanna, L. E. Pathophysiology of CD4+ T-Cell Depletion in HIV-1 and HIV-2 Infections. Front Immunol 8, 580 (2017).
514.
Frankel, A. D. & Young, J. A. HIV-1: Fifteen proteins and an RNA. Annu Rev Biochem 67, 1–25 (1998).
515.
Fossen, T. et al. Solution Structure of the Human Immunodeficiency Virus Type 1 P6 Protein *. Journal of Biological Chemistry 280, 42515–42527 (2005).
516.
Göttlinger, H. G., Dorfman, T., Sodroski, J. G. & Haseltine, W. A. Effect of mutations affecting the P6 gag protein on human immunodeficiency virus particle release. Proceedings of the National Academy of Sciences 88, 3195–3199 (1991).
517.
Huang, M., Orenstein, J. M., Martin, M. A. & Freed, E. O. p6Gag is required for particle production from full-length human immunodeficiency virus type 1 molecular clones expressing protease. Journal of Virology 69, 6810–6818 (1995).
518.
Bour, S., Geleziunas, R. & Wainberg, M. A. The human immunodeficiency virus type 1 (HIV-1) CD4 receptor and its central role in promotion of HIV-1 infection. Microbiological Reviews 59, 63–93 (1995).
519.
Hernandez, L. D., Hoffman, L. R., Wolfsberg, T. G. & White, J. M. Virus-cell and cell-cell fusion. Annu Rev Cell Dev Biol 12, 627–661 (1996).
520.
Jones, K. & Peterlin, B. Control of Rna Initiation and Elongation at the Hiv-1 Promoter. Annu. Rev. Biochem. 63, 717–743 (1994).
521.
Hope, T. J. Viral RNA export. Chemistry & Biology 4, 335–344 (1997).
522.
Mangasarian, A. & Trono, D. The multifaceted role of HIV Nef. Research in Virology 148, 30–33 (1997).
523.
Cohen, é. A., Subbramanian, R. A. & Göttlinger, H. G. Role of Auxiliary Proteins in Retroviral Morphogenesis. in Morphogenesis and Maturation of Retroviruses (ed. Kräusslich, H.-G.) 219–235 (Springer, 1996). doi:10.1007/978-3-642-80145-7_7.
524.
525.
Khan, N. & Geiger, J. D. Role of Viral Protein U (Vpu) in HIV-1 Infection and Pathogenesis. Viruses 13, 1466 (2021).
526.
Emerman, M. HIV-1, Vpr and the cell cycle. Current Biology 6, 1096–1103 (1996).
527.
528.
529.
Cassan, E., Arigon-Chifolleau, A.-M., Mesnard, J.-M., Gross, A. & Gascuel, O. Concomitant emergence of the antisense protein gene of HIV-1 and of the pandemic. Pnas 113, 11537–11542 (2016).
530.
531.
532.
Eisinger, R. W., Dieffenbach, C. W. & Fauci, A. S. HIV Viral Load and Transmissibility of HIV Infection: Undetectable Equals Untransmittable. Jama 321, 451–452 (2019).
533.
Palella, F. J. et al. Declining Morbidity and Mortality among Patients with Advanced Human Immunodeficiency Virus Infection. New England Journal of Medicine 338, 853–860 (1998).
534.
535.
Fischl, M. A. et al. The Efficacy of Azidothymidine (AZT) in the Treatment of Patients with AIDS and AIDS-Related Complex. New England Journal of Medicine 317, 185–191 (1987).
536.
Richman, D. D. Susceptibility to nucleoside analogues of zidovudine-resistant isolates of human immunodeficiency virus. The American Journal of Medicine 88, S8–s10 (1990).
537.
Yeo, J. Y., Goh, G.-R., Su, C. T.-T. & Gan, S. K.-E. The Determination of HIV-1 RT Mutation Rate, Its Possible Allosteric Effects, and Its Implications on Drug Resistance. Viruses 12, 297 (2020).
538.
Cuevas, J. M., Geller, R., Garijo, R., López-Aldeguer, J. & Sanjuán, R. Extremely High Mutation Rate of HIV-1 In Vivo. PLOS Biology 13, e1002251 (2015).
539.
Carvajal-Rodríguez, A., Crandall, K. A. & Posada, D. Recombination favors the evolution of drug resistance in HIV-1 during antiretroviral therapy. Infect Genet Evol 7, 476–483 (2007).
540.
541.
Wensing, A. M. J., van Maarseveen, N. M. & Nijhuis, M. Fifteen years of HIV Protease Inhibitors: Raising the barrier to resistance. Antiviral Research 85, 59–74 (2010).
542.
Pedersen, O. S. & Pedersen, E. B. Non-Nucleoside Reverse Transcriptase Inhibitors: The NNRTI Boom. Antivir Chem Chemother 10, 285–314 (1999).
543.
Scarsi, K. K., Havens, J. P., Podany, A. T., Avedissian, S. N. & Fletcher, C. V. HIV-1 Integrase Inhibitors: A Comparative Review of Efficacy and Safety. Drugs 80, 1649–1676 (2020).
544.
Fletcher, C. V. Enfuvirtide, a new drug for HIV infection. The Lancet 361, 1577–1578 (2003).
545.
Esté, J. A. & Telenti, A. HIV entry inhibitors. The Lancet 370, 81–88 (2007).
546.
Kilby, J. M. & Eron, J. J. Novel Therapies Based on Mechanisms of HIV-1 Cell Entry. N Engl J Med 348, 2228–2238 (2003).
547.
Yeni, P. Update on HAART in HIV. Journal of Hepatology 44, S100–s103 (2006).
548.
Palmisano, L. & Vella, S. A brief history of antiretroviral therapy of HIV infection: Success and challenges. Ann Ist Super Sanita 47, 44–48 (2011).
549.
Pennings, P. S. HIV drug resistance: Problems and perspectives. Infectious Disease Reports 5, e5 (2013).
550.
Mehta, S., Moore, R. D. & Graham, N. M. H. Potential factors affecting adherence with HIV therapy. Aids 11, 1665–1670 (1997).
551.
Miller, N. H. Compliance with treatment regimens in chronic asymptomatic diseases. The American Journal of Medicine 102, 43–49 (1997).
552.
Chesney, M. A., Morin, M. & Sherr, L. Adherence to HIV combination therapy. Social Science & Medicine 50, 1599–1605 (2000).
553.
Aldir, I., Horta, A. & Serrado, M. Single-tablet regimens in HIV: Does it really make a difference? Current Medical Research and Opinion 30, 89–97 (2014).
554.
Grant, R. M. et al. Preexposure Chemoprophylaxis for HIV Prevention in Men Who Have Sex with Men. N Engl J Med 363, 2587–2599 (2010).
555.
Baeten, J. M. et al. Antiretroviral prophylaxis for HIV prevention in heterosexual men and women. N Engl J Med 367, 399–410 (2012).
556.
Buchbinder, S. P. & Liu, A. Pre-exposure prophylaxis and the promise of combination prevention approaches. AIDS Behav 15 Suppl 1, S72–79 (2011).
557.
Riddell, J., IV, Amico, K. R. & Mayer, K. H. HIV Preexposure Prophylaxis: A Review. Jama 319, 1261–1268 (2018).
558.
559.
About PrEP | PrEP | HIV Basics | HIV/AIDS. https://www.cdc.gov/hiv/basics/prep/about-prep.html (2022).
560.
Zolopa, A. R. The evolution of HIV treatment guidelines: Current state-of-the-art of ART. Antiviral Research 85, 241–244 (2010).
561.
Organization, W. H. Consolidated guidelines on HIV prevention, testing, treatment, service delivery and monitoring: Recommendations for a public health approach. https://www.who.int/publications-detail-redirect/9789240031593 (2021).
562.
Ammaranond, P. & Sanguansittianan, S. Mechanism of HIV antiretroviral drugs progress toward drug resistance. Fundamental & Clinical Pharmacology 26, 146–161 (2012).
563.
Clavel, F. & Hance, A. J. HIV Drug Resistance. New England Journal of Medicine 350, 1023–1035 (2004).
564.
565.
Goodsell, D. S., Autin, L. & Olson, A. J. Illustrate: Software for Biomolecular Illustration. Structure 27, 1716–1720.e1 (2019).
566.
567.
Hang, J. Q. et al. Activity of the isolated HIV RNase H domain and specific inhibition by N-hydroxyimides. Biochemical and Biophysical Research Communications 317, 321–329 (2004).
568.
Klumpp, K. & Mirzadegan, T. Recent Progress in the Design of Small Molecule Inhibitors of HIV RNase H. Current Pharmaceutical Design 12, 1909–1922 (2006).
569.
570.
Sluis-Cremer, N., Arion, D. & Parniak*, M. A. Molecular mechanisms of HIV-1 resistance to nucleoside reverse transcriptase inhibitors (NRTIs). CMLS, Cell. Mol. Life Sci. 57, 1408–1422 (2000).
571.
Sarafianos, S. G. et al. Lamivudine (3TC) resistance in HIV-1 reverse transcriptase involves steric hindrance with beta-branched amino acids. Proc Natl Acad Sci U S A 96, 10027–10032 (1999).
572.
Meyer, P. R., Matsuura, S. E., Mian, A. M., So, A. G. & Scott, W. A. A mechanism of AZT resistance: An increase in nucleotide-dependent primer unblocking by mutant HIV-1 reverse transcriptase. Mol Cell 4, 35–43 (1999).
573.
Boyer, P. L., Sarafianos, S. G., Arnold, E. & Hughes, S. H. Selective Excision of AZTMP by Drug-Resistant Human Immunodeficiency Virus Reverse Transcriptase. J Virol 75, 4832–4842 (2001).
574.
Deeks, S. G. Nonnucleoside Reverse Transcriptase Inhibitor Resistance. JAIDS Journal of Acquired Immune Deficiency Syndromes 26, S25 (2001).
575.
576.
Lloyd, S. B., Kent, S. J. & Winnall, W. R. The High Cost of Fidelity. AIDS Research and Human Retroviruses 30, 8–16 (2014).
577.
Pearl, L. H. & Taylor, W. R. A structural model for the retroviral proteases. Nature 329, 351–354 (1987).
578.
Gulnik, S., Erickson, J. W. & Xie, D. HIV protease: Enzyme function and drug resistance. in Vitamins & Hormones vol. 58 213–256 (Academic Press, 2000).
579.
Silva, A. M., Cachau, R. E., Sham, H. L. & Erickson, J. W. Inhibition and catalytic mechanism of HIV-1 aspartic protease. Journal of Molecular Biology 255, 321–340 (1996).
580.
Hornak, V., Okur, A., Rizzo, R. C. & Simmerling, C. HIV-1 protease flaps spontaneously open and reclose in molecular dynamics simulations. Proceedings of the National Academy of Sciences 103, 915–920 (2006).
581.
582.
583.
Roberts, N. A. et al. Rational Design of Peptide-Based HIV Proteinase Inhibitors. Science 248, 358–361 (1990).
584.
Lv, Z., Chu, Y. & Wang, Y. HIV protease inhibitors: A review of molecular selectivity and toxicity. HIV AIDS (Auckl) 7, 95–104 (2015).
585.
586.
587.
Kurt Yilmaz, N., Swanstrom, R. & Schiffer, C. A. Improving Viral Protease Inhibitors to Counter Drug Resistance. Trends in Microbiology 24, 547–557 (2016).
588.
Chiu, T. K. & Davies, D. R. Structure and Function of HIV-1 Integrase. Current Topics in Medicinal Chemistry 4, 965–977 (2004).
589.
Esposito, D. & Craigie, R. HIV Integrase Structure and Function. in Advances in Virus Research (eds. Rlaramorosch, K., Murphy, F. A. & Shawn, A. J.) vol. 52 319–333 (Academic Press, 1999).
590.
Delelis, O., Carayon, K., Saïb, A., Deprez, E. & Mouscadet, J.-F. Integrase and integration: Biochemical activities of HIV-1 integrase. Retrovirology 5, 114 (2008).
591.
Maertens, G. N., Engelman, A. N. & Cherepanov, P. Structure and function of retroviral integrase. Nat Rev Microbiol 20, 20–34 (2022).
592.
Pommier, Y., Johnson, A. A. & Marchand, C. Integrase inhibitors to treat HIV/Aids. Nat Rev Drug Discov 4, 236–248 (2005).
593.
Blanco, J.-L., Varghese, V., Rhee, S.-Y., Gatell, J. M. & Shafer, R. W. HIV-1 Integrase Inhibitor Resistance and Its Clinical Implications. The Journal of Infectious Diseases 203, 1204–1214 (2011).
594.
Geretti, A. M., Armenia, D. & Ceccherini-Silberstein, F. Emerging patterns and implications of HIV-1 integrase inhibitor resistance. Current Opinion in Infectious Diseases 25, 677–686 (2012).
595.
Knox, D. C., Anderson, P. L., Harrigan, P. R. & Tan, D. H. S. Multidrug-Resistant HIV-1 Infection despite Preexposure Prophylaxis. N Engl J Med 376, 501–502 (2017).
596.
Hurt, C. B., Eron, J. J. & Cohen, M. S. Pre-exposure prophylaxis and antiretroviral resistance: HIV prevention at a cost? Clin Infect Dis 53, 1265–1270 (2011).
597.
Gibas, K. M., van den Berg, P., Powell, V. E. & Krakower, D. S. Drug Resistance During HIV Pre-Exposure Prophylaxis. Drugs 79, 609–619 (2019).
598.
599.
600.
601.
Boerma, R. S. et al. High levels of pre-treatment HIV drug resistance and treatment failure in Nigerian children. Journal of the International AIDS Society 19, 21140 (2016).
602.
Clutter, D. S., Jordan, M. R., Bertagnolio, S. & Shafer, R. W. HIV-1 drug resistance and resistance testing. Infection, Genetics and Evolution 46, 292–307 (2016).
603.
Kühnert, D. et al. Quantifying the fitness cost of HIV-1 drug resistance mutations through phylodynamics. PLOS Pathogens 14, e1006895 (2018).
604.
Mesplède, T. et al. Viral fitness cost prevents HIV-1 from evading dolutegravir drug pressure. Retrovirology 10, 22 (2013).
605.
Castro, H. et al. Persistence of HIV-1 Transmitted Drug Resistance Mutations. J Infect Dis 208, 1459–1463 (2013).
606.
Blassel, L. et al. Drug resistance mutations in HIV: New bioinformatics approaches and challenges. Current Opinion in Virology 51, 56–64 (2021).
607.
608.
Abeler-Dörner, L. et al. PANGEA-HIV 2: Phylogenetics And Networks for Generalised Epidemics in Africa. Current Opinion in HIV and AIDS 14, 173–180 (2019).
609.
Shafer, R. W. Rationale and Uses of a Public HIV DrugResistance Database. J Infect Dis 194, S51–s58 (2006).
610.
Kuiken, C., Korber, B. & Shafer, R. W. HIV Sequence Databases. AIDS Rev 5, 52–61 (2003).
611.
Wensing, A. M. et al. 2019 update of the drug resistance mutations in HIV-1. Top Antivir Med 27, 111–121 (2019).
612.
Clark, S. A., Calef, C. & Mellors, J. W. Mutations in retroviral genes associated with drug resistance. HIV sequence compendium 58–158 (2007).
613.
Liu, T. F. & Shafer, R. W. Web Resources for HIV Type 1 Genotypic-Resistance Test Interpretation. Clin Infect Dis 42, 1608–1618 (2006).
614.
Johnson, V. A. et al. Update of the Drug Resistance Mutations in HIV-1: March 2013. Top Antivir Med 21, 6–7 (2016).
615.
616.
Shulman, N. S., Bosch, R. J., Mellors, J. W., Albrecht, M. A. & Katzenstein, D. A. Genetic correlates of efavirenz hypersusceptibility. Aids 18, 1781–1785 (2004).
617.
618.
Brown, B. W. & Russell, K. Methods correcting for multiple testing: Operating characteristics. Statistics in Medicine 16, 2511–2528 (1997).
619.
Austin, P. C., Mamdani, M. M., Juurlink, D. N. & Hux, J. E. Testing multiple statistical hypotheses resulted in spurious associations: A study of astrological signs and health. Journal of Clinical Epidemiology 59, 964–969 (2006).
620.
Hochberg, Y. & Tamhane, A. C. Multiple comparison procedures. (1987). doi:10.1002/9780470316672.
621.
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 57, 289–300 (1995).
622.
623.
Seoighe, C. et al. A Model of Directional Selection Applied to the Evolution of Drug Resistance in HIV-1. Molecular Biology and Evolution 24, 1025–1031 (2007).
624.
Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics 15, 335–346 (2014).
625.
626.
627.
Petropoulos, C. J. et al. A Novel Phenotypic Drug Susceptibility Assay for Human Immunodeficiency Virus Type 1. Antimicrobial Agents and Chemotherapy 44, 920–928 (2000).
628.
629.
Heilek-Snyder, G. & Bean, P. Role of HIV phenotypic assays in the management of HIV infection. Am Clin Lab 21, 40–43 (2002 Jan-Feb).
630.
Moyle, G. J. et al. Epidemiology and Predictive Factors for Chemokine Receptor Use in HIV-1 Infection. The Journal of Infectious Diseases 191, 866–872 (2005).
631.
Gartland, M. et al. Susceptibility of global HIV-1 clinical isolates to fostemsavir using the PhenoSense® Entry assay. Journal of Antimicrobial Chemotherapy 76, 648–652 (2021).
632.
633.
634.
635.
Tambuyzer, L., Nijs, S., Daems, B., Picchio, G. & Vingerhoets, J. Effect of Mutations at Position E138 in HIV-1 Reverse Transcriptase on Phenotypic Susceptibility and Virologic Response to Etravirine. JAIDS Journal of Acquired Immune Deficiency Syndromes 58, 18–22 (2011).
636.
637.
Blassel, L. et al. Using machine learning and big data to explore the drug resistance landscape in HIV. PLOS Computational Biology 17, e1008873 (2021).
638.
Sheik Amamuddy, O., Bishop, N. T. & Tastan Bishop, Ö. Improving fold resistance prediction of HIV-1 against protease and reverse transcriptase inhibitors using artificial neural networks. BMC Bioinformatics 18, 369 (2017).
639.
Beerenwinkel, N. et al. Geno2pheno: Interpreting genotypic HIV drug resistance tests. IEEE Intelligent Systems 16, 35–41 (2001).
640.
Riemenschneider, M., Hummel, T. & Heider, D. SHIVA - a web application for drug resistance and tropism testing in HIV. BMC Bioinformatics 17, 314 (2016).
641.
642.
Heider, D., Senge, R., Cheng, W. & Hüllermeier, E. Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction. Bioinformatics 29, 1946–1952 (2013).
643.
Lepri, A. C. et al. Resistance Profiles in Patients with Viral Rebound on Potent Antiretroviral Therapy. J Infect Dis 181, 1143–1147 (2000).
644.
645.
Zhukova, A., Cutino-Moguel, T., Gascuel, O. & Pillay, D. The Role of Phylogenetics as a Tool to Predict the Spread of Resistance. J Infect Dis 216, S820–s823 (2017).
646.
647.
Hammond, J., Calef, C., Larder, B., Schinazi, R. & Mellors, J. W. Mutations in Retroviral Genes Associated with Drug Resistance. Human retroviruses and AIDS 11136–11179 (1998).
648.
649.
Dudoit, S. & Laan, M. J. van der. Multiple Testing Procedures with Applications to Genomics. (Springer Science & Business Media, 2007). doi:10.1007/978-0-387-49317-6.
650.
Maddison, W. P. & FitzJohn, R. G. The Unsolved Challenge to Phylogenetic Correlation Tests for Categorical Characters. Syst Biol 64, 127–136 (2015).
651.
Lengauer, T. & Sing, T. Bioinformatics-assisted anti-HIV therapy. Nat Rev Microbiol 4, 790–797 (2006).
652.
Zhang, J., Rhee, S.-Y., Taylor, J. & Shafer, R. W. Comparison of the Precision and Sensitivity of the Antivirogram and PhenoSense HIV Drug Susceptibility Assays. JAIDS Journal of Acquired Immune Deficiency Syndromes 38, 439–444 (2005).
653.
Beerenwinkel, N. et al. Geno2pheno: Estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Research 31, 3850–3855 (2003).
654.
Shen, C., Yu, X., Harrison, R. W. & Weber, I. T. Automated prediction of HIV drug resistance from genotype data. BMC Bioinformatics 17, 278 (2016).
655.
Yu, X., Weber, I. T. & Harrison, R. W. Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure. BMC Genomics 15, S1 (2014).
656.
Araya, S. T. & Hazelhurst, S. Support vector machine prediction of HIV-1 drug resistance using the viral nucleotide patterns. Transactions of the Royal Society of South Africa 64, 62–72 (2009).
657.
658.
Dr̆aghici, S. & Potter, R. B. Predicting HIV drug resistance with neural networks. Bioinformatics 19, 98–107 (2003).
659.
660.
Brier, G. W. Verification of Forecasts Expressed in Terms of Probability. Mon. Wea. Rev. 78, 1–3 (1950).
661.
Gascuel, O. et al. Twelve Numerical, Symbolic and Hybrid Supervised Classification Methods. Int. J. Patt. Recogn. Artif. Intell. 12, 517–571 (1998).
662.
Goeman, J. J. & Solari, A. Multiple hypothesis testing in genomics. Statistics in Medicine 33, 1946–1978 (2014).
663.
Rennie, J. D., Shih, L., Teevan, J. & Karger, D. R. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. in Proceedings of the 20th international conference on machine learning (ICML-03) 616–623 (2003).
664.
Alvarez Melis, D. & Jaakkola, T. Towards Robust Interpretability with Self-Explaining Neural Networks. in Advances in Neural Information Processing Systems 31 (eds. Bengio, S. et al.) 7775–7784 (Curran Associates, Inc., 2018).
665.
Zhang, Q., Wu, Y. N. & Zhu, S.-C. Interpretable Convolutional Neural Networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8827–8836 (2018). doi:10.1109/CVPR.2018.00920.
666.
Schrödinger, LLC. The PyMOL molecular graphics system, version 1.8. (2015).
667.
Rhee, S.-Y., Liu, T. F., Holmes, S. P. & Shafer, R. W. HIV-1 Subtype B Protease and Reverse Transcriptase Amino Acid Covariation. PLOS Computational Biology 3, e87 (2007).
668.
669.
Marcelin, A.-G. et al. Impact of HIV-1 reverse transcriptase polymorphism at codons 211 and 228 on virological response to didanosine. Antiviral Therapy 8 (2006) doi:10.1177/135965350601100609.
670.
671.
672.
Nebbia, G., Sabin, C. A., Dunn, D. T. & Geretti, A. M. Emergence of the H208Y mutation in the reverse transcriptase (RT) of HIV-1 in association with nucleoside RT inhibitor therapy. J Antimicrob Chemother 59, 1013–1016 (2007).
673.
674.
Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–721 (2009).
675.
676.
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
677.
Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 2, 183–197 (1991).
678.
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).
679.
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366 (1989).
680.
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251–257 (1991).
681.
LeCun, Y. et al. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1, 541–551 (1989).
682.
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
683.
684.
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
685.
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).
686.
Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2016) doi:10.48550/arXiv.1409.0473.
687.
Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017).
688.
How many words are there in English? | Merriam-Webster. https://www.merriam-webster.com/help/faq-how-many-english-words.
689.
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space. (2013) doi:10.48550/arXiv.1301.3781.
690.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed Representations of Words and Phrases and their Compositionality. in Advances in Neural Information Processing Systems vol. 26 (Curran Associates, Inc., 2013).
691.
Goldberg, Y. & Levy, O. Word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. (2014) doi:10.48550/arXiv.1402.3722.
692.
Ng, P. Dna2vec: Consistent vector representations of variable-length k-mers. (2017) doi:10.48550/arXiv.1701.06279.
693.
Liang, Y. et al. Hyb4mC: A hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction. BMC Bioinformatics 23, 258 (2022).
694.
Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed Representations for Biological Sequence Analysis. (2016) doi:10.48550/arXiv.1608.05949.
695.
696.
Kimothi, D., Shukla, A., Biyani, P., Anand, S. & Hogan, J. M. Metric learning on biological sequence embeddings. in 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) 1–5 (2017). doi:10.1109/spawc.2017.8227769.
697.
Song, B. et al. Pretraining model for biological sequence data. Briefings in Functional Genomics 20, 181–195 (2021).
698.
Wang, H., Wu, H., He, Z., Huang, L. & Ward Church, K. Progress in Machine Translation. Engineering (2021) doi:10.1016/j.eng.2021.03.023.
699.
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Naacl (2019) doi:10.48550/arXiv.1810.04805.
700.
Brown, T. et al. Language Models are Few-Shot Learners. in Advances in Neural Information Processing Systems vol. 33 1877–1901 (Curran Associates, Inc., 2020).
701.
Madani, A. et al. ProGen: Language Modeling for Protein Generation. bioRxiv (2020) doi:10.1101/2020.03.07.982272.
702.
Erik Nijkamp, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik & Ali Madani. ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv (2022) doi:10.48550/arxiv.2206.13517.
703.
Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Systems 12, 654–669.e3 (2021).
704.
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. 2020.12.15.422761 (2020) doi:10.1101/2020.12.15.422761.
705.
706.
Bhattacharya, N. et al. Single Layers of Attention Suffice to Predict Protein Contacts. 2020.12.21.423882 (2020) doi:10.1101/2020.12.21.423882.
707.
Hu, M. et al. Exploring evolution-based & -free protein language models as protein function predictors. (2022) doi:10.48550/arXiv.2206.06583.
708.
709.
Hie, B., Kevin K Yang & Kim, S. K. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Systems 13, 274–285.e6 (2022).
710.
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
711.
Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful zero-shot predictors of non-coding variant effects. 2022.08.22.504706 (2022) doi:10.1101/2022.08.22.504706.
712.
Cai, T. et al. Genome-wide Prediction of Small Molecule Binding to Remote Orphan Proteins Using Distilled Sequence Alignment Embedding. 2020.08.04.236729 (2020) doi:10.1101/2020.08.04.236729.
713.
Rao, R. et al. MSA Transformer. bioRxiv (2021) doi:10.1101/2021.02.12.430858.
714.
Sercu, T. et al. Neural Potts Model. 2021.04.08.439084 (2021) doi:10.1101/2021.04.08.439084.
715.
Sturmfels, P., Vig, J., Madani, A. & Rajani, N. F. Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models. (2020) doi:10.48550/arXiv.2012.00195.
716.
Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 1–7 (2022) doi:10.1038/s41587-022-01435-7.
717.
Ourmazd, A., Moffat, K. & Lattman, E. E. Structural biology is solved — now what? Nat Methods 19, 24–26 (2022).
718.
Vig, J. et al. BERTology Meets Biology: Interpreting Attention in Protein Language Models. (2021) doi:10.48550/arXiv.2006.15222.
719.
Gao, M. & Skolnick, J. A novel sequence alignment algorithm based on deep learning of the protein folding code. Bioinformatics 37, 490–496 (2021).
720.
Morton, J. T. et al. Protein Structural Alignments From Sequence. 2020.11.03.365932 (2020) doi:10.1101/2020.11.03.365932.
721.
Berman, H., Henrick, K., Nakamura, H. & Markley, J. L. The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Research 35, D301–d303 (2007).
722.
723.
Llinares-López, F., Berthet, Q., Blondel, M., Teboul, O. & Vert, J.-P. Deep embedding and alignment of protein sequences. 2021.11.15.468653 (2022) doi:10.1101/2021.11.15.468653.
724.
725.
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–d419 (2021).
726.
Petti, S. et al. End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. 2021.10.23.465204 (2022) doi:10.1101/2021.10.23.465204.
727.
Dotan, E. et al. Harnessing machine translation methods for sequence alignment. 2022.07.22.501063 (2022) doi:10.1101/2022.07.22.501063.
728.
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: Self-Attention with Linear Complexity. (2020) doi:10.48550/arXiv.2006.04768.
729.
Xiong, Y. et al. Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence 35, 14138–14148 (2021).
730.
Child, R., Gray, S., Radford, A. & Sutskever, I. Generating Long Sequences with Sparse Transformers. (2019) doi:10.48550/arXiv.1904.10509.
731.
Correia, G. M., Niculae, V. & Martins, A. F. T. Adaptively Sparse Transformers. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2174–2184 (Association for Computational Linguistics, 2019). doi:10.18653/v1/D19-1223.
732.
Sukhbaatar, S., Grave, E., Bojanowski, P. & Joulin, A. Adaptive Attention Span in Transformers. (2019) doi:10.48550/arXiv.1905.07799.
733.
Wu, Z., Liu, Z., Lin, J., Lin, Y. & Han, S. Lite Transformer with Long-Short Range Attention. (2020) doi:10.48550/arXiv.2004.11886.
734.
Kitaev, N., Kaiser, Ł. & Levskaya, A. Reformer: The Efficient Transformer. (2020) doi:10.48550/arXiv.2001.04451.
735.
Choromanski, K. et al. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers. (2020) doi:10.48550/arXiv.2006.03555.
736.
Bhattacharya, N. et al. Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention. in Biocomputing 2022 34–45 (World Scientific, 2021). doi:10.1142/9789811250477_0004.
737.
Kraska, T., Beutel, A., Chi, E. H., Dean, J. & Polyzotis, N. The Case for Learned Index Structures. in Proceedings of the 2018 International Conference on Management of Data 489–504 (Association for Computing Machinery, 2018). doi:10.1145/3183713.3196909.
738.
Jung, Y. & Han, D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38, 2404–2413 (2022).
739.
Kirsche, M., Das, A. & Schatz, M. C. Sapling: Accelerating suffix array queries with learned data models. Bioinformatics 37, 744–749 (2021).
740.
Ho, D. et al. LISA: Learned Indexes for Sequence Analysis. 2020.12.22.423964 (2021) doi:10.1101/2020.12.22.423964.
741.
Hoang, M., Zheng, H. & Kingsford, C. Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer. Journal of Computational Biology (2022) doi:10.1089/cmb.2022.0275.
742.
Min, S., Lee, B. & Yoon, S. TargetNet: Functional microRNA target prediction with deep neural networks. Bioinformatics 38, 671–677 (2022).
743.
Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat Biotechnol 38, 1309–1316 (2020).
744.
745.
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
746.
Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020).
747.
Seabold, Skipper & Perktold, Josef. Statsmodels: Econometric and Statistical Modeling with Python. in Proceedings of the 9th Python in Science Conference (eds. Walt, Stéfan van der & Millman, Jarrod) 92–96 (2010). doi:10.25080/Majora-92bf1922-011.
748.
Vinh, N. X. & Epps, J. A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering. in 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering 84–91 (2009). doi:10.1109/bibe.2009.19.
749.
Harremoes, P. Mutual information of contingency tables and related inequalities. in 2014 IEEE International Symposium on Information Theory 2474–2478 (Ieee, 2014). doi:10.1109/isit.2014.6875279.