General Introduction

This manuscript is the result of my years at the Institut Pasteur, where I built upon work initiated during an internship in 2018. During my time at the Institut Pasteur I have worked on two very distinct subjects:

  1. The study of drug resistance mutations in HIV sequences with machine learning.
  2. The study of sequence transformation functions to improve long-read mapping.

These two subjects, though distinct, do share some common characteristics: mainly that they are based on sequence data and specifically alignments. Although the research on drug resistance in HIV was conducted before that on long-read mapping, I have forgone the chronological ordering of my work in this manuscript for the sake of thematic coherence. Through the organization of this manuscript, I have tried to link all the facets of my PhD work, and it is my hope that readers will be able to follow the flow without too much jumping around.

This manuscript is articulated around seven chapters, listed as follows:

  1. An introduction to biological sequence data, how it is obtained and specific characteristics and problems inherent to long reads.
  2. An introduction to sequence alignment, and how and why read-mapping is performed.
  3. A presentation of my work on sequence transformation functions to improve long-read mapping, which was written as a standalone research article.
  4. An introduction to machine learning on biological sequence data, with a focus on techniques used later in the manuscript.
  5. An introduction to viruses and HIV in particular, with a focus on proteins important to drug resistance.
  6. A presentation of my work on drug resistance in HIV, which was written and published as a standalone research article.
  7. A short introduction to deep learning in sequence alignment and perspectives to the work presented in chapter 3.

Research output

During this thesis, my work on finding drug resistance mutations with machine learning resulted in two publications: a first author article describing our method published in PLOS Computational Biology as well as a co-first author review article published in Current Opinion in Virology.

The second half of my PhD work, on improving read-mapping resulted in a first-author paper, presented at the RECOMB-SEQ 2022 conference and to be published in the iScience proceedings of that conference.

In 2020, during the early stages of the COVID-19 pandemic and the lockdowns, I participated in some work resulting in the COVID-Align web-service and a middle-authorship in the corresponding Bioinformatics publication. This work also led to middle-authorship in an article concerning the origins of SARS-CoV-2 in the Comptes Rendus. Biologies journal of the French Science Academy.

Journal publications

This list contains the formal references of the publications mentioned above, along with my contribution represented using the CRediT taxonomy.

  • Blassel, Luc, Paul Medvedev and Rayan Chikhi. 2022. “Mapping-friendly sequence reductions: going beyond homopolymer compression”. In press as part of the RECOMB-SEQ 2022 proceedings in iScience, (Adapted as Chapter 3)
    Contributions: Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

  • Blassel, Luc1, Anna Zhukova1, Christian J Villabona-Arenas, Katherine E Atkins, Stéphane Hué, and Olivier Gascuel. 2021. “Drug Resistance Mutations in HIV: New Bioinformatics Approaches and Challenges.” Current Opinion in Virology 51 (December): 56–64. 10.1016/j.coviro.2021.09.009 (Used as the basis for Section 5.3.4)
    Contributions: Visualization, Writing – original draft, Writing – review & editing.

  • Blassel, Luc, Anna Tostevin, Christian Julian Villabona-Arenas, MartinePeeters, Stéphane Hué, and Olivier Gascuel. 2021. “Using Machine Learning and Big Data to Explore the Drug Resistance Landscape in HIV.” PLOS Computational Biology 17 (8): e1008873. 10.1371/journal.pcbi.1008873. (Adapted as Chapter 6)
    Contributions: Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

  • Zhukova, Anna, Luc Blassel, Frédéric Lemoine, Marie Morel, JakubVoznica, and Olivier Gascuel. 2021. “Origin, Evolution and Global Spread of SARS-CoV-2.” Comptes Rendus. Biologies 344 (1): 57–75. 10.5802/crbiol.29.
    Contributions: Writing – review & editing.

  • Lemoine, Frédéric, Luc Blassel, Jakub Voznica, and Olivier Gascuel.2020. “COVID-Align: accurate online alignment of hCoV-19 genomes using a profile HMM” Bioinformatics, 37 (12): 1761-1762. 10.1093/bioinformatics/btaa871.
    Contributions: Software, Writing – review & editing

Presentations and posters

  • “Mapping-friendly sequence reductions: going beyond homopolymer compression” proceedings talk, RECOMB-SEQ 2022. San Diego, USA (May 21st 2022)

  • “Can we improve analyses be transforming DNA?” Joint RECOMB-SEQ RECOMB-CCB scientific communication session2. San Diego, USA (May 21st 2022).

  • Machine learning approaches to reveal resistance mutations in HIV” Poster at MCEB 2019. Porquerolles, France (May 29th 2019)




  1. Co-first authors: Luc Blassel and Anna Zhukova↩︎

  2. 2nd place prize awarded↩︎