## About Me

Currently a post-doc in Bioinformatics working with Ronny Lorenz in the research group TBI, University of Vienna.
Our main goal is to predict RNA structure from the sequence at different level (secondary, pseudo-knotted, tertiary) while integrating different sources of probing data.
I recently (Dec 2021) defended Ph.D. thesis in Computer Science "Local Decomposition in RNA Structural Design" co-supervised by Yann Ponty (École Polytechnique, France) and Jérôme Waldispühl (McGill University, Canada).
The work was mainly focused on **RNA Design problem** from different angles, notably combinatorial and algorithmic.
Since secondary structure can be seen as a combinatorial object, my research interests focus on the impact of undesignable motifs on phenotype space.
In addition, being part of Infrared project, I developed a negative design tool by mean of positive design strategy.
I also work on 2.5D RNA module identification in collaboration with Roman Sarrazin-Gendron.

- RNA Bioinfomatics
- Analytic Combinatorics
- Algorithmics
- Graph Theory

## Research

- Counting Design
- BayesPairing2
- RNAPOND
- Infrared
- Exact Design

## Publications

## 2021

- Yao, H.-T. (2021).
*Local Decomposition in RNA Structural Design*(Number 2021IPPAX126) [Thesis, Ecole Polytechnique (Palaiseau, France) ; Université McGill [Montréal].**Abstract :**RNA positive structural design problem attempts to find RNA sequences achieving low free energy of the target secondary structure. Differently, in the negative design, solution sequences should adopt the target structure as its folding preferentially to any alternative structure, according to the given metric and energy model. Inverse folding, a typical negative design, requires the target to be the solution sequence’s MFE folding. Other metrics, like the ensemble defect, are also considered for design evaluation.

The additivity of the energy model suggests the existence of local properties for the RNA design problem. It was discovered in several works that, due to the presence of specific local motifs, some secondary structures are undesignable, i.e., no RNA sequence can fold into the target structure while satisfying the negative design objective. The sequence sampling approach is often used in the positive design. Unwanted local structures, like base pairs, repeatedly form while folding sampled sequences toward the negative design. In this thesis, we study the impact of such local nature on the combinatorial aspect and on the development of negative design methods.

We show that the proportion of designable secondary structures decreases exponentially with the target structure length from the combinatorial aspect. Given a negative design metric, we propose an automated pipeline to identify all undesignable motifs. Enumerating secondary structures avoiding such local obstructions followed by asymptotic analysis yields an upper-bounds on the number of designable structures. In addition, we define a lower bound for the structural ensemble defect derived from occurred local motifs. We show that the lower bound follows a Normal limiting distribution with a closed-form expression, implying also an exponential decrease.

We then present Infrared, a generic framework for efficient combinatorial sampling. We formalize the RNA design problem as a CSP with design objectives described as a set of constraints and a set of weighted functions. Assignments satisfying constraints are generated from a Boltzmann weighted distribution using a dynamic programming algorithm followed by stochastic backtracking. The approach is FPT for the treewidth of the dependency graph induced from the problem. We show that the framework can be easily employed for RNA positive design and flexible applications.

Finally, as an application of Infrared, we propose an original iterative sampling approach that captures negative design principles implemented in RNAPOsitive and Negative Design (RNAPOND). A set of DBPs is identified at each round and subsequently prevented from pairing by introducing proper constraints into the sampling framework. Despite the NP-hardness of the associated decision problem, an efficient sequence sampling algorithm is ensured by the Infrared framework. Our approach achieves a similar or better success rate than state-of-the-art negative design tools while allowing for the generation of diverse, thermodynamically efficient designs, i.e., positive design principles.

One of the research directions of the works presented in this thesis is the extension to more complicated structures, such as pseudoknotted secondary structures. The flexibility of the Infrared framework opens a door for design tool development. For example, the success of RNAPOND suggests a potential approach for RNA negative structural design. - Yao, H.-T., Waldispühl, J., Ponty, Y., & Will, S. (2021, April). Taming Disruptive Base Pairs to Reconcile Positive and Negative Structural Design of RNA.
*RECOMB 2021 - 25th International Conference on Research in Computational Molecular Biology*.**Abstract :**The negative structural design of RNAs, also called Inverse folding, consists in building a synthetic nucleotides sequence adopting a targeted secondary structure as its Minimum Free Energy (MFE) structure. Computationally an NP hard problem, it is mostly addressed as an optimization task and solved using (meta-)heuristics. Existing methods are frequently challenged by demanding instances, and typically produce a single design, hindering practical applications of design, where multiple candidates are desirable to circumvent the idealized nature of design models. In this work, we introduce RNA POsitive and Negative Design (RNAPOND), a sampling approach which generates design candidates exactly from a well-defined distribution influenced by positive design objectives, including affinity towards the target and GC-content. Negative design principles are captured by an original iterative approach, where a subset of Disruptive Base Pairs (DPBs) are identified at each step, and subsequently forbidden from pairing by the introduction of suitable constraints. Despite the NP-hardness of the associated decision problem, we propose a combinatorial sampling algorithm which is Fixed Parameter Tractable (FPT) for the tree-width of the constraint network. Our algorithm, coupled with a suitable rejection step and an automated inference of DPBs, achieves a similar or better level of success in comparison to the state of the art, while allowing for the generation of diverse designs. Interestingly, it also automatically recovers some of the strategies used by practitioners of RNA design. RNAPOND is an open source project, available at: https://gitlab.inria.fr/amibio/RNAPOND

## 2020

- Sarrazin-Gendron, R., Yao, H.-T., Reinharz, V., Oliver, C. G., Ponty, Y., & Waldispühl, J. (2020, May). Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Modules Identification.
*RECOMB 2020 - 24th Annual International Conference on Research in Computational Molecular Biology*.**Abstract :**RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6360 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications. - Ponty, Y., Hammer, S., Yao, H.-T., & Will, S. (2020). Advanced design of structural RNAs using RNARedPrint [Book Chapter]. In E. Picardi (Ed.),
*RNA Bioinformatics*.**Abstract :**RNA design addresses the need to build novel RNAs, e.g. for biotechnological applications in synthetic biology, equipped with desired functional properties. This chapter describes how to use the software RNARedPrint for the de novo rational design of RNA sequences adopting one or several desired secondary structures. Depending on the application , these structures could represent alternate configurations or kinetic pathways. The software makes such design convenient and sufficiently fast for practical routine, where it even overcomes notorious problems in the application of RNA design, e.g. it maintains realistic GC content.

## 2019

- Yao, H.-T., Chauve, C., Regnier, M., & Ponty, Y. (2019). Exponentially few RNA structures are designable [Conference].
*ACM-BCB 2019 - 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics*, 289–298.**Abstract :**The problem of RNA design attempts to construct RNA sequences that performs a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations. In this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upper-bounds on the number of designable structures.

## 2018

- Feijao, P., Yao, H.-T., Fornika, D., Gardy, J., Hsiao, W., Chauve, C., & Chindelevitch, L. (2018). MentaLiST – A fast MLST caller for large MLST schemes [Journal].
*Microbial Genomics*,*4*(2).**Abstract :**MLST (multi-locus sequence typing) is a classic technique for genotyping bacteria, widely applied for pathogen outbreak surveillance. Traditionally, MLST is based on identifying sequence types from a small number of housekeeping genes. With the increasing availability of whole-genome sequencing data, MLST methods have evolved towards larger typing schemes, based on a few hundred genes [core genome MLST (cgMLST)] to a few thousand genes [whole genome MLST (wgMLST)]. Such large-scale MLST schemes have been shown to provide a finer resolution and are increasingly used in various contexts such as hospital outbreaks or foodborne pathogen outbreaks. This methodological shift raises new computational challenges, especially given the large size of the schemes involved. Very few available MLST callers are currently capable of dealing with large MLST schemes. We introduce MentaLiST, a new MLST caller, based on a k-mer voting algorithm and written in the Julia language, specifically designed and implemented to handle large typing schemes. We test it on real and simulated data to show that MentaLiST is faster than any other available MLST caller while providing the same or better accuracy, and is capable of dealing with MLST schemes with up to thousands of genes while requiring limited computational resources. MentaLiST source code and easy installation instructions using a Conda package are available at https://github.com/WGS-TB/MentaLiST.

## Trajectory

#### University of Vienna · Austria

#### École Polytechnique (IPP Paris) · France

#### McGill University · Canada

*Local Decomposition in RNA Structural Design*(defense version)

Co-supervised by Yann Ponty and Jérôme Waldispühl

#### CEA Grenoble · France

#### Simon Fraser University · Canada

*Model-based clustering on Tuberculosis strains during an outbreak*

co-supervised by Cedric Chauve and Leonid Chindelevitch