About Me
Currently a postdoc in Bioinformatics working with Ronny Lorenz in the research group TBI, University of Vienna. Our main goal is to predict RNA structure from the sequence at different level (secondary, pseudoknotted, tertiary) while integrating different sources of probing data. I defended in Dec 2021 Ph.D. thesis in Computer Science "Local Decomposition in RNA Structural Design" cosupervised by Yann Ponty (École Polytechnique, France) and Jérôme Waldispühl (McGill University, Canada). The work was mainly focused on RNA Design problem from different angles, notably combinatorial and algorithmic. Since secondary structure can be seen as a combinatorial object, my research interests focus on the impact of undesignable motifs on phenotype space. In addition, being part of Infrared project, I developed a negative design tool by mean of positive design strategy. I also work on 2.5D RNA module identification in collaboration with Roman SarrazinGendron.
 RNA Bioinfomatics
 Analytic Combinatorics
 Algorithmics
 Graph Theory
Publications
2024

Sidl, L., Faissner, M., Uhlir, M., VelandiaHuerto, C. A., Waldl, M., Yao, H.T., Hofacker, I. L., & Stadler, P. F. (2024). A Relation between Natural Selection and Computational Complexity [Preprint].Abstract :
Biological systems are widely regarded as performing computations. It is much less clear, however, what exactly is computed and how biological computation fits within the framework of standard computer science. Here we investigate a possible connection, focussing on the notion that biological systems may utilize fixed algorithms that imply subsets of solvable problem instances. In simple simulation experiments we show that differences in solvable instance sets may have an impact on evolution.

Boury, T., Sidl, L., Hofacker, I. L., Ponty, Y., & Yao, H.T. (2024). Old dog, new tricks: Exact seeding strategy improves RNA design performances [Preprint].Abstract :
The Inverse Folding problem involves identifying RNA sequences that adopt a target structure with respect to freeenergy minimization, i.e. preferential to all alternative structures. The problem has historically been regarded as challenging, largely due to its proven NPcompleteness of an extended version where the base pair maximization energy model is used. In contrast, it has recently been shown that a large subset called mseparable structures, notably including those comprising helices of length 3+, can be solved in lineartime within the same energy model. This permits not only the identification of a single solution, but also the characterization of a language of solutions.
In this work, we seek to describe the “hardness” of Inverse Folding, bridging (at least heuristically) the gap between a simplified energy model and a more realistic Turner energy model. We used LinearBPDesign to generate seed sequences for RNAinverse, thereby improving the design process in a Turner energy model. To this end, we extended LinearBPDesign to accommodate biseparability and to handle non or high modulo separable structures by minimalist addition of base pairs.
Our study suggests that seeds generated by LinearBPDesign capture longrange interactions, thereby improving the performance of RNAinverse compared to seed focusing on refining the energy model itself. Most surprisingly, a significant number of LinearBPDesign seeds uniquely fold into the target structure in the Turner model, especially when helices are at least of length 2. This observation suggests that the “hardness” of design may arise from the intrinsic properties of the structures themselves.

von Löhneysen, S., Spicher, T., Varenyk, Y., Yao, H.T., Lorenz, R., Hofacker, I., & Stadler, P. F. (2024). Phylogenetic and Chemical Probing Information as Soft Constraints in RNA Secondary Structure Prediction [Journal]. Journal of Computational Biology, 31(6), 549–563.Abstract :
Extrinsic, experimental information can be incorporated into thermodynamicsbased RNA folding algorithms in the form of pseudoenergies. Evolutionary conservation of RNA secondary structure elements is detectable in alignments of phylogenetically related sequences and provides evidence for the presence of certain base pairs that can also be converted into pseudoenergy contributions. We show that the centroid base pairs computed from a consensus folding model such as RNAalifold result in a substantial improvement of the prediction accuracy for single sequences. Evidence for specific base pairs turns out to be more informative than a positionwise profile for the conservation of the pairing status. A comparison with chemical probing data, furthermore, strongly suggests that phylogenetic base pairing data are more informative than positionspecific data on (un)pairedness as obtained from chemical probing experiments. In this context we demonstrate, in addition, that the conversion of signal from probing data into pseudoenergies is possible using thermodynamic structure predictions as a reference instead of known RNA structures.

Waldl, M., Yao, H.T., & Hofacker, I. (2024). Sequence design for RNARNA interactions [Book Chapter].Abstract :
The design of RNA sequences with desired structural properties presents a challenging computational problem with promising applications in biotechnology and biomedicine. Most regulatory RNAs function by forming RNARNA interactions, e.g., in order to regulate mRNA expression. It is therefore natural to consider problems where a sequence is designed to form a desired RNARNA interaction and switch between structures upon binding. This contribution demonstrates the use of the Infrared framework to design interacting sequences. Specifically, we consider the regulation of the rpoS mRNA by the sRNA DsrA and design artificial 5’UTRs that place a downstream protein coding gene under control of DsrA. The design process is explained stepbystep in a Jupyter notebook, accompanied by Python code. The text discusses setting up design constraints for sampling sequences in Infrared, computing quality measures, constructing a suitable cost function, as well as the optimization procedure. We show that not only thermodynamic, but also kinetic folding features can be relevant. Kinetics of interaction formation can be estimated efficiently using the RRIkinDP tool, and the chapter explains how to include kinetic folding features from RRIkinDP directly in the cost function. The protocol implemented in our Jupyter notebook can easily be extended to consider additional requirements or adapted to novel design scenarios.

Yao, H.T., Ponty, Y., & Will, S. (2024). Developing Complex RNA Design Applications in the Infrared Framework [Book Chapter]. In R. Lorenz (Ed.), RNA Folding: Methods and Protocols (pp. 285–313). Springer US.Abstract :
Applications in biotechnology and biomedical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational design tools but at the same time put high demands on their flexibility and expressivity to model the applicationsspecific requirements. To address such demands, we present the computational framework Infrared. It supports developing advanced customized design tools, which generate RNA sequences with specific properties, often in a few lines of Python code. This text guides the reader in tutorialformat through the development of complex design applications. Thanks to the declarative, compositional approach of Infrared, we can describe this development as stepbystep extension of an elementary design task. Thus, we start with generating sequences that are compatible with a single RNA structure and go all the way to RNA design targeting complex positive and negative design objectives with respect to single or even multiple target structures. Finally, we present a ’realworld’ application of computational RNA design of a biotechnological device. We use Infrared to generate design candidates of an artificial ANDriboswitch, which could activate gene expression (only) in the simultaneous presence of two different metabolites.

Yao, H.T., Marchand, B., Berkemer, S. J., Ponty, Y., & Will, S. (2024). Infrared: a declarative tree decompositionpowered framework for bioinformatics [Journal]. Algorithms for Molecular Biology.Abstract :
Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.
Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its under‐ lying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.
Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decompositionbased efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequencestructure alignment, parsimonydriven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework—together with our novel results—underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.
Availability: Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared/ with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.
2023

Yao, H.T., Lorenz, R., Hofacker, I. L., & Stadler, P. F. (2023). Monovalent salt corrections for RNA secondary structures in the ViennaRNA package [Journal]. Algorithms for Molecular Biology, 18, 8.Abstract :
Background: RNA features a highly negatively charged phosphate backbone that attracts a of cloud counterions that reduce the electrostatic repulsion in a concentration dependent manner. Ion concentrations thus have a large influence on folding and stability of RNA structures. Despite their welldocumented effects, salt effects are not handled by currently available secondary stucture prediction algorithms. Combining DebyeHückel potentials for line charges and Manning’s counterion condensation theory, Einert et al. [Biophys. J. 100: 27452753 (2011)] modeled the energetic effects contributions monovalent cations on loops and helices.
Results: The model of Einert et al. is adapted to match the structure of the dynamic programming recursion of RNA secondary structure prediction algorithms. An empirical term describing the dependence salt dependence of the duplex initiation energy is added to improve cofolding predictions for two or more RNA strands. The slightly modified model is implemented in the ViennaRNA package in such way that only the energy parameters but not the algorithmic structure is affected. A comparison with data from the literature show that predicted free energies and melting temperatures are in reasonable agreement with experiments. Conclusion: The new feature in the ViennaRNA package makes it possible to study effects of salt concentrations on RNA folding in a systematic manner. Strictly speaking, the model pertains only to monovalent cations, and thus covers the most important parameter, i.e., the NaCl concentration. It remains a question for future research to what extent unspecific effects of biand trivalent cations can be approximated in a similar manner.
Availability: Corrections for the concentration of monovalent cations are available in the ViennaRNA package starting from version 2.6.0.

von Löhneysen, S., Spicher, T., Varenyk, Y., Yao, H.T., Lorenz, R., Hofacker, I., & Stadler, P. F. (2023). Phylogenetic Information as Soft Constraints in RNA Secondary Structure Prediction [Conference]. In Bioinformatics Research and Applications (pp. 267–279). Springer Nature Singapore.Abstract :
Pseudoenergies are a generic method to incorporate extrinsic information into energydirected RNA secondary structure predictions. Consensus structures of RNA families, usually predicted from multiple sequence alignments, can be treated as soft constraints in this manner. In this contribution we first revisit the theoretical framework and then show that pseudoenergies for the centroid base pairs of the consensus structure result in a substantial increase in folding accuracy. In contrast, only a moderate improvement can be achieved if only the information that a base is predominantly paired is utilized.
2021

Yao, H.T. (2021). Local Decomposition in RNA Structural Design (Number 2021IPPAX126) [Thesis, Ecole Polytechnique (Palaiseau, France) ; Université McGill [Montréal].Abstract :
RNA positive structural design problem attempts to find RNA sequences achieving low free energy of the target secondary structure. Differently, in the negative design, solution sequences should adopt the target structure as its folding preferentially to any alternative structure, according to the given metric and energy model. Inverse folding, a typical negative design, requires the target to be the solution sequence’s MFE folding. Other metrics, like the ensemble defect, are also considered for design evaluation.
The additivity of the energy model suggests the existence of local properties for the RNA design problem. It was discovered in several works that, due to the presence of specific local motifs, some secondary structures are undesignable, i.e., no RNA sequence can fold into the target structure while satisfying the negative design objective. The sequence sampling approach is often used in the positive design. Unwanted local structures, like base pairs, repeatedly form while folding sampled sequences toward the negative design. In this thesis, we study the impact of such local nature on the combinatorial aspect and on the development of negative design methods.
We show that the proportion of designable secondary structures decreases exponentially with the target structure length from the combinatorial aspect. Given a negative design metric, we propose an automated pipeline to identify all undesignable motifs. Enumerating secondary structures avoiding such local obstructions followed by asymptotic analysis yields an upperbounds on the number of designable structures. In addition, we define a lower bound for the structural ensemble defect derived from occurred local motifs. We show that the lower bound follows a Normal limiting distribution with a closedform expression, implying also an exponential decrease.
We then present Infrared, a generic framework for efficient combinatorial sampling. We formalize the RNA design problem as a CSP with design objectives described as a set of constraints and a set of weighted functions. Assignments satisfying constraints are generated from a Boltzmann weighted distribution using a dynamic programming algorithm followed by stochastic backtracking. The approach is FPT for the treewidth of the dependency graph induced from the problem. We show that the framework can be easily employed for RNA positive design and flexible applications.
Finally, as an application of Infrared, we propose an original iterative sampling approach that captures negative design principles implemented in RNAPOsitive and Negative Design (RNAPOND). A set of DBPs is identified at each round and subsequently prevented from pairing by introducing proper constraints into the sampling framework. Despite the NPhardness of the associated decision problem, an efficient sequence sampling algorithm is ensured by the Infrared framework. Our approach achieves a similar or better success rate than stateoftheart negative design tools while allowing for the generation of diverse, thermodynamically efficient designs, i.e., positive design principles.
One of the research directions of the works presented in this thesis is the extension to more complicated structures, such as pseudoknotted secondary structures. The flexibility of the Infrared framework opens a door for design tool development. For example, the success of RNAPOND suggests a potential approach for RNA negative structural design.

Yao, H.T., Waldispühl, J., Ponty, Y., & Will, S. (2021, April). Taming Disruptive Base Pairs to Reconcile Positive and Negative Structural Design of RNA. RECOMB 2021  25th International Conference on Research in Computational Molecular Biology.Abstract :
The negative structural design of RNAs, also called Inverse folding, consists in building a synthetic nucleotides sequence adopting a targeted secondary structure as its Minimum Free Energy (MFE) structure. Computationally an NP hard problem, it is mostly addressed as an optimization task and solved using (meta)heuristics. Existing methods are frequently challenged by demanding instances, and typically produce a single design, hindering practical applications of design, where multiple candidates are desirable to circumvent the idealized nature of design models. In this work, we introduce RNA POsitive and Negative Design (RNAPOND), a sampling approach which generates design candidates exactly from a welldefined distribution influenced by positive design objectives, including affinity towards the target and GCcontent. Negative design principles are captured by an original iterative approach, where a subset of Disruptive Base Pairs (DPBs) are identified at each step, and subsequently forbidden from pairing by the introduction of suitable constraints. Despite the NPhardness of the associated decision problem, we propose a combinatorial sampling algorithm which is Fixed Parameter Tractable (FPT) for the treewidth of the constraint network. Our algorithm, coupled with a suitable rejection step and an automated inference of DPBs, achieves a similar or better level of success in comparison to the state of the art, while allowing for the generation of diverse designs. Interestingly, it also automatically recovers some of the strategies used by practitioners of RNA design. RNAPOND is an open source project, available at: https://gitlab.inria.fr/amibio/RNAPOND
2020

SarrazinGendron, R., Yao, H.T., Reinharz, V., Oliver, C. G., Ponty, Y., & Waldispühl, J. (2020, May). Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Modules Identification. RECOMB 2020  24th Annual International Conference on Research in Computational Molecular Biology.Abstract :
RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. WatsonCrick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of noncanonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6360 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genomewide applications.

Ponty, Y., Hammer, S., Yao, H.T., & Will, S. (2020). Advanced design of structural RNAs using RNARedPrint [Book Chapter]. In E. Picardi (Ed.), RNA Bioinformatics.Abstract :
RNA design addresses the need to build novel RNAs, e.g. for biotechnological applications in synthetic biology, equipped with desired functional properties. This chapter describes how to use the software RNARedPrint for the de novo rational design of RNA sequences adopting one or several desired secondary structures. Depending on the application , these structures could represent alternate configurations or kinetic pathways. The software makes such design convenient and sufficiently fast for practical routine, where it even overcomes notorious problems in the application of RNA design, e.g. it maintains realistic GC content.
2019

Yao, H.T., Chauve, C., Regnier, M., & Ponty, Y. (2019). Exponentially few RNA structures are designable [Conference]. ACMBCB 2019  10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 289–298.Abstract :
The problem of RNA design attempts to construct RNA sequences that performs a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations. In this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upperbounds on the number of designable structures.
2018

Feijao, P., Yao, H.T., Fornika, D., Gardy, J., Hsiao, W., Chauve, C., & Chindelevitch, L. (2018). MentaLiST – A fast MLST caller for large MLST schemes [Journal]. Microbial Genomics, 4(2).Abstract :
MLST (multilocus sequence typing) is a classic technique for genotyping bacteria, widely applied for pathogen outbreak surveillance. Traditionally, MLST is based on identifying sequence types from a small number of housekeeping genes. With the increasing availability of wholegenome sequencing data, MLST methods have evolved towards larger typing schemes, based on a few hundred genes [core genome MLST (cgMLST)] to a few thousand genes [whole genome MLST (wgMLST)]. Such largescale MLST schemes have been shown to provide a finer resolution and are increasingly used in various contexts such as hospital outbreaks or foodborne pathogen outbreaks. This methodological shift raises new computational challenges, especially given the large size of the schemes involved. Very few available MLST callers are currently capable of dealing with large MLST schemes. We introduce MentaLiST, a new MLST caller, based on a kmer voting algorithm and written in the Julia language, specifically designed and implemented to handle large typing schemes. We test it on real and simulated data to show that MentaLiST is faster than any other available MLST caller while providing the same or better accuracy, and is capable of dealing with MLST schemes with up to thousands of genes while requiring limited computational resources. MentaLiST source code and easy installation instructions using a Conda package are available at https://github.com/WGSTB/MentaLiST.
Software
Trajectory
University of Vienna · Austria
École Polytechnique (IPP Paris) · France
McGill University · Canada
CEA Grenoble · France
supervised by Laurent Guyon
Simon Fraser University · Canada
cosupervised by Cedric Chauve and Leonid Chindelevitch