TBI @ University of Vienna
Currently a post-doc in Bioinformatics working with Ronny Lorenz in the research group TBI, University of Vienna. Our main goal is to predict RNA structure from the sequence at different level (secondary, pseudo-knotted, tertiary) while integrating different sources of probing data. I defended in Dec 2021 Ph.D. thesis in Computer Science "Local Decomposition in RNA Structural Design" co-supervised by Yann Ponty (École Polytechnique, France) and Jérôme Waldispühl (McGill University, Canada). The work was mainly focused on RNA Design problem from different angles, notably combinatorial and algorithmic. Since secondary structure can be seen as a combinatorial object, my research interests focus on the impact of undesignable motifs on phenotype space. In addition, being part of Infrared project, I developed a negative design tool by mean of positive design strategy. I also work on 2.5D RNA module identification in collaboration with Roman Sarrazin-Gendron.
- RNA Bioinfomatics
- Analytic Combinatorics
- Graph Theory
Yao, H.-T., Marchand, B., Berkemer, S. J., Ponty, Y., & Will, S. (2023). Infrared : a declarative tree decomposition-powered framework for bioinformatics [Preprint].Abstract :
Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.
Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential on the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.
Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations (and extensions) for methods targeting RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and coding sequence design demonstrate multidimensional Boltzmann sampling. Previous work together with novel results demonstrate the practical relevance of the framework, whose complexity is typically equivalent or better than specialized algorithms and implementations.
Infrared is available at https://www.lix.polytechnique.fr/%7Ewill/Software/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.
Yao, H.-T., Lorenz, R., Hofacker, I. L., & Stadler, P. F. (2023). Mono-valent salt corrections for RNA secondary structures in the ViennaRNA package [Journal]. Algorithms for Molecular Biology, 18, 8.Abstract :
Background: RNA features a highly negatively charged phosphate backbone that attracts a of cloud counter-ions that reduce the electrostatic repulsion in a concentration dependent manner. Ion concentrations thus have a large influence on folding and stability of RNA structures. Despite their well-documented effects, salt effects are not handled by currently available secondary stucture prediction algorithms. Combining Debye-Hückel potentials for line charges and Manning’s counter-ion condensation theory, Einert et al. [Biophys. J. 100: 2745-2753 (2011)] modeled the energetic effects contributions monovalent cations on loops and helices.
Results: The model of Einert et al. is adapted to match the structure of the dynamic programming recursion of RNA secondary structure prediction algorithms. An empirical term describing the dependence salt dependence of the duplex initiation energy is added to improve co-folding predictions for two or more RNA strands. The slightly modified model is implemented in the ViennaRNA package in such way that only the energy parameters but not the algorithmic structure is affected. A comparison with data from the literature show that predicted free energies and melting temperatures are in reasonable agreement with experiments. Conclusion: The new feature in the ViennaRNA package makes it possible to study effects of salt concentrations on RNA folding in a systematic manner. Strictly speaking, the model pertains only to mono-valent cations, and thus covers the most important parameter, i.e., the NaCl concentration. It remains a question for future research to what extent unspecific effects of bi-and tri-valent cations can be approximated in a similar manner.
Availability: Corrections for the concentration of monovalent cations are available in the ViennaRNA package starting from version 2.6.0.
von Löhneysen, S., Spicher, T., Varenyk, Y., Yao, H.-T., Lorenz, R., Hofacker, I., & Stadler, P. F. (2023). Phylogenetic Information as Soft Constraints in RNA Secondary Structure Prediction [Conference]. In Bioinformatics Research and Applications (pp. 267–279). Springer Nature Singapore.Abstract :
Pseudo-energies are a generic method to incorporate extrinsic information into energy-directed RNA secondary structure predictions. Consensus structures of RNA families, usually predicted from multiple sequence alignments, can be treated as soft constraints in this manner. In this contribution we first revisit the theoretical framework and then show that pseudo-energies for the centroid base pairs of the consensus structure result in a substantial increase in folding accuracy. In contrast, only a moderate improvement can be achieved if only the information that a base is predominantly paired is utilized.
Yao, H.-T., Ponty, Y., & Will, S. (2022). Developing complex RNA design applications in the Infrared framework [Book Chapter]. In RNA Folding - Methods and Protocols.Abstract :
Applications in biotechnology and bio-medical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational design tools but at the same time put high demands on their flexibility and expressivity to model the applications-specific requirements. To address such demands, we present the computational framework Infrared. It supports developing advanced customized design tools, which generate RNA sequences with specific properties, often in a few lines of Python code. This text guides the reader in tutorial-format through the development of complex design applications. Thanks to the declarative, compositional approach of Infrared, we can describe this development as step-by-step extension of an elementary design task. Thus, we start with generating sequences that are compatible with a single RNA structure and go all the way to RNA design targeting complex positive and negative design objectives with respect to single or even multiple target structures. Finally, we present a ’real-world’ application of computational RNA design of a biotechnological device. We use Infrared to generate design candidates of an artificial AND-riboswitch, which could activate gene expression (only) in the simultaneous presence of two different metabolites.
Yao, H.-T. (2021). Local Decomposition in RNA Structural Design (Number 2021IPPAX126) [Thesis, Ecole Polytechnique (Palaiseau, France) ; Université McGill [Montréal].Abstract :
RNA positive structural design problem attempts to find RNA sequences achieving low free energy of the target secondary structure. Differently, in the negative design, solution sequences should adopt the target structure as its folding preferentially to any alternative structure, according to the given metric and energy model. Inverse folding, a typical negative design, requires the target to be the solution sequence’s MFE folding. Other metrics, like the ensemble defect, are also considered for design evaluation.
The additivity of the energy model suggests the existence of local properties for the RNA design problem. It was discovered in several works that, due to the presence of specific local motifs, some secondary structures are undesignable, i.e., no RNA sequence can fold into the target structure while satisfying the negative design objective. The sequence sampling approach is often used in the positive design. Unwanted local structures, like base pairs, repeatedly form while folding sampled sequences toward the negative design. In this thesis, we study the impact of such local nature on the combinatorial aspect and on the development of negative design methods.
We show that the proportion of designable secondary structures decreases exponentially with the target structure length from the combinatorial aspect. Given a negative design metric, we propose an automated pipeline to identify all undesignable motifs. Enumerating secondary structures avoiding such local obstructions followed by asymptotic analysis yields an upper-bounds on the number of designable structures. In addition, we define a lower bound for the structural ensemble defect derived from occurred local motifs. We show that the lower bound follows a Normal limiting distribution with a closed-form expression, implying also an exponential decrease.
We then present Infrared, a generic framework for efficient combinatorial sampling. We formalize the RNA design problem as a CSP with design objectives described as a set of constraints and a set of weighted functions. Assignments satisfying constraints are generated from a Boltzmann weighted distribution using a dynamic programming algorithm followed by stochastic backtracking. The approach is FPT for the treewidth of the dependency graph induced from the problem. We show that the framework can be easily employed for RNA positive design and flexible applications.
Finally, as an application of Infrared, we propose an original iterative sampling approach that captures negative design principles implemented in RNAPOsitive and Negative Design (RNAPOND). A set of DBPs is identified at each round and subsequently prevented from pairing by introducing proper constraints into the sampling framework. Despite the NP-hardness of the associated decision problem, an efficient sequence sampling algorithm is ensured by the Infrared framework. Our approach achieves a similar or better success rate than state-of-the-art negative design tools while allowing for the generation of diverse, thermodynamically efficient designs, i.e., positive design principles.
One of the research directions of the works presented in this thesis is the extension to more complicated structures, such as pseudoknotted secondary structures. The flexibility of the Infrared framework opens a door for design tool development. For example, the success of RNAPOND suggests a potential approach for RNA negative structural design.
Yao, H.-T., Waldispühl, J., Ponty, Y., & Will, S. (2021, April). Taming Disruptive Base Pairs to Reconcile Positive and Negative Structural Design of RNA. RECOMB 2021 - 25th International Conference on Research in Computational Molecular Biology.Abstract :
The negative structural design of RNAs, also called Inverse folding, consists in building a synthetic nucleotides sequence adopting a targeted secondary structure as its Minimum Free Energy (MFE) structure. Computationally an NP hard problem, it is mostly addressed as an optimization task and solved using (meta-)heuristics. Existing methods are frequently challenged by demanding instances, and typically produce a single design, hindering practical applications of design, where multiple candidates are desirable to circumvent the idealized nature of design models. In this work, we introduce RNA POsitive and Negative Design (RNAPOND), a sampling approach which generates design candidates exactly from a well-defined distribution influenced by positive design objectives, including affinity towards the target and GC-content. Negative design principles are captured by an original iterative approach, where a subset of Disruptive Base Pairs (DPBs) are identified at each step, and subsequently forbidden from pairing by the introduction of suitable constraints. Despite the NP-hardness of the associated decision problem, we propose a combinatorial sampling algorithm which is Fixed Parameter Tractable (FPT) for the tree-width of the constraint network. Our algorithm, coupled with a suitable rejection step and an automated inference of DPBs, achieves a similar or better level of success in comparison to the state of the art, while allowing for the generation of diverse designs. Interestingly, it also automatically recovers some of the strategies used by practitioners of RNA design. RNAPOND is an open source project, available at: https://gitlab.inria.fr/amibio/RNAPOND
Sarrazin-Gendron, R., Yao, H.-T., Reinharz, V., Oliver, C. G., Ponty, Y., & Waldispühl, J. (2020, May). Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Modules Identification. RECOMB 2020 - 24th Annual International Conference on Research in Computational Molecular Biology.Abstract :
RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6360 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications.
Ponty, Y., Hammer, S., Yao, H.-T., & Will, S. (2020). Advanced design of structural RNAs using RNARedPrint [Book Chapter]. In E. Picardi (Ed.), RNA Bioinformatics.Abstract :
RNA design addresses the need to build novel RNAs, e.g. for biotechnological applications in synthetic biology, equipped with desired functional properties. This chapter describes how to use the software RNARedPrint for the de novo rational design of RNA sequences adopting one or several desired secondary structures. Depending on the application , these structures could represent alternate configurations or kinetic pathways. The software makes such design convenient and sufficiently fast for practical routine, where it even overcomes notorious problems in the application of RNA design, e.g. it maintains realistic GC content.
Yao, H.-T., Chauve, C., Regnier, M., & Ponty, Y. (2019). Exponentially few RNA structures are designable [Conference]. ACM-BCB 2019 - 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 289–298.Abstract :
The problem of RNA design attempts to construct RNA sequences that performs a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations. In this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upper-bounds on the number of designable structures.
Feijao, P., Yao, H.-T., Fornika, D., Gardy, J., Hsiao, W., Chauve, C., & Chindelevitch, L. (2018). MentaLiST – A fast MLST caller for large MLST schemes [Journal]. Microbial Genomics, 4(2).Abstract :
MLST (multi-locus sequence typing) is a classic technique for genotyping bacteria, widely applied for pathogen outbreak surveillance. Traditionally, MLST is based on identifying sequence types from a small number of housekeeping genes. With the increasing availability of whole-genome sequencing data, MLST methods have evolved towards larger typing schemes, based on a few hundred genes [core genome MLST (cgMLST)] to a few thousand genes [whole genome MLST (wgMLST)]. Such large-scale MLST schemes have been shown to provide a finer resolution and are increasingly used in various contexts such as hospital outbreaks or foodborne pathogen outbreaks. This methodological shift raises new computational challenges, especially given the large size of the schemes involved. Very few available MLST callers are currently capable of dealing with large MLST schemes. We introduce MentaLiST, a new MLST caller, based on a k-mer voting algorithm and written in the Julia language, specifically designed and implemented to handle large typing schemes. We test it on real and simulated data to show that MentaLiST is faster than any other available MLST caller while providing the same or better accuracy, and is capable of dealing with MLST schemes with up to thousands of genes while requiring limited computational resources. MentaLiST source code and easy installation instructions using a Conda package are available at https://github.com/WGS-TB/MentaLiST.
University of Vienna · Austria
CEA Grenoble · France
supervised by Laurent Guyon