Hua-Ting Yao

Post-Doc

TBI @ University of Vienna


About Me

Currently a post-doc in Bioinformatics working with Ronny Lorenz in the research group TBI, University of Vienna. Our main goal is to predict RNA structure from the sequence at different level (secondary, pseudo-knotted, tertiary) while integrating different sources of probing data. I defended in Dec 2021 Ph.D. thesis in Computer Science "Local Decomposition in RNA Structural Design" co-supervised by Yann Ponty (École Polytechnique, France) and Jérôme Waldispühl (McGill University, Canada). The work was mainly focused on RNA Design problem from different angles, notably combinatorial and algorithmic. Since secondary structure can be seen as a combinatorial object, my research interests focus on the impact of undesignable motifs on phenotype space. In addition, being part of Infrared project, I developed a negative design tool by mean of positive design strategy. I also work on 2.5D RNA module identification in collaboration with Roman Sarrazin-Gendron.

My research interests include but not limited to:
  • RNA Bioinfomatics
  • Analytic Combinatorics
  • Algorithmics
  • Graph Theory

Publications

2024

  1. Sidl, L., Faissner, M., Uhlir, M., Velandia-Huerto, C. A., Waldl, M., Yao, H.-T., Hofacker, I. L., & Stadler, P. F. (2024). A Relation between Natural Selection and Computational Complexity [Preprint].

    Preprint

    Abstract :

    Biological systems are widely regarded as performing computations. It is much less clear, however, what exactly is computed and how biological computation fits within the framework of standard computer science. Here we investigate a possible connection, focussing on the notion that biological systems may utilize fixed algorithms that imply subsets of solvable problem instances. In simple simulation experiments we show that differences in solvable instance sets may have an impact on evolution.

  2. Boury, T., Sidl, L., Hofacker, I. L., Ponty, Y., & Yao, H.-T. (2024). Old dog, new tricks: Exact seeding strategy improves RNA design performances [Preprint].

    Preprint

    Abstract :

    The Inverse Folding problem involves identifying RNA sequences that adopt a target structure with respect to free-energy minimization, i.e. preferential to all alternative structures. The problem has historically been regarded as challenging, largely due to its proven NP-completeness of an extended version where the base pair maximization energy model is used. In contrast, it has recently been shown that a large subset called m-separable structures, notably including those comprising helices of length 3+, can be solved in linear-time within the same energy model. This permits not only the identification of a single solution, but also the characterization of a language of solutions.

    In this work, we seek to describe the “hardness” of Inverse Folding, bridging (at least heuristically) the gap between a simplified energy model and a more realistic Turner energy model. We used LinearBPDesign to generate seed sequences for RNAinverse, thereby improving the design process in a Turner energy model. To this end, we extended LinearBPDesign to accommodate biseparability and to handle non- or high modulo separable structures by minimalist addition of base pairs.

    Our study suggests that seeds generated by LinearBPDesign capture long-range interactions, thereby improving the performance of RNAinverse compared to seed focusing on refining the energy model itself. Most surprisingly, a significant number of LinearBPDesign seeds uniquely fold into the target structure in the Turner model, especially when helices are at least of length 2. This observation suggests that the “hardness” of design may arise from the intrinsic properties of the structures themselves.

  3. von Löhneysen, S., Spicher, T., Varenyk, Y., Yao, H.-T., Lorenz, R., Hofacker, I., & Stadler, P. F. (2024). Phylogenetic and Chemical Probing Information as Soft Constraints in RNA Secondary Structure Prediction [Journal]. Journal of Computational Biology, 31(6), 549–563.

    Journal

    Abstract :

    Extrinsic, experimental information can be incorporated into thermodynamics-based RNA folding algorithms in the form of pseudo-energies. Evolutionary conservation of RNA secondary structure elements is detectable in alignments of phylogenetically related sequences and provides evidence for the presence of certain base pairs that can also be converted into pseudo-energy contributions. We show that the centroid base pairs computed from a consensus folding model such as RNAalifold result in a substantial improvement of the prediction accuracy for single sequences. Evidence for specific base pairs turns out to be more informative than a position-wise profile for the conservation of the pairing status. A comparison with chemical probing data, furthermore, strongly suggests that phylogenetic base pairing data are more informative than position-specific data on (un)pairedness as obtained from chemical probing experiments. In this context we demonstrate, in addition, that the conversion of signal from probing data into pseudo-energies is possible using thermodynamic structure predictions as a reference instead of known RNA structures.

  4. Waldl, M., Yao, H.-T., & Hofacker, I. (2024). Sequence design for RNA-RNA interactions [Book Chapter].

    Book Chapter

    Abstract :

    The design of RNA sequences with desired structural properties presents a challenging computational problem with promising applications in biotechnology and biomedicine. Most regulatory RNAs function by forming RNA-RNA interactions, e.g., in order to regulate mRNA expression. It is therefore natural to consider problems where a sequence is designed to form a desired RNA-RNA interaction and switch between structures upon binding. This contribution demonstrates the use of the Infrared framework to design interacting sequences. Specifically, we consider the regulation of the rpoS mRNA by the sRNA DsrA and design artificial 5’UTRs that place a downstream protein coding gene under control of DsrA. The design process is explained step-by-step in a Jupyter notebook, accompanied by Python code. The text discusses setting up design constraints for sampling sequences in Infrared, computing quality measures, constructing a suitable cost function, as well as the optimization procedure. We show that not only thermodynamic, but also kinetic folding features can be relevant. Kinetics of interaction formation can be estimated efficiently using the RRIkinDP tool, and the chapter explains how to include kinetic folding features from RRIkinDP directly in the cost function. The protocol implemented in our Jupyter notebook can easily be extended to consider additional requirements or adapted to novel design scenarios.

  5. Yao, H.-T., Ponty, Y., & Will, S. (2024). Developing Complex RNA Design Applications in the Infrared Framework [Book Chapter]. In R. Lorenz (Ed.), RNA Folding: Methods and Protocols (pp. 285–313). Springer US.

    Book Chapter

    Abstract :

    Applications in biotechnology and bio-medical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational design tools but at the same time put high demands on their flexibility and expressivity to model the applications-specific requirements. To address such demands, we present the computational framework Infrared. It supports developing advanced customized design tools, which generate RNA sequences with specific properties, often in a few lines of Python code. This text guides the reader in tutorial-format through the development of complex design applications. Thanks to the declarative, compositional approach of Infrared, we can describe this development as step-by-step extension of an elementary design task. Thus, we start with generating sequences that are compatible with a single RNA structure and go all the way to RNA design targeting complex positive and negative design objectives with respect to single or even multiple target structures. Finally, we present a ’real-world’ application of computational RNA design of a biotechnological device. We use Infrared to generate design candidates of an artificial AND-riboswitch, which could activate gene expression (only) in the simultaneous presence of two different metabolites.

  6. Yao, H.-T., Marchand, B., Berkemer, S. J., Ponty, Y., & Will, S. (2024). Infrared: a declarative tree decomposition-powered framework for bioinformatics [Journal]. Algorithms for Molecular Biology.

    Journal

    Abstract :

    Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.

    Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its under‐ lying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.

    Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework—together with our novel results—underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.

    Availability: Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared/ with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.

2023

  1. Yao, H.-T., Lorenz, R., Hofacker, I. L., & Stadler, P. F. (2023). Mono-valent salt corrections for RNA secondary structures in the ViennaRNA package [Journal]. Algorithms for Molecular Biology, 18, 8.

    Journal

    Abstract :

    Background: RNA features a highly negatively charged phosphate backbone that attracts a of cloud counter-ions that reduce the electrostatic repulsion in a concentration dependent manner. Ion concentrations thus have a large influence on folding and stability of RNA structures. Despite their well-documented effects, salt effects are not handled by currently available secondary stucture prediction algorithms. Combining Debye-Hückel potentials for line charges and Manning’s counter-ion condensation theory, Einert et al. [Biophys. J. 100: 2745-2753 (2011)] modeled the energetic effects contributions monovalent cations on loops and helices.

    Results: The model of Einert et al. is adapted to match the structure of the dynamic programming recursion of RNA secondary structure prediction algorithms. An empirical term describing the dependence salt dependence of the duplex initiation energy is added to improve co-folding predictions for two or more RNA strands. The slightly modified model is implemented in the ViennaRNA package in such way that only the energy parameters but not the algorithmic structure is affected. A comparison with data from the literature show that predicted free energies and melting temperatures are in reasonable agreement with experiments. Conclusion: The new feature in the ViennaRNA package makes it possible to study effects of salt concentrations on RNA folding in a systematic manner. Strictly speaking, the model pertains only to mono-valent cations, and thus covers the most important parameter, i.e., the NaCl concentration. It remains a question for future research to what extent unspecific effects of bi-and tri-valent cations can be approximated in a similar manner.

    Availability: Corrections for the concentration of monovalent cations are available in the ViennaRNA package starting from version 2.6.0.

  2. von Löhneysen, S., Spicher, T., Varenyk, Y., Yao, H.-T., Lorenz, R., Hofacker, I., & Stadler, P. F. (2023). Phylogenetic Information as Soft Constraints in RNA Secondary Structure Prediction [Conference]. In Bioinformatics Research and Applications (pp. 267–279). Springer Nature Singapore.

    Conference

    Abstract :

    Pseudo-energies are a generic method to incorporate extrinsic information into energy-directed RNA secondary structure predictions. Consensus structures of RNA families, usually predicted from multiple sequence alignments, can be treated as soft constraints in this manner. In this contribution we first revisit the theoretical framework and then show that pseudo-energies for the centroid base pairs of the consensus structure result in a substantial increase in folding accuracy. In contrast, only a moderate improvement can be achieved if only the information that a base is predominantly paired is utilized.

2021

  1. Yao, H.-T. (2021). Local Decomposition in RNA Structural Design (Number 2021IPPAX126) [Thesis, Ecole Polytechnique (Palaiseau, France) ; Université McGill [Montréal].

    Thesis

    Abstract :

    RNA positive structural design problem attempts to find RNA sequences achieving low free energy of the target secondary structure. Differently, in the negative design, solution sequences should adopt the target structure as its folding preferentially to any alternative structure, according to the given metric and energy model. Inverse folding, a typical negative design, requires the target to be the solution sequence’s MFE folding. Other metrics, like the ensemble defect, are also considered for design evaluation.

    The additivity of the energy model suggests the existence of local properties for the RNA design problem. It was discovered in several works that, due to the presence of specific local motifs, some secondary structures are undesignable, i.e., no RNA sequence can fold into the target structure while satisfying the negative design objective. The sequence sampling approach is often used in the positive design. Unwanted local structures, like base pairs, repeatedly form while folding sampled sequences toward the negative design. In this thesis, we study the impact of such local nature on the combinatorial aspect and on the development of negative design methods.

    We show that the proportion of designable secondary structures decreases exponentially with the target structure length from the combinatorial aspect. Given a negative design metric, we propose an automated pipeline to identify all undesignable motifs. Enumerating secondary structures avoiding such local obstructions followed by asymptotic analysis yields an upper-bounds on the number of designable structures. In addition, we define a lower bound for the structural ensemble defect derived from occurred local motifs. We show that the lower bound follows a Normal limiting distribution with a closed-form expression, implying also an exponential decrease.

    We then present Infrared, a generic framework for efficient combinatorial sampling. We formalize the RNA design problem as a CSP with design objectives described as a set of constraints and a set of weighted functions. Assignments satisfying constraints are generated from a Boltzmann weighted distribution using a dynamic programming algorithm followed by stochastic backtracking. The approach is FPT for the treewidth of the dependency graph induced from the problem. We show that the framework can be easily employed for RNA positive design and flexible applications.

    Finally, as an application of Infrared, we propose an original iterative sampling approach that captures negative design principles implemented in RNAPOsitive and Negative Design (RNAPOND). A set of DBPs is identified at each round and subsequently prevented from pairing by introducing proper constraints into the sampling framework. Despite the NP-hardness of the associated decision problem, an efficient sequence sampling algorithm is ensured by the Infrared framework. Our approach achieves a similar or better success rate than state-of-the-art negative design tools while allowing for the generation of diverse, thermodynamically efficient designs, i.e., positive design principles.

    One of the research directions of the works presented in this thesis is the extension to more complicated structures, such as pseudoknotted secondary structures. The flexibility of the Infrared framework opens a door for design tool development. For example, the success of RNAPOND suggests a potential approach for RNA negative structural design.

  2. Yao, H.-T., Waldispühl, J., Ponty, Y., & Will, S. (2021, April). Taming Disruptive Base Pairs to Reconcile Positive and Negative Structural Design of RNA. RECOMB 2021 - 25th International Conference on Research in Computational Molecular Biology.

    Conference

    Abstract :

    The negative structural design of RNAs, also called Inverse folding, consists in building a synthetic nucleotides sequence adopting a targeted secondary structure as its Minimum Free Energy (MFE) structure. Computationally an NP hard problem, it is mostly addressed as an optimization task and solved using (meta-)heuristics. Existing methods are frequently challenged by demanding instances, and typically produce a single design, hindering practical applications of design, where multiple candidates are desirable to circumvent the idealized nature of design models. In this work, we introduce RNA POsitive and Negative Design (RNAPOND), a sampling approach which generates design candidates exactly from a well-defined distribution influenced by positive design objectives, including affinity towards the target and GC-content. Negative design principles are captured by an original iterative approach, where a subset of Disruptive Base Pairs (DPBs) are identified at each step, and subsequently forbidden from pairing by the introduction of suitable constraints. Despite the NP-hardness of the associated decision problem, we propose a combinatorial sampling algorithm which is Fixed Parameter Tractable (FPT) for the tree-width of the constraint network. Our algorithm, coupled with a suitable rejection step and an automated inference of DPBs, achieves a similar or better level of success in comparison to the state of the art, while allowing for the generation of diverse designs. Interestingly, it also automatically recovers some of the strategies used by practitioners of RNA design. RNAPOND is an open source project, available at: https://gitlab.inria.fr/amibio/RNAPOND

2020

  1. Sarrazin-Gendron, R., Yao, H.-T., Reinharz, V., Oliver, C. G., Ponty, Y., & Waldispühl, J. (2020, May). Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Modules Identification. RECOMB 2020 - 24th Annual International Conference on Research in Computational Molecular Biology.

    Conference

    Abstract :

    RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6360 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications.

  2. Ponty, Y., Hammer, S., Yao, H.-T., & Will, S. (2020). Advanced design of structural RNAs using RNARedPrint [Book Chapter]. In E. Picardi (Ed.), RNA Bioinformatics.

    Book Chapter

    Abstract :

    RNA design addresses the need to build novel RNAs, e.g. for biotechnological applications in synthetic biology, equipped with desired functional properties. This chapter describes how to use the software RNARedPrint for the de novo rational design of RNA sequences adopting one or several desired secondary structures. Depending on the application , these structures could represent alternate configurations or kinetic pathways. The software makes such design convenient and sufficiently fast for practical routine, where it even overcomes notorious problems in the application of RNA design, e.g. it maintains realistic GC content.

2019

  1. Yao, H.-T., Chauve, C., Regnier, M., & Ponty, Y. (2019). Exponentially few RNA structures are designable [Conference]. ACM-BCB 2019 - 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 289–298.

    Conference

    Abstract :

    The problem of RNA design attempts to construct RNA sequences that performs a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations. In this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upper-bounds on the number of designable structures.

2018

  1. Feijao, P., Yao, H.-T., Fornika, D., Gardy, J., Hsiao, W., Chauve, C., & Chindelevitch, L. (2018). MentaLiST – A fast MLST caller for large MLST schemes [Journal]. Microbial Genomics, 4(2).

    Journal

    Abstract :

    MLST (multi-locus sequence typing) is a classic technique for genotyping bacteria, widely applied for pathogen outbreak surveillance. Traditionally, MLST is based on identifying sequence types from a small number of housekeeping genes. With the increasing availability of whole-genome sequencing data, MLST methods have evolved towards larger typing schemes, based on a few hundred genes [core genome MLST (cgMLST)] to a few thousand genes [whole genome MLST (wgMLST)]. Such large-scale MLST schemes have been shown to provide a finer resolution and are increasingly used in various contexts such as hospital outbreaks or foodborne pathogen outbreaks. This methodological shift raises new computational challenges, especially given the large size of the schemes involved. Very few available MLST callers are currently capable of dealing with large MLST schemes. We introduce MentaLiST, a new MLST caller, based on a k-mer voting algorithm and written in the Julia language, specifically designed and implemented to handle large typing schemes. We test it on real and simulated data to show that MentaLiST is faster than any other available MLST caller while providing the same or better accuracy, and is capable of dealing with MLST schemes with up to thousands of genes while requiring limited computational resources. MentaLiST source code and easy installation instructions using a Conda package are available at https://github.com/WGS-TB/MentaLiST.



Trajectory

University of Vienna · Austria

Post-Doc
Mar 2022 -

École Polytechnique (IPP Paris) · France

McGill University · Canada

Joint Ph.D. in Computer Science
Jan 2019 - Dec 2021

CEA Grenoble · France

Visiting Intern · M2 internship
Gene selection with network based a prior
supervised by Laurent Guyon
Mar 2018 - Aug 2018

Paris-Saclay University · France

Analysis · Modeling
Sep 2017 - Sep 2018

Simon Fraser University · Canada

Visiting Intern · M1 internship
Model-based clustering on Tuberculosis strains during an outbreak
co-supervised by Cedric Chauve and Leonid Chindelevitch
Apr 2017 - Aug 2017

École Polytechnique · France

Cycle Ingénieur
Bioinformatics · Computer Science · Applied Mathematics
Sep 2014 - Aug 2018

Lycée Louis-Le-Grand · France

Mathematics · Physics · Computer Science
Sep 2012 - Jul 2014