Computational methods for comparing and integrating multiple probing assays to predict RNA secondary structure

Afaf Saaidi; Delphine M Allouche; Yann Ponty; Bruno M Sargueil; Mireille M Regnier

Résumé

1- Introduction: RNA structure is a key to understand retroviruses’s mechanisms e.g. HIV. Many prediction approaches suggesting accurate structures are available but they could be further improved by both taking advantage of Next generation sequencing technology and new experimental techniques (Enzymatic and SHAPE). 2 - Experimental probing data In this poster, we present an integrative approach based on using many experimental data, resulting from sequencing, to direct predictions with the aim to find an accurate structure lying in the intersection of different sources of experiments. From one side, to reveal single nucleotide, reactivity profiles resulting from a SHAPE technology were used as “soft constraints”, meaning that the reactivity values were translated into pseudo-energies as described (Lorenz et al, 2016). From the other side, RNAses cleavage was used with two enzymes V and T targeting respectively paired and unpaired nucleotides. Reactivity scores resulting from those two experiments are used as hard constraints, forcing positions that exceed a specific threshold to be paired(case of Enzymatic-V) or unpaired(case of Enzymatic-T). 3-stochastic sampling: At the thermodynamic equilibrium, a given RNA can have many alternative structures, where each structure could be characterized by a probability within the space of all the possible conformations (Boltzmann ensemble). This probability is related to the energy of the structure, the highest the energy needed to break pairs present in the structure the highest is its probability in the ensemble. We admit that the optimal structure(s) should be energetically stable and supported by several experimental data. For this reason, we coupled a stochastic sampling from the Boltzmann ensembles associated with the experimentally derived constraints, with a clustering across experimental conditions, to generate a structural models that are well-supported by available data. 4-The work-flow description: 1. Experimental data from different conditions(SHAPE,Enzymatic-T, Enzymatic-V) were analysed to extract reactivity profiles that will serve as constraints. 2. We sampled 2000 structures per condition: We perform a Boltzmann sampling (Ding et al, 2005) to generate a predefined number of stable structures, compatible with the constraints derived for each condition. We used the stochastic sampling mode of RNAsubopt (-p option) to generate energetically stable structures that are either fully compliant with constraints derived from enzymatic data (hard constraints(Mathews et al,2004)), or constitute reasonable trade-offs between thermodynamic stability and compatibility with SHAPE data (soft constraints, using the pseudo-potentials of Deigan et al. (Deigan et al, 2009), see (Lorenz et al,2016) for details). 3. We merge the structures while keeping labels to retain the origin of each structure. 4. In order to detect structures with affinity to each other, The merged sets of models were clustered, using the base-pair distance as a measure of dissimilarity, the distance between two structures corresponds to the number of base pairs needed to break and to build in order to go from a structure to an other. A clustering algorithms (affinity propagation (Wang et al, 2007) implemented in the scikit-learn Python package (Pedregosa et al, 2011) is used to agglomerate and identify recurrent structures. One of the advantages of affinity propagation resides in its low computational requirements. 5. The next step consists on identifying clusters that are homogeneous, stable and well supported by experimental evidences, leading to the identification of the following objective criteria: -Present conditions that informs as about the diversity of the cluster: Our primary target are clusters compatible with multiple experimental conditions. However, the larger sampled sets required for reproducibility tend to populate each cluster with structures from all conditions. We thus associated with each cluster the number of represented conditions, defined as the number of conditions for which the accumulated Boltzmann probability in the cluster exceeds a predefined threshold. -Boltzmann weight that is a measure of stability: Structures that are found in a given cluster may be unstable, and should be treated as outliers. For this reason, we computed the cumulated normalized Boltzmann probabilities within the cluster, to favor stable clusters consisting of stable structures; - Average Cluster Distance to count for coherence: We observed a general tendency of clustering algorithms to create heterogeneous clusters when faced with noisy data. We thus associated with each cluster the mean distance between pairs of structures, estimated as the average distance to the MEA (Lu et al, 2009) for the sake of efficiency, in order to neglect clusters that were too diverse. 6. The next steps consist on choosing cluster(s) with high coherence, diversity and stability. for this purpose we restricted our analysis to clusters that were found on the 3D Pareto Frontier (Mattson Messac, 2005) with respect to the three mentioned above criteria . 7. After detecting the optimal Pareto cluster(s), we need to identify representative structure for each cluster. We chose the maximum expected accuracy (MEA) structure (Lu et al, 2009) as the representative structure for each cluster, which is defined as the secondary structure whose structural elements have highest accumulated Boltzmann probability within the cluster. 5 Results: This resulted in 2 structures which we narrowed down to a single candidate using compatibility with the 1M7 SHAPE data as a final discriminatory criterion.

Computational methods for comparing and integrating multiple probing assays to predict RNA secondary structure

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager