Research topics
Descriptors
Conception of new, chemically relevant molecular descriptors is a core activity of our group.
ISIDA Fragment Descriptors encode occurrence frequencies of substructural patterns in a molecule. The patterns are extracted from the 2D structure using graph algorithms, in an open-ended way: the user specifies the nature of the patterns to be monitored (sequences, circular fragments, atom pairs and atom triplets, including or ignoring bond order information). The minimal and maximal sizes to be considered (expressed as topological distances, i.e. graph path length). Every pattern matching specifications present in at least one of the molecules in the processed set will be added to the list of monitored patterns in the final – open-ended – descriptor vector.
In addition to the ‘default’ rendering of atoms by their symbol, various coloring schemes can be used to generate alternative, information-rich atom labels. Of particular importance are “fuzzy” pharmacophore type and force-field type labels, which can be assigned to atoms prior to the actual fragmentation process. This is done according to a procedure that relies on the ChemAxon protonation equilibrium model (pKa plugin) in order to predict expected relative populations of anionic and cationic forms in equilibrium with the neutral associated acids/bases, and to assign pharmacophore flagging weights. For example, an amine/ammonium group may count as a Cation and Hydrogen Bond Donor to a degree of 80%, whilst considered as a neutral Hydrogen Bond Acceptor to a degree of 20% – the relative flagging weights being used to modulate the final occurrence levels of the substructures containing the amino/ammonium moiety.
The above-mentioned high level of chemical realism renders ‘colored’ ISIDA descriptors very information-rich and thus well-suited to support both Quantitative Structure-Activity Relationships, and similarity-based searching for active analogues. Note that the nature of supported ISIDA fragmentation schemes is diverse enough to support a wide tuning of the nature of “chemical similarity” one may capture. With circular fragment types, using atom-symbol based coloring, retrieved analogues will display a strong similarity in terms of their skeleton, matching the innate “feeling” of chemical similarity of medicinal chemists. By contrast, with fuzzy atom pair or triplet counts, colored by pharmacophore type, returned analogues may feature radically novel, unrelated scaffolds nevertheless incarnating a global pharmacophore pattern that nicely matches the one of the reference compound – ‘Scaffold Hopping’ is thus possible. To make the similarity searching even more powerful, our web server also supports, in addition to classical Tanimoto and Euclidean similarity scoring, a series of less often used but very powerful alternative similarity scores – in particular, the asymmetric Tversky approach.
This combination of the diversity of fragmentation schemes with the information-richness of graph coloring approaches makes it very likely that some of the possible description schemes will eventually prove convenient choices for virtually any classical structure-property problems. Furthermore, fragment counts are intuitive – which greatly enhances the interpretation of structure-activity models.
QSAR (Quantitative Structure-Activity Relationships) Methodology
The laboratory develops and adapts machine learning methods for QSAR modelling. Examples of concepts studied in the laboratory are :
- Applicability domain assessment: establishing a set of rules used to discriminate compounds for which the QSAR model may provide unreliable predictions, such as to discard a maximum of mispredicted compounds, but a minimum of correct predictions.
- Ensemble learning: strategies to combine different QSAR models aiming to increase the performance of predictions.
- Inductive transfer: an approach to design efficient QSAR models for some property, by incorporating information about related properties.
- Semi-supervised learning: machine-learning approaches using both labeled and unlabeled data. This helps to obtain QSAR models specifically oriented toward a given chemical library.
- Modeling of non-additive mixtures - a challenging task for QSAR involving design of special "mixture" descriptors and kernels.
- "Divide and conquer" strategy generates, for a given property, several global and local models then combines them into one consensus model.
- Dimensionality reduction methods attempt to simplify the representation of data. The data are easier to manipulate, the models are computationally more efficient, eventually easier to interpret.
QSAR models are often difficult to interpret. Yet, their interpretations are useful, for it may – ideally – reveal mechanistic hits about the way in which, for example, ligands interact to a target, or else, equally important, highlight the bias induced by the finite size of the dataset used to train the models or discrepancies induced by the presence of outliers. The laboratory uses the characteristics of the ISIDA fragment descriptors to compute atomic contributions to a model : first, the contribution of a descriptor is estimated by a numerical estimate of the first order partial derivative of the model, then the value is shared among all atoms inside the definition of the descriptor. The atomic contributions can be represented on a picture using a color code.
Chemical Space Analysis
Rendering a molecule as a dot, positioned by its high-dimensional descriptor vector in some abstract chemical space is a perfect strategy for computer-aided processing of chemical information, but not for the human mind. The latter yearns for lower-dimensional representations of chemical space – preferably, as a planar map. This, of course, cannot be achieved without information loss. However, the pioneering idea of our group to adapt Generative Topographic Mapping (GTM) for chemoinformatics applications retrospectively appeared to be rather fruitful: this non-linear projection technique may propose a 2D-view of chemical space, all while preserving enough of the original chemical information in order to still be able to support robust quantitative property prediction modeling, in spite of dimensionality reduction. GTMs are built by fitting a 2D-manifold (a “rubber sheet”) such as to approach as closely as possible the given “frame” compounds in original descriptor space. Then, the “rubber sheet” is “flattened” out onto a 2D square grid of nodes, and molecules are given calculated probabilities of “residence” in each node (technically, the “responsibility” vector).
The plain position of a molecule on a map, as a set of (x,y) coordinates in the plane can be inferred as the mean of the (x,y) coordinates of nodes, weighted by the node responsibility values (probability of the molecule to reside in that node). Therefore, related compounds appear as neighbors on the map. As a consequence, the map can be “colored” on the basis of any known property of thereon mapped molecules – the “color” (for example, activity level) at any map point (x,y) can be interpolated from the activity values of molecules projected in the nearest neighborhood of given (x,y). Such an “activity landscape” can be used as a predictive tool: position any new compound on the map, and “read” the local “color” (activity value) associated to that point. This can be both done with continuous activity values (“activity landscape”) or with classification labels (“class landscape”). Thus, any GTM is easily interpreted as (Q)SAR model : a model and a simple representation of the model are generated together.
Each node of the map can be colored according to the most probable class (classification map), an average property (property landscape), or cumulated responsibilities (density map). Here, the property landscape and density map values where interpolated using kriging.
Furthermore, activity/class landscapes also provide information about the relative occupancy of map zones by residing molecules. One may be confident that in map zones richly populated by the training set compounds (i.e. the compounds used to “color” the map) the activity landscape is accurate – less so in sparsely populated map regions, where the interpolation of activity landscape values may be based on too remote experimental examples. Thus, a GTM implicitly provides a definition of its own applicability domain. Below, misclassified compounds can be seen to fall, in majority, in sparsely populated map zones. On the left, map zones with any minimal non-zero occupancy were colored by the statistically dominant class of its residents. A black dot in a colored area means that the corresponding compound, predicted to be of the class (“color”) of the zone it fell into, belongs in reality of another class. In the right-hand pictures, the minimal occupancy thresholds at which a map pixel is assigned a color has been increased: pixels sparsely populated (by training set compounds) are now colorless. Or, it is apparent that most mispredictions happened exactly in these areas that were “whitened” as a consequence of raising the minimum occupancy threshold.
Note that map building is, per se, an unsupervised learning process, and that after a map has been constructed, it can be challenged to complete any diverse regression and classification problems. In most of the cases, if they were built on hand of focused compound sets associated to a well-defined property/activity, they will however likely fail when confronted with molecules of completely unrelated chemical series, associated to unrelated chemical properties. However, may the training on the basis of a diverse set of representative drug-like molecules lead to some “pluricompetent” map, capable to host robust regression/classification models for compound sets and properties never encountered at training stage? In how far would such a “Universal” map allow separation of actives from inactives, not for a single given target, but for an entire profile of biologically diverse targets? Based on an evolutionary algorithm, such “Universal” maps have been successfully grown and extensively challenged with respect to their active/inactive discriminant power, over a vast majority of ChEMBL-available structure-activity information.
“Universal” GTMs can be built, allowing to successfully discriminate actives from inactives, for many unrelated targets and molecular structures (external challenge featuring > 450 targets – other than Selection targets used to grow the maps, ~150,000 individual compound structures and ~300,000 experimental activity values)
Eventually, more innovative techniques such as “Stargate” GTMs, directly connecting the descriptor space to an activity profile space (defined by the vector of activities with respect to several targets) are under development, with encouraging results.
Schematic representation of the Stargate GTM approach: the S-GTM model links different spaces (Space 1 and Space 2) and acts as a “gate” to map molecules from one space into the other, using two different manifolds fitting the data in each space.
Reaction Mining
In reaction mining studies we use the concept of Condensed Graph of Reaction (CGR). A CGR can be obtained by fusing the reactant and products graphs into one sole molecular graph, characterizing by both conventional chemical bond (single, double, aromatic, etc) and “dynamical” bonds representing chemical transformations (single to double, double to single, broken single, created double, etc).
Example of chemical reaction containing two reactants and two products encoded by Condensed Graph of Reaction
A CGR can be considered as a pseudomolecule for which molecular descriptors or fingerprints can be generated. This approach has been used for the development of predictive models for:
- rate of SN2, E2 and Diels-Alder reactions,
- optimal reaction conditions for Michael reactions and deprotection reactions,
- tautomeric equilibrium constants,
- activity cliffs in protein-ligand interactions.
Flexible Docking & Protein Folding
S4MPLE (Sampler for Multi Protein-Ligand Entities) is a modular molecular modeling tool designed for the conformational sampling of various classes of compounds. It is based on a genetic hybrid algorithm, combining force field-based optimization in addition to classical evolutionary sampling strategies. Allowing full control of the considered degrees of freedom, S4MPLE is a completely general approach to visit the conformational space of arbitrary molecules or molecular complexes.
It may be equally well used for conformational sampling of organic compounds, small peptides and for docking cases. The docking mode is able to simulate the sampling of multiple ligands in presence of a rigid or flexible binding site.
Concept : folding, docking, protein loop modeling are all particular instances of conformational sampling
WEB services
The laboratory develops and provides various WEB services for public use.
The WEB platform for virtual screening is accessible at http://infochim.u-strasbg.fr/webserv/VSEngine.html
The service allows mining for similar compounds (active analogue retrieval) using different kinds of molecular descriptors and – classical or original – similarity metrics. A Self-Organizing Map based strategy is used to hash the database.
The server provides access to the QSAR/QSPR models published by the laboratory. Models provide automated standardization of user-input structures, descriptor calculations and property prediction, in association with various Applicability Domain assessment schemes, and a final “trustworthiness” verdict for each prediction:
The laboratory implements also a legacy server for QSAR/QSPR models. In practice the engine is incorporated into the present virtual screening service.
Finally, the laboratory supports and promote the Mobyle project. We participate to the Mobyle Network to provide some Chemoinformatics tools. To access to the server click here.
The Mobyle project is supported by the Institut Pasteur.
Databases
- Solvent eXtraction Database (SXD)
- Database of drug-like compounds ScreenDB