Context and motivation of the project
Project Description
Expected Scientifical and Technical Achievments
Expected Industrial and Economical Achievments
Context and motivation of the project
The computational prediction of energetically stable molecular conformations is at the core of computeraided molecular property and biological activity predictions^{1}. Such predictions are of obvious economic interest, as their success would significantly alleviate the need to randomly synthesise and test molecules in order to discover the “needle in the haystack”  the compound with desired properties. The enumeration of all the relevant minima of the potential energy function of the geometry is a NPcomplete problem^{2,3,4} with tens to thousands of degrees of freedom and would, per se, justify the use of distributed computing technologies. Furthermore, this energy landscape is extremely rugged, with steep “walls” due to van der Waals repulsions^{5,6}, where insignificant shifts of atomic positions may cause dramatic changes in energy. Therefore, most of the multimodal optimisation tools currently used in scientific and engineering problems^{7,8,9} are challenged when applied to conformational sampling. Traditionally, the conformational sampling problem has been differently addressed in function of the molecular size: small organic compounds with tens of degrees of freedom (torsional axes) were typically processed by Monte Carlo or systematic sampling programs, while the sampling of macromolecules has been typically performed by molecular dynamics or Monte Carlo simulations starting from an initial fold built on hand of experimental constraints. Ab initio (ignoring experimental constraints) folding of modest size proteins are still a big technical challenge, both because molecular dynamics simulations are not efficient at energy barrier crossing and because they are not easy to run in parallel (several short simulations will not visit the same conformational space as a long one). Recent work^{10} showed that evolutionary algorithms and in particular Genetic Algorithm for Conformational Sampling (CSGA) can be rendered much more effective by hybridization with other optimisation heuristics, leading to a tool with both excellent barrier crossing skills, a controllable diversity of retrieved solutions and relatively rapid convergence to lowenergy zones. Genetic algorithms are inherently parallel and then very well suited for distributed computing^{11}, as they require relatively little communication between the nodes. However, the deployment of the CSGA (currently running on a multiprocessor – a SMP machine) on the GRID brings up a series of algorithmic and technical challenges concerning the management of the global solution diversity (how to efficiently manage and distribute to active nodes the lists of “taboo” conformations already discovered and not to be revisited, for example). Several potential deployment schemes will have to be explored.
To summarize, the main goal of this project is to come up with viable technical solutions for a multimodal optimisation procedure specifically designed to work on the GRID and specifically adapted to conformational sampling. While several docking programs have been ported on the GRID^{12}, there is still much to do in order to define an optimal deployment architecture of these approaches. Therefore, the planned collaboration with scientists from CEA Grenoble (DSV/iRTSV/BIM and DSV/DRDCiRTSV/CMBA), which plan to use the reference docking programs AutoDock^{13} and FlexX^{14} on the GRID (European FP6 EGEE project “Proposal for a First Data Challenge on In Silico drug discovery) is beneficial for both partners: it allows us to benchmark our results with respect to the performance of state of the art technology, while the herein developed software will be used by the CEA.
The second, more fundamental problem of the accurate definition of the molecular energy surface will be addressed by this project after the solving of technical and scientific issues addressed above. Classical “force fields”^{15,16,17} (parameterized empirical energy functions of molecular internal coordinates) being either too specialized (for proteins) or inaccurate^{18}, there is a risk to see the software correctly finding the local energy minima of an energy surface which fails to describe the actual behaviour of the studied molecules. Another, more subtle problem is the relationship between the “point” energies calculated for every given geometry and the free energy characterizing the average occupancy of a given conformer or docking mode. The latter free energies are the one directly related to experimental properties (binding affinities, conformational preferences) and they might, in principle, be obtained from the ensemble of “point” energy levels of all the states in the immediate vicinity of the energy minimum of the potential landscape. Classical docking algorithms^{19,20,21} typically ignore this relationship and rely on a “docking” function to position the ligand in the protein binding site, but use a different empirical “scoring” function to estimate binding affinity on hand of the previously obtained single pose. This is incoherent, but necessary unless fully flexible docking is performed (rigid docking cannot provide a binding free energy). Here, we would like to explore a novel “bootstrapping” force field calibration approach, by running, at given force field parameter setup, fullblown simulations of several test molecules (conformational sampling of structured peptides and sugars) or proteinligand complexes. The appropriateness of the choice of force field parameters (which determine, of course, the relative depth of the relevant energy minima, but also the ruggedness of the energy landscape and thus the success of the sampling procedure) will be evaluated on hand of the relevance of the “free energies” derived from the partition functions of the sampled ensembles. The optimal force field resulting from several such optimisation cycles would (a) provide a correct relative scoring of the energy minima of a series of peptides, sugars and organic ligands and (b) define an as smooth as possible energy landscape, such as to maximize the chances of success of the conformational sampling approach.
Project Description
Conformation sampling and docking require searching an optimal configuration (minimizing the energy) among exponentially large spaces of possible configurations with a number of degrees of freedom ranging from hundreds to a few thousands. To cope with this challenge, a double action is required. On one hand, the use of Grids permits the exploitation of the intrinsic parallelism which is present in many aspects of the problem. On the other hand, sophisticated and efficient algorithms are needed to exploit fully the deep mathematical properties of the physics involved. Brute force methods are at a loss here and the way is open for effective sampling methodologies, algorithms for combinatorial optimization, and strategies for searching over (hyper)surfaces with multiple local minima. Innovative methodologies will be used in order to push forward the state of the art. The objectives of this proposal are mainly:
 Specification and validation of mathematical models for conformational sampling and docking: The objective here is the modelization of the problem as a combinatorial optimization problem. The analysis of different models is carried out. New models based on multiobjective formulation of the problem will also be explored. For our knowledge, this will be the first multiobjective model for the problem. Preliminary results have been obtained in the master research study of S. Hoareau.
 Design of parallel cooperative metaheuristics to solve the NPcomplete problem: The design of metaheuristics based on evolutionary algorithms and local search will be carried out. The main criteria in the design are : generality, efficiency, scalability, and robustness. Hence, some advanced search mechanisms will be introduced such as hybridization, adaptive control.
 Grid Deployment: The designed distributed and cooperative optimization methods will be implemented on Grids. Different models of Grid computing will be explored. Evaluation of the algorithm’s implementation in terms of efficiency, scalability, and faulttolerance will be realized. Another important objective is the dissemination of this work to a large community in academia and industry in terms of users and developers.
 Validation and application: Here, we are dealing with the biological validation of the obtained results on well known benchmarks of the literature and on some reallife problems. A comparison with other well known projects and software using different models and different optimization approaches (EGEE FP6 project, CHARMM, Decrypton, …) will be performed.
The contributions to key actions objectives are the following:
1. Modeling aspect: The main objective of this important part is the specification of the mathematical models for the problem. An important point to keep in mind is the flexibility of the PARADISEO software for optimization which allows to solve different models of the problem (http://www.lifl.fr/OPAC). Hence, we will evaluate, analyse and explore different combinatorial models for the problem. An extension towards a multiobjective model for the problem will be considered. In our knowledge, this is a first explored multiobjective model for the problem. The PARADISEO software already integrates algorithms to solve multiobjective problems.
2. Parallel Cooperative Optimization aspect: Two complementary optimization approaches based on metaheuristics will be considered : evolutionary algorithms (populationbased search) and local search (single solution search). Those metaheuristics have to be adapted to solve the considered model for the problem. Hence, specific coding and operators will be dedicated to the problem. Then, the development of efficient parallel cooperative methods to solve the problem will be considered. It is important to integrate adaptive techniques for the automatic generation of the different parameters of the methods (crossover and mutation rate, hybridization, …). Some advanced multiobjective search mechanisms will be realized for the multiobjective model of the problem. Let’s note that all the algorithms will be implemented under the PARADISEO framework which is a generic open source software for optimization problems.
A simple prototype of the hybrid Conformational Sampling Genetic Algorithm is already operating on a lowscale (SMP multiprocessor). At this point, it accepts only intramolecular degrees of freedom (torsional axes) – its generalization to introduce intermolecular translation and rotation coordinates is in progress. The existing algorithm includes a “metaoptimization” loop that iteratively relaunches conformational sampling at various combinations of the operational parameters controlling the functioning of the GA (population size, mutation frequency, selection strategies, etc.). Sampling success at a given operational parameter set is measured by the depth and number of the diverse energy minima that were sampled during that run. The method is autoadaptive and successive CSGA runs have increasing chances to perform well due to a judicious choice of operational parameters.
3.GRID problem solving aspect: First, we have to gridify in terms of design and implementation the framework PARADISEO on Grids (PARADISEO@GRID). This task has been largely initiated in the group OPACLIFL. We will use standard middleware freely available and open source such as Globus and Condor. This work is independent from the docking problem. So, the results here may be used for any optimization problem and then for different communities. Then, the evaluation of the deployment of PARADISEO@GRID will be done on different platforms (GRID’5000, Desktop computing) solving classical combinatorial optimization problems.
Once the distributed cooperative optimization algorithms are designed for the docking problem, they will be deployed on a Grid using the PARADISEO@GRID framework and giving the software Docking@Grid. Finally, the evaluation of the framework Docking@Grid on different Grid platforms (GRID’5000, Desktop computing) will be realized using metrics such as speedup, scalability, faulttolerance, etc.
4. Validation and application aspect: The biological validation of the obtained results will be performed on well known academic benchmarks. The training set, currently featuring 4 small structured proteins and a cyclic sugar molecule (cyclodextrine) will be extended to include larger proteins (for which only a loop will be submitted to sampling) and proteinligand complexes. Starting from the set of CVFF force field parameters, force field will be optimised by (a) randomly changing force field parameters, (b) sampling of all the test cases according to the new force field setups (c) checking whether “native” conformations were retained among the most stable of the generated conformers and (d) if so, evaluating the free folding or docking energies from the ensembles of “correct” and “wrong” geometries. These free energies should be, at least qualitatively, in agreement with experiment (e.g. ensembles of wellfolded conformers should have a lower free energy than misfolded ones). Docking of (virtual) molecules into active sites of interesting biological targets will then be performed. A direct application could be flexible docking of organic molecules into the tubulin binding site, in collaboration with our CEA partners, in order to detect compounds with potential anticancer effects. Tubulin binders^{22} such as Taxol (paclitaxel) and Taxotere (docetaxel) act through an antimitotic mechanism of action based on stabilization of microtubules, thus disrupting the mitosis of cancer cells and eventually forcing their apoptosis.
Expected Scientifical and Technical Achievments
The current project addresses two innovative elements :
 The design and optimisation of a GRIDbased platform dedicated to conformational sampling and docking, with an architecture aimed at maximising the benefits of massive parallel deployment and maximising robustness with respect to the pitfalls of GRID computing (efficient scheduling, fault tolerance, etc.). This software will be based on an optimization framework (PARADISEO@GRID) which allows a flexibility in terms of the used mathematical models and optimization algorithms. The developed software will be flexible in terms of the mathematical model used. Indeed, many models are actually experimented in the consortium^{23}. A multiobjective combinatorial model will also be considered^{24}.
 The fitting and validation of an empirical force field with respect to the properties of the sets of stable conformers that it generates. Such a force field has to be accurate and to allow an easy sampling of the potential energy surface it describes. There are no guarantees that such a force field – covering the typical functional groups in organic ligands, as well as in macromolecules – will actually be found, but the fundamental question whether or not GRID computation may allow the treatment of the docking problem as a generalized simultaneous conformational sampling approach of two molecules is worth asking anyway. If there exists a selfconsistent approach combining (a) a rigorous, welldefined sampling procedure and (b) a force field adapted to this procedure, with the property of returning reasonable estimates of the free energies of the sampled ensembles, then this would represent a major breakthrough in molecular modelling. In the worst case, if the force field fitting appears infeasible or leads to no conclusive results, the project would still have lead to a powerful sampling tool using the CVFF parameters, as valid as any other forcefield based method (and which could, with relatively little effort, adapted to run with other molecular force fields).
There are potentially many users of such a tool in both academia:
 Network of Genopoles in France and particularly Genopole of Lille (www.genopolelille.fr).
 The medicinal chemistry groups of UMR 8525 and the Faculty of Pharmacy in Lille – Prof. Deprez.
 CEA Grenoble – DSV/iRTSV (S. Roy and L. Lafanechère).
 The group of Dr. Sergeant, INSERM Lille – applicant of a Decrypthon project for GRID docking in search of inhibitors of targets involved in neuromuscular diseases.
 etc.
and industry (Cerep, etc.).
In terms of the docking software Docking@Grid, the development of a web portail for remote job submission by interested users can be considered and the project could be later integrated into the larger platform of chemoinformatics tools under construction around the site of the “Chimiothèque Nationale”  project of the CNRS (Prof. Hibert, Strasbourg). This platform, designed as a portail for the display of the collections of molecules synthesised in French academic labs might offer predicted affinities of these compounds with respect to various biologically interesting targets, in order to facilitate compound selection.
In terms of the optimization framework PARADISEO@GRID, the PARADISEO framework (cluster and SMP version) is currently provided as an open source and we have a well established community using PARADISEO. More the 50 users (academic and industrials) of more than 15 countries are using the optimization software for solving different problems in diverse domains. A forum and a user group are dedicated to the framework. The software is the heart of the INRIA project “DOLPHIN” (www.inria.fr/recherche/equipes/dolphin.fr.html).
Expected Industrial and Economical Achievments
Virtual screening, the generic research field encompassing all the various computational computeraided drug design strategies, is nowadays well established at the core of the technology of rational drug design. Unlike high throughput screening, which needs to be “fed” existing molecule collection and comes with a high cost in compounds, biological targets and other consumable reagents, virtual screening is much cheaper and may access huge libraries of virtual molecules, thus allowing the selection of best suited candidates for synthesis. It must be kept in mind that, even if virtual screening is based on simplified working hypotheses that will forcibly trigger inaccurate predictions, it remains nevertheless an important way to reduce costs in drug discovery. Supposing that a computerbased prediction is wrong in 9 cases out of 10. A medicinal chemist trusting the output of the algorithm will find that out of 10 synthesized and tested molecules, one is active and nine were unnecessary – this may look deceiving, but it must be kept in mind that, in a random strategy without the input from the algorithm, a hundred to one thousand compounds must have been synthesised and tested until one active would have been found by pure chance. The herein described approach is a computerintensive method that accounts for subtle effects usually ignored in the “industry standard” docking procedures – therefore, higher quality results are expected. If these promises are met, in spite of the intrinsic limitations of this forcefield based method, then a technology transfer towards big pharmaceutical companies can be envisaged. All leading companies in the field already possess massively parallel computers and/or GRID architectures, and as they are reluctant to send their proprietary molecule structures for docking on thirdparty computers, they might nevertheless express their interest in purchasing and internal use of this original docking algorithm.
Virtual Screening is even more interesting for academic research, which has little access to high throughput screening. Molecule synthesis and purchase strategies could be driven by docking results of virtual libraries of chemically feasible molecules. The availability of online docking results for French academic molecules in the “Chimiothèque Nationale” would increase the interest of potential industrial purchasers for compounds predicted to bind at certain targets. In case of successful solving of all the algorithmic and force field setup challenges, the use of the GRID platform for tubulin (or other anticancer target) binder research in the context of the CEAiRTSV program may potentially lead to novel anticancer drugs.
1
Neumaier, A., (1997). Molecular modelling of proteins and
mathematical prediction of protein structure. SIAM Review,
39:407460.
2
Crescenzi, P., Goldman, D., Papadimitriou, C.H., Piccolboni, A. et
Yannakakis, M. (1998). On the Complexity of Protein Folding. Journal
of Computational Biology, 5(3):423466.
3
Atkins, J. et Hart W.E. (1999).On the
intractability of the protein folding with finite alphabet of
aminoacids.
4
Calland P.Y. (2003). On the structural complexity of a protein.
Protein Engineering, 16(2):7686.
5
Hagler, A.T., Huler, E. et Lifson, S., (1974). Energy functions for
peptides and proteins. i. derivation of a consistent force field
including the hydrogen bond from amide crystals. Journal of
American Chemical Society, 96(17):53195327.
6
Hagler, A.T. et Lifson, S. (1974). Energy
functions for peptides and proteins. ii. The amine hydrogen bond and
calculation ofamide crystal properties. Journal of American
Chemical Society, 96(17):53275335.
7
Neumaier, A., (2004). Acta Numerica 2004, chapter Complete
Search in Continuous Global Optimization and Constraint
Satisfaction, pp. 271369. Cambridge University Press.
8
Schulze Kremer, S. et Tiedmann, U. (1994). Parameterizing
genetic algorithms for protein folding simulation. HICSS (5),
pp.345354.
9
Damsbo, M., Kinnear, B.S., Hartings, M.R., Ruhoff, P.T., Jarrold,
M.F., et Ratner, M.A. (2004). Application of Evolutionary algorithm
methods to polypeptide folding: comparison with experimental results
for unsolvated Ac(AlaGlyGly)_{5}Lys^{+}. PNAS,
101(19)72157222.
10
Parent, B., Kökösy, A. et Horvath, D., (2005). Optimized
Evolutionary Strategies in Conformational Sampling. Soumis à
publication
11
EG. Talbi, "Parallel Combinatorial Optimization", John
Wiley & Sons, USA, 2005.
12
Thormann, M. et Pons, M. (2001). Massive docking of flexible ligands
using environmental niches in parallelized genetic algorithms.
Journal of Computational Chemistry, 22(16):19711982.
13
Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E.,
Belew, R.K. et Olson, A.J. (1998). Automated docking using a
Lamarckian genetic algorithm and an empirical binding free energy
function. Journal of Computational Chemistry,
19(14):16391662.
14
Claussen, H., Buning, C., Rarey, M., et Lengauer, T. (2001). FlexE:
Efficient molecule docking considering protein structure variations.
Journal of Molecula Biology, 308(2):377395.
15
Herges, T. et Wenzel, W. (2004). An
allatom ForceField for tertiary structure prediction of helical
proteins. Biophysical Journal, 87:122.
16
Liwo, A., Lee, J., Ripoll, D.R., Pillardy, J. et Scheraga, H.A.
(1999). Protein structure prediction by global optimisation of a
protein energy function. PNAS, 96(10):54825485.
17
Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M.,
Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W., et
Kollman, P.A. (1995). A second generation force field for the
simulation of proteins, nucleic acids, and organic molecules.
Journal of American Chemical Society, 117:51795197.
18
Okur, A., Strockbine, B., Hornak, V. et Simmerling, C, (2003). Using
PC Clusters to evaluate the transferabiblity of molecular mechanics
force fields for proteins. Journal of Computational Chemistry,
24(1):2131.
19
de Magalhães, C.S., Barbosa, H.J.C., Dardenne, L.E., A
genetic algorithm for the ligandprotein docking problem. Genetic
and Molecular Biology, 27(4):605610.
20
Carlson, H.A. et McCammon, J.A. (2000). Accommodating
Protein Flexibility in Computational Drug Design. Molecular
Pharmacology, 57(2)213218.
21
Given, J.A. et Gilson, M.K. (1998). A hierarchical method for
generating lowenergy conformers of a proteinligand complex.
Proteins: Structure Function and Genetics, 33(4):475495.
22
Geney R., Sun L., Pera P., Bernacki R.J., Xia S., Horwitz
S.B., Simmerling C.L., Ojima I. (2005) Use of the Tubulin Bound
Paclitaxel Conformation for StructureBased Rational Drug Design.
Chemistry & Biology, 12:339–348
23
Parent B.. Echantillonage conformationnel par
algorithmes génétiques. Rapport de DEA, IBL, Juin
2004.
24
Hoareau S.. Protein folding sur Grilles. Rapport de DEA, LIFL,
Juil 2005.
