4184 lines
114 KiB
Markdown
4184 lines
114 KiB
Markdown
EBERHARD-KARLS-UNIVERSIT ¨AT T ¨UBINGEN
|
||
Wilhelm-Schickard-Institut f¨ur Informatik
|
||
Lehrstuhl Rechnerarchitektur
|
||
|
||
Diplomarbeit
|
||
|
||
Active Structure Learning using Genetic
|
||
Algorithms and Kernel Functions
|
||
|
||
Christoph Schw¨orer
|
||
|
||
Betreuer:
|
||
|
||
Prof. Dr. rer. nat. Andreas Zell
|
||
Wilhelm-Schickard-Institut f¨ur Informatik
|
||
|
||
Prof. Dr. rer. nat. Karl-Heinz Wiesm¨uller
|
||
EMC Microcollections GmbH
|
||
|
||
Begonnen am:
|
||
|
||
13th January 2010
|
||
|
||
Beendet am:
|
||
|
||
12th July 2010
|
||
|
||
Erkl¨arung
|
||
|
||
Hiermit versichere ich, diese Arbeit selbstst¨andig
|
||
verfasst und nur die angegebenen Quellen benutzt
|
||
zu haben.
|
||
|
||
T¨ubingen am 12th July 2010
|
||
|
||
Christoph Schw¨orer
|
||
|
||
Kurzfassung.
|
||
Current 3D QSAR approaches attempt to build models base not only on 1D or 2D descriptors
|
||
of molecules like weight, charge or molecular graphs, but also on 3D sensitive information
|
||
like the conformation or information about the molecules surface. A basic assumption on
|
||
building these 3D QSAR models is that the best results are attained by using the best available
|
||
(i.e., the obtained active structure) data. In this work I tried to find such a best achievable 3D
|
||
QSAR model by the means of optimizing a model over a set of conformations using a genetic
|
||
algorithm and three different kernel methods. The intent was to see if these resulting models
|
||
would include the active structures. For the generation of the sets of conformations I used
|
||
two different approaches. The first being a precomputation of the conformations the second
|
||
an implicit generation concurrent to the optimization. The results will show that the model
|
||
with the best generalization and prediction accuracy in most cases do not include the active
|
||
conformation but conformations with a minimal average pairwise distance to all other possible
|
||
conformations of the respective molecules.
|
||
|
||
Contents
|
||
|
||
1
|
||
|
||
Introduction
|
||
|
||
1
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
4
|
||
2 Background Information
|
||
4
|
||
2.1 Kernel Functions
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
.
|
||
4
|
||
2.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
7
|
||
2.3 Rotation with quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
7
|
||
. . . . . . . . . . . . . . . . . . . . . . .
|
||
2.4 RMSD calculation with quaternions
|
||
2.5 Genetic Algorithm .
|
||
9
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
2.6 Quantitative Structure-Activity Relationship . . . . . . . . . . . . . . . . . . . 11
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
13
|
||
3 Materials and Methods
|
||
3.1 Overall process .
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
|
||
3.2 Radial Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
|
||
3.3 Kernel .
|
||
Probability Product Kernel . . . . . . . . . . . . . . . . . . . . . . . . 17
|
||
3.3.1
|
||
3.3.2 Radial Basis Function Kernel
|
||
. . . . . . . . . . . . . . . . . . . . . . 18
|
||
3.3.3 Atom Pair Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
|
||
.
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
|
||
Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . 21
|
||
Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . 24
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
|
||
. .
|
||
|
||
3.4 Dataset
|
||
.
|
||
.
|
||
3.5 Conformation Sampling .
|
||
|
||
3.6 SVR .
|
||
|
||
3.5.1
|
||
3.5.2
|
||
|
||
. .
|
||
|
||
. .
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
4 Results
|
||
|
||
.
|
||
|
||
.
|
||
|
||
Initial Runs .
|
||
|
||
28
|
||
4.1 Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . 28
|
||
4.1.1
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
|
||
4.1.2 Reduced Dataset with PPK and RBF Kernel . . . . . . . . . . . . . . . 33
|
||
4.1.3 Reduced Dataset with Atom Pair Kernel . . . . . . . . . . . . . . . . . 35
|
||
4.1.4 Alternative Parameters for the Product Probability Kernel
|
||
. . . . . . . 37
|
||
4.1.5 Alternative Parameters for APK . . . . . . . . . . . . . . . . . . . . . 39
|
||
Increased Mutation Rate . . . . . . . . . . . . . . . . . . . . . . . . . 41
|
||
4.1.6
|
||
4.1.7 Alternative Conformation Sampling . . . . . . . . . . . . . . . . . . . 42
|
||
4.1.8 Alternative Mutation Operator . . . . . . . . . . . . . . . . . . . . . . 45
|
||
4.1.9 Reruns
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
|
||
Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 47
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
|
||
4.2.1
|
||
.
|
||
. . . . . . . . . . . . . . . 51
|
||
4.2.2 Reduced Dataset and Fixed Conformation
|
||
|
||
Initial Runs .
|
||
|
||
4.2
|
||
|
||
. .
|
||
|
||
. .
|
||
|
||
.
|
||
|
||
5 Discussion
|
||
|
||
6 Prospects
|
||
|
||
iv
|
||
|
||
52
|
||
|
||
54
|
||
|
||
Bibliography
|
||
|
||
Contents
|
||
|
||
55
|
||
|
||
v
|
||
|
||
Contents
|
||
|
||
vi
|
||
|
||
1 Introduction
|
||
|
||
In drug design one of the major goals is to find new lead structures. Lead structures already
|
||
show a certain affinity towards the intended target but express unwanted side effects or lack
|
||
certain properties. For example they may be toxic or have a low bioavailability. Without a
|
||
detailed understanding of the biochemical processes responsible for the activity the search for
|
||
such a new lead structure is non-trivial.
|
||
|
||
The usual process is to simply try a huge combination of different chemical compounds in
|
||
vitro and observe their activity. But the combinatorial possibilities of this strategy can explode
|
||
even for small systems. For instance the number of compounds needed to place 10 substituents
|
||
on the four open positions of an asymmetrically disubstituted benzene ring system is approxi-
|
||
mately 10,000.
|
||
|
||
Therefore this classical screening process was automatized and combinatorially optimized
|
||
in the last decades to high throughput screening (HTS) which allowed for a systematical search
|
||
in greater databanks with hundreds of thousands of entries. But still this process makes up a
|
||
large amount of the development-costs and -time. Further the chemical compounds needed for
|
||
the synthesis are often rare and hard to come by in the purity needed for reliable results.
|
||
|
||
One way to optimize this exhaustive search and give an indication of the right direction is
|
||
to develop a model that quantitatively relates variations in biological activity to changes in
|
||
molecular properties which can be easily obtained for each compound. One of the first to build
|
||
such a model was Corvin Hansch correlating lipophilicity and polarity with biological activity
|
||
in his Hansch method [Han69]. But there exist many other approaches to this Quantitative
|
||
Structure-Activity Relationship (QSAR) principle, which mostly differ in their use of molecular
|
||
descriptors and mathematic models such as Partial Least Squares or Principal Component
|
||
Analysis. The QSAR models developed in this work are based on kernels which are evaluated
|
||
by Support Vector Regression.
|
||
|
||
In the recent years several models have been developed using 3D descriptors of molecules.
|
||
These 3D descriptors are important, because to build a model and gain understanding for the
|
||
binding process it is not enough to know of the single component and values of a molecule but
|
||
also to know their 3D dimensional arrangement. As one can see for example on fig 1.1 where a
|
||
single molecule can take on several conformations. To know which of these conformation is the
|
||
active conformation can improve the modeling process and the understanding of the chemical
|
||
processes leading to the activity.
|
||
|
||
One thing all QSAR methods have in common is the basic assumption that the biological
|
||
activity is an additive function of the molecular properties (2D or 3D) of the substituents and
|
||
groups of the respective structure. Not only the mere presence of those groups is essential but
|
||
also their three dimensional arrangement.
|
||
|
||
This leads to the expectation that on using 3D descriptors only good and correct training data
|
||
including 3D information of the molecules leads to a good model of the activity. But, what if
|
||
this doesn’t hold true? What if a better model can be created not using the actually correct
|
||
physical data? The question is if the reverse of this expectation is always valid, thus if the
|
||
model quality is bijective to the training data quality.
|
||
|
||
1
|
||
|
||
1 Introduction
|
||
|
||
Figure 1.1: This figure shows an overlay of six conformations of the same thrombin inhibitor.
|
||
|
||
One can see the high flexibility of the lower ring system.
|
||
|
||
(a) This figure shows the standard process for building a
|
||
QSAR model.
|
||
|
||
(b) This figure shows the reverse process of optimizing a
|
||
QSAR model with respect to the training data quality
|
||
|
||
Figure 1.2: These figures show the standart and the process used in this work to build a QSAR
|
||
model. To optimize the model quality with respect to the varying training data the
|
||
process has been reversed.
|
||
|
||
To test, whether one can find such a model I will reverse the QSAR approach (see figure 1.2 .
|
||
Therefore optimizing the QSAR model to predict activity by the means of altering the training
|
||
dataset, where for each data point several values are given including the actual physical ones.
|
||
While the input data points vary (i.e. their molecular descriptors) their target function value
|
||
|
||
2
|
||
|
||
(i.e. the activity) stays the same.
|
||
|
||
The intention of this experiment is to see if the best attainable model includes the actual
|
||
active structure or the artificially created one. To this end I compiled a data set and created a set
|
||
of conformers for each molecule. Then I concurrently optimized the activity prediction over
|
||
the whole training dataset not to favor one molecule over the other by successively optimizing
|
||
one after the other. A good way to handle multidimensional optimization with several datasets
|
||
is to use a genetic algorithm which I did in this case.
|
||
|
||
In this work I will show the methods used for the generation of the the dataset, the optimization
|
||
and the evaluation. Further I will present the results and discuss their significance. Finally I
|
||
will give a perspective of further work which can be done on this topic.
|
||
|
||
3
|
||
|
||
2 Background Information
|
||
|
||
2.1 Kernel Functions
|
||
|
||
Detecting linear relations has been the focus of much research in statistics and machine learning
|
||
for the last few decades and the resulting algorithms are well understood, well developed and
|
||
efficient [Sew07].However, many models of natural processes aren’t linear. So, if a problem
|
||
is non-linear, instead of trying to fit a non-linear model, one can map the problem from the
|
||
inputspace X to a new higher-dimensional space called the f eaturespace F and then use
|
||
a linear model in the feature space. This mapping can be achieved by doing a non-linear
|
||
transformation. For example the function φ can be given as
|
||
|
||
φ : R2 → R3 with φ (x1, x2) = (x2
|
||
1,
|
||
|
||
(cid:112)
|
||
|
||
2x1x2, x2
|
||
2)
|
||
|
||
(2.1)
|
||
|
||
While this function is a very simple one, other functions can easily become computationally
|
||
impracticable for both polynomial features and higher dimensionality. This is grounded on the
|
||
(cid:17)
|
||
, with d = dim X
|
||
fact that the number of different monomial features of degree p is
|
||
[Vap95] (e.g. p = 7, d = 28 · 28 = 748, corresponds to a total of approximately 3, 7 · 1016
|
||
features).
|
||
|
||
(cid:16) d+p−1
|
||
p
|
||
|
||
The key to an efficient computation is the observation made by [BGV92] that
|
||
|
||
(cid:68)(cid:16)
|
||
|
||
x2
|
||
1,
|
||
|
||
(cid:112)
|
||
|
||
2x1x2, x2
|
||
2
|
||
|
||
(cid:17)
|
||
|
||
(cid:16)
|
||
x(cid:48)2
|
||
1,
|
||
|
||
,
|
||
|
||
(cid:112)
|
||
|
||
2x(cid:48)
|
||
|
||
1x(cid:48)
|
||
|
||
2, x(cid:48)2
|
||
2
|
||
|
||
(cid:17)(cid:69)
|
||
|
||
= (cid:104)x, x(cid:48)(cid:105)2
|
||
|
||
(2.2)
|
||
|
||
which allows the use of kernel f unctions where φ must not be explicitly known as long as the
|
||
function corresponds to a dot product in the FeaturespaceF
|
||
|
||
k(x, x(cid:48)) := (cid:104)φ (x), φ (x(cid:48))(cid:105)
|
||
|
||
(2.3)
|
||
|
||
2.2 Support Vector Regression
|
||
|
||
Many multi variant systems assume that there is a linear relation between X and Y which holds
|
||
for all samples. In chemoinformatics this assumption does not hold true and causes a variety of
|
||
problems on the prediction of unknown data points. One way to solve these occurring problems
|
||
is to use non-linear learning methods such as support vector regression (SVR). The support
|
||
vector algorithm is a non-linear generalization of the Generalized Portrait algorithm developed
|
||
in Russia in the 1960’s [VL63] [VC64]. Its groundwork, the statistical learning theory, or VC
|
||
theory, has been developed over the last half century by Vapnik and Chervonenkis [VC74]
|
||
[Vap82] [Vap95]. The VC theory defines properties of learning machines, enabling them to
|
||
generalize to unseen data.
|
||
|
||
Given a set of training data {(x1, y1), ..., (yn, yn)} ⊂ χ × R with χ denoting the space of input
|
||
patterns (e.g. χ = Rd) the goal of ε − SV regression is to find a function f (x) with a maximum
|
||
|
||
4
|
||
|
||
2.2 Support Vector Regression
|
||
|
||
deviation of ε from the actually received targets yi for all the training data. In addition f should
|
||
be as flat as possible. The form of a linear function f is given as
|
||
|
||
f (x) = (cid:104)w, x(cid:105) + b with w ∈ χ, b ∈ R
|
||
|
||
(2.4)
|
||
|
||
with (cid:104)·, ·(cid:105) denoting the dot product in χ and flatness meaning a small w. To attain this we
|
||
2 (cid:107)w(cid:107)2 which can be formally written as a convex optimization
|
||
minimize the euclidean norm 1
|
||
problem:
|
||
|
||
minimize
|
||
|
||
subject to
|
||
|
||
1
|
||
|
||
2(cid:107)w(cid:107)2
|
||
(cid:40)
|
||
|
||
yi − (cid:104)w, xi(cid:105) − b ≤ ε
|
||
(cid:104)w, xi(cid:105) + b − yi ≤ ε
|
||
|
||
(2.5)
|
||
|
||
The above formula is viable for all problems where a function f actually exists that approx-
|
||
imates all pairs (xi, yi) with precision ε. If this is not the case, or if we want to allow some
|
||
errors, according to [CV95] one can introduce slack variables ξi, ξ ∗
|
||
i
|
||
|
||
leading to the formula:
|
||
|
||
minimize
|
||
|
||
subject to
|
||
|
||
i=1(ξi + ξ ∗
|
||
i )
|
||
|
||
1
|
||
|
||
2 (cid:107)w(cid:107)2 +C ∑n
|
||
|
||
|
||
|
||
yi − (cid:104)w, xi(cid:105) − b ≤ ε + ξi
|
||
(cid:104)w, xi(cid:105) + b − yi ≤ ε + ξ ∗
|
||
i
|
||
ξi, ξ ∗
|
||
|
||
i ≥ 0
|
||
|
||
|
||
|
||
(2.6)
|
||
|
||
Where the constant C > 0 defines the trade off between the flatness of f and the amount up to
|
||
which deviations larger than ε are tolerated. This is the same as dealing with a ε-intensive loss
|
||
function |ξ |ε denoted by:
|
||
|
||
(cid:40)
|
||
|
||
|ξ |ε :=
|
||
|
||
0
|
||
|ξ | − ε
|
||
|
||
if|ξ | ≤ ε
|
||
else
|
||
|
||
(2.7)
|
||
|
||
Figure 2.1 depicts the use of ξ and ε. Extending support vector machines to solve non linear
|
||
problems is possible by using a standard dualization approach utilizing Lagrange multipliers as
|
||
described in [Fle89] leading to the following formula:
|
||
|
||
L := 1
|
||
|
||
2 (cid:107)w(cid:107)2 +C ∑n
|
||
|
||
i=1(ξi + ξ ∗
|
||
|
||
i ) − ∑n
|
||
|
||
i=1(ηiξi + η ∗
|
||
|
||
i ξ ∗
|
||
i )
|
||
|
||
− ∑n
|
||
|
||
i=1 αi(ε + ξi − yi + (cid:104)w, xi(cid:105) + b)
|
||
|
||
(2.8)
|
||
|
||
− ∑n
|
||
|
||
i=1 α ∗
|
||
|
||
i (ε + ξ ∗
|
||
|
||
i + yi − (cid:104)w, xi(cid:105) − b)
|
||
|
||
With L being the Lagrangian and ηi, η ∗
|
||
satisfy the constraints
|
||
|
||
i , αi, α ∗
|
||
i
|
||
|
||
the Lagrangian multipliers. Thus they have to
|
||
|
||
ηi, η ∗
|
||
|
||
i , αi, α ∗
|
||
|
||
i ≥ 0
|
||
|
||
(2.9)
|
||
|
||
To gain an optimal result one can infer from the saddle point condition that the partial deriva-
|
||
|
||
5
|
||
|
||
2 Background Information
|
||
|
||
Figure 2.1: The image shows the use of ξ and ε in a support vector regression. Data points
|
||
with a distance smaller than ε are not considered an error. For data points with a
|
||
distance larger than ε the parameter ξ decides wether they are tollerated or not.
|
||
|
||
tives of L have to vanish
|
||
|
||
∂ L
|
||
∑n
|
||
∂ b =
|
||
∂ L
|
||
∂ w = w − ∑n
|
||
∂ L
|
||
∂ ξi
|
||
∂ L
|
||
∂ ξ ∗
|
||
i
|
||
|
||
i=1(α ∗
|
||
i − αi)
|
||
i=1(αi − α ∗
|
||
C − αi − ηi
|
||
i − η ∗
|
||
C − α ∗
|
||
i
|
||
|
||
=
|
||
|
||
=
|
||
|
||
= 0
|
||
i )xi = 0
|
||
= 0
|
||
= 0
|
||
|
||
Substituting eq 2.7 into eq. 2.6 leads to the dual optimization problem:
|
||
|
||
(cid:40)
|
||
|
||
maximize
|
||
|
||
subject to ∑n
|
||
|
||
i, j=1(αi − α ∗
|
||
i=1(αi + α ∗
|
||
|
||
2 ∑n
|
||
− 1
|
||
i )(α j − α ∗
|
||
−ε ∑n
|
||
i ) + ∑n
|
||
i=1(αi − α ∗
|
||
i ) = 0 and αi, α ∗
|
||
i we can further reformulate (7) to η (∗)
|
||
|
||
j )(cid:104)xi, x j(cid:105)
|
||
i=1 yi(αi − α ∗
|
||
i )
|
||
i ∈ [0,C]
|
||
|
||
i = C − α (∗)
|
||
|
||
i
|
||
|
||
Having already eliminated ηi, η ∗
|
||
follows
|
||
|
||
w =
|
||
|
||
n
|
||
∑
|
||
i=1
|
||
|
||
(αi − α ∗
|
||
|
||
i )xi, thus f (x) =
|
||
|
||
n
|
||
∑
|
||
i=1
|
||
|
||
(αi − α ∗
|
||
|
||
i )(cid:104)xi, x(cid:105) + b.
|
||
|
||
(2.10)
|
||
|
||
(2.11)
|
||
|
||
so that
|
||
|
||
(2.12)
|
||
|
||
The fact that the dataxi only contributes in form of the dot product allows the introduction of
|
||
kernel functions in such a way that
|
||
|
||
This allows the prediction of unknown data points via
|
||
|
||
k(xi, x j) = (cid:104)φ (xi), φ (x j)(cid:105)
|
||
|
||
n
|
||
∑
|
||
i=1
|
||
|
||
(αi − α ∗
|
||
|
||
i )k(x, x j) + b
|
||
|
||
f (x) =
|
||
|
||
6
|
||
|
||
(2.13)
|
||
|
||
(2.14)
|
||
|
||
2.3 Rotation with quaternions
|
||
|
||
2.3 Rotation with quaternions
|
||
|
||
Quaternions are an extension of the complex numbers invented by William Rowan Hamilton in
|
||
1843[Ham66] and formally introduced to computer graphics by the publication of Shoemaker
|
||
[Sho85] [Har94]
|
||
|
||
Quaternions encode rotations by a set of 4 real numbers (or 2 complex numbers), while a
|
||
linear representation of a rotation requires a 3 × 3 Matrix, thus 9 numbers. Further Quaternions
|
||
occupy a smooth, seamless isotropic space which is the generalization of the surface of a sphere.
|
||
This means that one doesn’t need to take special care in avoiding singularities (e.g., the gimbal
|
||
lock, where two rotation axes collapse into one making the interpolation irreversible).
|
||
|
||
The four-dimensional space H is spanned by the real axis and three additional orthogonal
|
||
axes, spanned by the vectors i, j, k called the principal imaginaries, which obey Hamilton’s
|
||
rule
|
||
|
||
Where the three dimensional vectors i, j, k signify
|
||
|
||
i2 = j2 = k2 = ijk = − 1
|
||
|
||
i = (1,0,0)
|
||
j = (0,1,0)
|
||
k = (0,0,1).
|
||
|
||
(2.15)
|
||
|
||
(2.16)
|
||
|
||
A quaternion q = r + xi + yj + zk consists of a real part r and a pure part xi + yj + zk and can
|
||
be written as a three dimensional vector an a scalar
|
||
|
||
The sum of two quaternions is given as
|
||
|
||
q = (a, b)
|
||
|
||
q1 + q2 = (a1 + a2) + (v1 + v2)
|
||
|
||
and their product as
|
||
|
||
(2.17)
|
||
|
||
(2.18)
|
||
|
||
q1q2 = a1a2 − b1 · b2 + a1b2 + a2b1 + b1 × b2
|
||
|
||
(2.19)
|
||
|
||
where the multiplication of two quaternions q1q2 with unit length (i.e. absolute value = 1) and
|
||
q2 being a pure quaternion (i.e. with a = 0) causes a rotation of b2 around the axis described
|
||
by b1 for cos−1 2φ degrees. Where φ is the desired rotation angle.
|
||
|
||
2.4 RMSD calculation with quaternions
|
||
|
||
In various cheminformatic situations the problem arises of finding the best superposition of on
|
||
rigid object onto another. For example to give a similartiy measure for two proteins or in case
|
||
of this work two conformations of the same molecule. One method is finding the best rotation
|
||
and translation to minimize the root mean square deviation (RMSD) [Kab76] with examples
|
||
are given by [Dia76] and [McL72]. A prerequisite for this method is a given assignment of the
|
||
points matched on each other. Usually such an assignment is already given (e.g., the canonical
|
||
atom numbering of two different conformations).
|
||
|
||
The mathematical problem can the be stated as follows: [Cou04]
|
||
|
||
7
|
||
|
||
2 Background Information
|
||
|
||
“given a ordered set of vetors yk (target) and a second set xk (model), 1 ≤ k ≤ N, find a
|
||
|
||
orthogonal transformation U and a translation r such that the residual E (weighted by wk)
|
||
|
||
E :=
|
||
|
||
1
|
||
N
|
||
|
||
N
|
||
∑
|
||
k=1
|
||
|
||
wk|U xk + r − yk|2
|
||
|
||
(2.20)
|
||
|
||
is minimized. ”Where the weight factor wk allows to lay the emphasis on certain parts of the
|
||
structure in question.
|
||
|
||
While Kabsch’s method uses Lagrange multipliers, Mackay proposed a method in 1984
|
||
[Mac84] using quaternions to calculate the rotation matrix. One disadvantage of Mackay’s
|
||
method was that, using a linear form of the least square errors, the results could be false where
|
||
objects had different relative orientations in space. In 1989 Kearsley developed a method, solv-
|
||
ing the non-linear least square error problem with an eigenvalue determination through the use
|
||
of quaternions [Kea89]. The proof that both, Kabschs and Kearsleys methods lead to the same
|
||
result was brought by Coutsias et al. in 2005 [Cou05].
|
||
|
||
If xk and yk are considered as pure quaternions, with xk := (0, xk) and xc
|
||
|
||
k = −xk the rotation
|
||
|
||
U (q) can be written as
|
||
|
||
And the residual function is transformed using quaternions to
|
||
|
||
(0, U (q)xk) = qxkqc
|
||
|
||
Eq =
|
||
|
||
1
|
||
N
|
||
|
||
N
|
||
∑
|
||
k=1
|
||
|
||
(qxqc − yk)(qxqc − yk)c
|
||
|
||
An expansion and a multiplication by N leads to
|
||
|
||
NEq = ∑N
|
||
|
||
k=1(qxkqc)(qxkqc)c + ykyc
|
||
|
||
k(qxkqc)yc
|
||
|
||
k − yk(qxkqc)c
|
||
|
||
(2.21)
|
||
|
||
(2.22)
|
||
|
||
(2.23)
|
||
|
||
= ∑N
|
||
|
||
k=1(xkxc
|
||
|
||
k + ykyc
|
||
|
||
k + (qxkqc)yk + yk(qxkqc))
|
||
|
||
where the normalization qqc = 1 and the property of pure quaternions xc = −x has been used.
|
||
qxkqc and yk being pure quaternions and with a, b pure ab + ba = 2(−a · b, 0) = 2([ab]0, 0) the
|
||
last two terms in eq. 2.23 can be combined as follows
|
||
|
||
(qxkqc)yk + yk(qxkqc) = 2([yk(qxkqc)]0, 0)
|
||
|
||
(2.24)
|
||
|
||
This means that only the 0th component is non-zero. Because of the associativity of the quater-
|
||
nions one can write yk(qxkqc) = (ykqxk)qc and define xk := ykqxk wich leads to the 4-vector
|
||
form of zk, Zk with Zk = AL(yk)AR(xk)Q with AL, AR defined as follows
|
||
|
||
AR(p) =
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
p0 −p1 −p2 −p3
|
||
p3 −p2
|
||
p1
|
||
p0
|
||
p1
|
||
p2 −p3
|
||
p0
|
||
p0
|
||
p2 −p1
|
||
p3
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
, AL(p) =
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
p0 −p1 −p2 −p3
|
||
p2
|
||
p0 −p3
|
||
p1
|
||
p0 −p1
|
||
p2
|
||
p3
|
||
p0
|
||
p1
|
||
p3 −p2
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
(2.25)
|
||
|
||
8
|
||
|
||
All together we can write
|
||
|
||
followed by the residue
|
||
|
||
with
|
||
|
||
−2yT
|
||
k
|
||
|
||
U (q)xk = 2[yk(qxkqc)0
|
||
|
||
= 2[zkqc]0
|
||
= 2(zk0q0 + zk · q)
|
||
= 2QT Zk
|
||
= 2Z T Al(yk)R(xk)Q
|
||
|
||
NEq =
|
||
|
||
N
|
||
∑
|
||
k=1
|
||
|
||
(|xk|2 + |yk|2) − 2QT F Q
|
||
|
||
F := −
|
||
|
||
N
|
||
∑
|
||
k=1
|
||
|
||
AL(yk)AR(xk)
|
||
|
||
2.5 Genetic Algorithm
|
||
|
||
(2.26)
|
||
|
||
(2.27)
|
||
|
||
(2.28)
|
||
|
||
leading to the full form of the matrix F in terms of the correlation matrix R
|
||
|
||
F =
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
R11 + R22 + R33
|
||
R23 − R32
|
||
R31 − R13
|
||
R12 − R21
|
||
|
||
R23 − R32
|
||
R11 − R22 − R33
|
||
R12 + R21
|
||
R13 + R31
|
||
|
||
R31 − R13
|
||
R12 + R21
|
||
−R11 + R22 − R33
|
||
R23 + R32
|
||
|
||
R12 − R21
|
||
R13 + R31
|
||
R23 + R32
|
||
−R11 − R22 + R33
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
(2.29)
|
||
|
||
In this way the problem can be reduced to finding the extreme of a quadratic form QT F Q
|
||
for the four variables qi, i ∈ {0, 1, 2, 3} subject to the constraint QTQ = 1. Here QT F Q is the
|
||
standard Rayleigh quotient for a symmetric matrix F , where the maximum value of QT F Q
|
||
is equal to its larges eigenvalue which leads to the following problem
|
||
|
||
which in turn leads to the following expression for the best RMSD Value
|
||
|
||
F Q = λ Q
|
||
|
||
(cid:115)
|
||
|
||
eq =
|
||
|
||
(cid:114)
|
||
|
||
min
|
||
(cid:107)q(cid:107)=1
|
||
|
||
Eq =
|
||
|
||
∑N
|
||
|
||
k=1(|xk|2 + |yk|2) − 2λmax
|
||
N
|
||
|
||
(2.30)
|
||
|
||
(2.31)
|
||
|
||
2.5 Genetic Algorithm
|
||
|
||
In cheminformatics one often encounters optimization problems with several variable param-
|
||
eters. Traditional optimization methods such as steepest decent often fail at this task because
|
||
they often run into a local optimum. To get around this problem Prof. John Holland developed
|
||
the class of Genetic Algorithms (GA’s) at the University of Michigan during the 60’s and 70’s
|
||
[Hol75].
|
||
|
||
Genetic algorithms belong to the class of stochastic search methods. Their distinctive feature
|
||
is, that instead of operating on a single solution like most other stochastic search methods, they
|
||
operate on a whole set of solutions. The term Genetic Algorithm is a tribute to their basic
|
||
operations which derive from natural evolutionary processes, such as inheritance, mutation,
|
||
selection, and crossover.
|
||
|
||
9
|
||
|
||
2 Background Information
|
||
|
||
Given a problem P with parameters x1, ..., xn the first step is to initialize a first set of solutions,
|
||
called population M(0). Each single solution is called individual m and is represented by a bit
|
||
string called chromosome (see fig x). The initial value of each parameter is chosen at random
|
||
within its predefined range.
|
||
|
||
Figure 2.2: This figure shows two individuals with parameters x1, .., x4 encoded as a series of
|
||
|
||
binary representations of different length.
|
||
|
||
The second step is to evaluate each individual (i.e. solution) in the current population M(t)
|
||
for its fitness. This is done by and applying the individuals parameter values to a fitness function
|
||
(which in most cases is the initial problem function) and assigning the function result as fitness
|
||
value u(m). This means that the parameters of individuals with higher fitness values lead to a
|
||
better result of the problem function.
|
||
|
||
The third step is to assign each of the current individuals a selection probability p(m) which
|
||
depends on the individuals fitness value u(m). This selection probability determines if a in-
|
||
dividual is chosen for mating. There are several methods of assigning selection probabilities
|
||
like roulette wheel selection (the likelihood of picking an individual is proportional to the indi-
|
||
vidual’s score), tournament selection (a number of individuals are picked using roulette wheel
|
||
selection, then the best of these are chosen for mating), and rank selection (pick the best in-
|
||
dividual every time). Moreover it is important not to use a method which always picks the
|
||
individuals with the best fitness because then the population will quickly converge to these
|
||
individuals narrowing the search space.
|
||
|
||
The fourth step is to generate a new population M(t + 1) using the individuals selected in
|
||
step three to produce offspring applying the already mentioned genetic operators mutation and
|
||
crossover with a predefined probability (see figure 2.3(a) and 2.3(b) for genetic operators).
|
||
|
||
10
|
||
|
||
2.6 Quantitative Structure-Activity Relationship
|
||
|
||
(a) Mutation of the fifth bit from 0 to 1
|
||
|
||
(b) Crossover after the third bit
|
||
|
||
Figure 2.3: This figure shows the two genetic operators mutation and crossover. These allow
|
||
to generate new individuals from already existing ones and to introduce new sets of
|
||
parameters with possible better fitness values.
|
||
|
||
Steps two to four are then repeated until one of three possibilities occur. The best fitness
|
||
in the current population reaches a given limit, the best fitness does not increase over several,
|
||
predefined generations, or the steps two to four are repeated for a specific number of times.
|
||
|
||
2.6 Quantitative Structure-Activity Relationship
|
||
|
||
For the development of a new drug it is important not only to know its chemical formula but
|
||
also its conformation. The underlying principle for that is the so called lock and key principle
|
||
postulated by Emil Fischer in 1894 [Fis94] stating that an active compound has to be spacial
|
||
complementary to its target to form a complex. But as we know today there are several other
|
||
factors that influence the building of an active complex. Those can be direct features of the
|
||
molecules, like hydrophobicity, partial atomic charge, binding sites etc., or there can be influ-
|
||
ences from the surrounding solution (e.g., water) so that a ligand changes its conformation in
|
||
the binding process. These considerations lead to the expansion of the lock and key principle
|
||
to the induced fit theory in 1958 [Kos58][Kos94]
|
||
|
||
11
|
||
|
||
2 Background Information
|
||
|
||
Figure 2.4: This figure shows a Thrombin-Hirudin complex. The Hirudin(magenta) being the
|
||
|
||
key to the Thrombin(blue) lock.
|
||
|
||
Also new to this theory was the introduction of flexible binding sites which can account for
|
||
differences in specificity and affinity. This leads to the conclusion that the biological activity
|
||
is a direct function of the ligands three dimensional structure which in turn is the fundamental
|
||
premise for the quantitative structure-activity relationship (QSAR) [SOW04]. QSAR Meth-
|
||
ods attempt to represent the relationship between structural attributes of molecules and their
|
||
biological activity. In the beginning QSAR models where used to retrospectively analyze the
|
||
activity modulation of molecules in a specific subset. But in the last decade QSAR models have
|
||
been increasingly used for predictions on novel derivatives of well known ligands [Eki04]. To
|
||
be applicable to such a use the applied QSAR models must be able to generalize and predict
|
||
activities correctly beyond the chemical space defined by the given training data.
|
||
|
||
To that end a large number of methods has been described in the literature since the begin-
|
||
ning of the research on QSAR. The early methods implemented only 2D features of molecules
|
||
(e.g. the connection table of a molecule), while newer ones often include 3D features like the
|
||
chemical properties of molecules in their bioactive conformation [SJ93] [OW91].
|
||
|
||
12
|
||
|
||
3 Materials and Methods
|
||
|
||
In this chapter the two main strategies applied to the problem and the overall process will
|
||
be explained in detail and their function will be exemplified. The parameters used for the
|
||
experiments and their progress will be given. The implementation of the algorithms or the use
|
||
of external programs or code will be described. All algorithms were written in Java.
|
||
|
||
3.1 Overall process
|
||
|
||
Because this work consist of a concatenation of different machine learning and chemoinformat-
|
||
ical methods I will first give an overview of the whole process and then explain the appointed
|
||
methods in depth.
|
||
|
||
The aim was to see if the best models for an activity prediction included the actual active
|
||
|
||
structures of the given molecules or if a better model could be found without them.
|
||
|
||
In this work I used two different approaches. The first was to precompile a set of conformers
|
||
for each molecule maximizing the coverage of the conformer space and the second was to
|
||
create random new conformers during the optimization process. From the set of precompiled
|
||
conformers 100 (or the maximum available if lower then 100) were chosen equally distributed
|
||
over the calculated relative energy range for each molecule and used as the training set. In both
|
||
approaches the optimization was done by a genetic algorithm. The deciding facts for using a
|
||
heuristic (in this case the genetic algorithm) were that both a full search of the optimization
|
||
space isn’t feasible for 100 molecules each with at least 100 conformation and that the solution
|
||
hyperplane is very jagged and there was no information about a starting point.
|
||
The information about the molecules conformation were encoded in the GA’s genes, either as a
|
||
direct reference to the whole conformation (in the precomputed approach) or as single dihedral
|
||
angles for each rotatable bond in each molecule (in the implicit approach).
|
||
|
||
After each generation of the GA the fitness of it’s individuals was calculated. In this case
|
||
each individual corresponded to a set of conformers for which a kernel matrix using one of the
|
||
following kernel methods were used.
|
||
|
||
• Probability Product Kernel (PPK)
|
||
|
||
• radial basis function(RBF)
|
||
|
||
• Atom Pair Kernel (APK)
|
||
|
||
The first two of which were working on the RDF of a given molecule and the third one working
|
||
directly on 3D model of the conformation. Each kernel matrix therefore consisted of similar-
|
||
ity measures between the molecules. And for each molecule pKi value was known. These
|
||
informations were used to build a SVR model to predict the activity of an unknown molecule
|
||
in relation to it’s similarity to the molecule in the training set. For each model a set of best
|
||
|
||
13
|
||
|
||
3 Materials and Methods
|
||
|
||
parameters was searched using 5 repetitions of leave-one-out convoluted with a 5-fold cross-
|
||
validation. These best parameters were used to compute the MSE of the model which in turn
|
||
served as the fitness value for each individual.
|
||
|
||
The next generation of individuals in the GA was then generated using standard GA opera-
|
||
tors such as mutation and cross over. The individual selected to mate for the next generation
|
||
according to their fitness value.
|
||
|
||
(a) In the first approach the conformer sampling was
|
||
done before the optimization process
|
||
|
||
(b) In the second approach the conformer sampling
|
||
was done implicitly as part of the optimization process
|
||
by mutating the conformers
|
||
|
||
Figure 3.1: These two figures show the different procedures of the two approaches. Both con-
|
||
sist of four frameworks indicated by the different colors. The conformer gener-
|
||
ation(either with MacroModel or implicit), the GA (with JavaEva2) which runs
|
||
the optimization loop, the kernel matrix computation and the SVR modeling (with
|
||
libsvm).
|
||
|
||
The process of generating new generations, the calculation of each kernel matrix and the
|
||
evaluation via the SVR was then repeated 200 times. The development of the MSE for each
|
||
individual and the RMSD between the conformations of the individual and the known active
|
||
structure was calculated.
|
||
|
||
14
|
||
|
||
ConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult outputConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult output3.2 Radial Distribution Function
|
||
|
||
3.2 Radial Distribution Function
|
||
|
||
An important prerequisite for the computation of active structures with respect to the different
|
||
conformations is keeping some kind of knowledge about the 3D structure of the molecules
|
||
throughout the whole process. Therefor a molecular representation is needed that guarantees
|
||
3D sensitivity. To do so there are some prerequisites for a structure code
|
||
|
||
• independence form the numbers of atoms, i.e. the size of the molecule,
|
||
|
||
• unambiguity regarding the three-dimensional arrangement of the atoms and
|
||
|
||
• invariance against translation and rotation of the entire molecule
|
||
|
||
(a) Overlay of three different conformations of the
|
||
same molecule
|
||
|
||
(b) The RDF for the three molecules shown on the left
|
||
|
||
Figure 3.2: These figures show an overlay of three conformations of the same Thrombin in-
|
||
hibitor and their RDF. While the internal distances of the ring systems stay the
|
||
same (i.e. the peeks representing the ring systems at r ≈ 1.5 and r ≈ 2.6 overlap for
|
||
all three molecules) their relative spacial position vary (i.e. the peeks representing
|
||
the distances of the ring systems among themselves at r ≈ 6 to r ≈ 12 are set off)
|
||
|
||
One method that meets all of the above requirements and which I used in this work is
|
||
a derivation of the 3D-Molecule Representation based on Electron diffraction (3D-MoRSE)
|
||
[Sch96] [Sel97], the radial distribution funtion [Gas96] [Gas97]. In general this function gives
|
||
the probability to find a pair of atoms in the given molecule with similar properties in the
|
||
distance r to each other.
|
||
|
||
g(r) = f
|
||
|
||
N−1
|
||
∑
|
||
i
|
||
|
||
N
|
||
∑
|
||
j>i
|
||
|
||
AiA je−B(r−ri j)2
|
||
|
||
(3.1)
|
||
|
||
15
|
||
|
||
3 Materials and Methods
|
||
|
||
where f is the scaling factor and N is the number of atoms. The exponential therm consists
|
||
of the distance ri j between two atoms i, j and the smoothing factor B for the probability dis-
|
||
tribution which will be explained later. Ai and A j are the characteristic Atom properties. The
|
||
properties used in this work are standard properties of the JoeLib2 framework, for example:
|
||
|
||
• Electro-topological state
|
||
|
||
• Electronegativity (Pauling)
|
||
|
||
• Partial charge
|
||
|
||
• Atom mass
|
||
|
||
• Electron affinity
|
||
|
||
• Intrinsic state
|
||
|
||
• Free electron count
|
||
|
||
• Hybridisation
|
||
|
||
• Van-der-Waals volume
|
||
|
||
• Heavy atom valence
|
||
|
||
• Electrogometrical state
|
||
|
||
• Implicit valence
|
||
|
||
This distribution function allows to embed a lot of additional information, e.g. bond dis-
|
||
tances, ring types, planar and non-planar systems and atom types, all of which are important in
|
||
calculating the similarity of two molecules or as in this case the similarity of two conformers
|
||
of the same molecule.
|
||
|
||
An important factor in using the radial distribution function is the resolution of the 3D model
|
||
of the molecule on which the formula is applied. Using exact distances stands in contrast
|
||
to physical reality and further restricts the application of any ability to interpolate for better
|
||
results. Even though if one wants to compute the similarity of two conformers using paired
|
||
atomic distances a certain amount of fuzziness is necessary to account for flexibility and errors
|
||
in the initial measurement. Therefor the width of the peaks in the radial distribution function
|
||
is determined by the factor B. As an approximation the value of B can be given as a relation
|
||
between B and the chosen step size ∆r [Hem99] by
|
||
|
||
B (cid:117) (∆r)−2
|
||
|
||
(3.2)
|
||
|
||
In this work I started with a value of B = 1000 for my computations. But on realizing that
|
||
even slight changes had a large effect on similarity values I successively lowered it up to a value
|
||
of B = 10 where only rotations of whole ring systems had a noticeable effect on similarity. The
|
||
step size ∆r was always set to value of ∆r = 0.1 ˚A.
|
||
|
||
Implementation
|
||
|
||
In this implementation the function was internally represented by a vector of double values
|
||
each representing the value of the RDF at point g(r) with r ∈ 0.1N. The length of the vector,
|
||
and therefore the range of the function with y values ≥ 0 was predetermined by measuring the
|
||
longest distance of atom pairs in a molecule over all molecules in the dataset and adding 2 ˚A as
|
||
security margin. The preceding scaling factor f was not used (i.e. always set to f = 1).
|
||
|
||
16
|
||
|
||
3.3 Kernel
|
||
|
||
Figure 3.3: This figure shows the overlay of three RDF diagrams of the same molecule with
|
||
three different values for B: 10, 100, 100. One can see that with increasing B the
|
||
smoothness decreases but the information value increases.
|
||
|
||
3.3 Kernel
|
||
|
||
3.3.1 Probability Product Kernel
|
||
|
||
One of the two methods used in this work to give a 3D sensitive representation of a molecule
|
||
was the radial basis function (RBF). This function can be regarded as a distinct distribution of
|
||
atom pairs in the given molecule.
|
||
|
||
Typical kernels compute a generalize inner product between two input objects χ and χ (cid:48)
|
||
which is equivalent to applying a mapping function φ to each object and then computing a dot
|
||
product between φ (χ) and φ (χ (cid:48)) in a Hilbert space [Jeb04]. This kernel considers the case of a
|
||
mapping φ (χ) being a probability distribution p(x|χ), restricting the Hilbert space to the space
|
||
of distributions embedded in the Hilbert space.
|
||
|
||
In this work the probability distribution φ (x|χ) is given as the RDF function which leads to
|
||
|
||
the definition of the probability product kernel as follows
|
||
|
||
Definition Let p and p’ be probability distributions on a space X and ρ be a positive con-
|
||
stant. Assume that pρ , p(cid:48)ρ ∈ L2(X), i.e. that (cid:82)
|
||
X p(cid:48)(x)2ρ dx are well defined (not
|
||
infinity).
|
||
The probability product kernel (PPK) between distributions p and p’ is defined as
|
||
|
||
X p(x)2ρ dx and (cid:82)
|
||
|
||
kprob(p, p(cid:48)) =
|
||
|
||
(cid:90)
|
||
|
||
X
|
||
|
||
p(x)ρ p(cid:48)(x)ρ dx = (cid:104)pρ , p(cid:48)ρ (cid:105)L2.
|
||
|
||
(3.3)
|
||
|
||
Furthermore it is well known that L2(X) is a Hilbert space.Hence the defined kernel is positive
|
||
definite for any set of P of probability distributions over X such that (cid:82)
|
||
X p(x)2ρ is finite for any
|
||
|
||
17
|
||
|
||
3 Materials and Methods
|
||
|
||
p ∈ P.
|
||
|
||
Implementation
|
||
|
||
The first idea was to implement the computation of the probability product kernel with the
|
||
numerical integration of the given RDF functions via Simson’s rule (see figure 3.4)
|
||
|
||
(cid:90) b
|
||
|
||
a
|
||
|
||
f (x)dx ≈
|
||
|
||
(cid:20)
|
||
|
||
a − b
|
||
6
|
||
|
||
f (a) + 4 f
|
||
|
||
(cid:19)
|
||
|
||
(cid:18)a + b
|
||
2
|
||
|
||
(cid:21)
|
||
|
||
+ f (b)
|
||
|
||
.
|
||
|
||
(3.4)
|
||
|
||
Figure 3.4: This figure shows the approximation of a function f (x) by a quadratic interpolation
|
||
|
||
P(x).
|
||
|
||
The RDF was interpolated by Simpson’s rule in steps of 0.01 which led to an exact calcula-
|
||
tion of the integral up to the 6th decimal place and also allowed to freely choose the factor ρ in
|
||
the PPK formula.
|
||
|
||
But the first tests on this implementation showed that the computation of a single kernel
|
||
value could take up to 10 seconds resulting in maximum total of 1.5 hours per kernel matrix.
|
||
Being unfeasible due to the enormous amount of need computational power I decided to fix
|
||
the parameter ρ with ρ = 1. With this the kernel takes the form of the expectation of one
|
||
distribution under the other:
|
||
|
||
(cid:90)
|
||
|
||
k(p, p(cid:48)) =
|
||
|
||
p(x)p(cid:48)(x)dx = Ep[p(cid:48)(x)] = Ep(cid:48)[p(x)]
|
||
|
||
(3.5)
|
||
|
||
This is also called the expected likelihood kernel.
|
||
|
||
3.3.2 Radial Basis Function Kernel
|
||
|
||
Another method of measuring similarity between the two result vectors A and B of the RDF
|
||
is the use of a radial basis function. A radial basis function (RBF) kernel, also known as an
|
||
isotropic stationary kernel [HG04], is defined by a function ψ : [0, inf) → R such that
|
||
|
||
k(x, x(cid:48)) = ψ((cid:107)x − x(cid:48)(cid:107))
|
||
|
||
(3.6)
|
||
|
||
18
|
||
|
||
where x, x(cid:48) ∈ X and (cid:107) · (cid:107) denotes the Euclidean norm. The use of a special RBF kernel, the
|
||
Gaussian RBF kernel has been suggested in [Guy93] with
|
||
|
||
3.3 Kernel
|
||
|
||
k(x, x(cid:48)) = exp
|
||
|
||
(cid:18)
|
||
|
||
−
|
||
|
||
∑n
|
||
|
||
1 (cid:107)xi − x(cid:48)
|
||
2σ 2
|
||
|
||
i(cid:107)2
|
||
|
||
(cid:19)
|
||
|
||
(3.7)
|
||
|
||
where xi and x(cid:48)
|
||
width of the sphere surrounding the corresponding training pattern [Cha05].
|
||
|
||
i are the single data points in the result vectors of the RDF. And σ defining the
|
||
|
||
The issue on implementing this kernel was to find a viable value for the σ parameter in the
|
||
above formula. On choosing σ to low the patterns will tend to be very similar over-fitting
|
||
the model and taking away its ability to generalize outside its bounds. While choosing σ to
|
||
high will have opposite effect letting the patterns appear very dissimilar and under-fitting the
|
||
model. So finding a optimal value for σ is more about finding an acceptable trade-off between
|
||
over-fitting in dense areas and under-fitting in sparse areas.
|
||
|
||
3.3.3 Atom Pair Kernel
|
||
|
||
While the preceeding kernel were based on a RDF representation another method to compare
|
||
the 3D structure of two molecules or the different conformations of the same molecule is to
|
||
represent the molecule as a trie data. For that I use a derivate of the optimal assignment of atom
|
||
pairs [Jah09].
|
||
This method is based on a matrix D =
|
||
of binned geometrical distances between the three-
|
||
dimensional coordinates of atoms i, j. Where di j are the atomic distances and b is the binning
|
||
factor. The matrix D is used a a lookup table for the information needed to build a trie con-
|
||
taining all the geometrical information for all atom pairs from a fixed atom i to any other atom.
|
||
Where a trie is a prefix based search tree that can be applied to any symbolic pattern with a
|
||
reading direction.
|
||
At the beginning the trie of atom i only consists of the root labeled with the hash code of the
|
||
atomic symbol i. To fill the trie patterns of the form
|
||
|
||
(cid:106) di j
|
||
b
|
||
|
||
(cid:107)
|
||
|
||
hash(symbol(i)), di j, hash(symbol( j))
|
||
|
||
(3.8)
|
||
|
||
are inserted successively as ordered triplets. An example of a local atom pair environment and
|
||
the corresponding trie is shown in figure 3.5.
|
||
|
||
19
|
||
|
||
3 Materials and Methods
|
||
|
||
Figure 3.5: Binned geometrical distances, spheres and trie. The upper left figure shows
|
||
the spheres of the binned geometrical distances 1.0, 2.0 and 3.0 ˚A for the centered
|
||
carbon atom. The sphere of the binned geometrical distance of 0.0 ˚A (distances in
|
||
the range [0.0; 1.0)) is not visualized as individual sphere because it contains no
|
||
atoms. The upper right figure illustrates the resulting local atom pair environment
|
||
of binned geometrical distances. For simplicity, only the distances to non-carbon
|
||
atoms are displayed. The lower figure visualizes the corresponding trie of geometric
|
||
atomic distances of the annotated atom in the upper figures. The root and leaves are
|
||
labeled with the corresponding atom type. The leaves contain additionally the total
|
||
number of occurrences in the local atom pair environment.[JZ10]
|
||
|
||
The representation of a local atom environment as tries allows the comparison of two local
|
||
atom environments by comparing the tries. This can be achieved by applying a well known
|
||
similarity measurement like the Tanimoto coefficient
|
||
|
||
T (A, B) =
|
||
|
||
A · B
|
||
(cid:107)A(cid:107)2 + (cid:107)B(cid:107)2 − A · B
|
||
|
||
(3.9)
|
||
|
||
In this case let LA, LB be two sets of local atom pair environments of two molecular graphs A, B
|
||
and lAi ∈ LA, lB j ∈ LB the tries i, j of the nominal features (atom pair environments of atoms i, j.
|
||
|
||
20
|
||
|
||
Then the Tanimoto coefficient can be defined as
|
||
|
||
Sim(lAi, lB j) =
|
||
|
||
(cid:12)
|
||
(cid:12)lAi ∩ lB j
|
||
(cid:12)
|
||
(cid:12)lAi ∪ lB j
|
||
|
||
(cid:12)
|
||
(cid:12)
|
||
(cid:12)
|
||
(cid:12)
|
||
|
||
3.4 Dataset
|
||
|
||
(3.10)
|
||
|
||
Implementation
|
||
|
||
The implementation used in this work was based on the Chemistry Development Kit (CDK)
|
||
[Ste03] [Ste06]an implemented by [Jah09].
|
||
|
||
The single arbitrary parameter b was initially set to b = 0.1 and subsequently set to b = 0.2
|
||
|
||
to account for errors in measurement of the crystal structure.
|
||
|
||
3.4 Dataset
|
||
|
||
The dataset used in the experiments consisted of two parts. A precompiled set of 88 molecules
|
||
taken from [Boe99] and a smaller set of 12 molecules compiled for this work. All of the
|
||
molecules in the dataset were thrombin inhibitors with a known pKi value. However only
|
||
the 12 molecules in the compiled dataset had crystallographic determined active structures.
|
||
The active structures were gained by taking the crystal structure analysis of thrombin with the
|
||
respective ligand and extract the bound ligand from the whole structure.
|
||
|
||
The fist step therefore was to search for all potential thrombin inhibitors in the scBDP 1
|
||
|
||
[Kel06]
|
||
|
||
The second step was to find an entry with the identical structural formula in the Binding
|
||
|
||
Database 2 [XG02] for information about pKi Values and publications.
|
||
|
||
The third an final step was to download the crystallographic analysis given by the PDB
|
||
ID from the Protein Data Bank 3 [Ber77] and to extract the bound ligand with Schr¨odinger’s
|
||
Maestro program. Thus these 12 ligands will from now on be referenced by their originating
|
||
PDB ID. They are depicted in figure 3.6 and their data and publications is shown in table 3.1.
|
||
The 88 precompiled structures were only available as structural formulas so they had to be
|
||
converted into a valid 3D conformation. To achieve this they were converted with the CORINA
|
||
program [Sad94].
|
||
|
||
Trombin inhibitors were chosen both for their high flexibility and the fact that the interactions
|
||
of inhibitors and Thrombin are well investigated and there are several well documented studies
|
||
including crystal structures.
|
||
|
||
3.5 Conformation Sampling
|
||
|
||
3.5.1 Precomputed Conformation Sampling
|
||
|
||
The first strategy to be pursued was to precompute a set of conformers for all molecules, pick
|
||
a subset of 100 of these conformers per molecule (or less, if less then 100 were available)
|
||
and use the genes in the GA as indices for the molecules to chose from. Therefor a mutation
|
||
operation in the GA lead not only to a single change in the conformation but could lead to a
|
||
whole different one.
|
||
|
||
1http://bioinfo-pharma.u-strasbg.fr/scPDB/
|
||
2http://www.bindingdb.org
|
||
3http://www.rcsb.org/pdb
|
||
|
||
21
|
||
|
||
3 Materials and Methods
|
||
|
||
(a) 1a4w
|
||
|
||
(b) 1c5n
|
||
|
||
(c) 1ghy
|
||
|
||
(d) 1gj4
|
||
|
||
(e) 1gj5
|
||
|
||
(f) 1o2g
|
||
|
||
(g) 1o5g
|
||
|
||
(h) 2zc9
|
||
|
||
(i) 2zda
|
||
|
||
(j) 2zgx
|
||
|
||
(k) 2zo3
|
||
|
||
(l) 3dhk
|
||
|
||
Figure 3.6: These figures show the 12 molecules with known active structure used in this work.
|
||
|
||
They are labeled with the PDB ID they were extracted from.
|
||
|
||
22
|
||
|
||
3.5 Conformation Sampling
|
||
|
||
PDB ID pKi value
|
||
|
||
resolution (in ˚A) first published in
|
||
|
||
1a4w
|
||
1c5n
|
||
1ghy
|
||
1gj4
|
||
1gj5
|
||
1o2g
|
||
1o5g
|
||
2zc9
|
||
2zda
|
||
2zgx
|
||
2zo3
|
||
3dhk
|
||
|
||
7.796
|
||
4.699
|
||
5.071
|
||
4.222
|
||
6.347
|
||
6.495
|
||
4.957
|
||
7.327
|
||
8.398
|
||
6.745
|
||
10
|
||
6.744
|
||
|
||
1.80
|
||
1.50
|
||
1.85
|
||
1.81
|
||
1.73
|
||
1.58
|
||
1.75
|
||
1.58
|
||
1.73
|
||
1.80
|
||
1.70
|
||
1.73
|
||
|
||
[Mat96]
|
||
[Kat00]
|
||
[Kat01a]
|
||
[Kat01b]
|
||
[Kat01b]
|
||
[Kat03]
|
||
[Kat04]
|
||
[Bau09]
|
||
[Bau09]
|
||
[Bau09]
|
||
[Bau09]
|
||
[Bau09]
|
||
|
||
Table 3.1: This table gives information of all used molecules for which the crystal structure was
|
||
|
||
known
|
||
|
||
The conformations themselves were generated with the ConfGen program [WS10] which is
|
||
|
||
based on the molecular modeling Program MacroModel [SI08b].
|
||
|
||
The first step the program takes is to identify variable features which are rotatable bonds,
|
||
flexible ring systems and invertible nitrogens. ConfGen generally identifies a bond as rotatable
|
||
if the following criteria are met:
|
||
|
||
• It is a single bond
|
||
|
||
• It doesn’t lie within a ring
|
||
|
||
• Neither of the atoms connected by the bond is terminal (i.e. has no other bonds to it)
|
||
|
||
• Neither end of the bond is a CH3, NH2 or NH+
|
||
|
||
3 group
|
||
|
||
• Neither atom in the bond is bonded to two or three atoms that are all equivalent and are
|
||
|
||
arranged with two- or three-fold rotational symmetry.
|
||
|
||
Ring conformers are generated using the same template based facility available in LigPrep
|
||
[SI08a], Glide [Fri04], MacroModel [SI08b], or Phase [Dix06].
|
||
It is designed to generate
|
||
a complete set of accurate, low energy ring conformation identifying individual rings with a
|
||
smallest set of smallest rings (SSSR) method [Zam76]. When a ring system is identified it is
|
||
compared to a set of 1252 templates to find the most similar template. This template is then used
|
||
to calculate the relative energies of the ring within the molecule. There are Nri combinations of
|
||
ring conformations for a whole molecule:
|
||
|
||
Nri = 2Ni ∏
|
||
|
||
r
|
||
|
||
Ncr
|
||
|
||
(3.11)
|
||
|
||
where Ni is the number of invertible nitrogen atoms, r runs over all flexible ring systems and
|
||
Ncr is the number of templates selected to use for each individual ring system.
|
||
|
||
Each of the generated set of ring conformers is then processed as follows. First the potential
|
||
of each rotatable bonds connecting the ring systems are calculated using a derivative of OPLS
|
||
[Jor88] [Jor96] including a quick check of Lennard-Jones potentials of all atoms on one side
|
||
of the bond to all on the other side to avoid local Van-der-Waals clashes. Then the potential
|
||
|
||
23
|
||
|
||
3 Materials and Methods
|
||
|
||
parameter
|
||
maximum number of seach steps
|
||
search steps per rotatable bond
|
||
minimum heavy atom RMSD ( ˚A) for
|
||
distinct conformer
|
||
minimum dihedral angle difference
|
||
for polar hydrogens (◦)
|
||
maximum relative energy for flexible
|
||
rings (kcal/mol)
|
||
maximum number of ring conforma-
|
||
tions per ligand
|
||
maximum number of ring conforma-
|
||
tions per ring
|
||
maximum relative ConfGen energy
|
||
(kcal/mol)
|
||
energy threshold for periodic torsions
|
||
(kcal/mol)
|
||
restraint potentials for weak torsions
|
||
in MacroModel (kcal/mol)
|
||
restraint potential half width (◦)
|
||
suppress hydrogen-bond electrostat-
|
||
ics in MacroModel
|
||
maximum relative energy all-atom en-
|
||
ergy in MacroModel (kcal/mol)
|
||
|
||
intermediate
|
||
1000
|
||
75
|
||
1
|
||
|
||
comprehensive
|
||
1000
|
||
75
|
||
0.5
|
||
|
||
60
|
||
|
||
2.39
|
||
|
||
16
|
||
|
||
8
|
||
|
||
25
|
||
|
||
5.74
|
||
|
||
239
|
||
|
||
10
|
||
Yes
|
||
|
||
25
|
||
|
||
60
|
||
|
||
23.9
|
||
|
||
128
|
||
|
||
64
|
||
|
||
119.5
|
||
|
||
5.74
|
||
|
||
239
|
||
|
||
10
|
||
Yes
|
||
|
||
119.5
|
||
|
||
Table 3.2: This table shows the parameters used to generate the two datasets. The intermediate
|
||
parameter set is more restrictive and almost certainly only picks energetic minima
|
||
while the comprehensive parameter set allows for the algorithm to pick a conforma-
|
||
tion lying between two optima.
|
||
|
||
minima are computed and used to create sets of rotational bonds surrounding the molecular
|
||
core (i.e. the part of the molecule remaining if every outer rotational bond is severed).
|
||
|
||
For each combination of ring system conformation, invertible nitrogen atom geometry and
|
||
minima of rotatable bond dihedral angle all molecule conformations are compiled and, if the
|
||
sum of all relative potential (to the one with the least energy) energies doesn’t exceed a preset
|
||
limit, the conformation is added to the resulting set of conformers.
|
||
|
||
In this work I used two sets of parameters for the algorithm described above. One restrictive
|
||
and one permissive. While the restrictive parameter set only generated conformers where the
|
||
rotatable bonds took up a local minimum energetic state the permissive one allowed more
|
||
freedom. Thus conformations were pickes lying in between local optima and allowing the GA
|
||
to successively change more easily from one conformation to another. For the exact parameters
|
||
used in the conformer sampling see table 3.2
|
||
|
||
3.5.2 Implicit Conformation Sampling
|
||
|
||
The second strategy to be pursued was to not use the precomputed conformation sampling but
|
||
to generate a new set of conformations from generation to generation in the genetic algorithm.
|
||
Therefor the encoding of the single individuals in the GA had to be different. In contrast to the
|
||
|
||
24
|
||
|
||
3.5 Conformation Sampling
|
||
|
||
optimization of the precomputed conformation set, where each ‘gene’represented the confor-
|
||
mation ID of a whole molecule, here a ‘gene’ only represented a single rotatable bond within
|
||
a molecule. While a mutation on one gene in the GA meant to pick a whole new conformation
|
||
of the concerned molecule with possibly every rotatable bond affected, the mutation of a gene
|
||
in the implicit conformation sampling only meant the alteration of a single rotatable bond. In
|
||
addition it had to be ensured that the crossover operator didn’t cut in the middle of the encoding
|
||
of a molecule but only at the end of one and the beginning of the other. Doing a crossover in the
|
||
middle of a molecule could lead to an invalid conformation because it couldn’t be guaranteed
|
||
that the molecule wouldn’t fold back on itself overlapping one or more atoms.
|
||
|
||
Figure 3.7: The figure shows an example of a molecule with nine rotatble bonds an the corre-
|
||
sponding encoding as a gene for the GA. The denoted angles are the dihedral angles
|
||
for a unique set of deterministically calculated atoms ‘surrounding’the bond
|
||
|
||
But before encoding a molecule in the GA one had to know the exact number of rotatable
|
||
bonds. For that each bond was inspected and had to meet a list of criteria to count as rotatable.
|
||
The criteria used were the ones already implemented in the JoeLib2 framework:
|
||
|
||
• The atom at the beginning of the bond has to have a heavy atom valence of > 1
|
||
|
||
• The atom at the end of the bond has to have a heavy atom valence of > 1
|
||
|
||
• The bond order has to be 1
|
||
|
||
• The bond mustn’t lie in a ring system
|
||
|
||
• The the atom at the beginning of the bond mustn’t have a hybridization of 1
|
||
|
||
• The the atom at the end of the bond mustn’t have a hybridization of 1
|
||
|
||
If these criteria were met, the bond was added to the molecules rotatable bond list.
|
||
|
||
The unit with which the rotations where encoded was 1◦ (i.e. degree) where degree refers to
|
||
|
||
the dihedral angle. An angle of 0◦ refers to the original crystallographic conformation.
|
||
|
||
For the first generation of the GA a initial set of conformations was computed picking a
|
||
random value for the dihedral angle of each rotatabel bond of each molecule in the dataset.
|
||
|
||
25
|
||
|
||
3 Materials and Methods
|
||
|
||
After each occurring mutation in the GA the according molecule was computed again with the
|
||
new degree value. Where the new value was in reference to the original 0◦ value and not to the
|
||
currently applied one.
|
||
|
||
To compute a rotation around a rotatable bond one has to rotate each atom belonging to one
|
||
of either of the two bipartite graphs formed by splitting the molecular graph at the designated
|
||
bond. The bipartite graph was calculated using a stack, adding the beginning atom of the bond
|
||
and then recursively adding every atom bound to the ones already on the stack (except for the
|
||
atom at the end of the designated rotatable bond) until no new atoms could be found. This was
|
||
possible because no bond were allowed to be rotatable if they were in a ring system and no
|
||
molecules with macrocyles were in the dataset.
|
||
|
||
The actual rotation was achieved by applying a quaternion to each of the atoms in the bi-
|
||
partite graph with the center of the coordinate system being the atom at the beginning of the
|
||
bond.
|
||
|
||
(a) molecule in basic conformation
|
||
|
||
(b) molecule with bond Nr. 9 rotated by 90◦
|
||
|
||
Figure 3.8: These two figures show an example of a rotation around one rotatable bond.
|
||
|
||
Both, at the initial random initialization of the conformations and at every mutation event it
|
||
has to be ensured that the generated conformation is valid (i.e. no atoms or bonds overlap or
|
||
lie to close to each other). Therefore for every new conformation all pairwise atom distances
|
||
have to be calculated. The chosen value for a lower bound (gathered by calculating the average
|
||
minimal distance for non-bound atoms over the whole original dataset) was 2 ˚A while distances
|
||
of covalent bonds were ignored.
|
||
|
||
26
|
||
|
||
3.6 SVR
|
||
|
||
3.6 SVR
|
||
|
||
In this work I used the libSVM implementation by Chang and Lin [CL01] 4. To compute
|
||
the MSE a leave-one-out approach was applied. A model of the dataset (i.e represented by the
|
||
kernel matrix) was built n times (with n beeing the size of the dataset) always with one different
|
||
data point left out. For these datasets a five-fold cross-validation (inner fold) was run 5 times
|
||
(inner runs). The inner fold was used to determine the best parameters of the regression (i.e
|
||
values of parameters ε and c yielding the best performance on the validation dataset). The inner
|
||
runs were used to the best model, with the just computed best parameters. The best model was
|
||
then used to predict the currently left out data point. The set of parameters can be seen in table
|
||
3.3.
|
||
|
||
computation method
|
||
inner folds
|
||
inner repetitions
|
||
c begin
|
||
c end
|
||
ε begin
|
||
ε end
|
||
|
||
leave one out
|
||
5
|
||
5
|
||
-1
|
||
5
|
||
-7
|
||
-2
|
||
|
||
Table 3.3: This table shows the parameters used to calculate the MSE for the regression on the
|
||
|
||
datasets.
|
||
|
||
4http://www.csie.ntu.edu.tw/ cjlin/libsvm/
|
||
|
||
27
|
||
|
||
4 Results
|
||
|
||
In this chapter I will present the results, interpret and discuss them. The results are divided
|
||
by precomputed and implicit conformation sampling. Each of those two parts is further split
|
||
by the used kernel methods and parameters. The results are mostly presented in chronological
|
||
order to try and replicate my line of thought.
|
||
|
||
Evaluation Method
|
||
|
||
To explain the values shown in the following diagrams I will give a short explanation for each
|
||
of them. The meaning of the values remains the same for every ‘run’-diagram in this work.
|
||
|
||
MSE
|
||
|
||
The ‘avg MSE’value shown in each diagram is the average MSE value for one generation
|
||
(i.e. 100 individuals). Where MSE is the best found Mean Square Error for each regression.
|
||
In the corresponding table I will show the respective numerical values. The ‘Best individual
|
||
MSE’ relates to the absolute minimum found by at least one individual.
|
||
|
||
RMSD
|
||
|
||
The ‘avg RMSD’ value is the average RMSD value for the conformation of the 12 molecules
|
||
with known active structure encoded by the current individual to their respective active struc-
|
||
ture. The ‘Best individual RMSD’ relates to the individual with the lowest RMSD averaged
|
||
over the 12 molecules in my dataset to their respective active structure.
|
||
|
||
5% Quantile
|
||
|
||
The 5% quantile demarks the value where every point below this line lies in the lowest 5% of all
|
||
possible values for the average RMSD. It’s value is exactly 1.629. This was computed picking
|
||
20000 random combinations from the conformer sets of each molecule and building the average
|
||
RMSD to their respective active structures. From this normal distribution the p−Quantile
|
||
with p = 0.05 was calculated using the standard formula x(p) = µ + σ · z(p) where µ is the
|
||
expectation and σ 2 the variance and z(0.05) was looked up in the normal distribution table.
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
For the first experiment, the optimization of the QSAR model I used the dataset described ear-
|
||
lier. The dataset first included all conformers produced with ConfGen where every molecule
|
||
had a different amount of conformers created. To reduce the extent of the combinatorial size
|
||
I picked the lower of either 100 or the number of conformers originally created. The selec-
|
||
tion method was to pick the conformers equally distributed over their relative energy to the
|
||
|
||
28
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
conformer with the absolute lowest energy to guarantee an equal distribution over the con-
|
||
formational space of each molecule. Because the output sets for each individual conformer
|
||
where already sorted by their relative energy I simply had to pick every n-th conformer. Where
|
||
n = number of conformers available/100.
|
||
|
||
29
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the avg. MSE, avg RMSD, best
|
||
MSE and best RMSD for Run 01 where the PPK Kernel
|
||
was used with parameter B = 1000
|
||
|
||
(b) This figure shows the avg. MSE, avg RMSD, best
|
||
MSE and best RMSD for Run 02 where the RBF Kernel
|
||
was used with parameters B = 1000; sigma = 100
|
||
|
||
(c) [This figure shows the avg. MSE, avg RMSD, best
|
||
MSE and best RMSD for Run 02 where the RBF Kernel
|
||
was used with smoothing factor = 0.1
|
||
|
||
Figure 4.1: These figures show the results of the first three runs. One can see that the optimiza-
|
||
tion works fine due to the MSE declining while the average RMSD only declines in
|
||
Run03 using the APK but still doesn’t reach the 5% quantile.
|
||
|
||
30
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Paramter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
01
|
||
PPK
|
||
1000
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
1.256
|
||
0.118
|
||
1.138
|
||
0.109
|
||
0.117
|
||
1.893
|
||
1.878
|
||
0.015
|
||
1.822
|
||
1.525
|
||
|
||
02
|
||
RBF
|
||
1000
|
||
100
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
1.191
|
||
0.536
|
||
0.655
|
||
0.536
|
||
0.526
|
||
1.915
|
||
1.900
|
||
0.015
|
||
1.808
|
||
1.505
|
||
|
||
03
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
intermediate
|
||
1.039
|
||
0.337
|
||
0.702
|
||
0.337
|
||
0.335
|
||
1.907
|
||
1.790
|
||
1.117
|
||
1.724
|
||
1.465
|
||
|
||
Table 4.1: This table shows the parameters and the results for Run01,Run02 and Run 03. Pa-
|
||
|
||
rameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.1 Initial Runs
|
||
|
||
In the first runs (run 01-03) of the experiment I used the PPK and the RBF Kernel on the RDF
|
||
of the molecules and the APK to generate the kernel matrix. The parameters were set to their
|
||
default values to check the overall function of the optimization. (see table 4.1)
|
||
|
||
The results of these three runs are depicted on the left side in figure (4.1). One can see that
|
||
the basic optimization is functional. The average MSE declines from the values of 1.256, 1.191
|
||
and 1.039 to values of 0.109, 0.536 and 0.337. But, while the average RMSD declines slightly
|
||
in Run03 with the use of the APK it remains at the same level with the use of the PPK and
|
||
RBF. Although some isolated individuals get below the 5% quantil mark they are dismissed in
|
||
the next generation implying that the individuals with a higher average RMSD result in better
|
||
models with lower MSEs.
|
||
|
||
My first consideration on evaluating these results where twofold. Either the use of the large
|
||
dataset of molecules with unknown active structures impeded the decline of the ones with
|
||
known active structures because their weight in the model building process was to large, or the
|
||
parameters used were not fit for this kind of optimization.
|
||
|
||
Therefore I consecutively lowered the size of the dataset to 56, 41 and 34 molecules, al-
|
||
ways including the 12 known active structures, and changed the parameters of the kernels used.
|
||
Which are the B parameter for the RDF resulting in a smoother RDF function and the smooth-
|
||
ing factor for the APK, both in the hope of a better generalization. These changes are shown in
|
||
the next sections.
|
||
|
||
31
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the PPK with B =
|
||
1000 on the dataset with 56 molecules
|
||
|
||
(b) This figure shows the results for the PPK with B =
|
||
1000 on the dataset with 41 molecules
|
||
|
||
(c) This figure shows the results for the PPK with B =
|
||
1000 on the dataset with 34 molecules
|
||
|
||
(d) This figure shows the results for the RBF Kernel
|
||
with sigma = 100 on the dataset with 56 molecules
|
||
|
||
(e) This figure shows the results for the RBF Kernel
|
||
with sigma = 100 on the dataset with 41 molecules
|
||
|
||
(f) This figure shows the results for the RBF Kernel with
|
||
sigma = 100 on the dataset with 34 moleculesl
|
||
|
||
Figure 4.2: These figures show the results of the runs with reduced datasets for the PPK and
|
||
|
||
RBF Kernel.
|
||
|
||
32
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
04
|
||
PPK
|
||
1000
|
||
-
|
||
0.1
|
||
no
|
||
interm.
|
||
56
|
||
1.3470
|
||
0.1184
|
||
1.2286
|
||
0.1128
|
||
0.1091
|
||
1.8626
|
||
1.7738
|
||
0.0888
|
||
1.5744
|
||
1.4630
|
||
|
||
05
|
||
PPK
|
||
1000
|
||
-
|
||
0.1
|
||
no
|
||
interm.
|
||
41
|
||
1.4283
|
||
0.0917
|
||
1.3366
|
||
0.0885
|
||
0.0856
|
||
1.8348
|
||
1.6095
|
||
0.2253
|
||
1.5977
|
||
1.2913
|
||
|
||
06
|
||
PPK
|
||
1000
|
||
-
|
||
0.1
|
||
no
|
||
interm.
|
||
34
|
||
1.5836
|
||
0.1366
|
||
1.447
|
||
0.1343
|
||
0.1334
|
||
1.8924
|
||
1.9664
|
||
-0.074
|
||
1.5805
|
||
1.5148
|
||
|
||
07
|
||
RBF
|
||
1000
|
||
100
|
||
0.1
|
||
no
|
||
interm.
|
||
56
|
||
1.3432
|
||
0.3663
|
||
0.9769
|
||
0.3663
|
||
0.3578
|
||
1.8239
|
||
1.7765
|
||
0.0474
|
||
1.7443
|
||
1.2660
|
||
|
||
08
|
||
RBF
|
||
1000
|
||
100
|
||
0.1
|
||
no
|
||
interm.
|
||
41
|
||
1.4391
|
||
0.3422
|
||
1.0969
|
||
0.3422
|
||
0.3393
|
||
1.8206
|
||
1.8151
|
||
0.0109
|
||
1.8120
|
||
1.3552
|
||
|
||
09
|
||
RBF
|
||
1000
|
||
100
|
||
0.1
|
||
no
|
||
interm.
|
||
34
|
||
1.5611
|
||
0.4955
|
||
1.0656
|
||
0.4897
|
||
0.4862
|
||
1.8717
|
||
1.9408
|
||
-0.0691
|
||
1.8688
|
||
1.2588
|
||
|
||
Table 4.2: This table shows the parameters and the results for Run 04 through Run 09. Param-
|
||
|
||
eters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.2 Reduced Dataset with PPK and RBF Kernel
|
||
|
||
To see if reducing the dataset size would yield models with a lower average RMSD I ran the
|
||
PPK and the RBF on datasets where the only every 2nd, 3rd and 4th molecule with unknown
|
||
active structure where included. The hypothesis was that due to the fact that the overall influ-
|
||
ence of the known active structures on the model is higher and if the general assumption of
|
||
good models consisting of good data (i.e. the active structure) the RMSD would be lower.
|
||
|
||
The results of these runs are shown in figure 4.1.1 and table 4.2. With the use of the PPK
|
||
(runs 04-06) the average RMSD gets below the 5% quantile at some point. In run 04 and run 05
|
||
the average RMSD gets below the 5% quantile within the first 50 generations but returns to its
|
||
starting level shortly after and stagnates. In run 05 however the average RMS stays at a relative
|
||
high value in comparison to run 04 and 06 but declines to a value below the 5% quantile mark
|
||
after 50 generations. In addition the MSE of the best model found for run 05 was the lowest of
|
||
all three runs with final value of 0.0917 in contrast to 0.1184 and 0.1366 for runs 04 and 06.
|
||
|
||
With the use of RBF kernel (run 07, run 08 and run 09) the average RMSD didn’t get below
|
||
the 5% quantile in any of the 3 runs. Although run 08 on the dataset with 41 molecules showed
|
||
a steady decline of the average RMSD which is similar to the results of run 05. It is noticeable
|
||
that the initial generation of all three runs consisted of at least one individual with a very
|
||
low average RMSD and considering the low starting RMSD more than one. These indivduals
|
||
however were dismissed in the first 25 generations resulting in a average RMSD. Furthermore
|
||
the best models found by using the RBF kernel had MSE values of 0.3578, 0.3393 and 0.4852
|
||
which is for each more then three times the MSE value of the best model for the PPK with
|
||
corresponding dataset size where the values are 0.1091, 0.0856 and 0.1334.
|
||
|
||
Considering this direct comparison of the PPK and the RBF kernel the PPK shows better
|
||
|
||
results, both for the modeling and for the use of the actual active structure.
|
||
|
||
33
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the APK with
|
||
smoothing f actor = 0.1 on the dataset with 56 molecule
|
||
|
||
(b) This figure shows the results for the APK with
|
||
smoothing f actor = 0.1 on the dataset with 41 molecule
|
||
|
||
(c) This figure shows the results for the APK with
|
||
smoothing f actor = 0.1 on the dataset with 34 moleculel
|
||
|
||
Figure 4.3: These figures show the results of the runs with reduced datasets for the APK. Run
|
||
|
||
12 is the model with lowest final average RMSD of all runs.
|
||
|
||
34
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
10
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
intermediate
|
||
56
|
||
1.0026
|
||
0.3420
|
||
0.6606
|
||
0.3420
|
||
0.3390
|
||
1.8399
|
||
1.6799
|
||
0.16
|
||
1.5922
|
||
1.4729
|
||
|
||
11
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
intermediate
|
||
41
|
||
1.2949
|
||
0.5053
|
||
0.7896
|
||
0.5051
|
||
0.5028
|
||
1.8381
|
||
2.0098
|
||
-0.1717
|
||
1.7233
|
||
1.3699
|
||
|
||
12
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
intermediate
|
||
34
|
||
1.2255
|
||
0.5067
|
||
0.7188
|
||
0.5023
|
||
0.5012
|
||
1.9241
|
||
1.3813
|
||
0.5428
|
||
1.2618
|
||
1.1870
|
||
|
||
Table 4.3: This table shows the parameters and the results for Ru10, Run11 and Run 12. Pa-
|
||
rameters denoted by ‘-’ are not available for the chosen kernel method
|
||
|
||
4.1.3 Reduced Dataset with Atom Pair Kernel
|
||
|
||
The reduction of the dataset had a similar effect on the use of the APK as id had on the RPK.
|
||
The starting average RMSD was 1.8399, 1.18381 and 1.9241 and while run 10 and 12 had a
|
||
considerably lower end RMSD with 1.699 and 1.3913 the final RMSD of run 11 was 2.0098.
|
||
Which is 0.1717 higher then the start RMSD.
|
||
|
||
The first and third run, 10 and 12 show a similar development as the earlier runs 04, 05 and
|
||
06 with the average RMSD dropping by several percent around generation 50. But in contrast
|
||
to all other previous runs the RMSD of run 12 declines further giving an indication that the
|
||
optimization reaches a point where it can drop into several minima one of them beeing a model
|
||
that included structures more likely to be near the conformation of the active structure.
|
||
|
||
Further one can see that for all three kernel methods the overall end MSE value rises with
|
||
descending dataset size. This can be lead back to loss of information with decreased data set
|
||
size. But both the APK and especially the PPK mostly lead to better models then the RBF dies
|
||
with the full dataset of 100 molecules. Where the APK and PPK differ in the way that using
|
||
the APK leads to models which have a lower RMSD to the active structures but a higher MSE
|
||
while the use of the PPK leads to very good models with the lowest MSE of all all models
|
||
created but with higher RMSD values.
|
||
|
||
Because most of the resulting models using the APK and PPK with the reduced datasets
|
||
where as good as the the ones using the full dataset due to their equal or lower MSE, I decided
|
||
to use the reduced dataset for future runs. Since the SVR is contained in O(n3) and the GA is
|
||
contained in O(n) this measure cut the computation time for a complete run by at least 50%.
|
||
|
||
35
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the PPK with B = 10
|
||
on the dataset with 34 molecules
|
||
|
||
(b) This figure shows the results for the PPK with B =
|
||
10 on the dataset with 56 molecules
|
||
|
||
(c) This figure shows the results for the PPK with B =
|
||
100 on the dataset with 56 molecules
|
||
|
||
(d) This figure shows the results for the PPK with B =
|
||
500 on the dataset with 34 molecules
|
||
|
||
Figure 4.4: These figures show the results of the runs with reduced datasets and altered param-
|
||
|
||
eters for the PPK.
|
||
|
||
36
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
13
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
34
|
||
1.5727
|
||
0.1701
|
||
1.4026
|
||
0.1645
|
||
0.1595
|
||
1.8585
|
||
1.8107
|
||
0.0478
|
||
1.6172
|
||
1.4992
|
||
|
||
14
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
56
|
||
1.3344
|
||
0.0853
|
||
1.2491
|
||
0.0838
|
||
0.0814
|
||
1.8323
|
||
1.8826
|
||
-0.0503
|
||
1.7662
|
||
1.3194
|
||
|
||
15
|
||
PPK
|
||
100
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
56
|
||
1.3834
|
||
0.0813
|
||
1.3021
|
||
0.0777
|
||
0.0763
|
||
1.8402
|
||
1.7181
|
||
0.1221
|
||
1.7118
|
||
1.0926
|
||
|
||
16
|
||
PPK
|
||
500
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
intermediate
|
||
34
|
||
1.5559
|
||
0.1387
|
||
1.4172
|
||
0.1306
|
||
0.1285
|
||
1.9009
|
||
2.0711
|
||
-0.1702
|
||
1.6644
|
||
1.6265
|
||
|
||
Table 4.4: This table shows the parameters and the results for run 13, run 14, run 15 and run
|
||
|
||
16. Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.4 Alternative Parameters for the Product Probability Kernel
|
||
|
||
In addition to reducing the dataset I changed the parameters of the PPK and APK. The results
|
||
for the PPK with the B parameter of the RBF set to 10, 100 and 500 in relation to 1000 at the
|
||
previous runs are shown in figure 4.4 and table 4.4. One can see that run 16 still shows the low
|
||
RMSD values around generation 50 with a strong increase and stagnation afterwards. The runs
|
||
13, 14 and 15 also show the decrease of the RMSD around generation 50 but not as strong as
|
||
runs with a higher parameter.
|
||
|
||
The average RMSD values of runs 13 and 15 only decrease slightly by 0.0478 and 0.1221
|
||
from 1.8585 and 1.8107 to 1.8107 and 1.7181. While the average RMSD values of runs 14
|
||
and 16 even increase by 0.0503 and 0.1702 from 1.8323 and 1.9009 to 1.8826 and 2.0711. The
|
||
MSE though reaches the lowest values of all runs with run 15 at a value of 0.777 and the second
|
||
lowest at run 14 with 0.0838.
|
||
|
||
The fact that a run with parameter B = 10 renders the best resulting model can be lead
|
||
back to the fact that the B parameter describes the ‘smoothness’ and distinctness of a RDF.
|
||
With declining B the RDF becomes more of a general description of the respective molecule
|
||
and its conformation instead of an exact characterization. In this case the presence of distinct
|
||
chemical groups or pharmacophores and their arrangement to each other is more important then
|
||
their individual orientation. This leads to a better generalization of the model at the cost of a
|
||
better discrimination of the conformations for each molecule.
|
||
|
||
37
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the APK with
|
||
smoothing f actor = 0.2 on the dataset with 34 molecules
|
||
|
||
(b) This figure shows the results for the APK with
|
||
smoothing f actor = 0.2 on the dataset with 41 molecules
|
||
|
||
the APK
|
||
(c) This figure shows
|
||
with smoothing f actor = 0.2 on the dataset with 56
|
||
moleculesl
|
||
|
||
the results
|
||
|
||
for
|
||
|
||
(d) This figure shows the results for the APK with
|
||
smoothing f actor = 0.2 on the dataset with 100
|
||
molecules
|
||
|
||
Figure 4.5: These figures show the results of the runs with reduced datasets and altered param-
|
||
|
||
eters for the APK.
|
||
|
||
38
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End RMSD
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
17
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
intermediate
|
||
34
|
||
1.1934
|
||
0.3901
|
||
0.8033
|
||
0.3857
|
||
0.3830
|
||
1.9084
|
||
1.8879
|
||
0.0205
|
||
1.7840
|
||
1.4992
|
||
|
||
18
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
intermediate
|
||
41
|
||
1.3013
|
||
0.3813
|
||
0.92
|
||
0.3813
|
||
0.3807
|
||
1.8337
|
||
1.7185
|
||
0.1152
|
||
1.5997
|
||
1.3194
|
||
|
||
19
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
intermediate
|
||
56
|
||
0.9636
|
||
0.2805
|
||
0.6831
|
||
0.2796
|
||
0.2774
|
||
1.8434
|
||
1.6696
|
||
0.1738
|
||
1.6470
|
||
1.0926
|
||
|
||
20
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
intermediate
|
||
100
|
||
1.0234
|
||
0.2878
|
||
0.7356
|
||
0.2878
|
||
0.2864
|
||
1.9025
|
||
1.9787
|
||
-0.0762
|
||
1.8900
|
||
1.6265
|
||
|
||
Table 4.5: This table shows the parameters and the results for run 17, run 18 and run 19 and run
|
||
|
||
20. Parameters denoted by ‘-’ are not available for the chosen kernel method
|
||
|
||
4.1.5 Alternative Parameters for APK
|
||
|
||
Figure 4.5 and table 4.5 show the results for the runs 17, 18, 19 and 20 using the APK with the
|
||
full and the reduced datasets and a smoothing factor of 0.2. Runs 18 and 19 show the first deep
|
||
decline of the RMSD at 50 generations to a global minimum of an average RMSD of 1.5997
|
||
and 1.6470 while run 17 shows a steady decline and run 20 an overall stagnation at a RMSD of
|
||
approximately 2.0.
|
||
|
||
As with the alternation of the B parameter of the PPK, setting the smoothing factor to a
|
||
value of 0.2 for the APK changes the generalization of the model resulting in lower MSE
|
||
values than previous runs with the use of the APK for all four runs. While the APK only
|
||
encodes atom types, distances and binding modes, doubling the smoothing factor still holds
|
||
enough information to fit the model. It allows further for the GA to hold more individuals with
|
||
a wider RMSD range. This can be seen in figure 4.5 with the best individual RMSD values
|
||
being distinctively low than the average RMSD values over several generations in all four runs.
|
||
The average MSE values of the final models was 0.3901, 0.3813, 0.2805 and 0.2878, which
|
||
is approximately 0.15 below previous runs. But in change for the better generalization and
|
||
lower MSE values the overall RMSD stagnated with only run 19 showing a decline to a finale
|
||
value of 1.7185 which is still above the 5% quantile.
|
||
|
||
39
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the PPK with B = 10
|
||
on 56 molecules and a mutation probability of 0.2
|
||
|
||
(b) This figure shows the results for the PPK with B =
|
||
10 on 56 molecules and a mutation probability of 0.2
|
||
|
||
(c) This figure shows the results for the PPK with B = 10
|
||
on 56 molecules and a mutation probability of 0.2
|
||
|
||
Figure 4.6: These figures show the results of the runs with reduced datasets and increased mu-
|
||
|
||
tation probability for the PPK and APK.
|
||
|
||
40
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
21
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.5
|
||
no
|
||
intermediate
|
||
34
|
||
1.5839
|
||
0.1584
|
||
1.4255
|
||
0.132
|
||
0.1148
|
||
1.8889
|
||
1.8699
|
||
0.0190
|
||
1.7506
|
||
1.4206
|
||
|
||
22
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.5
|
||
no
|
||
intermediate
|
||
56
|
||
1.3608
|
||
0.1012
|
||
1.2596
|
||
0.0909
|
||
0.0736
|
||
1.8520
|
||
1.9711
|
||
-0.1191
|
||
1.6712
|
||
1.3559
|
||
|
||
23
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.5
|
||
no
|
||
intermediate
|
||
34
|
||
1.2421
|
||
0.4708
|
||
0.7713
|
||
0.4651
|
||
0.4497
|
||
1.8823
|
||
2.0111
|
||
-0.1288
|
||
1.7752
|
||
1.4107
|
||
|
||
Table 4.6: This table shows the parameters and the results for Run 21, Run 22 and Run 23.
|
||
Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.6 Increased Mutation Rate
|
||
|
||
Studying the results of the changes in decreasing the dataset and altering the kernel parameter
|
||
the next step was to change the parameters and overall process of the GA. The easier of both was
|
||
to set the mutation probability to 0.5 instead of the standard 0.1 value. The mutation probability
|
||
defines the rate at which mutations occur during the mating process from one generation to the
|
||
next.
|
||
Increasing the mutation probability allows the GA to search in a broader range and
|
||
increases the chance of the optimization to jump out of a local minimum, but it also decreases
|
||
the optimization rate and may lead to more diverse results.
|
||
|
||
As one can see in all three runs depicted in figure 4.6 the increased mutation probability
|
||
leads to at least one individual in each generation with a significantly lower RMSD as the
|
||
average. Further noticeable is the fact the the progression of the average and especially the best
|
||
individual RMSD include more peaks.
|
||
|
||
While changing the mutation probability still leads to good models with an average MSE of
|
||
0.1584, 0.1012 and 0.4708, in two of the runs the average end RMSD was even 0.1191 and
|
||
0.1288 higher then their start RMSD with 1.8520 and 1.8823.
|
||
|
||
41
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the APK with
|
||
smoothing f actor = 0.2 on 56 molecules with alterna-
|
||
tive conformation sampling and a mutation probability
|
||
of 0.1
|
||
|
||
(b) This figure shows the results for the APK with
|
||
smoothing f actor = 0.2 on 56 molecules with alterna-
|
||
tive conformation sampling and a mutation probability
|
||
of 0.1
|
||
|
||
(c) This figure shows the results for the PPK with B = 10
|
||
on 56 molecules with alternative conformation sampling
|
||
and a mutation probability of 0.5
|
||
|
||
(d) This figure shows the results for the PPK with B =
|
||
10 on 56 molecules with alternative conformation sam-
|
||
pling and a mutation probability of 0.1
|
||
|
||
Figure 4.7: These figures show the results of the runs with reduced datasets of the alternative
|
||
conformation sampling and increased mutation probability for the PPK.
|
||
|
||
4.1.7 Alternative Conformation Sampling
|
||
|
||
A second way of increasing the chance of the optimization to jump out of a local minimum
|
||
was to change the conformation sampling of the dataset. While the intermediate parameters for
|
||
the ConfGen algorithm only allows local minima of relative molecular energy the comprehen-
|
||
sive parameters allowed GonfGen to output molecules with dihedral angles and flexible ring
|
||
energies not being in a local minimum.
|
||
|
||
42
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Dataset size
|
||
Conformation Sampling
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
24
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
56
|
||
comprehensive
|
||
0.9265
|
||
0.3775
|
||
0.549
|
||
0.3775
|
||
0.3756
|
||
1.7999
|
||
1.7873
|
||
0.0126
|
||
1.7175
|
||
1.376
|
||
|
||
25
|
||
APK
|
||
-
|
||
-
|
||
0.2
|
||
0.1
|
||
no
|
||
56
|
||
comprehensive
|
||
0.9252
|
||
0.3192
|
||
0.606
|
||
0.3192
|
||
0.3169
|
||
1.8047
|
||
1.7608
|
||
0.0439
|
||
1.7385
|
||
1.3512
|
||
|
||
26
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.5
|
||
no
|
||
56
|
||
comprehensive
|
||
1.3531
|
||
0.1084
|
||
1.2447
|
||
0.1025
|
||
0.0816
|
||
1.794
|
||
1.6181
|
||
0.1759
|
||
1.5084
|
||
1.2488
|
||
|
||
27
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.5
|
||
no
|
||
56
|
||
comprehensive
|
||
1.3671
|
||
0.1191
|
||
1.248
|
||
0.1059
|
||
0.0934
|
||
1.7862
|
||
1.7728
|
||
0.0134
|
||
1.6867
|
||
1.3333
|
||
|
||
Table 4.7: This table shows the parameters and the results for Run 24, Run 25, Run 26 and Run
|
||
|
||
27. Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
The thought was to allow the GA to successively get out of local minima due to the differ-
|
||
ences in relative molecular energies not being as great as with the intermediate conformation
|
||
sampling. Therefore the error of a change from one conformation to the one with the nearest
|
||
relative energy would not be as great for the comprehensive conformation sampling as for the
|
||
intermediate.
|
||
|
||
Another reason for using the comprehensive conformation sampling was that the relative
|
||
energy of an active structure is not necessarily a local minimum due to the interaction of the
|
||
molecule with its target and the solvent. So by allowing non-minima structures in the dataset I
|
||
reduced the minimal RMSD between the conformers in the dataset and the active structures.
|
||
|
||
The results for the experiments with the dataset produced by conformation sampling with
|
||
comprehensive parameters are shown in figure 4.7 and table 4.7. As one can see in run 24
|
||
and run 25 which used the APK and a smoothing factor of 0.2 the first decline of the average
|
||
RMSD with its concurrent increase between generation 35 and 50 still occurs. But instead of
|
||
stagnating at the same average level as in most previous runs the average RMSD declines again
|
||
in later generations. The final average RMSD, however, was only 0.0126 and 0.0439 lower than
|
||
the starting average RMSD with 1,783 and 1.7608 while the final average MSE with 0.3775
|
||
and 0.3192 was better than most of the previous runs with the APK. Therefore the final models
|
||
were more precise but still did not include conformations near the active structure.
|
||
|
||
The results for run 26 and run 27 are also shown in figure 4.1.6 and table 4.7. Both runs
|
||
used the PPK and a mutation probability of 0.5. In run 26 the average RMSD almost always
|
||
lies within the 5% quantile. In run 27 the average RMSD declines to a value of approximately
|
||
1.7 and stagnates for the second half of the optimization. The runs have final average RMSD
|
||
values of 1.6181 and 1.7728 and final average MSE values of 0.1084 and 0.1191.
|
||
|
||
43
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the APK with
|
||
smoothing f actor = 0.1 on 56 molecules with alternative
|
||
conformation sampling and a the mutation only allowed
|
||
on the 12 known active structures
|
||
|
||
(b) This figure shows the results for the APK with
|
||
smoothing f actor = 0.1 on 34 molecules with alternative
|
||
conformation sampling and a the mutation only allowed
|
||
on the 12 known active structures
|
||
|
||
(c) This figure shows the results for the PPK with B = 10
|
||
on 56 molecules with alternative conformation sampling
|
||
and a the mutation only allowed on the 12 known active
|
||
structures
|
||
|
||
(d) This figure shows the results for the PPK with B =
|
||
10 on 34 molecules with alternative conformation sam-
|
||
pling and a the mutation only allowed on the 12 known
|
||
active structures
|
||
|
||
Figure 4.8: These figures show the results of the runs with reduced datasets of the alternative
|
||
conformation sampling and altered mutation operator to allow mutation only on the
|
||
conformers of the molecules with known active structure.
|
||
|
||
44
|
||
|
||
4.1 Precomputed Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Dataset size
|
||
Conformation Sampling
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
28
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
yes
|
||
56
|
||
comprehensive
|
||
0.856
|
||
0.682
|
||
0.174
|
||
0.6815
|
||
0.6815
|
||
1.7972
|
||
1.7187
|
||
0.0785
|
||
1.5975
|
||
1.3621
|
||
|
||
29
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
yes
|
||
34
|
||
comprehensive
|
||
1.1405
|
||
0.9029
|
||
0.2376
|
||
0.9022
|
||
0.9022
|
||
1.7881
|
||
1.8774
|
||
-0.0893
|
||
1.727
|
||
1.5028
|
||
|
||
30
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
56
|
||
comprehensive
|
||
1.311
|
||
0.7738
|
||
0.5372
|
||
0.7706
|
||
0.7656
|
||
1.7808
|
||
1.7369
|
||
0.0439
|
||
1.7098
|
||
1.3915
|
||
|
||
31
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
34
|
||
comprehensive
|
||
1.5813
|
||
0.7589
|
||
0.8224
|
||
0.7255
|
||
0.7207
|
||
1.7808
|
||
1.8078
|
||
-0.027
|
||
1.7052
|
||
1.3247
|
||
|
||
Table 4.8: This table shows the parameters and the results for Run 28, Run 29, Run 30 and Run
|
||
|
||
31. Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.8 Alternative Mutation Operator
|
||
|
||
The final change to the mutation was to change the mutation operator in that way that it only
|
||
allowed the conformers of the molecules with known active structures to be mutated during the
|
||
mating process at the end of a generation. For the rest of the molecules, which are the ones with
|
||
unknown active structures, the conformation with the minimal relative energy was fixed. The
|
||
reason for this change was to reduce the search space for the optimization to the conformations
|
||
of the molecules with known active structure. Therefore increasing the chance of finding a
|
||
model with low MSE which included conformations similar to the active structures resulting in
|
||
a lower RMSD.
|
||
|
||
The results of the four runs, 28, 29, 30 and 31 with altered mutation operator are shown in
|
||
figure 4.8 and table 4.8. One can see that, while the average MSE rapidly declines in the first
|
||
25 generations the average RMSD remains a the same level throught the whole run for all four
|
||
runs. Futher the average MSE only reaches values of 0.684 to 0.9029 which is significantly
|
||
higher than in previous runs due to the fact that the remaining fixed molecules do not allow a
|
||
better model.
|
||
|
||
This means that, while only optimizing over the generated conformers of the known active
|
||
structures, the best models found still do not include conformations similar (i.e with a low
|
||
RMSD) to those structures. Possible reasons for that are manifold and will be discussed in the
|
||
next chapter.
|
||
|
||
45
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the average results for four runs
|
||
with the APK with smoothing f actor = 0.1 on conform-
|
||
ers of 34 molecules created with the intermediate param-
|
||
eter set
|
||
|
||
(b) This figure shows the average results for four runs
|
||
with the APK with smoothing f actor = 0.1 on conform-
|
||
ers of 34 molecules created with the comprehensive pa-
|
||
rameter set
|
||
|
||
Figure 4.9: These figures show the average results of four runs with the APK and reduced
|
||
|
||
datasets of both conformation sampling parameter sets.
|
||
|
||
46
|
||
|
||
4.2 Implicit Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Dataset size
|
||
Conformation Sampling
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff. Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
avg. of run 32-35
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
34
|
||
intermediate
|
||
1.2388
|
||
0.5018
|
||
0.7370
|
||
0.4999
|
||
0.4965
|
||
1.8967
|
||
1.8979
|
||
-0.0012
|
||
1.8294
|
||
1.5099
|
||
|
||
avg. of run 36-39
|
||
APK
|
||
-
|
||
-
|
||
0.1
|
||
0.1
|
||
no
|
||
34
|
||
comprehensive
|
||
1.2199
|
||
0.5338
|
||
0.6862
|
||
0.5338
|
||
0.5316
|
||
1.8022
|
||
1.8070
|
||
-0.0048
|
||
1.6579
|
||
1.3405
|
||
|
||
Table 4.9: This table shows the parameters and the results for the average of runs 32-35 and
|
||
|
||
36-39. Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.1.9 Reruns
|
||
|
||
The only run resulting in an considerably lower average RMSD then all other runs was run
|
||
12 with a final average RMSD of 1.3813 (see figure 4.3 and table 4.3). To check if this was
|
||
a random result or if the constellation of kernel, dataset and parameters lead to models us-
|
||
ing conformations with a low RMSD to the active structure I reran the specific parameter set
|
||
of run 12 four times with either of both conformation sampling parameters intermediate and
|
||
comprehensive. The averaged results of these runs are shown in figure 4.9 and table 4.9
|
||
|
||
As one can see the average RMSD stagnates at 1.8 which is also the mean value for the
|
||
RMSD of all possible combinations of conformers. The decline and immediate return to the
|
||
mean RMSD between generation 25 and 50 is also visible for both results.
|
||
|
||
This proves that run 12 was a random result with the GA finding a local minimum. With a
|
||
value of 0.567 the MSE of run 12 is even higher then the average MSE for both of the 4 runs
|
||
with 0.5018 and 0.5338.
|
||
|
||
4.2 Implicit Conformation Sampling
|
||
|
||
The runs of the optimization with the implicit conformation sampling were done parallel to the
|
||
runs with precomputed conformation sampling. Therefore the results of the runs with precom-
|
||
puted conformation sampling influenced the decisions made for the parameters and dataset size
|
||
for the runs with implicit conformation sampling. One run with implicit conformation sam-
|
||
pling on the full dataset took up to two weeks on a Xeon quadcore server. This is why there are
|
||
fewer results for the implicit conformation sampling.
|
||
|
||
47
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the RBF kernel with
|
||
B = 1000 and sigma = 100 on the full dataset with im-
|
||
plicit conformation sampling
|
||
|
||
(b) This figure shows the results for the RBF with B =
|
||
1000 on the full dataset with implicit conformation sam-
|
||
pling
|
||
|
||
(c) This figure shows the results for the PPK with B = 10
|
||
on the full dataset with implicit conformation sampling
|
||
|
||
(d) This figure shows the results for the PPK with B =
|
||
10 on the full dataset with implicit conformation sam-
|
||
pling
|
||
|
||
Figure 4.10: These figures show the results of the runs with implicit conformation sampling on
|
||
|
||
the full dataset and the use of the PPK and RBF kernel.
|
||
|
||
48
|
||
|
||
4.2 Implicit Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
01i
|
||
RBF
|
||
1000
|
||
100
|
||
-
|
||
0.1
|
||
no
|
||
implicit
|
||
100
|
||
1.1411
|
||
0.5331
|
||
0.608
|
||
0.5311
|
||
0.5285
|
||
1.8724
|
||
1.7978
|
||
0.0746
|
||
1.6873
|
||
1.5876
|
||
|
||
02i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
implicit
|
||
100
|
||
1.0375
|
||
0.171
|
||
0.8665
|
||
0.171
|
||
0.1638
|
||
1.8826
|
||
1.9955
|
||
-0.1129
|
||
1.8246
|
||
1.5268
|
||
|
||
03i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
implicit
|
||
100
|
||
1.1491
|
||
0.519
|
||
0.6301
|
||
0.519
|
||
0.5142
|
||
1.9039
|
||
1.8523
|
||
0.0516
|
||
1.8425
|
||
1.4942
|
||
|
||
04i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
no
|
||
implicit
|
||
100
|
||
1.0389
|
||
0.3874
|
||
0.6515
|
||
0.3874
|
||
0.3808
|
||
1.8894
|
||
1.8909
|
||
-0.0015
|
||
1.8498
|
||
1.6142
|
||
|
||
Table 4.10: This table shows the parameters and the results for run 01i, run 02i, run 03i and run
|
||
|
||
04i. Parameters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.2.1 Initial Runs
|
||
|
||
The results for the initial runs, 01i and 02i, are shown in figure 4.10 and table 4.10 . Run 03i
|
||
and run 04i are later runs with the same parameters as run 02i. As one can see the average MSE
|
||
is decreasing. This shows that the optimization is functional. But in comparison to the runs
|
||
with precomputed conformation sampling shown in the preceding section the average MSE
|
||
decreases more slowly and is has not reached a minimum at the end of the run. This can be
|
||
assumed due to the average MSE still decreasing in generation 150 to 200 and not reaching an
|
||
even level. The final average MSE of the runs was 0.5331, 0.1710, 0.5190 and 0.3874. This
|
||
is the range of the results for the average MSE from the runs with precomputed conformation
|
||
sampling.
|
||
|
||
Further noticeable is the fact that the average RMSD shows almost no change after genera-
|
||
tion 50 in all four runs stagnating for many generations. The mean RMSD over all generations
|
||
of all four runs is 1.859, which is near the overall mean of 1.81 of all possible conformations. In
|
||
addition to the best individual RMSD this can be explained by the small chance of a mutation
|
||
occurring at the rotatable bonds of the molecules with known active structure.This low chance
|
||
of a mutation is a result of the molecules with known active structure having fewer rotational
|
||
bonds then the molecules without known active structure. Therefore the chance of a mutation
|
||
occurring on a molecule without known active structure is increased in relation to the runs with
|
||
precomputed conformation sampling where the mutation chances were equally distributed.
|
||
|
||
In addition all four runs lack the initial decrease of the average RMSD between generation
|
||
|
||
25 and 50 seen in the runs with precomputed conformation sampling.
|
||
|
||
49
|
||
|
||
4 Results
|
||
|
||
(a) This figure shows the results for the first run with the
|
||
PPK and B = 10 on the reduced dataset of 34 molecules
|
||
with implicit conformation sampling
|
||
|
||
(b) This figure shows the results for the second run
|
||
with the PPK and B = 10 on the reduced dataset of 34
|
||
molecules with implicit conformation sampling
|
||
|
||
(c) This figure shows the results for the first run with the
|
||
PPK and B = 10 on the reduced dataset of 56 molecules
|
||
with implicit conformation sampling
|
||
|
||
(d) This figure shows the results for the secondt run
|
||
with the PPK and B = 10 on the reduced dataset of 56
|
||
molecules with implicit conformation sampling
|
||
|
||
Figure 4.11: These figures show the results of the runs with reduced datasets, implicit confor-
|
||
|
||
mation sampling an the use of the PPK.
|
||
|
||
50
|
||
|
||
4.2 Implicit Conformation Sampling
|
||
|
||
Parameter / Run Nr.
|
||
Kernel method
|
||
RDF B factor
|
||
RBF Sigma factor
|
||
Smoothing factor
|
||
Mutation Probability
|
||
Mutate first 12 only
|
||
Conformation Sampling
|
||
Dataset size
|
||
Start avg. MSE
|
||
End avg. MSE
|
||
Diff. Start/End MSE
|
||
Best avg. MSE
|
||
Best individual MSE
|
||
Start avg. RMSD
|
||
End avg. RMSD
|
||
Diff Start/End RMSD
|
||
Best avg. RMSD
|
||
Best individual RMSD
|
||
|
||
05i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
implicit
|
||
100
|
||
1,687
|
||
0,6932
|
||
0,9938
|
||
0,673
|
||
0,6693
|
||
1,8592
|
||
2,0149
|
||
-0,1557
|
||
1,8592
|
||
1,5489
|
||
|
||
06i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
implicit
|
||
100
|
||
1,73
|
||
0,7425
|
||
0,9875
|
||
0,7398
|
||
0,7352
|
||
1,8605
|
||
2,0571
|
||
-0,1966
|
||
1,8605
|
||
1,5543
|
||
|
||
07i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
implicit
|
||
100
|
||
1,5033
|
||
0,9929
|
||
0,5104
|
||
0,989
|
||
0,9827
|
||
1,8487
|
||
1,9715
|
||
-0,1228
|
||
1,8487
|
||
1,5698
|
||
|
||
08i
|
||
PPK
|
||
10
|
||
-
|
||
-
|
||
0.1
|
||
yes
|
||
implicit
|
||
100
|
||
1,5079
|
||
0,9555
|
||
0,5524
|
||
0,9475
|
||
0,9459
|
||
1,8546
|
||
1,8685
|
||
-0,0139
|
||
1,7359
|
||
1,5687
|
||
|
||
Table 4.11: This table shows the parameters and the results for Run 04 through Run 09. Param-
|
||
|
||
eters denoted by ‘- ’are not available for the chosen kernel method
|
||
|
||
4.2.2 Reduced Dataset and Fixed Conformation
|
||
|
||
In run 05 to run 08 I combined the several changes. First I reduced the dataset to 56 and 34
|
||
molecules including the 12 with known active structure. The second change was to fix the
|
||
conformation of the molecules with unknown active structure to the conformation with the
|
||
lowest relative energy and allowing mutation only at rotational bonds of the molecules with
|
||
known active conformation. Run 07 was interrupted at generation 116 due to a server crash an
|
||
could not be resumed.
|
||
|
||
The results for the four runs with this configuration are shown in figure 4.11 and table 4.11.
|
||
One can see that the average MSE declines faster (within the firt 50 generation) than in the
|
||
previous runs with implicit conformation sampling to a final value of 0.6932, 0.7425, 0.9938
|
||
and 0.9555. These higher average MSE values can be explained by the fixed conformations
|
||
and the resulting missing possibility for optimization.
|
||
|
||
One can see the effect of allowing mutation only on the 12 molecules with known conforma-
|
||
tion. The best individual RMSD is lower then the average RMSD for almost every generation
|
||
in all four runs. This can be explained by the higher mutation rate resulting in individuals with
|
||
conformations with a lower RMSD to the active structures.
|
||
|
||
51
|
||
|
||
5 Discussion
|
||
|
||
The hypothesis that the best achievable models to predict activity include the active structures
|
||
of the training molecules, can not be confirmed in this work. The average RMSD over all
|
||
runs is almost exactly the average RMSD over all possible sets of conformations to the active
|
||
structure(See figure 5.1). While some runs show a RMSD below the 5% quantile they still
|
||
are within the normal distribution and are countered by the runs with a RMSD above average.
|
||
Though models were found with a a low average RMSD the optimization in most cases returned
|
||
to models with an RMSD near the average value. The reasons for these results can come from
|
||
two directions. They can be either chemically or mathematically qualified.
|
||
|
||
One possible reason is, that to many factors determining the active structure are missing
|
||
from the models I created. For example the solvent of the molecules, in this case water, is not
|
||
included in the model at all. But as studies have shown the solvent often has a great influence
|
||
on the activity and the active structure of a molecule. It can change the molecules conformation
|
||
to a more fitting one or even be part of the active site itself by filling the space not occupied
|
||
by the ligand. Therefore disregarding the solvent may lead to a model with a feature space not
|
||
having enough information about the active complex.
|
||
|
||
Figure 5.1: This figure shows the diagram for the mean values over all runs. One can clearly see
|
||
the average RMSD stagnating at 1.81 over the whole run while the MSE declines
|
||
to a value of about 0.4
|
||
|
||
A second reason may be that only part of the molecules conformations are critical for the
|
||
activity while another, possibly larger part can take up a random, probably energetically min-
|
||
imal, conformation. However this would only account for a part of the normal distribution
|
||
of the RMSD values. The ‘active’ part of the molecule conformations chosen for the model
|
||
would have a distinctly lower average RMSD to the active conformation resulting in the overall
|
||
RMSD being lower than mean of the normal distribution. For the dataset used in this work, this
|
||
can be ruled out because all 12 molecules with known active structure are entirely integrated in
|
||
the active process and have no parts which can take on free conformations.
|
||
|
||
52
|
||
|
||
Another reason for the model consisting of conformations with an average RMSD to the
|
||
active structure may come from the used kernel methods in combination with the molecules in
|
||
the dataset. Regardless of the chemical properties the molecules in the dataset often consisted
|
||
of several ring systems. These ring systems contribute the same partial results to all kernel
|
||
values for all different combinations of conformations due to the fact that they are rigid and do
|
||
not change partial results between different conformations. To rule this out one would have to
|
||
repeat the experiments with a dataset of molecules with less rigid parts or with a kernel method
|
||
that prioritizes longer distances within the conformations.
|
||
|
||
The most important and apparent reason though is based on the principle of the SVR. On
|
||
optimizing over the activity prediction with the activity value being the same for each con-
|
||
former of a given molecule, it is clear that to achieve a maximal generalization the process will
|
||
pick that conformation with the best representation of the whole conformational space of the
|
||
molecule. In the conformational hypersphere this will be a conformation near the center of the
|
||
sphere. Which in this case means a conformation with a maximal ‘similarity’ to all other con-
|
||
formations of that molecule. Or in other words, a conformation with a minimal average RMSD
|
||
to all other conformers. In most cases this would not be the active conformation. For the 12
|
||
molecules with known active structure I used in this work, the average distance (i.e. RMSD)
|
||
from the active conformation to the center of the conformational hypersphere was 1.81 ˚A.
|
||
|
||
Not regarding the resulting average RMSD values, the PPK kernel yielded the best MSE
|
||
with values as low to 0.01 in contrast to the RDF kernel with MSE values only in range of
|
||
0.5 and the APK with MSE values in the range of 0.3. Furthermore the PPK had the steepes
|
||
decent of the MSE values reaching an almost even level at generation 25-50, whereas the APK
|
||
needed more generations. The resulting models of the implicit conformation sampling were
|
||
less significant because in most runs the optimization process was not finished. This can be
|
||
explained by the mutation probability only being set to 0.1 and the number of points being
|
||
about 10 times as much as with the precomputed conformation sampling. One can see that by
|
||
reducing the possible mutations as in the final runs with implicit conformation sampling and
|
||
fixed conformers the MSE also decreases more rapidly.
|
||
|
||
53
|
||
|
||
6 Prospects
|
||
|
||
Following the results of this work models for activity prediction provide the best results not
|
||
by using the active structures but a conformation with minimal distance to all possible confor-
|
||
mations of the respective molecules. To confirm these results one would have to run further
|
||
experiments with other kernels and data sets. In these experiments one would have to calculate
|
||
the distances of the best resulting conformations for activity prediction not only to the active
|
||
conformation but to all possible conformation, or at least an equally distributed set over the
|
||
conformational space. These new model would be expected to provide the best results if they
|
||
were based on these ‘average’ conformations and not on the active structures.
|
||
|
||
If proven correct one would have to rethink the use of active conformations in 3D QSAR
|
||
models in favor of more generalized conformations. Further this would suggest a method of
|
||
finding a conformation near the center of the conformational hypersphere without calculating
|
||
all pairwise distances of all possible conformations.
|
||
|
||
If proven wrong one would have to revise the results of this work and further investigate
|
||
the reasons for the average RMSD of the best achievable models constantly being the exact
|
||
RMSD of the active conformations to the middle of the proposed conformational hypersphere.
|
||
Therefore one would have to copile a new set of molecules with known active structures. Where
|
||
the set would include more and diverse active conformations to cover a wider range of the
|
||
chemspace.
|
||
|
||
54
|
||
|
||
Bibliography
|
||
|
||
[Bau09] L.; Heine-A.; Smolinski M.; Hangauer D.; Klebe G. Baum, B.; Muley. Think
|
||
twice: understanding the high potency of bis(phenyl)methane inhibitors of throm-
|
||
bin. J.Biol.Mol, 391:552–564, 2009.
|
||
|
||
[Ber77]
|
||
|
||
T.F.; Williams-G.J. Meyer E.E. Jr.; Brice M.D.; Rodgers J.R.; Kennard O.; Shi-
|
||
manouchi T.; Tasumi M. Bernstein, F.C.; Koetzle. The protein data bank: A
|
||
J. of Mol. Biol.,
|
||
computer-based archival file for macromolecular structures.
|
||
112:535, 1977.
|
||
|
||
[BGV92] B.E. Boser, I.M. Guyon, and Vapnik V.N. Annual Workshop on Computational
|
||
Learning Theory, chapter Proceedings of the fifth annual workshop on Computa-
|
||
tional learning theory, pages 144–152. ACM, 1992.
|
||
|
||
[Boe99]
|
||
|
||
J.; Klebe G Boehm, M.;Stuerzebecher. Three-dimensional qantitive structure activ-
|
||
ity relationship analyses using comparative molecular file analysis and comparative
|
||
molecular similarity indices analysis to elucidate selectivity differences of inhibitors
|
||
binding to trypsin, thrombin, and factor xa. Journal of Medical Chemistry, 42:458–
|
||
477, 1999.
|
||
|
||
[Cha05] Q Chang. Scaling gaussian rbf kernel width to improve svm classification. Neural
|
||
Networks and Brain, 2005. ICNN&B ’05. International Conference on, pages 19–
|
||
22, 2005.
|
||
|
||
[CL01]
|
||
|
||
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector
|
||
machines, 2001. Software available at http://www.csie.ntu.edu.tw/
|
||
˜cjlin/libsvm.
|
||
|
||
[Cou04] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Using quaternions usings rmsd.
|
||
|
||
J.Comput. Chem, 25:1849–1857, 2004.
|
||
|
||
[Cou05] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Rotational superposition and
|
||
least sequares: the svd and quaternions approach yield identical results. reply to the
|
||
preceeding comment by g. kneller. J. Comput. Chem., 26:1663–1665, 2005.
|
||
|
||
[CV95] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297,
|
||
|
||
1995.
|
||
|
||
[Dia76] R. Diamond. On the comparison of conformations using linear and quadratic trans-
|
||
|
||
formations. Acta Cryst, 32:1–10, 1976.
|
||
|
||
[Dix06] A.; Knoll E.; Rao S.; Schaw D. Friesner R.A. Dixon, S.; Smondyrev. Phase: a new
|
||
engine for pharmacophore perception, 3d qsar model developement and 3d database
|
||
screening. J. Comput.-Aided Mol. Design, 20(10):647–671, 2006.
|
||
|
||
55
|
||
|
||
Bibliography
|
||
|
||
[Eki04]
|
||
|
||
S. Ekins. Predicting undesirable drug interactions with promisscuous proteins in
|
||
silico. Drug Dicovery Today, 9:276–285, 2004.
|
||
|
||
[Fis94]
|
||
|
||
E. Fischer. Einfluss der konfiguration auf die wirkung der enzyme. Berichte der
|
||
deutschen chemischen Gesellschaft, 27:2985–2993, 1894.
|
||
|
||
[Fle89]
|
||
|
||
R. Fletcher. Practical Mehods of Optimization. John Wiley and Sons, New York,
|
||
1989.
|
||
|
||
[Fri04]
|
||
|
||
J.L.; Murphy R.B.; Halgren T.A.; Klicic J.J.; Mainz D.T.; Repasky M.P.; Knoll E.H.
|
||
Shelley M.; Perry J.K.; shaw D.E.; Francis P.; Shenkin P.S. Friesner, R.A.; Banks.
|
||
Glide: a new approach for rapid, accurate doachin and coring. 1- method and as-
|
||
sessment of doching accuracy. J.Med.Chem, 47(7):1739–49, 2004.
|
||
|
||
[Gas96]
|
||
|
||
J.; Schuur J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Sadowski. Chem-
|
||
ical information in 3d space. J. Chem. Inf. Comput. Sci ., 36:1030–1037, 1996.
|
||
|
||
[Gas97]
|
||
|
||
J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Schuur. Finding the 3d
|
||
structure of a molecule in its ir spectrum. Fresenius J. Anal. Chem., 359:50–55,
|
||
1997.
|
||
|
||
[Guy93] B.; Vapnik V.N. Guyon, I.; Boser. Advances in Neural Information Processing
|
||
Systems, chapter Automtic capactiy tuning of very large VC-dimension classifiers,
|
||
pages 147–155. Morgan Kaufmann, San Mateo, CA, 1993.
|
||
|
||
[Ham66] Sir. Hamilton, William Rowan. Elements of Quaternions. Longmans, Green & Co.,
|
||
|
||
London, 1866.
|
||
|
||
[Han69] C. Hansch. A quantitative approach to biochemical structure-activity relationships.
|
||
|
||
Acc. Chem. Res., 2:232–239, 1969.
|
||
|
||
[Har94] George K.; Kauffman Louis H. Hart, John C.; Francis. Visualizing quaternion rota-
|
||
|
||
tion. Transactions on Graphics, 13:256–276, 1994.
|
||
|
||
[Hem99] Markus C.; Steinhauer V.; Gasteiger J. Hemmer. Deriving the 3d structure of organic
|
||
molecules from their infrared spectra. Vibrational Spectroscopy, 19:151–164, 1999.
|
||
|
||
[HG04] H.Z. Hao and M. Genton. Compactly supported radial basis function kernels. 2004.
|
||
|
||
[Hol75]
|
||
|
||
John H. Holland. Adaptation in Natural and Artificial Systems. Univ. Michigan
|
||
Press., 1975.
|
||
|
||
[Jah09] G.; Fechner N.; Zell A. Jahn, A.; Hinselmann. Optimal assignment methods for
|
||
ligand-based virtual screening. Journal of Chemoinformatics, 1:14, 2009.
|
||
|
||
[Jeb04]
|
||
|
||
R.; Howard A. Jebara, T.; Kondor. Probability product kernels. Journal of Machine
|
||
Learning Research, 5:819–844, 2004.
|
||
|
||
[Jor88]
|
||
|
||
J.T. Jorgensen, T.L.; Tirado-Rives. The opls potential functions for proteins. energy
|
||
minimization for crystals of cyclic peptides and crambin. J.Am.Chem.Soc., 110:165,
|
||
1988.
|
||
|
||
56
|
||
|
||
Bibliography
|
||
|
||
[Jor96]
|
||
|
||
[JZ10]
|
||
|
||
D.S.; Tirado-Rives J. Jorgensen, W.L.; Maxwell. Development and testing of the
|
||
opls all-atom force field on cornformational energetcs and properties of organic liq-
|
||
uids. J.Am.Chem.Soc., 118:11225–11235, 1996.
|
||
|
||
G.; Fechner-N.; Henneges C. Jahn, A.; Hinselmann and A. Zell. Probabilistic model-
|
||
ing of conformational space for 3d machine learning approaches. Mol. Inf., 29:441–
|
||
455, 2010.
|
||
|
||
[Kab76] Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta
|
||
|
||
Crystallographica, 32(5)A:922–923, 1976.
|
||
|
||
[Kat00] R.; Luong-C.; Radika K.; Martelli A.; Sprengeler P.A.; Wang J.; Chan H.;
|
||
Wong L Katz, B.A.; Mackman.
|
||
Structural basis for selectivity of a small
|
||
molecule, s1-binding, submicromolar inhibitor of urokinase-type plasminogen ac-
|
||
tivator. Chem.Biol., 7:299–312, 2000.
|
||
|
||
[Kat01a] K.; Luong-C.; Rice M.J.; Mackman R.L.; Sprengeler P.A.; Spencer J.; Hataye J.;
|
||
Janc J.; Link J.; Litvak J.; Rai R.; Rice K.; Sideris S.; Verner E.; Young W. Katz,
|
||
B.A.; Elrod. A novel serine protease inhibition motif involving a multi-centered
|
||
short hydrogen bonding network at the active site. J.Biol.Mol, 307:1451–1486,
|
||
2001.
|
||
|
||
[Kat01b] P.A.; Luong-C.; Verner E.; Elrod K.; Kirtley M.; Janc J.; Spencer J.R.; Breit-
|
||
enbucher J.G.; Hui H.; McGee D.; Allen D.; Martelli A.; Mackman R.L. Katz,
|
||
B.A.; Sprengeler. Engineering inhibitors highly selective for the s1 sites of ser190
|
||
trypsin-like serine protease drug targets. Chem.Biol., 8:1107–1121, 2001.
|
||
|
||
[Kat03] K.; Verner-E.; Mackman R.L.; Luong C.; Shrader W.D.; Sendzik M.; Spencer J.R.;
|
||
Sprengeler P.A.; Kolesnikov A.; Tai V.W.-F.; Hui H.C.; Breitenbucher J.G.; Allen
|
||
D.; Janc J.W. Katz, B.A.; Elrod. Elaborate manifold of short hydrogen bond ar-
|
||
rays mediating binding of active site-directed serine protease inhibitors. J.Biol.Mol,
|
||
329:93–120, 2003.
|
||
|
||
[Kat04] C.; Ho-J.D.; Somoza J.R.; Gjerstad E.; Tang J.; Williams S.R.; Verner E.; Mackman
|
||
R.L.; Young W.B.; Sprengeler P.A.; Chan H.; Mortara K.; Janc J.W.; McGrath M.E.
|
||
Katz, B.A.; Luong. Dissecting and designing inhibitor selectivity determinants at
|
||
the s1 site using an artificial ala190 protease (ala190 upa). J.Biol.Mol, 344:527–
|
||
547, 2004.
|
||
|
||
[Kea89] Simon K. Kearsley. On the orthogonal transformation used for structural compari-
|
||
|
||
son. Acta Crystallographica, 45(2)A:208–210, 1989.
|
||
|
||
[Kel06]
|
||
|
||
P.; Schalon-C.; Bret G.; Foata N.; Rognan D. Kellenberg, E.; Muller. sc-pdb: an
|
||
annotated database of druggable binding sites from the protein data bank. Journal
|
||
of Chemical Information and Modeling, 46(2):717–727, 2006.
|
||
|
||
[Kos58]
|
||
|
||
Jr. Koshland, D. E. Application of a theory of enzyme specificity to protein synthe-
|
||
sis. Proc. Natl. Acad. Sci. U.S.A., 44:98–104, 1958.
|
||
|
||
[Kos94]
|
||
|
||
Jr. Koshland, D. E.
|
||
Angew.Chem.Int.Ed.Engl, 33:2375–2378, 1994.
|
||
|
||
The key and lock theory and the induced fit
|
||
|
||
theory.
|
||
|
||
57
|
||
|
||
Bibliography
|
||
|
||
[Mac84] A. L. Mackay. Quaternion transformation of molecular orientation. Acta Crystallo-
|
||
|
||
graphica Section A, 40(2):165–166, Mar 1984.
|
||
|
||
[Mat96] R.; Costanzo-M.J.; Maryanoff B.E.; Tulinsky A Matthews, J.H.; Krishnan. Crystal
|
||
structures of thrombin with thiazole-containing inhibitors: probes of the s1’ binding
|
||
site. Biophys.J,, 71:2830–2839, 1996.
|
||
|
||
[McL72] A.D. McLachlan. A mathematical procedure for superimposing atomic coordinates
|
||
|
||
of proteins. ActaCryst, 28:656–657, 1972.
|
||
|
||
[OW91] T.I. Oprea and C.L. Walter. Reviews in Computational Chemistry, chapter Theoreti-
|
||
cal and practical aspects of thee-dimensional quantitative structure-activity relation-
|
||
ships, pages 127–182. Wiley-VCH: New York, 1991.
|
||
|
||
[Sad94]
|
||
|
||
[Sch96]
|
||
|
||
[Sel97]
|
||
|
||
J. Sadowski, J.; Wagener M.; Gasteiger. Corina: Automatic generation of high-
|
||
quality 3d-molecular models for application in qsar. In 10th European Symposium
|
||
on Structure-Activity Relationships: QSAR and Molecular Modelling, 1994.
|
||
|
||
P.; Gasteiger J Schuur, J.H.; Selzer. The coding of the three-dimensional structure
|
||
of molecules by molecular transforms and its application to structure - spectra cor-
|
||
relations and studies of biological activity. J. Chem. Inf. Comput. Sci., 36:334–344,
|
||
1996.
|
||
|
||
J.H.; Gasteiger Selzer, P.; Schuur. Software Development in Chemistry 10, vol-
|
||
ume 10, chapter Simulation of IR Spectra with Neural Networks Using the 3D-
|
||
MoRSE Code, page 293. Gesellschaft Deutscher Chemiker: Frankfurt am Main,
|
||
1997.
|
||
|
||
[Sew07] Martin Sewell. Kernel methods. Technical report, Department of Computer Science
|
||
|
||
University College London, 2007.
|
||
|
||
[Sho85] K. Shoemaker. Animating rotation with quaternion curves. Comput. Graph.,
|
||
|
||
19:245–254, 1985.
|
||
|
||
[SI08a] New York Schroedinger Inc. LigPrep, V2.1. 2008.
|
||
|
||
[SI08b] New York Schroedinger Inc. MacroModel, V9.6. 2008.
|
||
|
||
[SJ93] M. Stone and P. Jonathan. Statistical thinking and techniques for qsar related studies.
|
||
|
||
1 general theory. J. Chemom., 7:455–475, 1993.
|
||
|
||
[SOW04] Jeffrey J. Sutherland, Lee A. O’Brian, and Donald F. Weaver. A comparison of
|
||
methods for modeling quantitative structure-activity relationship. J. Med. Chem.,
|
||
47:5541–5554, 2004.
|
||
|
||
[Ste03] Y.; Kuhn S.; Horlacher O.; Luttmann E.; Willighagen E. Steinbeck, C.; Han. The
|
||
chemistry development kit (cdk): an open source java library for chemo- and bioin-
|
||
formatics. J Chem Inf Comput Sci, 43(2):493–500, 2003.
|
||
|
||
[Ste06]
|
||
|
||
C.; Kuhn S.; Floris M.; Guha R. Steinbeck, C.; Hoppe. Recent development of the
|
||
chemistry development kit (cdk) - an open source library for chemo- and bioinfor-
|
||
matics. Curr Pharm Des, 12(17):2111–2120, 2006.
|
||
|
||
58
|
||
|
||
Bibliography
|
||
|
||
[Vap82] V. Vapnik. Estimation of dependencies bade on empirical data. Springer Verlag,
|
||
|
||
1982.
|
||
|
||
[Vap95] V. Vapnik. The Nature of Statistical Learning Theroy. Springer Verlag, 1995.
|
||
|
||
[VC64] V. Vapnik and A. Chervenonkis. A note on one class perceptrons. Automation and
|
||
|
||
Remote Control, 25, 1964.
|
||
|
||
[VC74] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka (Russia),
|
||
|
||
1974.
|
||
|
||
[VL63] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.
|
||
|
||
Automation and Remote Control, 24, 1963.
|
||
|
||
[WS10]
|
||
|
||
P.; Murphy R.B.; Sherman W.; Friesner R.A. Watts, K.S.; Dalal and J.C. Shelley.
|
||
Confgen: A conformational search method for efficient gerneration of bioactive con-
|
||
formers. J.Chem.Inf.Model., 50:534–546, 2010.
|
||
|
||
[XG02] Y.; Ming L. Xi, C.;Lin and K. Gilson. The binding database: data management and
|
||
|
||
interface design. Bioinformatics, 18(1):130–139, 2002.
|
||
|
||
[Zam76] A. Zamora.
|
||
|
||
An algorithm for finding the smallest set of smalles rings.
|
||
|
||
J.Chem.Inf.Comput.Sci., 16(1):40–43, 1976.
|
||
|
||
59
|
||
|
||
List of Figures
|
||
|
||
1.1 Overlay of Thrombin inhibitors . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
1.2 QSAR process .
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
2
|
||
2
|
||
|
||
. .
|
||
|
||
6
|
||
.
|
||
.
|
||
. .
|
||
2.1 SVR .
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
|
||
2.2 GA individuals
|
||
.
|
||
2.3 GA mutation operators .
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
|
||
2.4 Thrombin-Hirudin complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
||
|
||
. .
|
||
.
|
||
|
||
.
|
||
.
|
||
|
||
.
|
||
.
|
||
|
||
.
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
examples for RDF .
|
||
overlay of RDF functions .
|
||
curve approximation .
|
||
.
|
||
|
||
3.1 flowchart of the overall process . . . . . . . . . . . . . . . . . . . . . . . . . . 14
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
|
||
3.2
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
|
||
3.3
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
|
||
3.4
|
||
3.5 Atom Pair Kernel
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
|
||
12 structures with known active conformation . . . . . . . . . . . . . . . . . . 22
|
||
3.6
|
||
example for implicit conformation sampling encoding . . . . . . . . . . . . . . 25
|
||
3.7
|
||
exapmle of rotatable bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
|
||
3.8
|
||
|
||
.
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
|
||
results of the initial runs
|
||
4.1
|
||
results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 32
|
||
4.2
|
||
results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 34
|
||
4.3
|
||
results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 36
|
||
4.4
|
||
results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 38
|
||
4.5
|
||
results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 40
|
||
4.6
|
||
results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 42
|
||
4.7
|
||
results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 44
|
||
4.8
|
||
4.9
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
|
||
results for the reruns
|
||
4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 48
|
||
4.11 results for runs with implicit conformation sampling, reduced dataset and fixed
|
||
|
||
.
|
||
|
||
.
|
||
|
||
conformation . .
|
||
|
||
. .
|
||
|
||
. .
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
|
||
|
||
5.1 mean over all runs .
|
||
|
||
.
|
||
|
||
.
|
||
|
||
.
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
|
||
|
||
60
|
||
|
||
List of Tables
|
||
|
||
3.1
|
||
3.2
|
||
3.3
|
||
|
||
table of compiled structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
|
||
table of parameters for conformer generation . . . . . . . . . . . . . . . . . . 24
|
||
table of parameters for the SVR . . . . . . . . . . . . . . . . . . . . . . . . . 27
|
||
|
||
results of the initial runs
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
|
||
4.1
|
||
results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 33
|
||
4.2
|
||
results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 35
|
||
4.3
|
||
results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 37
|
||
4.4
|
||
results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 39
|
||
4.5
|
||
results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 41
|
||
4.6
|
||
results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 43
|
||
4.7
|
||
results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 45
|
||
4.8
|
||
4.9
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
|
||
results for the reruns
|
||
4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 49
|
||
4.11 results for runs with implicit conformation sampling, reduced dataset and fixed
|
||
|
||
.
|
||
|
||
.
|
||
|
||
conformation . .
|
||
|
||
. .
|
||
|
||
. .
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
|
||
|
||
61
|
||
|
||
|