Masterarbeit/StilVorlagen/Diplomarbeit.md

EBERHARD-KARLS-UNIVERSIT ¨AT T ¨UBINGEN
Wilhelm-Schickard-Institut f¨ur Informatik
Lehrstuhl Rechnerarchitektur

Diplomarbeit

Active Structure Learning using Genetic
Algorithms and Kernel Functions

Christoph Schw¨orer

Betreuer:

Prof. Dr. rer. nat. Andreas Zell
Wilhelm-Schickard-Institut f¨ur Informatik

Prof. Dr. rer. nat. Karl-Heinz Wiesm¨uller
EMC Microcollections GmbH

Begonnen am:

13th January 2010

Beendet am:

12th July 2010

Erkl¨arung

Hiermit versichere ich, diese Arbeit selbstst¨andig
verfasst und nur die angegebenen Quellen benutzt
zu haben.

T¨ubingen am 12th July 2010

Christoph Schw¨orer

Kurzfassung.
Current 3D QSAR approaches attempt to build models base not only on 1D or 2D descriptors
of molecules like weight, charge or molecular graphs, but also on 3D sensitive information
like the conformation or information about the molecules surface. A basic assumption on
building these 3D QSAR models is that the best results are attained by using the best available
(i.e., the obtained active structure) data. In this work I tried to ﬁnd such a best achievable 3D
QSAR model by the means of optimizing a model over a set of conformations using a genetic
algorithm and three different kernel methods. The intent was to see if these resulting models
would include the active structures. For the generation of the sets of conformations I used
two different approaches. The ﬁrst being a precomputation of the conformations the second
an implicit generation concurrent to the optimization. The results will show that the model
with the best generalization and prediction accuracy in most cases do not include the active
conformation but conformations with a minimal average pairwise distance to all other possible
conformations of the respective molecules.

Contents

1

Introduction

1

.

.

.

4
2 Background Information
4
2.1 Kernel Functions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
4
2.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3 Rotation with quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
. . . . . . . . . . . . . . . . . . . . . . .
2.4 RMSD calculation with quaternions
2.5 Genetic Algorithm .
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Quantitative Structure-Activity Relationship . . . . . . . . . . . . . . . . . . . 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13
3 Materials and Methods
3.1 Overall process .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Radial Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Kernel .
Probability Product Kernel . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1
3.3.2 Radial Basis Function Kernel
. . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Atom Pair Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . 21
Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
. .

3.4 Dataset
.
.
3.5 Conformation Sampling .

3.6 SVR .

3.5.1
3.5.2

. .

. .

.

.

.

.

.

.

.

.

.

.

.

.

4 Results

.

.

Initial Runs .

28
4.1 Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Reduced Dataset with PPK and RBF Kernel . . . . . . . . . . . . . . . 33
4.1.3 Reduced Dataset with Atom Pair Kernel . . . . . . . . . . . . . . . . . 35
4.1.4 Alternative Parameters for the Product Probability Kernel
. . . . . . . 37
4.1.5 Alternative Parameters for APK . . . . . . . . . . . . . . . . . . . . . 39
Increased Mutation Rate . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.6
4.1.7 Alternative Conformation Sampling . . . . . . . . . . . . . . . . . . . 42
4.1.8 Alternative Mutation Operator . . . . . . . . . . . . . . . . . . . . . . 45
4.1.9 Reruns
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1
.
. . . . . . . . . . . . . . . 51
4.2.2 Reduced Dataset and Fixed Conformation

Initial Runs .

4.2

. .

. .

.

5 Discussion

6 Prospects

iv

52

54

Bibliography

Contents

55

v

Contents

vi

1 Introduction

In drug design one of the major goals is to ﬁnd new lead structures. Lead structures already
show a certain afﬁnity towards the intended target but express unwanted side effects or lack
certain properties. For example they may be toxic or have a low bioavailability. Without a
detailed understanding of the biochemical processes responsible for the activity the search for
such a new lead structure is non-trivial.

The usual process is to simply try a huge combination of different chemical compounds in
vitro and observe their activity. But the combinatorial possibilities of this strategy can explode
even for small systems. For instance the number of compounds needed to place 10 substituents
on the four open positions of an asymmetrically disubstituted benzene ring system is approxi-
mately 10,000.

Therefore this classical screening process was automatized and combinatorially optimized
in the last decades to high throughput screening (HTS) which allowed for a systematical search
in greater databanks with hundreds of thousands of entries. But still this process makes up a
large amount of the development-costs and -time. Further the chemical compounds needed for
the synthesis are often rare and hard to come by in the purity needed for reliable results.

One way to optimize this exhaustive search and give an indication of the right direction is
to develop a model that quantitatively relates variations in biological activity to changes in
molecular properties which can be easily obtained for each compound. One of the ﬁrst to build
such a model was Corvin Hansch correlating lipophilicity and polarity with biological activity
in his Hansch method [Han69]. But there exist many other approaches to this Quantitative
Structure-Activity Relationship (QSAR) principle, which mostly differ in their use of molecular
descriptors and mathematic models such as Partial Least Squares or Principal Component
Analysis. The QSAR models developed in this work are based on kernels which are evaluated
by Support Vector Regression.

In the recent years several models have been developed using 3D descriptors of molecules.
These 3D descriptors are important, because to build a model and gain understanding for the
binding process it is not enough to know of the single component and values of a molecule but
also to know their 3D dimensional arrangement. As one can see for example on ﬁg 1.1 where a
single molecule can take on several conformations. To know which of these conformation is the
active conformation can improve the modeling process and the understanding of the chemical
processes leading to the activity.

One thing all QSAR methods have in common is the basic assumption that the biological
activity is an additive function of the molecular properties (2D or 3D) of the substituents and
groups of the respective structure. Not only the mere presence of those groups is essential but
also their three dimensional arrangement.

This leads to the expectation that on using 3D descriptors only good and correct training data
including 3D information of the molecules leads to a good model of the activity. But, what if
this doesn’t hold true? What if a better model can be created not using the actually correct
physical data? The question is if the reverse of this expectation is always valid, thus if the
model quality is bijective to the training data quality.

1

1 Introduction

Figure 1.1: This ﬁgure shows an overlay of six conformations of the same thrombin inhibitor.

One can see the high ﬂexibility of the lower ring system.

(a) This ﬁgure shows the standard process for building a
QSAR model.

(b) This ﬁgure shows the reverse process of optimizing a
QSAR model with respect to the training data quality

Figure 1.2: These ﬁgures show the standart and the process used in this work to build a QSAR
model. To optimize the model quality with respect to the varying training data the
process has been reversed.

To test, whether one can ﬁnd such a model I will reverse the QSAR approach (see ﬁgure 1.2 .
Therefore optimizing the QSAR model to predict activity by the means of altering the training
dataset, where for each data point several values are given including the actual physical ones.
While the input data points vary (i.e. their molecular descriptors) their target function value

2

(i.e. the activity) stays the same.

The intention of this experiment is to see if the best attainable model includes the actual
active structure or the artiﬁcially created one. To this end I compiled a data set and created a set
of conformers for each molecule. Then I concurrently optimized the activity prediction over
the whole training dataset not to favor one molecule over the other by successively optimizing
one after the other. A good way to handle multidimensional optimization with several datasets
is to use a genetic algorithm which I did in this case.

In this work I will show the methods used for the generation of the the dataset, the optimization
and the evaluation. Further I will present the results and discuss their signiﬁcance. Finally I
will give a perspective of further work which can be done on this topic.

3

2 Background Information

2.1 Kernel Functions

Detecting linear relations has been the focus of much research in statistics and machine learning
for the last few decades and the resulting algorithms are well understood, well developed and
efﬁcient [Sew07].However, many models of natural processes aren’t linear. So, if a problem
is non-linear, instead of trying to ﬁt a non-linear model, one can map the problem from the
inputspace X to a new higher-dimensional space called the f eaturespace F and then use
a linear model in the feature space. This mapping can be achieved by doing a non-linear
transformation. For example the function φ can be given as

φ : R2 → R3 with φ (x1, x2) = (x2
1,

(cid:112)

2x1x2, x2
2)

(2.1)

While this function is a very simple one, other functions can easily become computationally
impracticable for both polynomial features and higher dimensionality. This is grounded on the
(cid:17)
, with d = dim X
fact that the number of different monomial features of degree p is
[Vap95] (e.g. p = 7, d = 28 · 28 = 748, corresponds to a total of approximately 3, 7 · 1016
features).

(cid:16) d+p−1
p

The key to an efﬁcient computation is the observation made by [BGV92] that

(cid:68)(cid:16)

x2
1,

(cid:112)

2x1x2, x2
2

(cid:17)

(cid:16)
x(cid:48)2
1,

,

(cid:112)

2x(cid:48)

1x(cid:48)

2, x(cid:48)2
2

(cid:17)(cid:69)

= (cid:104)x, x(cid:48)(cid:105)2

(2.2)

which allows the use of kernel f unctions where φ must not be explicitly known as long as the
function corresponds to a dot product in the FeaturespaceF

k(x, x(cid:48)) := (cid:104)φ (x), φ (x(cid:48))(cid:105)

(2.3)

2.2 Support Vector Regression

Many multi variant systems assume that there is a linear relation between X and Y which holds
for all samples. In chemoinformatics this assumption does not hold true and causes a variety of
problems on the prediction of unknown data points. One way to solve these occurring problems
is to use non-linear learning methods such as support vector regression (SVR). The support
vector algorithm is a non-linear generalization of the Generalized Portrait algorithm developed
in Russia in the 1960’s [VL63] [VC64]. Its groundwork, the statistical learning theory, or VC
theory, has been developed over the last half century by Vapnik and Chervonenkis [VC74]
[Vap82] [Vap95]. The VC theory deﬁnes properties of learning machines, enabling them to
generalize to unseen data.

Given a set of training data {(x1, y1), ..., (yn, yn)} ⊂ χ × R with χ denoting the space of input
patterns (e.g. χ = Rd) the goal of ε − SV regression is to ﬁnd a function f (x) with a maximum

4

2.2 Support Vector Regression

deviation of ε from the actually received targets yi for all the training data. In addition f should
be as ﬂat as possible. The form of a linear function f is given as

f (x) = (cid:104)w, x(cid:105) + b with w ∈ χ, b ∈ R

(2.4)

with (cid:104)·, ·(cid:105) denoting the dot product in χ and ﬂatness meaning a small w. To attain this we
2 (cid:107)w(cid:107)2 which can be formally written as a convex optimization
minimize the euclidean norm 1
problem:

minimize

subject to

1

2(cid:107)w(cid:107)2
(cid:40)

yi − (cid:104)w, xi(cid:105) − b ≤ ε
(cid:104)w, xi(cid:105) + b − yi ≤ ε

(2.5)

The above formula is viable for all problems where a function f actually exists that approx-
imates all pairs (xi, yi) with precision ε. If this is not the case, or if we want to allow some
errors, according to [CV95] one can introduce slack variables ξi, ξ ∗
i

leading to the formula:

minimize

subject to

i=1(ξi + ξ ∗
i )

1

2 (cid:107)w(cid:107)2 +C ∑n



yi − (cid:104)w, xi(cid:105) − b ≤ ε + ξi
(cid:104)w, xi(cid:105) + b − yi ≤ ε + ξ ∗
i
ξi, ξ ∗

i ≥ 0



(2.6)

Where the constant C > 0 deﬁnes the trade off between the ﬂatness of f and the amount up to
which deviations larger than ε are tolerated. This is the same as dealing with a ε-intensive loss
function |ξ |ε denoted by:

(cid:40)

|ξ |ε :=

0
|ξ | − ε

if|ξ | ≤ ε
else

(2.7)

Figure 2.1 depicts the use of ξ and ε. Extending support vector machines to solve non linear
problems is possible by using a standard dualization approach utilizing Lagrange multipliers as
described in [Fle89] leading to the following formula:

L := 1

2 (cid:107)w(cid:107)2 +C ∑n

i=1(ξi + ξ ∗

i ) − ∑n

i=1(ηiξi + η ∗

i ξ ∗
i )

− ∑n

i=1 αi(ε + ξi − yi + (cid:104)w, xi(cid:105) + b)

(2.8)

− ∑n

i=1 α ∗

i (ε + ξ ∗

i + yi − (cid:104)w, xi(cid:105) − b)

With L being the Lagrangian and ηi, η ∗
satisfy the constraints

i , αi, α ∗
i

the Lagrangian multipliers. Thus they have to

ηi, η ∗

i , αi, α ∗

i ≥ 0

(2.9)

To gain an optimal result one can infer from the saddle point condition that the partial deriva-

5

2 Background Information

Figure 2.1: The image shows the use of ξ and ε in a support vector regression. Data points
with a distance smaller than ε are not considered an error. For data points with a
distance larger than ε the parameter ξ decides wether they are tollerated or not.

tives of L have to vanish

∂ L
∑n
∂ b =
∂ L
∂ w = w − ∑n
∂ L
∂ ξi
∂ L
∂ ξ ∗
i

i=1(α ∗
i − αi)
i=1(αi − α ∗
C − αi − ηi
i − η ∗
C − α ∗
i

=

=

= 0
i )xi = 0
= 0
= 0

Substituting eq 2.7 into eq. 2.6 leads to the dual optimization problem:

(cid:40)

maximize

subject to ∑n

i, j=1(αi − α ∗
i=1(αi + α ∗

2 ∑n
− 1
i )(α j − α ∗
−ε ∑n
i ) + ∑n
i=1(αi − α ∗
i ) = 0 and αi, α ∗
i we can further reformulate (7) to η (∗)

j )(cid:104)xi, x j(cid:105)
i=1 yi(αi − α ∗
i )
i ∈ [0,C]

i = C − α (∗)

i

Having already eliminated ηi, η ∗
follows

w =

n
∑
i=1

(αi − α ∗

i )xi, thus f (x) =

n
∑
i=1

(αi − α ∗

i )(cid:104)xi, x(cid:105) + b.

(2.10)

(2.11)

so that

(2.12)

The fact that the dataxi only contributes in form of the dot product allows the introduction of
kernel functions in such a way that

This allows the prediction of unknown data points via

k(xi, x j) = (cid:104)φ (xi), φ (x j)(cid:105)

n
∑
i=1

(αi − α ∗

i )k(x, x j) + b

f (x) =

6

(2.13)

(2.14)

2.3 Rotation with quaternions

2.3 Rotation with quaternions

Quaternions are an extension of the complex numbers invented by William Rowan Hamilton in
1843[Ham66] and formally introduced to computer graphics by the publication of Shoemaker
[Sho85] [Har94]

Quaternions encode rotations by a set of 4 real numbers (or 2 complex numbers), while a
linear representation of a rotation requires a 3 × 3 Matrix, thus 9 numbers. Further Quaternions
occupy a smooth, seamless isotropic space which is the generalization of the surface of a sphere.
This means that one doesn’t need to take special care in avoiding singularities (e.g., the gimbal
lock, where two rotation axes collapse into one making the interpolation irreversible).

The four-dimensional space H is spanned by the real axis and three additional orthogonal
axes, spanned by the vectors i, j, k called the principal imaginaries, which obey Hamilton’s
rule

Where the three dimensional vectors i, j, k signify

i2 = j2 = k2 = ijk = − 1

i = (1,0,0)
j = (0,1,0)
k = (0,0,1).

(2.15)

(2.16)

A quaternion q = r + xi + yj + zk consists of a real part r and a pure part xi + yj + zk and can
be written as a three dimensional vector an a scalar

The sum of two quaternions is given as

q = (a, b)

q1 + q2 = (a1 + a2) + (v1 + v2)

and their product as

(2.17)

(2.18)

q1q2 = a1a2 − b1 · b2 + a1b2 + a2b1 + b1 × b2

(2.19)

where the multiplication of two quaternions q1q2 with unit length (i.e. absolute value = 1) and
q2 being a pure quaternion (i.e. with a = 0) causes a rotation of b2 around the axis described
by b1 for cos−1 2φ degrees. Where φ is the desired rotation angle.

2.4 RMSD calculation with quaternions

In various cheminformatic situations the problem arises of ﬁnding the best superposition of on
rigid object onto another. For example to give a similartiy measure for two proteins or in case
of this work two conformations of the same molecule. One method is ﬁnding the best rotation
and translation to minimize the root mean square deviation (RMSD) [Kab76] with examples
are given by [Dia76] and [McL72]. A prerequisite for this method is a given assignment of the
points matched on each other. Usually such an assignment is already given (e.g., the canonical
atom numbering of two different conformations).

The mathematical problem can the be stated as follows: [Cou04]

7

2 Background Information

“given a ordered set of vetors yk (target) and a second set xk (model), 1 ≤ k ≤ N, ﬁnd a

orthogonal transformation U and a translation r such that the residual E (weighted by wk)

E :=

1
N

N
∑
k=1

wk|U xk + r − yk|2

(2.20)

is minimized. ”Where the weight factor wk allows to lay the emphasis on certain parts of the
structure in question.

While Kabsch’s method uses Lagrange multipliers, Mackay proposed a method in 1984
[Mac84] using quaternions to calculate the rotation matrix. One disadvantage of Mackay’s
method was that, using a linear form of the least square errors, the results could be false where
objects had different relative orientations in space. In 1989 Kearsley developed a method, solv-
ing the non-linear least square error problem with an eigenvalue determination through the use
of quaternions [Kea89]. The proof that both, Kabschs and Kearsleys methods lead to the same
result was brought by Coutsias et al. in 2005 [Cou05].

If xk and yk are considered as pure quaternions, with xk := (0, xk) and xc

k = −xk the rotation

U (q) can be written as

And the residual function is transformed using quaternions to

(0, U (q)xk) = qxkqc

Eq =

1
N

N
∑
k=1

(qxqc − yk)(qxqc − yk)c

An expansion and a multiplication by N leads to

NEq = ∑N

k=1(qxkqc)(qxkqc)c + ykyc

k(qxkqc)yc

k − yk(qxkqc)c

(2.21)

(2.22)

(2.23)

= ∑N

k=1(xkxc

k + ykyc

k + (qxkqc)yk + yk(qxkqc))

where the normalization qqc = 1 and the property of pure quaternions xc = −x has been used.
qxkqc and yk being pure quaternions and with a, b pure ab + ba = 2(−a · b, 0) = 2([ab]0, 0) the
last two terms in eq. 2.23 can be combined as follows

(qxkqc)yk + yk(qxkqc) = 2([yk(qxkqc)]0, 0)

(2.24)

This means that only the 0th component is non-zero. Because of the associativity of the quater-
nions one can write yk(qxkqc) = (ykqxk)qc and deﬁne xk := ykqxk wich leads to the 4-vector
form of zk, Zk with Zk = AL(yk)AR(xk)Q with AL, AR deﬁned as follows

AR(p) =







p0 −p1 −p2 −p3
p3 −p2
p1
p0
p1
p2 −p3
p0
p0
p2 −p1
p3







, AL(p) =







p0 −p1 −p2 −p3
p2
p0 −p3
p1
p0 −p1
p2
p3
p0
p1
p3 −p2







(2.25)

8

All together we can write

followed by the residue

with

−2yT
k

U (q)xk = 2[yk(qxkqc)0

= 2[zkqc]0
= 2(zk0q0 + zk · q)
= 2QT Zk
= 2Z T Al(yk)R(xk)Q

NEq =

N
∑
k=1

(|xk|2 + |yk|2) − 2QT F Q

F := −

N
∑
k=1

AL(yk)AR(xk)

2.5 Genetic Algorithm

(2.26)

(2.27)

(2.28)

leading to the full form of the matrix F in terms of the correlation matrix R

F =







R11 + R22 + R33
R23 − R32
R31 − R13
R12 − R21

R23 − R32
R11 − R22 − R33
R12 + R21
R13 + R31

R31 − R13
R12 + R21
−R11 + R22 − R33
R23 + R32

R12 − R21
R13 + R31
R23 + R32
−R11 − R22 + R33







(2.29)

In this way the problem can be reduced to ﬁnding the extreme of a quadratic form QT F Q
for the four variables qi, i ∈ {0, 1, 2, 3} subject to the constraint QTQ = 1. Here QT F Q is the
standard Rayleigh quotient for a symmetric matrix F , where the maximum value of QT F Q
is equal to its larges eigenvalue which leads to the following problem

which in turn leads to the following expression for the best RMSD Value

F Q = λ Q

(cid:115)

eq =

(cid:114)

min
(cid:107)q(cid:107)=1

Eq =

∑N

k=1(|xk|2 + |yk|2) − 2λmax
N

(2.30)

(2.31)

2.5 Genetic Algorithm

In cheminformatics one often encounters optimization problems with several variable param-
eters. Traditional optimization methods such as steepest decent often fail at this task because
they often run into a local optimum. To get around this problem Prof. John Holland developed
the class of Genetic Algorithms (GA’s) at the University of Michigan during the 60’s and 70’s
[Hol75].

Genetic algorithms belong to the class of stochastic search methods. Their distinctive feature
is, that instead of operating on a single solution like most other stochastic search methods, they
operate on a whole set of solutions. The term Genetic Algorithm is a tribute to their basic
operations which derive from natural evolutionary processes, such as inheritance, mutation,
selection, and crossover.

9

2 Background Information

Given a problem P with parameters x1, ..., xn the ﬁrst step is to initialize a ﬁrst set of solutions,
called population M(0). Each single solution is called individual m and is represented by a bit
string called chromosome (see ﬁg x). The initial value of each parameter is chosen at random
within its predeﬁned range.

Figure 2.2: This ﬁgure shows two individuals with parameters x1, .., x4 encoded as a series of

binary representations of different length.

The second step is to evaluate each individual (i.e. solution) in the current population M(t)
for its ﬁtness. This is done by and applying the individuals parameter values to a ﬁtness function
(which in most cases is the initial problem function) and assigning the function result as ﬁtness
value u(m). This means that the parameters of individuals with higher ﬁtness values lead to a
better result of the problem function.

The third step is to assign each of the current individuals a selection probability p(m) which
depends on the individuals ﬁtness value u(m). This selection probability determines if a in-
dividual is chosen for mating. There are several methods of assigning selection probabilities
like roulette wheel selection (the likelihood of picking an individual is proportional to the indi-
vidual’s score), tournament selection (a number of individuals are picked using roulette wheel
selection, then the best of these are chosen for mating), and rank selection (pick the best in-
dividual every time). Moreover it is important not to use a method which always picks the
individuals with the best ﬁtness because then the population will quickly converge to these
individuals narrowing the search space.

The fourth step is to generate a new population M(t + 1) using the individuals selected in
step three to produce offspring applying the already mentioned genetic operators mutation and
crossover with a predeﬁned probability (see ﬁgure 2.3(a) and 2.3(b) for genetic operators).

10

2.6 Quantitative Structure-Activity Relationship

(a) Mutation of the ﬁfth bit from 0 to 1

(b) Crossover after the third bit

Figure 2.3: This ﬁgure shows the two genetic operators mutation and crossover. These allow
to generate new individuals from already existing ones and to introduce new sets of
parameters with possible better ﬁtness values.

Steps two to four are then repeated until one of three possibilities occur. The best ﬁtness
in the current population reaches a given limit, the best ﬁtness does not increase over several,
predeﬁned generations, or the steps two to four are repeated for a speciﬁc number of times.

2.6 Quantitative Structure-Activity Relationship

For the development of a new drug it is important not only to know its chemical formula but
also its conformation. The underlying principle for that is the so called lock and key principle
postulated by Emil Fischer in 1894 [Fis94] stating that an active compound has to be spacial
complementary to its target to form a complex. But as we know today there are several other
factors that inﬂuence the building of an active complex. Those can be direct features of the
molecules, like hydrophobicity, partial atomic charge, binding sites etc., or there can be inﬂu-
ences from the surrounding solution (e.g., water) so that a ligand changes its conformation in
the binding process. These considerations lead to the expansion of the lock and key principle
to the induced ﬁt theory in 1958 [Kos58][Kos94]

11

2 Background Information

Figure 2.4: This ﬁgure shows a Thrombin-Hirudin complex. The Hirudin(magenta) being the

key to the Thrombin(blue) lock.

Also new to this theory was the introduction of ﬂexible binding sites which can account for
differences in speciﬁcity and afﬁnity. This leads to the conclusion that the biological activity
is a direct function of the ligands three dimensional structure which in turn is the fundamental
premise for the quantitative structure-activity relationship (QSAR) [SOW04]. QSAR Meth-
ods attempt to represent the relationship between structural attributes of molecules and their
biological activity. In the beginning QSAR models where used to retrospectively analyze the
activity modulation of molecules in a speciﬁc subset. But in the last decade QSAR models have
been increasingly used for predictions on novel derivatives of well known ligands [Eki04]. To
be applicable to such a use the applied QSAR models must be able to generalize and predict
activities correctly beyond the chemical space deﬁned by the given training data.

To that end a large number of methods has been described in the literature since the begin-
ning of the research on QSAR. The early methods implemented only 2D features of molecules
(e.g. the connection table of a molecule), while newer ones often include 3D features like the
chemical properties of molecules in their bioactive conformation [SJ93] [OW91].

12

3 Materials and Methods

In this chapter the two main strategies applied to the problem and the overall process will
be explained in detail and their function will be exempliﬁed. The parameters used for the
experiments and their progress will be given. The implementation of the algorithms or the use
of external programs or code will be described. All algorithms were written in Java.

3.1 Overall process

Because this work consist of a concatenation of different machine learning and chemoinformat-
ical methods I will ﬁrst give an overview of the whole process and then explain the appointed
methods in depth.

The aim was to see if the best models for an activity prediction included the actual active

structures of the given molecules or if a better model could be found without them.

In this work I used two different approaches. The ﬁrst was to precompile a set of conformers
for each molecule maximizing the coverage of the conformer space and the second was to
create random new conformers during the optimization process. From the set of precompiled
conformers 100 (or the maximum available if lower then 100) were chosen equally distributed
over the calculated relative energy range for each molecule and used as the training set. In both
approaches the optimization was done by a genetic algorithm. The deciding facts for using a
heuristic (in this case the genetic algorithm) were that both a full search of the optimization
space isn’t feasible for 100 molecules each with at least 100 conformation and that the solution
hyperplane is very jagged and there was no information about a starting point.
The information about the molecules conformation were encoded in the GA’s genes, either as a
direct reference to the whole conformation (in the precomputed approach) or as single dihedral
angles for each rotatable bond in each molecule (in the implicit approach).

After each generation of the GA the ﬁtness of it’s individuals was calculated. In this case
each individual corresponded to a set of conformers for which a kernel matrix using one of the
following kernel methods were used.

• Probability Product Kernel (PPK)

• radial basis function(RBF)

• Atom Pair Kernel (APK)

The ﬁrst two of which were working on the RDF of a given molecule and the third one working
directly on 3D model of the conformation. Each kernel matrix therefore consisted of similar-
ity measures between the molecules. And for each molecule pKi value was known. These
informations were used to build a SVR model to predict the activity of an unknown molecule
in relation to it’s similarity to the molecule in the training set. For each model a set of best

13

3 Materials and Methods

parameters was searched using 5 repetitions of leave-one-out convoluted with a 5-fold cross-
validation. These best parameters were used to compute the MSE of the model which in turn
served as the ﬁtness value for each individual.

The next generation of individuals in the GA was then generated using standard GA opera-
tors such as mutation and cross over. The individual selected to mate for the next generation
according to their ﬁtness value.

(a) In the ﬁrst approach the conformer sampling was
done before the optimization process

(b) In the second approach the conformer sampling
was done implicitly as part of the optimization process
by mutating the conformers

Figure 3.1: These two ﬁgures show the different procedures of the two approaches. Both con-
sist of four frameworks indicated by the different colors. The conformer gener-
ation(either with MacroModel or implicit), the GA (with JavaEva2) which runs
the optimization loop, the kernel matrix computation and the SVR modeling (with
libsvm).

The process of generating new generations, the calculation of each kernel matrix and the
evaluation via the SVR was then repeated 200 times. The development of the MSE for each
individual and the RMSD between the conformations of the individual and the known active
structure was calculated.

14

ConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult outputConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult output3.2 Radial Distribution Function

3.2 Radial Distribution Function

An important prerequisite for the computation of active structures with respect to the different
conformations is keeping some kind of knowledge about the 3D structure of the molecules
throughout the whole process. Therefor a molecular representation is needed that guarantees
3D sensitivity. To do so there are some prerequisites for a structure code

• independence form the numbers of atoms, i.e. the size of the molecule,

• unambiguity regarding the three-dimensional arrangement of the atoms and

• invariance against translation and rotation of the entire molecule

(a) Overlay of three different conformations of the
same molecule

(b) The RDF for the three molecules shown on the left

Figure 3.2: These ﬁgures show an overlay of three conformations of the same Thrombin in-
hibitor and their RDF. While the internal distances of the ring systems stay the
same (i.e. the peeks representing the ring systems at r ≈ 1.5 and r ≈ 2.6 overlap for
all three molecules) their relative spacial position vary (i.e. the peeks representing
the distances of the ring systems among themselves at r ≈ 6 to r ≈ 12 are set off)

One method that meets all of the above requirements and which I used in this work is
a derivation of the 3D-Molecule Representation based on Electron diffraction (3D-MoRSE)
[Sch96] [Sel97], the radial distribution funtion [Gas96] [Gas97]. In general this function gives
the probability to ﬁnd a pair of atoms in the given molecule with similar properties in the
distance r to each other.

g(r) = f

N−1
∑
i

N
∑
j>i

AiA je−B(r−ri j)2

(3.1)

15

3 Materials and Methods

where f is the scaling factor and N is the number of atoms. The exponential therm consists
of the distance ri j between two atoms i, j and the smoothing factor B for the probability dis-
tribution which will be explained later. Ai and A j are the characteristic Atom properties. The
properties used in this work are standard properties of the JoeLib2 framework, for example:

• Electro-topological state

• Electronegativity (Pauling)

• Partial charge

• Atom mass

• Electron afﬁnity

• Intrinsic state

• Free electron count

• Hybridisation

• Van-der-Waals volume

• Heavy atom valence

• Electrogometrical state

• Implicit valence

This distribution function allows to embed a lot of additional information, e.g. bond dis-
tances, ring types, planar and non-planar systems and atom types, all of which are important in
calculating the similarity of two molecules or as in this case the similarity of two conformers
of the same molecule.

An important factor in using the radial distribution function is the resolution of the 3D model
of the molecule on which the formula is applied. Using exact distances stands in contrast
to physical reality and further restricts the application of any ability to interpolate for better
results. Even though if one wants to compute the similarity of two conformers using paired
atomic distances a certain amount of fuzziness is necessary to account for ﬂexibility and errors
in the initial measurement. Therefor the width of the peaks in the radial distribution function
is determined by the factor B. As an approximation the value of B can be given as a relation
between B and the chosen step size ∆r [Hem99] by

B (cid:117) (∆r)−2

(3.2)

In this work I started with a value of B = 1000 for my computations. But on realizing that
even slight changes had a large effect on similarity values I successively lowered it up to a value
of B = 10 where only rotations of whole ring systems had a noticeable effect on similarity. The
step size ∆r was always set to value of ∆r = 0.1 ˚A.

Implementation

In this implementation the function was internally represented by a vector of double values
each representing the value of the RDF at point g(r) with r ∈ 0.1N. The length of the vector,
and therefore the range of the function with y values ≥ 0 was predetermined by measuring the
longest distance of atom pairs in a molecule over all molecules in the dataset and adding 2 ˚A as
security margin. The preceding scaling factor f was not used (i.e. always set to f = 1).

16

3.3 Kernel

Figure 3.3: This ﬁgure shows the overlay of three RDF diagrams of the same molecule with
three different values for B: 10, 100, 100. One can see that with increasing B the
smoothness decreases but the information value increases.

3.3 Kernel

3.3.1 Probability Product Kernel

One of the two methods used in this work to give a 3D sensitive representation of a molecule
was the radial basis function (RBF). This function can be regarded as a distinct distribution of
atom pairs in the given molecule.

Typical kernels compute a generalize inner product between two input objects χ and χ (cid:48)
which is equivalent to applying a mapping function φ to each object and then computing a dot
product between φ (χ) and φ (χ (cid:48)) in a Hilbert space [Jeb04]. This kernel considers the case of a
mapping φ (χ) being a probability distribution p(x|χ), restricting the Hilbert space to the space
of distributions embedded in the Hilbert space.

In this work the probability distribution φ (x|χ) is given as the RDF function which leads to

the deﬁnition of the probability product kernel as follows

Deﬁnition Let p and p’ be probability distributions on a space X and ρ be a positive con-
stant. Assume that pρ , p(cid:48)ρ ∈ L2(X), i.e. that (cid:82)
X p(cid:48)(x)2ρ dx are well deﬁned (not
inﬁnity).
The probability product kernel (PPK) between distributions p and p’ is deﬁned as

X p(x)2ρ dx and (cid:82)

kprob(p, p(cid:48)) =

(cid:90)

X

p(x)ρ p(cid:48)(x)ρ dx = (cid:104)pρ , p(cid:48)ρ (cid:105)L2.

(3.3)

Furthermore it is well known that L2(X) is a Hilbert space.Hence the deﬁned kernel is positive
deﬁnite for any set of P of probability distributions over X such that (cid:82)
X p(x)2ρ is ﬁnite for any

17

3 Materials and Methods

p ∈ P.

Implementation

The ﬁrst idea was to implement the computation of the probability product kernel with the
numerical integration of the given RDF functions via Simson’s rule (see ﬁgure 3.4)

(cid:90) b

a

f (x)dx ≈

(cid:20)

a − b
6

f (a) + 4 f

(cid:19)

(cid:18)a + b
2

(cid:21)

+ f (b)

.

(3.4)

Figure 3.4: This ﬁgure shows the approximation of a function f (x) by a quadratic interpolation

P(x).

The RDF was interpolated by Simpson’s rule in steps of 0.01 which led to an exact calcula-
tion of the integral up to the 6th decimal place and also allowed to freely choose the factor ρ in
the PPK formula.

But the ﬁrst tests on this implementation showed that the computation of a single kernel
value could take up to 10 seconds resulting in maximum total of 1.5 hours per kernel matrix.
Being unfeasible due to the enormous amount of need computational power I decided to ﬁx
the parameter ρ with ρ = 1. With this the kernel takes the form of the expectation of one
distribution under the other:

(cid:90)

k(p, p(cid:48)) =

p(x)p(cid:48)(x)dx = Ep[p(cid:48)(x)] = Ep(cid:48)[p(x)]

(3.5)

This is also called the expected likelihood kernel.

3.3.2 Radial Basis Function Kernel

Another method of measuring similarity between the two result vectors A and B of the RDF
is the use of a radial basis function. A radial basis function (RBF) kernel, also known as an
isotropic stationary kernel [HG04], is deﬁned by a function ψ : [0, inf) → R such that

k(x, x(cid:48)) = ψ((cid:107)x − x(cid:48)(cid:107))

(3.6)

18

where x, x(cid:48) ∈ X and (cid:107) · (cid:107) denotes the Euclidean norm. The use of a special RBF kernel, the
Gaussian RBF kernel has been suggested in [Guy93] with

3.3 Kernel

k(x, x(cid:48)) = exp

(cid:18)

−

∑n

1 (cid:107)xi − x(cid:48)
2σ 2

i(cid:107)2

(cid:19)

(3.7)

where xi and x(cid:48)
width of the sphere surrounding the corresponding training pattern [Cha05].

i are the single data points in the result vectors of the RDF. And σ deﬁning the

The issue on implementing this kernel was to ﬁnd a viable value for the σ parameter in the
above formula. On choosing σ to low the patterns will tend to be very similar over-ﬁtting
the model and taking away its ability to generalize outside its bounds. While choosing σ to
high will have opposite effect letting the patterns appear very dissimilar and under-ﬁtting the
model. So ﬁnding a optimal value for σ is more about ﬁnding an acceptable trade-off between
over-ﬁtting in dense areas and under-ﬁtting in sparse areas.

3.3.3 Atom Pair Kernel

While the preceeding kernel were based on a RDF representation another method to compare
the 3D structure of two molecules or the different conformations of the same molecule is to
represent the molecule as a trie data. For that I use a derivate of the optimal assignment of atom
pairs [Jah09].
This method is based on a matrix D =
of binned geometrical distances between the three-
dimensional coordinates of atoms i, j. Where di j are the atomic distances and b is the binning
factor. The matrix D is used a a lookup table for the information needed to build a trie con-
taining all the geometrical information for all atom pairs from a ﬁxed atom i to any other atom.
Where a trie is a preﬁx based search tree that can be applied to any symbolic pattern with a
reading direction.
At the beginning the trie of atom i only consists of the root labeled with the hash code of the
atomic symbol i. To ﬁll the trie patterns of the form

(cid:106) di j
b

(cid:107)

hash(symbol(i)), di j, hash(symbol( j))

(3.8)

are inserted successively as ordered triplets. An example of a local atom pair environment and
the corresponding trie is shown in ﬁgure 3.5.

19

3 Materials and Methods

Figure 3.5: Binned geometrical distances, spheres and trie. The upper left ﬁgure shows
the spheres of the binned geometrical distances 1.0, 2.0 and 3.0 ˚A for the centered
carbon atom. The sphere of the binned geometrical distance of 0.0 ˚A (distances in
the range [0.0; 1.0)) is not visualized as individual sphere because it contains no
atoms. The upper right ﬁgure illustrates the resulting local atom pair environment
of binned geometrical distances. For simplicity, only the distances to non-carbon
atoms are displayed. The lower ﬁgure visualizes the corresponding trie of geometric
atomic distances of the annotated atom in the upper ﬁgures. The root and leaves are
labeled with the corresponding atom type. The leaves contain additionally the total
number of occurrences in the local atom pair environment.[JZ10]

The representation of a local atom environment as tries allows the comparison of two local
atom environments by comparing the tries. This can be achieved by applying a well known
similarity measurement like the Tanimoto coefﬁcient

T (A, B) =

A · B
(cid:107)A(cid:107)2 + (cid:107)B(cid:107)2 − A · B

(3.9)

In this case let LA, LB be two sets of local atom pair environments of two molecular graphs A, B
and lAi ∈ LA, lB j ∈ LB the tries i, j of the nominal features (atom pair environments of atoms i, j.

20

Then the Tanimoto coefﬁcient can be deﬁned as

Sim(lAi, lB j) =

(cid:12)
(cid:12)lAi ∩ lB j
(cid:12)
(cid:12)lAi ∪ lB j

(cid:12)
(cid:12)
(cid:12)
(cid:12)

3.4 Dataset

(3.10)

Implementation

The implementation used in this work was based on the Chemistry Development Kit (CDK)
[Ste03] [Ste06]an implemented by [Jah09].

The single arbitrary parameter b was initially set to b = 0.1 and subsequently set to b = 0.2

to account for errors in measurement of the crystal structure.

3.4 Dataset

The dataset used in the experiments consisted of two parts. A precompiled set of 88 molecules
taken from [Boe99] and a smaller set of 12 molecules compiled for this work. All of the
molecules in the dataset were thrombin inhibitors with a known pKi value. However only
the 12 molecules in the compiled dataset had crystallographic determined active structures.
The active structures were gained by taking the crystal structure analysis of thrombin with the
respective ligand and extract the bound ligand from the whole structure.

The ﬁst step therefore was to search for all potential thrombin inhibitors in the scBDP 1

[Kel06]

The second step was to ﬁnd an entry with the identical structural formula in the Binding

Database 2 [XG02] for information about pKi Values and publications.

The third an ﬁnal step was to download the crystallographic analysis given by the PDB
ID from the Protein Data Bank 3 [Ber77] and to extract the bound ligand with Schr¨odinger’s
Maestro program. Thus these 12 ligands will from now on be referenced by their originating
PDB ID. They are depicted in ﬁgure 3.6 and their data and publications is shown in table 3.1.
The 88 precompiled structures were only available as structural formulas so they had to be
converted into a valid 3D conformation. To achieve this they were converted with the CORINA
program [Sad94].

Trombin inhibitors were chosen both for their high ﬂexibility and the fact that the interactions
of inhibitors and Thrombin are well investigated and there are several well documented studies
including crystal structures.

3.5 Conformation Sampling

3.5.1 Precomputed Conformation Sampling

The ﬁrst strategy to be pursued was to precompute a set of conformers for all molecules, pick
a subset of 100 of these conformers per molecule (or less, if less then 100 were available)
and use the genes in the GA as indices for the molecules to chose from. Therefor a mutation
operation in the GA lead not only to a single change in the conformation but could lead to a
whole different one.

1http://bioinfo-pharma.u-strasbg.fr/scPDB/
2http://www.bindingdb.org
3http://www.rcsb.org/pdb

21

3 Materials and Methods

(a) 1a4w

(b) 1c5n

(c) 1ghy

(d) 1gj4

(e) 1gj5

(f) 1o2g

(g) 1o5g

(h) 2zc9

(i) 2zda

(j) 2zgx

(k) 2zo3

(l) 3dhk

Figure 3.6: These ﬁgures show the 12 molecules with known active structure used in this work.

They are labeled with the PDB ID they were extracted from.

22

3.5 Conformation Sampling

PDB ID pKi value

resolution (in ˚A) ﬁrst published in

1a4w
1c5n
1ghy
1gj4
1gj5
1o2g
1o5g
2zc9
2zda
2zgx
2zo3
3dhk

7.796
4.699
5.071
4.222
6.347
6.495
4.957
7.327
8.398
6.745
10
6.744

1.80
1.50
1.85
1.81
1.73
1.58
1.75
1.58
1.73
1.80
1.70
1.73

[Mat96]
[Kat00]
[Kat01a]
[Kat01b]
[Kat01b]
[Kat03]
[Kat04]
[Bau09]
[Bau09]
[Bau09]
[Bau09]
[Bau09]

Table 3.1: This table gives information of all used molecules for which the crystal structure was

known

The conformations themselves were generated with the ConfGen program [WS10] which is

based on the molecular modeling Program MacroModel [SI08b].

The ﬁrst step the program takes is to identify variable features which are rotatable bonds,
ﬂexible ring systems and invertible nitrogens. ConfGen generally identiﬁes a bond as rotatable
if the following criteria are met:

• It is a single bond

• It doesn’t lie within a ring

• Neither of the atoms connected by the bond is terminal (i.e. has no other bonds to it)

• Neither end of the bond is a CH3, NH2 or NH+

3 group

• Neither atom in the bond is bonded to two or three atoms that are all equivalent and are

arranged with two- or three-fold rotational symmetry.

Ring conformers are generated using the same template based facility available in LigPrep
[SI08a], Glide [Fri04], MacroModel [SI08b], or Phase [Dix06].
It is designed to generate
a complete set of accurate, low energy ring conformation identifying individual rings with a
smallest set of smallest rings (SSSR) method [Zam76]. When a ring system is identiﬁed it is
compared to a set of 1252 templates to ﬁnd the most similar template. This template is then used
to calculate the relative energies of the ring within the molecule. There are Nri combinations of
ring conformations for a whole molecule:

Nri = 2Ni ∏

r

Ncr

(3.11)

where Ni is the number of invertible nitrogen atoms, r runs over all ﬂexible ring systems and
Ncr is the number of templates selected to use for each individual ring system.

Each of the generated set of ring conformers is then processed as follows. First the potential
of each rotatable bonds connecting the ring systems are calculated using a derivative of OPLS
[Jor88] [Jor96] including a quick check of Lennard-Jones potentials of all atoms on one side
of the bond to all on the other side to avoid local Van-der-Waals clashes. Then the potential

23

3 Materials and Methods

parameter
maximum number of seach steps
search steps per rotatable bond
minimum heavy atom RMSD ( ˚A) for
distinct conformer
minimum dihedral angle difference
for polar hydrogens (◦)
maximum relative energy for ﬂexible
rings (kcal/mol)
maximum number of ring conforma-
tions per ligand
maximum number of ring conforma-
tions per ring
maximum relative ConfGen energy
(kcal/mol)
energy threshold for periodic torsions
(kcal/mol)
restraint potentials for weak torsions
in MacroModel (kcal/mol)
restraint potential half width (◦)
suppress hydrogen-bond electrostat-
ics in MacroModel
maximum relative energy all-atom en-
ergy in MacroModel (kcal/mol)

intermediate
1000
75
1

comprehensive
1000
75
0.5

60

2.39

16

8

25

5.74

239

10
Yes

25

60

23.9

128

64

119.5

5.74

239

10
Yes

119.5

Table 3.2: This table shows the parameters used to generate the two datasets. The intermediate
parameter set is more restrictive and almost certainly only picks energetic minima
while the comprehensive parameter set allows for the algorithm to pick a conforma-
tion lying between two optima.

minima are computed and used to create sets of rotational bonds surrounding the molecular
core (i.e. the part of the molecule remaining if every outer rotational bond is severed).

For each combination of ring system conformation, invertible nitrogen atom geometry and
minima of rotatable bond dihedral angle all molecule conformations are compiled and, if the
sum of all relative potential (to the one with the least energy) energies doesn’t exceed a preset
limit, the conformation is added to the resulting set of conformers.

In this work I used two sets of parameters for the algorithm described above. One restrictive
and one permissive. While the restrictive parameter set only generated conformers where the
rotatable bonds took up a local minimum energetic state the permissive one allowed more
freedom. Thus conformations were pickes lying in between local optima and allowing the GA
to successively change more easily from one conformation to another. For the exact parameters
used in the conformer sampling see table 3.2

3.5.2 Implicit Conformation Sampling

The second strategy to be pursued was to not use the precomputed conformation sampling but
to generate a new set of conformations from generation to generation in the genetic algorithm.
Therefor the encoding of the single individuals in the GA had to be different. In contrast to the

24

3.5 Conformation Sampling

optimization of the precomputed conformation set, where each ‘gene’represented the confor-
mation ID of a whole molecule, here a ‘gene’ only represented a single rotatable bond within
a molecule. While a mutation on one gene in the GA meant to pick a whole new conformation
of the concerned molecule with possibly every rotatable bond affected, the mutation of a gene
in the implicit conformation sampling only meant the alteration of a single rotatable bond. In
addition it had to be ensured that the crossover operator didn’t cut in the middle of the encoding
of a molecule but only at the end of one and the beginning of the other. Doing a crossover in the
middle of a molecule could lead to an invalid conformation because it couldn’t be guaranteed
that the molecule wouldn’t fold back on itself overlapping one or more atoms.

Figure 3.7: The ﬁgure shows an example of a molecule with nine rotatble bonds an the corre-
sponding encoding as a gene for the GA. The denoted angles are the dihedral angles
for a unique set of deterministically calculated atoms ‘surrounding’the bond

But before encoding a molecule in the GA one had to know the exact number of rotatable
bonds. For that each bond was inspected and had to meet a list of criteria to count as rotatable.
The criteria used were the ones already implemented in the JoeLib2 framework:

• The atom at the beginning of the bond has to have a heavy atom valence of > 1

• The atom at the end of the bond has to have a heavy atom valence of > 1

• The bond order has to be 1

• The bond mustn’t lie in a ring system

• The the atom at the beginning of the bond mustn’t have a hybridization of 1

• The the atom at the end of the bond mustn’t have a hybridization of 1

If these criteria were met, the bond was added to the molecules rotatable bond list.

The unit with which the rotations where encoded was 1◦ (i.e. degree) where degree refers to

the dihedral angle. An angle of 0◦ refers to the original crystallographic conformation.

For the ﬁrst generation of the GA a initial set of conformations was computed picking a
random value for the dihedral angle of each rotatabel bond of each molecule in the dataset.

25

3 Materials and Methods

After each occurring mutation in the GA the according molecule was computed again with the
new degree value. Where the new value was in reference to the original 0◦ value and not to the
currently applied one.

To compute a rotation around a rotatable bond one has to rotate each atom belonging to one
of either of the two bipartite graphs formed by splitting the molecular graph at the designated
bond. The bipartite graph was calculated using a stack, adding the beginning atom of the bond
and then recursively adding every atom bound to the ones already on the stack (except for the
atom at the end of the designated rotatable bond) until no new atoms could be found. This was
possible because no bond were allowed to be rotatable if they were in a ring system and no
molecules with macrocyles were in the dataset.

The actual rotation was achieved by applying a quaternion to each of the atoms in the bi-
partite graph with the center of the coordinate system being the atom at the beginning of the
bond.

(a) molecule in basic conformation

(b) molecule with bond Nr. 9 rotated by 90◦

Figure 3.8: These two ﬁgures show an example of a rotation around one rotatable bond.

Both, at the initial random initialization of the conformations and at every mutation event it
has to be ensured that the generated conformation is valid (i.e. no atoms or bonds overlap or
lie to close to each other). Therefore for every new conformation all pairwise atom distances
have to be calculated. The chosen value for a lower bound (gathered by calculating the average
minimal distance for non-bound atoms over the whole original dataset) was 2 ˚A while distances
of covalent bonds were ignored.

26

3.6 SVR

3.6 SVR

In this work I used the libSVM implementation by Chang and Lin [CL01] 4. To compute
the MSE a leave-one-out approach was applied. A model of the dataset (i.e represented by the
kernel matrix) was built n times (with n beeing the size of the dataset) always with one different
data point left out. For these datasets a ﬁve-fold cross-validation (inner fold) was run 5 times
(inner runs). The inner fold was used to determine the best parameters of the regression (i.e
values of parameters ε and c yielding the best performance on the validation dataset). The inner
runs were used to the best model, with the just computed best parameters. The best model was
then used to predict the currently left out data point. The set of parameters can be seen in table
3.3.

computation method
inner folds
inner repetitions
c begin
c end
ε begin
ε end

leave one out
5
5
-1
5
-7
-2

Table 3.3: This table shows the parameters used to calculate the MSE for the regression on the

datasets.

4http://www.csie.ntu.edu.tw/ cjlin/libsvm/

27

4 Results

In this chapter I will present the results, interpret and discuss them. The results are divided
by precomputed and implicit conformation sampling. Each of those two parts is further split
by the used kernel methods and parameters. The results are mostly presented in chronological
order to try and replicate my line of thought.

Evaluation Method

To explain the values shown in the following diagrams I will give a short explanation for each
of them. The meaning of the values remains the same for every ‘run’-diagram in this work.

MSE

The ‘avg MSE’value shown in each diagram is the average MSE value for one generation
(i.e. 100 individuals). Where MSE is the best found Mean Square Error for each regression.
In the corresponding table I will show the respective numerical values. The ‘Best individual
MSE’ relates to the absolute minimum found by at least one individual.

RMSD

The ‘avg RMSD’ value is the average RMSD value for the conformation of the 12 molecules
with known active structure encoded by the current individual to their respective active struc-
ture. The ‘Best individual RMSD’ relates to the individual with the lowest RMSD averaged
over the 12 molecules in my dataset to their respective active structure.

5% Quantile

The 5% quantile demarks the value where every point below this line lies in the lowest 5% of all
possible values for the average RMSD. It’s value is exactly 1.629. This was computed picking
20000 random combinations from the conformer sets of each molecule and building the average
RMSD to their respective active structures. From this normal distribution the p−Quantile
with p = 0.05 was calculated using the standard formula x(p) = µ + σ · z(p) where µ is the
expectation and σ 2 the variance and z(0.05) was looked up in the normal distribution table.

4.1 Precomputed Conformation Sampling

For the ﬁrst experiment, the optimization of the QSAR model I used the dataset described ear-
lier. The dataset ﬁrst included all conformers produced with ConfGen where every molecule
had a different amount of conformers created. To reduce the extent of the combinatorial size
I picked the lower of either 100 or the number of conformers originally created. The selec-
tion method was to pick the conformers equally distributed over their relative energy to the

28

4.1 Precomputed Conformation Sampling

conformer with the absolute lowest energy to guarantee an equal distribution over the con-
formational space of each molecule. Because the output sets for each individual conformer
where already sorted by their relative energy I simply had to pick every n-th conformer. Where
n = number of conformers available/100.

29

4 Results

(a) This ﬁgure shows the avg. MSE, avg RMSD, best
MSE and best RMSD for Run 01 where the PPK Kernel
was used with parameter B = 1000

(b) This ﬁgure shows the avg. MSE, avg RMSD, best
MSE and best RMSD for Run 02 where the RBF Kernel
was used with parameters B = 1000; sigma = 100

(c) [This ﬁgure shows the avg. MSE, avg RMSD, best
MSE and best RMSD for Run 02 where the RBF Kernel
was used with smoothing factor = 0.1

Figure 4.1: These ﬁgures show the results of the ﬁrst three runs. One can see that the optimiza-
tion works ﬁne due to the MSE declining while the average RMSD only declines in
Run03 using the APK but still doesn’t reach the 5% quantile.

30

4.1 Precomputed Conformation Sampling

Paramter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Start avg. MSE
End avg. MSE
Diff. Start/End
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End
Best avg. RMSD
Best individual RMSD

01
PPK
1000
-
-
0.1
no
intermediate
1.256
0.118
1.138
0.109
0.117
1.893
1.878
0.015
1.822
1.525

02
RBF
1000
100
-
0.1
no
intermediate
1.191
0.536
0.655
0.536
0.526
1.915
1.900
0.015
1.808
1.505

03
APK
-
-
0.1
0.1
no
intermediate
1.039
0.337
0.702
0.337
0.335
1.907
1.790
1.117
1.724
1.465

Table 4.1: This table shows the parameters and the results for Run01,Run02 and Run 03. Pa-

rameters denoted by ‘- ’are not available for the chosen kernel method

4.1.1 Initial Runs

In the ﬁrst runs (run 01-03) of the experiment I used the PPK and the RBF Kernel on the RDF
of the molecules and the APK to generate the kernel matrix. The parameters were set to their
default values to check the overall function of the optimization. (see table 4.1)

The results of these three runs are depicted on the left side in ﬁgure (4.1). One can see that
the basic optimization is functional. The average MSE declines from the values of 1.256, 1.191
and 1.039 to values of 0.109, 0.536 and 0.337. But, while the average RMSD declines slightly
in Run03 with the use of the APK it remains at the same level with the use of the PPK and
RBF. Although some isolated individuals get below the 5% quantil mark they are dismissed in
the next generation implying that the individuals with a higher average RMSD result in better
models with lower MSEs.

My ﬁrst consideration on evaluating these results where twofold. Either the use of the large
dataset of molecules with unknown active structures impeded the decline of the ones with
known active structures because their weight in the model building process was to large, or the
parameters used were not ﬁt for this kind of optimization.

Therefore I consecutively lowered the size of the dataset to 56, 41 and 34 molecules, al-
ways including the 12 known active structures, and changed the parameters of the kernels used.
Which are the B parameter for the RDF resulting in a smoother RDF function and the smooth-
ing factor for the APK, both in the hope of a better generalization. These changes are shown in
the next sections.

31

4 Results

(a) This ﬁgure shows the results for the PPK with B =
1000 on the dataset with 56 molecules

(b) This ﬁgure shows the results for the PPK with B =
1000 on the dataset with 41 molecules

(c) This ﬁgure shows the results for the PPK with B =
1000 on the dataset with 34 molecules

(d) This ﬁgure shows the results for the RBF Kernel
with sigma = 100 on the dataset with 56 molecules

(e) This ﬁgure shows the results for the RBF Kernel
with sigma = 100 on the dataset with 41 molecules

(f) This ﬁgure shows the results for the RBF Kernel with
sigma = 100 on the dataset with 34 moleculesl

Figure 4.2: These ﬁgures show the results of the runs with reduced datasets for the PPK and

RBF Kernel.

32

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

04
PPK
1000
-
0.1
no
interm.
56
1.3470
0.1184
1.2286
0.1128
0.1091
1.8626
1.7738
0.0888
1.5744
1.4630

05
PPK
1000
-
0.1
no
interm.
41
1.4283
0.0917
1.3366
0.0885
0.0856
1.8348
1.6095
0.2253
1.5977
1.2913

06
PPK
1000
-
0.1
no
interm.
34
1.5836
0.1366
1.447
0.1343
0.1334
1.8924
1.9664
-0.074
1.5805
1.5148

07
RBF
1000
100
0.1
no
interm.
56
1.3432
0.3663
0.9769
0.3663
0.3578
1.8239
1.7765
0.0474
1.7443
1.2660

08
RBF
1000
100
0.1
no
interm.
41
1.4391
0.3422
1.0969
0.3422
0.3393
1.8206
1.8151
0.0109
1.8120
1.3552

09
RBF
1000
100
0.1
no
interm.
34
1.5611
0.4955
1.0656
0.4897
0.4862
1.8717
1.9408
-0.0691
1.8688
1.2588

Table 4.2: This table shows the parameters and the results for Run 04 through Run 09. Param-

eters denoted by ‘- ’are not available for the chosen kernel method

4.1.2 Reduced Dataset with PPK and RBF Kernel

To see if reducing the dataset size would yield models with a lower average RMSD I ran the
PPK and the RBF on datasets where the only every 2nd, 3rd and 4th molecule with unknown
active structure where included. The hypothesis was that due to the fact that the overall inﬂu-
ence of the known active structures on the model is higher and if the general assumption of
good models consisting of good data (i.e. the active structure) the RMSD would be lower.

The results of these runs are shown in ﬁgure 4.1.1 and table 4.2. With the use of the PPK
(runs 04-06) the average RMSD gets below the 5% quantile at some point. In run 04 and run 05
the average RMSD gets below the 5% quantile within the ﬁrst 50 generations but returns to its
starting level shortly after and stagnates. In run 05 however the average RMS stays at a relative
high value in comparison to run 04 and 06 but declines to a value below the 5% quantile mark
after 50 generations. In addition the MSE of the best model found for run 05 was the lowest of
all three runs with ﬁnal value of 0.0917 in contrast to 0.1184 and 0.1366 for runs 04 and 06.

With the use of RBF kernel (run 07, run 08 and run 09) the average RMSD didn’t get below
the 5% quantile in any of the 3 runs. Although run 08 on the dataset with 41 molecules showed
a steady decline of the average RMSD which is similar to the results of run 05. It is noticeable
that the initial generation of all three runs consisted of at least one individual with a very
low average RMSD and considering the low starting RMSD more than one. These indivduals
however were dismissed in the ﬁrst 25 generations resulting in a average RMSD. Furthermore
the best models found by using the RBF kernel had MSE values of 0.3578, 0.3393 and 0.4852
which is for each more then three times the MSE value of the best model for the PPK with
corresponding dataset size where the values are 0.1091, 0.0856 and 0.1334.

Considering this direct comparison of the PPK and the RBF kernel the PPK shows better

results, both for the modeling and for the use of the actual active structure.

33

4 Results

(a) This ﬁgure shows the results for the APK with
smoothing f actor = 0.1 on the dataset with 56 molecule

(b) This ﬁgure shows the results for the APK with
smoothing f actor = 0.1 on the dataset with 41 molecule

(c) This ﬁgure shows the results for the APK with
smoothing f actor = 0.1 on the dataset with 34 moleculel

Figure 4.3: These ﬁgures show the results of the runs with reduced datasets for the APK. Run

12 is the model with lowest ﬁnal average RMSD of all runs.

34

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

10
APK
-
-
0.1
0.1
no
intermediate
56
1.0026
0.3420
0.6606
0.3420
0.3390
1.8399
1.6799
0.16
1.5922
1.4729

11
APK
-
-
0.1
0.1
no
intermediate
41
1.2949
0.5053
0.7896
0.5051
0.5028
1.8381
2.0098
-0.1717
1.7233
1.3699

12
APK
-
-
0.1
0.1
no
intermediate
34
1.2255
0.5067
0.7188
0.5023
0.5012
1.9241
1.3813
0.5428
1.2618
1.1870

Table 4.3: This table shows the parameters and the results for Ru10, Run11 and Run 12. Pa-
rameters denoted by ‘-’ are not available for the chosen kernel method

4.1.3 Reduced Dataset with Atom Pair Kernel

The reduction of the dataset had a similar effect on the use of the APK as id had on the RPK.
The starting average RMSD was 1.8399, 1.18381 and 1.9241 and while run 10 and 12 had a
considerably lower end RMSD with 1.699 and 1.3913 the ﬁnal RMSD of run 11 was 2.0098.
Which is 0.1717 higher then the start RMSD.

The ﬁrst and third run, 10 and 12 show a similar development as the earlier runs 04, 05 and
06 with the average RMSD dropping by several percent around generation 50. But in contrast
to all other previous runs the RMSD of run 12 declines further giving an indication that the
optimization reaches a point where it can drop into several minima one of them beeing a model
that included structures more likely to be near the conformation of the active structure.

Further one can see that for all three kernel methods the overall end MSE value rises with
descending dataset size. This can be lead back to loss of information with decreased data set
size. But both the APK and especially the PPK mostly lead to better models then the RBF dies
with the full dataset of 100 molecules. Where the APK and PPK differ in the way that using
the APK leads to models which have a lower RMSD to the active structures but a higher MSE
while the use of the PPK leads to very good models with the lowest MSE of all all models
created but with higher RMSD values.

Because most of the resulting models using the APK and PPK with the reduced datasets
where as good as the the ones using the full dataset due to their equal or lower MSE, I decided
to use the reduced dataset for future runs. Since the SVR is contained in O(n3) and the GA is
contained in O(n) this measure cut the computation time for a complete run by at least 50%.

35

4 Results

(a) This ﬁgure shows the results for the PPK with B = 10
on the dataset with 34 molecules

(b) This ﬁgure shows the results for the PPK with B =
10 on the dataset with 56 molecules

(c) This ﬁgure shows the results for the PPK with B =
100 on the dataset with 56 molecules

(d) This ﬁgure shows the results for the PPK with B =
500 on the dataset with 34 molecules

Figure 4.4: These ﬁgures show the results of the runs with reduced datasets and altered param-

eters for the PPK.

36

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

13
PPK
10
-
-
0.1
no
intermediate
34
1.5727
0.1701
1.4026
0.1645
0.1595
1.8585
1.8107
0.0478
1.6172
1.4992

14
PPK
10
-
-
0.1
no
intermediate
56
1.3344
0.0853
1.2491
0.0838
0.0814
1.8323
1.8826
-0.0503
1.7662
1.3194

15
PPK
100
-
-
0.1
no
intermediate
56
1.3834
0.0813
1.3021
0.0777
0.0763
1.8402
1.7181
0.1221
1.7118
1.0926

16
PPK
500
-
-
0.1
no
intermediate
34
1.5559
0.1387
1.4172
0.1306
0.1285
1.9009
2.0711
-0.1702
1.6644
1.6265

Table 4.4: This table shows the parameters and the results for run 13, run 14, run 15 and run

16. Parameters denoted by ‘- ’are not available for the chosen kernel method

4.1.4 Alternative Parameters for the Product Probability Kernel

In addition to reducing the dataset I changed the parameters of the PPK and APK. The results
for the PPK with the B parameter of the RBF set to 10, 100 and 500 in relation to 1000 at the
previous runs are shown in ﬁgure 4.4 and table 4.4. One can see that run 16 still shows the low
RMSD values around generation 50 with a strong increase and stagnation afterwards. The runs
13, 14 and 15 also show the decrease of the RMSD around generation 50 but not as strong as
runs with a higher parameter.

The average RMSD values of runs 13 and 15 only decrease slightly by 0.0478 and 0.1221
from 1.8585 and 1.8107 to 1.8107 and 1.7181. While the average RMSD values of runs 14
and 16 even increase by 0.0503 and 0.1702 from 1.8323 and 1.9009 to 1.8826 and 2.0711. The
MSE though reaches the lowest values of all runs with run 15 at a value of 0.777 and the second
lowest at run 14 with 0.0838.

The fact that a run with parameter B = 10 renders the best resulting model can be lead
back to the fact that the B parameter describes the ‘smoothness’ and distinctness of a RDF.
With declining B the RDF becomes more of a general description of the respective molecule
and its conformation instead of an exact characterization. In this case the presence of distinct
chemical groups or pharmacophores and their arrangement to each other is more important then
their individual orientation. This leads to a better generalization of the model at the cost of a
better discrimination of the conformations for each molecule.

37

4 Results

(a) This ﬁgure shows the results for the APK with
smoothing f actor = 0.2 on the dataset with 34 molecules

(b) This ﬁgure shows the results for the APK with
smoothing f actor = 0.2 on the dataset with 41 molecules

the APK
(c) This ﬁgure shows
with smoothing f actor = 0.2 on the dataset with 56
moleculesl

the results

for

(d) This ﬁgure shows the results for the APK with
smoothing f actor = 0.2 on the dataset with 100
molecules

Figure 4.5: These ﬁgures show the results of the runs with reduced datasets and altered param-

eters for the APK.

38

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End RMSD
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

17
APK
-
-
0.2
0.1
no
intermediate
34
1.1934
0.3901
0.8033
0.3857
0.3830
1.9084
1.8879
0.0205
1.7840
1.4992

18
APK
-
-
0.2
0.1
no
intermediate
41
1.3013
0.3813
0.92
0.3813
0.3807
1.8337
1.7185
0.1152
1.5997
1.3194

19
APK
-
-
0.2
0.1
no
intermediate
56
0.9636
0.2805
0.6831
0.2796
0.2774
1.8434
1.6696
0.1738
1.6470
1.0926

20
APK
-
-
0.2
0.1
no
intermediate
100
1.0234
0.2878
0.7356
0.2878
0.2864
1.9025
1.9787
-0.0762
1.8900
1.6265

Table 4.5: This table shows the parameters and the results for run 17, run 18 and run 19 and run

20. Parameters denoted by ‘-’ are not available for the chosen kernel method

4.1.5 Alternative Parameters for APK

Figure 4.5 and table 4.5 show the results for the runs 17, 18, 19 and 20 using the APK with the
full and the reduced datasets and a smoothing factor of 0.2. Runs 18 and 19 show the ﬁrst deep
decline of the RMSD at 50 generations to a global minimum of an average RMSD of 1.5997
and 1.6470 while run 17 shows a steady decline and run 20 an overall stagnation at a RMSD of
approximately 2.0.

As with the alternation of the B parameter of the PPK, setting the smoothing factor to a
value of 0.2 for the APK changes the generalization of the model resulting in lower MSE
values than previous runs with the use of the APK for all four runs. While the APK only
encodes atom types, distances and binding modes, doubling the smoothing factor still holds
enough information to ﬁt the model. It allows further for the GA to hold more individuals with
a wider RMSD range. This can be seen in ﬁgure 4.5 with the best individual RMSD values
being distinctively low than the average RMSD values over several generations in all four runs.
The average MSE values of the ﬁnal models was 0.3901, 0.3813, 0.2805 and 0.2878, which
is approximately 0.15 below previous runs. But in change for the better generalization and
lower MSE values the overall RMSD stagnated with only run 19 showing a decline to a ﬁnale
value of 1.7185 which is still above the 5% quantile.

39

4 Results

(a) This ﬁgure shows the results for the PPK with B = 10
on 56 molecules and a mutation probability of 0.2

(b) This ﬁgure shows the results for the PPK with B =
10 on 56 molecules and a mutation probability of 0.2

(c) This ﬁgure shows the results for the PPK with B = 10
on 56 molecules and a mutation probability of 0.2

Figure 4.6: These ﬁgures show the results of the runs with reduced datasets and increased mu-

tation probability for the PPK and APK.

40

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

21
PPK
10
-
-
0.5
no
intermediate
34
1.5839
0.1584
1.4255
0.132
0.1148
1.8889
1.8699
0.0190
1.7506
1.4206

22
PPK
10
-
-
0.5
no
intermediate
56
1.3608
0.1012
1.2596
0.0909
0.0736
1.8520
1.9711
-0.1191
1.6712
1.3559

23
APK
-
-
0.1
0.5
no
intermediate
34
1.2421
0.4708
0.7713
0.4651
0.4497
1.8823
2.0111
-0.1288
1.7752
1.4107

Table 4.6: This table shows the parameters and the results for Run 21, Run 22 and Run 23.
Parameters denoted by ‘- ’are not available for the chosen kernel method

4.1.6 Increased Mutation Rate

Studying the results of the changes in decreasing the dataset and altering the kernel parameter
the next step was to change the parameters and overall process of the GA. The easier of both was
to set the mutation probability to 0.5 instead of the standard 0.1 value. The mutation probability
deﬁnes the rate at which mutations occur during the mating process from one generation to the
next.
Increasing the mutation probability allows the GA to search in a broader range and
increases the chance of the optimization to jump out of a local minimum, but it also decreases
the optimization rate and may lead to more diverse results.

As one can see in all three runs depicted in ﬁgure 4.6 the increased mutation probability
leads to at least one individual in each generation with a signiﬁcantly lower RMSD as the
average. Further noticeable is the fact the the progression of the average and especially the best
individual RMSD include more peaks.

While changing the mutation probability still leads to good models with an average MSE of
0.1584, 0.1012 and 0.4708, in two of the runs the average end RMSD was even 0.1191 and
0.1288 higher then their start RMSD with 1.8520 and 1.8823.

41

4 Results

(a) This ﬁgure shows the results for the APK with
smoothing f actor = 0.2 on 56 molecules with alterna-
tive conformation sampling and a mutation probability
of 0.1

(b) This ﬁgure shows the results for the APK with
smoothing f actor = 0.2 on 56 molecules with alterna-
tive conformation sampling and a mutation probability
of 0.1

(c) This ﬁgure shows the results for the PPK with B = 10
on 56 molecules with alternative conformation sampling
and a mutation probability of 0.5

(d) This ﬁgure shows the results for the PPK with B =
10 on 56 molecules with alternative conformation sam-
pling and a mutation probability of 0.1

Figure 4.7: These ﬁgures show the results of the runs with reduced datasets of the alternative
conformation sampling and increased mutation probability for the PPK.

4.1.7 Alternative Conformation Sampling

A second way of increasing the chance of the optimization to jump out of a local minimum
was to change the conformation sampling of the dataset. While the intermediate parameters for
the ConfGen algorithm only allows local minima of relative molecular energy the comprehen-
sive parameters allowed GonfGen to output molecules with dihedral angles and ﬂexible ring
energies not being in a local minimum.

42

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Dataset size
Conformation Sampling
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

24
APK
-
-
0.2
0.1
no
56
comprehensive
0.9265
0.3775
0.549
0.3775
0.3756
1.7999
1.7873
0.0126
1.7175
1.376

25
APK
-
-
0.2
0.1
no
56
comprehensive
0.9252
0.3192
0.606
0.3192
0.3169
1.8047
1.7608
0.0439
1.7385
1.3512

26
PPK
10
-
-
0.5
no
56
comprehensive
1.3531
0.1084
1.2447
0.1025
0.0816
1.794
1.6181
0.1759
1.5084
1.2488

27
PPK
10
-
-
0.5
no
56
comprehensive
1.3671
0.1191
1.248
0.1059
0.0934
1.7862
1.7728
0.0134
1.6867
1.3333

Table 4.7: This table shows the parameters and the results for Run 24, Run 25, Run 26 and Run

27. Parameters denoted by ‘- ’are not available for the chosen kernel method

The thought was to allow the GA to successively get out of local minima due to the differ-
ences in relative molecular energies not being as great as with the intermediate conformation
sampling. Therefore the error of a change from one conformation to the one with the nearest
relative energy would not be as great for the comprehensive conformation sampling as for the
intermediate.

Another reason for using the comprehensive conformation sampling was that the relative
energy of an active structure is not necessarily a local minimum due to the interaction of the
molecule with its target and the solvent. So by allowing non-minima structures in the dataset I
reduced the minimal RMSD between the conformers in the dataset and the active structures.

The results for the experiments with the dataset produced by conformation sampling with
comprehensive parameters are shown in ﬁgure 4.7 and table 4.7. As one can see in run 24
and run 25 which used the APK and a smoothing factor of 0.2 the ﬁrst decline of the average
RMSD with its concurrent increase between generation 35 and 50 still occurs. But instead of
stagnating at the same average level as in most previous runs the average RMSD declines again
in later generations. The ﬁnal average RMSD, however, was only 0.0126 and 0.0439 lower than
the starting average RMSD with 1,783 and 1.7608 while the ﬁnal average MSE with 0.3775
and 0.3192 was better than most of the previous runs with the APK. Therefore the ﬁnal models
were more precise but still did not include conformations near the active structure.

The results for run 26 and run 27 are also shown in ﬁgure 4.1.6 and table 4.7. Both runs
used the PPK and a mutation probability of 0.5. In run 26 the average RMSD almost always
lies within the 5% quantile. In run 27 the average RMSD declines to a value of approximately
1.7 and stagnates for the second half of the optimization. The runs have ﬁnal average RMSD
values of 1.6181 and 1.7728 and ﬁnal average MSE values of 0.1084 and 0.1191.

43

4 Results

(a) This ﬁgure shows the results for the APK with
smoothing f actor = 0.1 on 56 molecules with alternative
conformation sampling and a the mutation only allowed
on the 12 known active structures

(b) This ﬁgure shows the results for the APK with
smoothing f actor = 0.1 on 34 molecules with alternative
conformation sampling and a the mutation only allowed
on the 12 known active structures

(c) This ﬁgure shows the results for the PPK with B = 10
on 56 molecules with alternative conformation sampling
and a the mutation only allowed on the 12 known active
structures

(d) This ﬁgure shows the results for the PPK with B =
10 on 34 molecules with alternative conformation sam-
pling and a the mutation only allowed on the 12 known
active structures

Figure 4.8: These ﬁgures show the results of the runs with reduced datasets of the alternative
conformation sampling and altered mutation operator to allow mutation only on the
conformers of the molecules with known active structure.

44

4.1 Precomputed Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Dataset size
Conformation Sampling
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

28
APK
-
-
0.1
0.1
yes
56
comprehensive
0.856
0.682
0.174
0.6815
0.6815
1.7972
1.7187
0.0785
1.5975
1.3621

29
APK
-
-
0.1
0.1
yes
34
comprehensive
1.1405
0.9029
0.2376
0.9022
0.9022
1.7881
1.8774
-0.0893
1.727
1.5028

30
PPK
10
-
-
0.1
yes
56
comprehensive
1.311
0.7738
0.5372
0.7706
0.7656
1.7808
1.7369
0.0439
1.7098
1.3915

31
PPK
10
-
-
0.1
yes
34
comprehensive
1.5813
0.7589
0.8224
0.7255
0.7207
1.7808
1.8078
-0.027
1.7052
1.3247

Table 4.8: This table shows the parameters and the results for Run 28, Run 29, Run 30 and Run

31. Parameters denoted by ‘- ’are not available for the chosen kernel method

4.1.8 Alternative Mutation Operator

The ﬁnal change to the mutation was to change the mutation operator in that way that it only
allowed the conformers of the molecules with known active structures to be mutated during the
mating process at the end of a generation. For the rest of the molecules, which are the ones with
unknown active structures, the conformation with the minimal relative energy was ﬁxed. The
reason for this change was to reduce the search space for the optimization to the conformations
of the molecules with known active structure. Therefore increasing the chance of ﬁnding a
model with low MSE which included conformations similar to the active structures resulting in
a lower RMSD.

The results of the four runs, 28, 29, 30 and 31 with altered mutation operator are shown in
ﬁgure 4.8 and table 4.8. One can see that, while the average MSE rapidly declines in the ﬁrst
25 generations the average RMSD remains a the same level throught the whole run for all four
runs. Futher the average MSE only reaches values of 0.684 to 0.9029 which is signiﬁcantly
higher than in previous runs due to the fact that the remaining ﬁxed molecules do not allow a
better model.

This means that, while only optimizing over the generated conformers of the known active
structures, the best models found still do not include conformations similar (i.e with a low
RMSD) to those structures. Possible reasons for that are manifold and will be discussed in the
next chapter.

45

4 Results

(a) This ﬁgure shows the average results for four runs
with the APK with smoothing f actor = 0.1 on conform-
ers of 34 molecules created with the intermediate param-
eter set

(b) This ﬁgure shows the average results for four runs
with the APK with smoothing f actor = 0.1 on conform-
ers of 34 molecules created with the comprehensive pa-
rameter set

Figure 4.9: These ﬁgures show the average results of four runs with the APK and reduced

datasets of both conformation sampling parameter sets.

46

4.2 Implicit Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Dataset size
Conformation Sampling
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff. Start/End RMSD
Best avg. RMSD
Best individual RMSD

avg. of run 32-35
APK
-
-
0.1
0.1
no
34
intermediate
1.2388
0.5018
0.7370
0.4999
0.4965
1.8967
1.8979
-0.0012
1.8294
1.5099

avg. of run 36-39
APK
-
-
0.1
0.1
no
34
comprehensive
1.2199
0.5338
0.6862
0.5338
0.5316
1.8022
1.8070
-0.0048
1.6579
1.3405

Table 4.9: This table shows the parameters and the results for the average of runs 32-35 and

36-39. Parameters denoted by ‘- ’are not available for the chosen kernel method

4.1.9 Reruns

The only run resulting in an considerably lower average RMSD then all other runs was run
12 with a ﬁnal average RMSD of 1.3813 (see ﬁgure 4.3 and table 4.3). To check if this was
a random result or if the constellation of kernel, dataset and parameters lead to models us-
ing conformations with a low RMSD to the active structure I reran the speciﬁc parameter set
of run 12 four times with either of both conformation sampling parameters intermediate and
comprehensive. The averaged results of these runs are shown in ﬁgure 4.9 and table 4.9

As one can see the average RMSD stagnates at 1.8 which is also the mean value for the
RMSD of all possible combinations of conformers. The decline and immediate return to the
mean RMSD between generation 25 and 50 is also visible for both results.

This proves that run 12 was a random result with the GA ﬁnding a local minimum. With a
value of 0.567 the MSE of run 12 is even higher then the average MSE for both of the 4 runs
with 0.5018 and 0.5338.

4.2 Implicit Conformation Sampling

The runs of the optimization with the implicit conformation sampling were done parallel to the
runs with precomputed conformation sampling. Therefore the results of the runs with precom-
puted conformation sampling inﬂuenced the decisions made for the parameters and dataset size
for the runs with implicit conformation sampling. One run with implicit conformation sam-
pling on the full dataset took up to two weeks on a Xeon quadcore server. This is why there are
fewer results for the implicit conformation sampling.

47

4 Results

(a) This ﬁgure shows the results for the RBF kernel with
B = 1000 and sigma = 100 on the full dataset with im-
plicit conformation sampling

(b) This ﬁgure shows the results for the RBF with B =
1000 on the full dataset with implicit conformation sam-
pling

(c) This ﬁgure shows the results for the PPK with B = 10
on the full dataset with implicit conformation sampling

(d) This ﬁgure shows the results for the PPK with B =
10 on the full dataset with implicit conformation sam-
pling

Figure 4.10: These ﬁgures show the results of the runs with implicit conformation sampling on

the full dataset and the use of the PPK and RBF kernel.

48

4.2 Implicit Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff Start/End RMSD
Best avg. RMSD
Best individual RMSD

01i
RBF
1000
100
-
0.1
no
implicit
100
1.1411
0.5331
0.608
0.5311
0.5285
1.8724
1.7978
0.0746
1.6873
1.5876

02i
PPK
10
-
-
0.1
no
implicit
100
1.0375
0.171
0.8665
0.171
0.1638
1.8826
1.9955
-0.1129
1.8246
1.5268

03i
PPK
10
-
-
0.1
no
implicit
100
1.1491
0.519
0.6301
0.519
0.5142
1.9039
1.8523
0.0516
1.8425
1.4942

04i
PPK
10
-
-
0.1
no
implicit
100
1.0389
0.3874
0.6515
0.3874
0.3808
1.8894
1.8909
-0.0015
1.8498
1.6142

Table 4.10: This table shows the parameters and the results for run 01i, run 02i, run 03i and run

04i. Parameters denoted by ‘- ’are not available for the chosen kernel method

4.2.1 Initial Runs

The results for the initial runs, 01i and 02i, are shown in ﬁgure 4.10 and table 4.10 . Run 03i
and run 04i are later runs with the same parameters as run 02i. As one can see the average MSE
is decreasing. This shows that the optimization is functional. But in comparison to the runs
with precomputed conformation sampling shown in the preceding section the average MSE
decreases more slowly and is has not reached a minimum at the end of the run. This can be
assumed due to the average MSE still decreasing in generation 150 to 200 and not reaching an
even level. The ﬁnal average MSE of the runs was 0.5331, 0.1710, 0.5190 and 0.3874. This
is the range of the results for the average MSE from the runs with precomputed conformation
sampling.

Further noticeable is the fact that the average RMSD shows almost no change after genera-
tion 50 in all four runs stagnating for many generations. The mean RMSD over all generations
of all four runs is 1.859, which is near the overall mean of 1.81 of all possible conformations. In
addition to the best individual RMSD this can be explained by the small chance of a mutation
occurring at the rotatable bonds of the molecules with known active structure.This low chance
of a mutation is a result of the molecules with known active structure having fewer rotational
bonds then the molecules without known active structure. Therefore the chance of a mutation
occurring on a molecule without known active structure is increased in relation to the runs with
precomputed conformation sampling where the mutation chances were equally distributed.

In addition all four runs lack the initial decrease of the average RMSD between generation

25 and 50 seen in the runs with precomputed conformation sampling.

49

4 Results

(a) This ﬁgure shows the results for the ﬁrst run with the
PPK and B = 10 on the reduced dataset of 34 molecules
with implicit conformation sampling

(b) This ﬁgure shows the results for the second run
with the PPK and B = 10 on the reduced dataset of 34
molecules with implicit conformation sampling

(c) This ﬁgure shows the results for the ﬁrst run with the
PPK and B = 10 on the reduced dataset of 56 molecules
with implicit conformation sampling

(d) This ﬁgure shows the results for the secondt run
with the PPK and B = 10 on the reduced dataset of 56
molecules with implicit conformation sampling

Figure 4.11: These ﬁgures show the results of the runs with reduced datasets, implicit confor-

mation sampling an the use of the PPK.

50

4.2 Implicit Conformation Sampling

Parameter / Run Nr.
Kernel method
RDF B factor
RBF Sigma factor
Smoothing factor
Mutation Probability
Mutate ﬁrst 12 only
Conformation Sampling
Dataset size
Start avg. MSE
End avg. MSE
Diff. Start/End MSE
Best avg. MSE
Best individual MSE
Start avg. RMSD
End avg. RMSD
Diff Start/End RMSD
Best avg. RMSD
Best individual RMSD

05i
PPK
10
-
-
0.1
yes
implicit
100
1,687
0,6932
0,9938
0,673
0,6693
1,8592
2,0149
-0,1557
1,8592
1,5489

06i
PPK
10
-
-
0.1
yes
implicit
100
1,73
0,7425
0,9875
0,7398
0,7352
1,8605
2,0571
-0,1966
1,8605
1,5543

07i
PPK
10
-
-
0.1
yes
implicit
100
1,5033
0,9929
0,5104
0,989
0,9827
1,8487
1,9715
-0,1228
1,8487
1,5698

08i
PPK
10
-
-
0.1
yes
implicit
100
1,5079
0,9555
0,5524
0,9475
0,9459
1,8546
1,8685
-0,0139
1,7359
1,5687

Table 4.11: This table shows the parameters and the results for Run 04 through Run 09. Param-

eters denoted by ‘- ’are not available for the chosen kernel method

4.2.2 Reduced Dataset and Fixed Conformation

In run 05 to run 08 I combined the several changes. First I reduced the dataset to 56 and 34
molecules including the 12 with known active structure. The second change was to ﬁx the
conformation of the molecules with unknown active structure to the conformation with the
lowest relative energy and allowing mutation only at rotational bonds of the molecules with
known active conformation. Run 07 was interrupted at generation 116 due to a server crash an
could not be resumed.

The results for the four runs with this conﬁguration are shown in ﬁgure 4.11 and table 4.11.
One can see that the average MSE declines faster (within the ﬁrt 50 generation) than in the
previous runs with implicit conformation sampling to a ﬁnal value of 0.6932, 0.7425, 0.9938
and 0.9555. These higher average MSE values can be explained by the ﬁxed conformations
and the resulting missing possibility for optimization.

One can see the effect of allowing mutation only on the 12 molecules with known conforma-
tion. The best individual RMSD is lower then the average RMSD for almost every generation
in all four runs. This can be explained by the higher mutation rate resulting in individuals with
conformations with a lower RMSD to the active structures.

51

5 Discussion

The hypothesis that the best achievable models to predict activity include the active structures
of the training molecules, can not be conﬁrmed in this work. The average RMSD over all
runs is almost exactly the average RMSD over all possible sets of conformations to the active
structure(See ﬁgure 5.1). While some runs show a RMSD below the 5% quantile they still
are within the normal distribution and are countered by the runs with a RMSD above average.
Though models were found with a a low average RMSD the optimization in most cases returned
to models with an RMSD near the average value. The reasons for these results can come from
two directions. They can be either chemically or mathematically qualiﬁed.

One possible reason is, that to many factors determining the active structure are missing
from the models I created. For example the solvent of the molecules, in this case water, is not
included in the model at all. But as studies have shown the solvent often has a great inﬂuence
on the activity and the active structure of a molecule. It can change the molecules conformation
to a more ﬁtting one or even be part of the active site itself by ﬁlling the space not occupied
by the ligand. Therefore disregarding the solvent may lead to a model with a feature space not
having enough information about the active complex.

Figure 5.1: This ﬁgure shows the diagram for the mean values over all runs. One can clearly see
the average RMSD stagnating at 1.81 over the whole run while the MSE declines
to a value of about 0.4

A second reason may be that only part of the molecules conformations are critical for the
activity while another, possibly larger part can take up a random, probably energetically min-
imal, conformation. However this would only account for a part of the normal distribution
of the RMSD values. The ‘active’ part of the molecule conformations chosen for the model
would have a distinctly lower average RMSD to the active conformation resulting in the overall
RMSD being lower than mean of the normal distribution. For the dataset used in this work, this
can be ruled out because all 12 molecules with known active structure are entirely integrated in
the active process and have no parts which can take on free conformations.

52

Another reason for the model consisting of conformations with an average RMSD to the
active structure may come from the used kernel methods in combination with the molecules in
the dataset. Regardless of the chemical properties the molecules in the dataset often consisted
of several ring systems. These ring systems contribute the same partial results to all kernel
values for all different combinations of conformations due to the fact that they are rigid and do
not change partial results between different conformations. To rule this out one would have to
repeat the experiments with a dataset of molecules with less rigid parts or with a kernel method
that prioritizes longer distances within the conformations.

The most important and apparent reason though is based on the principle of the SVR. On
optimizing over the activity prediction with the activity value being the same for each con-
former of a given molecule, it is clear that to achieve a maximal generalization the process will
pick that conformation with the best representation of the whole conformational space of the
molecule. In the conformational hypersphere this will be a conformation near the center of the
sphere. Which in this case means a conformation with a maximal ‘similarity’ to all other con-
formations of that molecule. Or in other words, a conformation with a minimal average RMSD
to all other conformers. In most cases this would not be the active conformation. For the 12
molecules with known active structure I used in this work, the average distance (i.e. RMSD)
from the active conformation to the center of the conformational hypersphere was 1.81 ˚A.

Not regarding the resulting average RMSD values, the PPK kernel yielded the best MSE
with values as low to 0.01 in contrast to the RDF kernel with MSE values only in range of
0.5 and the APK with MSE values in the range of 0.3. Furthermore the PPK had the steepes
decent of the MSE values reaching an almost even level at generation 25-50, whereas the APK
needed more generations. The resulting models of the implicit conformation sampling were
less signiﬁcant because in most runs the optimization process was not ﬁnished. This can be
explained by the mutation probability only being set to 0.1 and the number of points being
about 10 times as much as with the precomputed conformation sampling. One can see that by
reducing the possible mutations as in the ﬁnal runs with implicit conformation sampling and
ﬁxed conformers the MSE also decreases more rapidly.

53

6 Prospects

Following the results of this work models for activity prediction provide the best results not
by using the active structures but a conformation with minimal distance to all possible confor-
mations of the respective molecules. To conﬁrm these results one would have to run further
experiments with other kernels and data sets. In these experiments one would have to calculate
the distances of the best resulting conformations for activity prediction not only to the active
conformation but to all possible conformation, or at least an equally distributed set over the
conformational space. These new model would be expected to provide the best results if they
were based on these ‘average’ conformations and not on the active structures.

If proven correct one would have to rethink the use of active conformations in 3D QSAR
models in favor of more generalized conformations. Further this would suggest a method of
ﬁnding a conformation near the center of the conformational hypersphere without calculating
all pairwise distances of all possible conformations.

If proven wrong one would have to revise the results of this work and further investigate
the reasons for the average RMSD of the best achievable models constantly being the exact
RMSD of the active conformations to the middle of the proposed conformational hypersphere.
Therefore one would have to copile a new set of molecules with known active structures. Where
the set would include more and diverse active conformations to cover a wider range of the
chemspace.

54

Bibliography

[Bau09] L.; Heine-A.; Smolinski M.; Hangauer D.; Klebe G. Baum, B.; Muley. Think
twice: understanding the high potency of bis(phenyl)methane inhibitors of throm-
bin. J.Biol.Mol, 391:552–564, 2009.

[Ber77]

T.F.; Williams-G.J. Meyer E.E. Jr.; Brice M.D.; Rodgers J.R.; Kennard O.; Shi-
manouchi T.; Tasumi M. Bernstein, F.C.; Koetzle. The protein data bank: A
J. of Mol. Biol.,
computer-based archival ﬁle for macromolecular structures.
112:535, 1977.

[BGV92] B.E. Boser, I.M. Guyon, and Vapnik V.N. Annual Workshop on Computational
Learning Theory, chapter Proceedings of the ﬁfth annual workshop on Computa-
tional learning theory, pages 144–152. ACM, 1992.

[Boe99]

J.; Klebe G Boehm, M.;Stuerzebecher. Three-dimensional qantitive structure activ-
ity relationship analyses using comparative molecular ﬁle analysis and comparative
molecular similarity indices analysis to elucidate selectivity differences of inhibitors
binding to trypsin, thrombin, and factor xa. Journal of Medical Chemistry, 42:458–
477, 1999.

[Cha05] Q Chang. Scaling gaussian rbf kernel width to improve svm classiﬁcation. Neural
Networks and Brain, 2005. ICNN&B ’05. International Conference on, pages 19–
22, 2005.

[CL01]

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector
machines, 2001. Software available at http://www.csie.ntu.edu.tw/
˜cjlin/libsvm.

[Cou04] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Using quaternions usings rmsd.

J.Comput. Chem, 25:1849–1857, 2004.

[Cou05] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Rotational superposition and
least sequares: the svd and quaternions approach yield identical results. reply to the
preceeding comment by g. kneller. J. Comput. Chem., 26:1663–1665, 2005.

[CV95] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297,

1995.

[Dia76] R. Diamond. On the comparison of conformations using linear and quadratic trans-

formations. Acta Cryst, 32:1–10, 1976.

[Dix06] A.; Knoll E.; Rao S.; Schaw D. Friesner R.A. Dixon, S.; Smondyrev. Phase: a new
engine for pharmacophore perception, 3d qsar model developement and 3d database
screening. J. Comput.-Aided Mol. Design, 20(10):647–671, 2006.

55

Bibliography

[Eki04]

S. Ekins. Predicting undesirable drug interactions with promisscuous proteins in
silico. Drug Dicovery Today, 9:276–285, 2004.

[Fis94]

E. Fischer. Einﬂuss der konﬁguration auf die wirkung der enzyme. Berichte der
deutschen chemischen Gesellschaft, 27:2985–2993, 1894.

[Fle89]

R. Fletcher. Practical Mehods of Optimization. John Wiley and Sons, New York,
1989.

[Fri04]

J.L.; Murphy R.B.; Halgren T.A.; Klicic J.J.; Mainz D.T.; Repasky M.P.; Knoll E.H.
Shelley M.; Perry J.K.; shaw D.E.; Francis P.; Shenkin P.S. Friesner, R.A.; Banks.
Glide: a new approach for rapid, accurate doachin and coring. 1- method and as-
sessment of doching accuracy. J.Med.Chem, 47(7):1739–49, 2004.

[Gas96]

J.; Schuur J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Sadowski. Chem-
ical information in 3d space. J. Chem. Inf. Comput. Sci ., 36:1030–1037, 1996.

[Gas97]

J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Schuur. Finding the 3d
structure of a molecule in its ir spectrum. Fresenius J. Anal. Chem., 359:50–55,
1997.

[Guy93] B.; Vapnik V.N. Guyon, I.; Boser. Advances in Neural Information Processing
Systems, chapter Automtic capactiy tuning of very large VC-dimension classiﬁers,
pages 147–155. Morgan Kaufmann, San Mateo, CA, 1993.

[Ham66] Sir. Hamilton, William Rowan. Elements of Quaternions. Longmans, Green & Co.,

London, 1866.

[Han69] C. Hansch. A quantitative approach to biochemical structure-activity relationships.

Acc. Chem. Res., 2:232–239, 1969.

[Har94] George K.; Kauffman Louis H. Hart, John C.; Francis. Visualizing quaternion rota-

tion. Transactions on Graphics, 13:256–276, 1994.

[Hem99] Markus C.; Steinhauer V.; Gasteiger J. Hemmer. Deriving the 3d structure of organic
molecules from their infrared spectra. Vibrational Spectroscopy, 19:151–164, 1999.

[HG04] H.Z. Hao and M. Genton. Compactly supported radial basis function kernels. 2004.

[Hol75]

John H. Holland. Adaptation in Natural and Artiﬁcial Systems. Univ. Michigan
Press., 1975.

[Jah09] G.; Fechner N.; Zell A. Jahn, A.; Hinselmann. Optimal assignment methods for
ligand-based virtual screening. Journal of Chemoinformatics, 1:14, 2009.

[Jeb04]

R.; Howard A. Jebara, T.; Kondor. Probability product kernels. Journal of Machine
Learning Research, 5:819–844, 2004.

[Jor88]

J.T. Jorgensen, T.L.; Tirado-Rives. The opls potential functions for proteins. energy
minimization for crystals of cyclic peptides and crambin. J.Am.Chem.Soc., 110:165,
1988.

56

Bibliography

[Jor96]

[JZ10]

D.S.; Tirado-Rives J. Jorgensen, W.L.; Maxwell. Development and testing of the
opls all-atom force ﬁeld on cornformational energetcs and properties of organic liq-
uids. J.Am.Chem.Soc., 118:11225–11235, 1996.

G.; Fechner-N.; Henneges C. Jahn, A.; Hinselmann and A. Zell. Probabilistic model-
ing of conformational space for 3d machine learning approaches. Mol. Inf., 29:441–
455, 2010.

[Kab76] Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta

Crystallographica, 32(5)A:922–923, 1976.

[Kat00] R.; Luong-C.; Radika K.; Martelli A.; Sprengeler P.A.; Wang J.; Chan H.;
Wong L Katz, B.A.; Mackman.
Structural basis for selectivity of a small
molecule, s1-binding, submicromolar inhibitor of urokinase-type plasminogen ac-
tivator. Chem.Biol., 7:299–312, 2000.

[Kat01a] K.; Luong-C.; Rice M.J.; Mackman R.L.; Sprengeler P.A.; Spencer J.; Hataye J.;
Janc J.; Link J.; Litvak J.; Rai R.; Rice K.; Sideris S.; Verner E.; Young W. Katz,
B.A.; Elrod. A novel serine protease inhibition motif involving a multi-centered
short hydrogen bonding network at the active site. J.Biol.Mol, 307:1451–1486,
2001.

[Kat01b] P.A.; Luong-C.; Verner E.; Elrod K.; Kirtley M.; Janc J.; Spencer J.R.; Breit-
enbucher J.G.; Hui H.; McGee D.; Allen D.; Martelli A.; Mackman R.L. Katz,
B.A.; Sprengeler. Engineering inhibitors highly selective for the s1 sites of ser190
trypsin-like serine protease drug targets. Chem.Biol., 8:1107–1121, 2001.

[Kat03] K.; Verner-E.; Mackman R.L.; Luong C.; Shrader W.D.; Sendzik M.; Spencer J.R.;
Sprengeler P.A.; Kolesnikov A.; Tai V.W.-F.; Hui H.C.; Breitenbucher J.G.; Allen
D.; Janc J.W. Katz, B.A.; Elrod. Elaborate manifold of short hydrogen bond ar-
rays mediating binding of active site-directed serine protease inhibitors. J.Biol.Mol,
329:93–120, 2003.

[Kat04] C.; Ho-J.D.; Somoza J.R.; Gjerstad E.; Tang J.; Williams S.R.; Verner E.; Mackman
R.L.; Young W.B.; Sprengeler P.A.; Chan H.; Mortara K.; Janc J.W.; McGrath M.E.
Katz, B.A.; Luong. Dissecting and designing inhibitor selectivity determinants at
the s1 site using an artiﬁcial ala190 protease (ala190 upa). J.Biol.Mol, 344:527–
547, 2004.

[Kea89] Simon K. Kearsley. On the orthogonal transformation used for structural compari-

son. Acta Crystallographica, 45(2)A:208–210, 1989.

[Kel06]

P.; Schalon-C.; Bret G.; Foata N.; Rognan D. Kellenberg, E.; Muller. sc-pdb: an
annotated database of druggable binding sites from the protein data bank. Journal
of Chemical Information and Modeling, 46(2):717–727, 2006.

[Kos58]

Jr. Koshland, D. E. Application of a theory of enzyme speciﬁcity to protein synthe-
sis. Proc. Natl. Acad. Sci. U.S.A., 44:98–104, 1958.

[Kos94]

Jr. Koshland, D. E.
Angew.Chem.Int.Ed.Engl, 33:2375–2378, 1994.

The key and lock theory and the induced ﬁt

theory.

57

Bibliography

[Mac84] A. L. Mackay. Quaternion transformation of molecular orientation. Acta Crystallo-

graphica Section A, 40(2):165–166, Mar 1984.

[Mat96] R.; Costanzo-M.J.; Maryanoff B.E.; Tulinsky A Matthews, J.H.; Krishnan. Crystal
structures of thrombin with thiazole-containing inhibitors: probes of the s1’ binding
site. Biophys.J,, 71:2830–2839, 1996.

[McL72] A.D. McLachlan. A mathematical procedure for superimposing atomic coordinates

of proteins. ActaCryst, 28:656–657, 1972.

[OW91] T.I. Oprea and C.L. Walter. Reviews in Computational Chemistry, chapter Theoreti-
cal and practical aspects of thee-dimensional quantitative structure-activity relation-
ships, pages 127–182. Wiley-VCH: New York, 1991.

[Sad94]

[Sch96]

[Sel97]

J. Sadowski, J.; Wagener M.; Gasteiger. Corina: Automatic generation of high-
quality 3d-molecular models for application in qsar. In 10th European Symposium
on Structure-Activity Relationships: QSAR and Molecular Modelling, 1994.

P.; Gasteiger J Schuur, J.H.; Selzer. The coding of the three-dimensional structure
of molecules by molecular transforms and its application to structure - spectra cor-
relations and studies of biological activity. J. Chem. Inf. Comput. Sci., 36:334–344,
1996.

J.H.; Gasteiger Selzer, P.; Schuur. Software Development in Chemistry 10, vol-
ume 10, chapter Simulation of IR Spectra with Neural Networks Using the 3D-
MoRSE Code, page 293. Gesellschaft Deutscher Chemiker: Frankfurt am Main,
1997.

[Sew07] Martin Sewell. Kernel methods. Technical report, Department of Computer Science

University College London, 2007.

[Sho85] K. Shoemaker. Animating rotation with quaternion curves. Comput. Graph.,

19:245–254, 1985.

[SI08a] New York Schroedinger Inc. LigPrep, V2.1. 2008.

[SI08b] New York Schroedinger Inc. MacroModel, V9.6. 2008.

[SJ93] M. Stone and P. Jonathan. Statistical thinking and techniques for qsar related studies.

1 general theory. J. Chemom., 7:455–475, 1993.

[SOW04] Jeffrey J. Sutherland, Lee A. O’Brian, and Donald F. Weaver. A comparison of
methods for modeling quantitative structure-activity relationship. J. Med. Chem.,
47:5541–5554, 2004.

[Ste03] Y.; Kuhn S.; Horlacher O.; Luttmann E.; Willighagen E. Steinbeck, C.; Han. The
chemistry development kit (cdk): an open source java library for chemo- and bioin-
formatics. J Chem Inf Comput Sci, 43(2):493–500, 2003.

[Ste06]

C.; Kuhn S.; Floris M.; Guha R. Steinbeck, C.; Hoppe. Recent development of the
chemistry development kit (cdk) - an open source library for chemo- and bioinfor-
matics. Curr Pharm Des, 12(17):2111–2120, 2006.

58

Bibliography

[Vap82] V. Vapnik. Estimation of dependencies bade on empirical data. Springer Verlag,

1982.

[Vap95] V. Vapnik. The Nature of Statistical Learning Theroy. Springer Verlag, 1995.

[VC64] V. Vapnik and A. Chervenonkis. A note on one class perceptrons. Automation and

Remote Control, 25, 1964.

[VC74] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka (Russia),

1974.

[VL63] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.

Automation and Remote Control, 24, 1963.

[WS10]

P.; Murphy R.B.; Sherman W.; Friesner R.A. Watts, K.S.; Dalal and J.C. Shelley.
Confgen: A conformational search method for efﬁcient gerneration of bioactive con-
formers. J.Chem.Inf.Model., 50:534–546, 2010.

[XG02] Y.; Ming L. Xi, C.;Lin and K. Gilson. The binding database: data management and

interface design. Bioinformatics, 18(1):130–139, 2002.

[Zam76] A. Zamora.

An algorithm for ﬁnding the smallest set of smalles rings.

J.Chem.Inf.Comput.Sci., 16(1):40–43, 1976.

59

List of Figures

1.1 Overlay of Thrombin inhibitors . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 QSAR process .

.

.

.

.

.

2
2

. .

6
.
.
. .
2.1 SVR .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 GA individuals
.
2.3 GA mutation operators .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Thrombin-Hirudin complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.

.
.

.
.

.
.

.

.

.

examples for RDF .
overlay of RDF functions .
curve approximation .
.

3.1 ﬂowchart of the overall process . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4
3.5 Atom Pair Kernel
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
12 structures with known active conformation . . . . . . . . . . . . . . . . . . 22
3.6
example for implicit conformation sampling encoding . . . . . . . . . . . . . . 25
3.7
exapmle of rotatable bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8

.
.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
results of the initial runs
4.1
results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 32
4.2
results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 34
4.3
results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 36
4.4
results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 38
4.5
results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6
results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 42
4.7
results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 44
4.8
4.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
results for the reruns
4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 48
4.11 results for runs with implicit conformation sampling, reduced dataset and ﬁxed

.

.

conformation . .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 mean over all runs .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

60

List of Tables

3.1
3.2
3.3

table of compiled structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
table of parameters for conformer generation . . . . . . . . . . . . . . . . . . 24
table of parameters for the SVR . . . . . . . . . . . . . . . . . . . . . . . . . 27

results of the initial runs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1
results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 33
4.2
results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 35
4.3
results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 37
4.4
results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 39
4.5
results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6
results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 43
4.7
results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 45
4.8
4.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
results for the reruns
4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 49
4.11 results for runs with implicit conformation sampling, reduced dataset and ﬁxed

.

.

conformation . .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

61