Files
Masterarbeit/StilVorlagen/Diplomarbeit.md
2025-11-08 20:36:02 +01:00

114 KiB
Raw Blame History

EBERHARD-KARLS-UNIVERSIT ¨AT T ¨UBINGEN Wilhelm-Schickard-Institut f¨ur Informatik Lehrstuhl Rechnerarchitektur

Diplomarbeit

Active Structure Learning using Genetic Algorithms and Kernel Functions

Christoph Schw¨orer

Betreuer:

Prof. Dr. rer. nat. Andreas Zell Wilhelm-Schickard-Institut f¨ur Informatik

Prof. Dr. rer. nat. Karl-Heinz Wiesm¨uller EMC Microcollections GmbH

Begonnen am:

13th January 2010

Beendet am:

12th July 2010

Erkl¨arung

Hiermit versichere ich, diese Arbeit selbstst¨andig verfasst und nur die angegebenen Quellen benutzt zu haben.

T¨ubingen am 12th July 2010

Christoph Schw¨orer

Kurzfassung. Current 3D QSAR approaches attempt to build models base not only on 1D or 2D descriptors of molecules like weight, charge or molecular graphs, but also on 3D sensitive information like the conformation or information about the molecules surface. A basic assumption on building these 3D QSAR models is that the best results are attained by using the best available (i.e., the obtained active structure) data. In this work I tried to find such a best achievable 3D QSAR model by the means of optimizing a model over a set of conformations using a genetic algorithm and three different kernel methods. The intent was to see if these resulting models would include the active structures. For the generation of the sets of conformations I used two different approaches. The first being a precomputation of the conformations the second an implicit generation concurrent to the optimization. The results will show that the model with the best generalization and prediction accuracy in most cases do not include the active conformation but conformations with a minimal average pairwise distance to all other possible conformations of the respective molecules.

Contents

1

Introduction

1

.

.

.

4 2 Background Information 4 2.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Rotation with quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . 2.4 RMSD calculation with quaternions 2.5 Genetic Algorithm . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Quantitative Structure-Activity Relationship . . . . . . . . . . . . . . . . . . . 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13 3 Materials and Methods 3.1 Overall process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Radial Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Kernel . Probability Product Kernel . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 3.3.2 Radial Basis Function Kernel . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Atom Pair Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . 21 Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . .

3.4 Dataset . . 3.5 Conformation Sampling .

3.6 SVR .

3.5.1 3.5.2

. .

. .

.

.

.

.

.

.

.

.

.

.

.

.

4 Results

.

.

Initial Runs .

28 4.1 Precomputed Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . 28 4.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Reduced Dataset with PPK and RBF Kernel . . . . . . . . . . . . . . . 33 4.1.3 Reduced Dataset with Atom Pair Kernel . . . . . . . . . . . . . . . . . 35 4.1.4 Alternative Parameters for the Product Probability Kernel . . . . . . . 37 4.1.5 Alternative Parameters for APK . . . . . . . . . . . . . . . . . . . . . 39 Increased Mutation Rate . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.6 4.1.7 Alternative Conformation Sampling . . . . . . . . . . . . . . . . . . . 42 4.1.8 Alternative Mutation Operator . . . . . . . . . . . . . . . . . . . . . . 45 4.1.9 Reruns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Implicit Conformation Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 . . . . . . . . . . . . . . . . 51 4.2.2 Reduced Dataset and Fixed Conformation

Initial Runs .

4.2

. .

. .

.

5 Discussion

6 Prospects

iv

52

54

Bibliography

Contents

55

v

Contents

vi

1 Introduction

In drug design one of the major goals is to find new lead structures. Lead structures already show a certain affinity towards the intended target but express unwanted side effects or lack certain properties. For example they may be toxic or have a low bioavailability. Without a detailed understanding of the biochemical processes responsible for the activity the search for such a new lead structure is non-trivial.

The usual process is to simply try a huge combination of different chemical compounds in vitro and observe their activity. But the combinatorial possibilities of this strategy can explode even for small systems. For instance the number of compounds needed to place 10 substituents on the four open positions of an asymmetrically disubstituted benzene ring system is approxi- mately 10,000.

Therefore this classical screening process was automatized and combinatorially optimized in the last decades to high throughput screening (HTS) which allowed for a systematical search in greater databanks with hundreds of thousands of entries. But still this process makes up a large amount of the development-costs and -time. Further the chemical compounds needed for the synthesis are often rare and hard to come by in the purity needed for reliable results.

One way to optimize this exhaustive search and give an indication of the right direction is to develop a model that quantitatively relates variations in biological activity to changes in molecular properties which can be easily obtained for each compound. One of the first to build such a model was Corvin Hansch correlating lipophilicity and polarity with biological activity in his Hansch method [Han69]. But there exist many other approaches to this Quantitative Structure-Activity Relationship (QSAR) principle, which mostly differ in their use of molecular descriptors and mathematic models such as Partial Least Squares or Principal Component Analysis. The QSAR models developed in this work are based on kernels which are evaluated by Support Vector Regression.

In the recent years several models have been developed using 3D descriptors of molecules. These 3D descriptors are important, because to build a model and gain understanding for the binding process it is not enough to know of the single component and values of a molecule but also to know their 3D dimensional arrangement. As one can see for example on fig 1.1 where a single molecule can take on several conformations. To know which of these conformation is the active conformation can improve the modeling process and the understanding of the chemical processes leading to the activity.

One thing all QSAR methods have in common is the basic assumption that the biological activity is an additive function of the molecular properties (2D or 3D) of the substituents and groups of the respective structure. Not only the mere presence of those groups is essential but also their three dimensional arrangement.

This leads to the expectation that on using 3D descriptors only good and correct training data including 3D information of the molecules leads to a good model of the activity. But, what if this doesnt hold true? What if a better model can be created not using the actually correct physical data? The question is if the reverse of this expectation is always valid, thus if the model quality is bijective to the training data quality.

1

1 Introduction

Figure 1.1: This figure shows an overlay of six conformations of the same thrombin inhibitor.

One can see the high flexibility of the lower ring system.

(a) This figure shows the standard process for building a QSAR model.

(b) This figure shows the reverse process of optimizing a QSAR model with respect to the training data quality

Figure 1.2: These figures show the standart and the process used in this work to build a QSAR model. To optimize the model quality with respect to the varying training data the process has been reversed.

To test, whether one can find such a model I will reverse the QSAR approach (see figure 1.2 . Therefore optimizing the QSAR model to predict activity by the means of altering the training dataset, where for each data point several values are given including the actual physical ones. While the input data points vary (i.e. their molecular descriptors) their target function value

2

(i.e. the activity) stays the same.

The intention of this experiment is to see if the best attainable model includes the actual active structure or the artificially created one. To this end I compiled a data set and created a set of conformers for each molecule. Then I concurrently optimized the activity prediction over the whole training dataset not to favor one molecule over the other by successively optimizing one after the other. A good way to handle multidimensional optimization with several datasets is to use a genetic algorithm which I did in this case.

In this work I will show the methods used for the generation of the the dataset, the optimization and the evaluation. Further I will present the results and discuss their significance. Finally I will give a perspective of further work which can be done on this topic.

3

2 Background Information

2.1 Kernel Functions

Detecting linear relations has been the focus of much research in statistics and machine learning for the last few decades and the resulting algorithms are well understood, well developed and efficient [Sew07].However, many models of natural processes arent linear. So, if a problem is non-linear, instead of trying to fit a non-linear model, one can map the problem from the inputspace X to a new higher-dimensional space called the f eaturespace F and then use a linear model in the feature space. This mapping can be achieved by doing a non-linear transformation. For example the function φ can be given as

φ : R2 → R3 with φ (x1, x2) = (x2 1,

(cid:112)

2x1x2, x2 2)

(2.1)

While this function is a very simple one, other functions can easily become computationally impracticable for both polynomial features and higher dimensionality. This is grounded on the (cid:17) , with d = dim X fact that the number of different monomial features of degree p is [Vap95] (e.g. p = 7, d = 28 · 28 = 748, corresponds to a total of approximately 3, 7 · 1016 features).

(cid:16) d+p1 p

The key to an efficient computation is the observation made by [BGV92] that

(cid:68)(cid:16)

x2 1,

(cid:112)

2x1x2, x2 2

(cid:17)

(cid:16) x(cid:48)2 1,

,

(cid:112)

2x(cid:48)

1x(cid:48)

2, x(cid:48)2 2

(cid:17)(cid:69)

= (cid:104)x, x(cid:48)(cid:105)2

(2.2)

which allows the use of kernel f unctions where φ must not be explicitly known as long as the function corresponds to a dot product in the FeaturespaceF

k(x, x(cid:48)) := (cid:104)φ (x), φ (x(cid:48))(cid:105)

(2.3)

2.2 Support Vector Regression

Many multi variant systems assume that there is a linear relation between X and Y which holds for all samples. In chemoinformatics this assumption does not hold true and causes a variety of problems on the prediction of unknown data points. One way to solve these occurring problems is to use non-linear learning methods such as support vector regression (SVR). The support vector algorithm is a non-linear generalization of the Generalized Portrait algorithm developed in Russia in the 1960s [VL63] [VC64]. Its groundwork, the statistical learning theory, or VC theory, has been developed over the last half century by Vapnik and Chervonenkis [VC74] [Vap82] [Vap95]. The VC theory defines properties of learning machines, enabling them to generalize to unseen data.

Given a set of training data {(x1, y1), ..., (yn, yn)} ⊂ χ × R with χ denoting the space of input patterns (e.g. χ = Rd) the goal of ε SV regression is to find a function f (x) with a maximum

4

2.2 Support Vector Regression

deviation of ε from the actually received targets yi for all the training data. In addition f should be as flat as possible. The form of a linear function f is given as

f (x) = (cid:104)w, x(cid:105) + b with w ∈ χ, b ∈ R

(2.4)

with (cid:104)·, ·(cid:105) denoting the dot product in χ and flatness meaning a small w. To attain this we 2 (cid:107)w(cid:107)2 which can be formally written as a convex optimization minimize the euclidean norm 1 problem:

minimize

subject to

1

2(cid:107)w(cid:107)2 (cid:40)

yi (cid:104)w, xi(cid:105) b ≤ ε (cid:104)w, xi(cid:105) + b yi ≤ ε

(2.5)

The above formula is viable for all problems where a function f actually exists that approx- imates all pairs (xi, yi) with precision ε. If this is not the case, or if we want to allow some errors, according to [CV95] one can introduce slack variables ξi, ξ i

leading to the formula:

minimize

subject to

i=1(ξi + ξ i )

1

2 (cid:107)w(cid:107)2 +C ∑n  

yi (cid:104)w, xi(cid:105) b ≤ ε + ξi (cid:104)w, xi(cid:105) + b yi ≤ ε + ξ i ξi, ξ

i ≥ 0



(2.6)

Where the constant C > 0 defines the trade off between the flatness of f and the amount up to which deviations larger than ε are tolerated. This is the same as dealing with a ε-intensive loss function |ξ |ε denoted by:

(cid:40)

|ξ |ε :=

0 |ξ | ε

if|ξ | ≤ ε else

(2.7)

Figure 2.1 depicts the use of ξ and ε. Extending support vector machines to solve non linear problems is possible by using a standard dualization approach utilizing Lagrange multipliers as described in [Fle89] leading to the following formula:

L := 1

2 (cid:107)w(cid:107)2 +C ∑n

i=1(ξi + ξ

i ) ∑n

i=1(ηiξi + η

i ξ i )

∑n

i=1 αi(ε + ξi yi + (cid:104)w, xi(cid:105) + b)

(2.8)

∑n

i=1 α

i (ε + ξ

i + yi (cid:104)w, xi(cid:105) b)

With L being the Lagrangian and ηi, η satisfy the constraints

i , αi, α i

the Lagrangian multipliers. Thus they have to

ηi, η

i , αi, α

i ≥ 0

(2.9)

To gain an optimal result one can infer from the saddle point condition that the partial deriva-

5

2 Background Information

Figure 2.1: The image shows the use of ξ and ε in a support vector regression. Data points with a distance smaller than ε are not considered an error. For data points with a distance larger than ε the parameter ξ decides wether they are tollerated or not.

tives of L have to vanish

∂ L ∑n ∂ b = ∂ L ∂ w = w ∑n ∂ L ∂ ξi ∂ L ∂ ξ i

i=1(α i αi) i=1(αi α C αi ηi i η C α i

=

=

= 0 i )xi = 0 = 0 = 0

Substituting eq 2.7 into eq. 2.6 leads to the dual optimization problem:

(cid:40)

maximize

subject to ∑n

i, j=1(αi α i=1(αi + α

2 ∑n 1 i )(α j α −ε ∑n i ) + ∑n i=1(αi α i ) = 0 and αi, α i we can further reformulate (7) to η ()

j )(cid:104)xi, x j(cid:105) i=1 yi(αi α i ) i ∈ [0,C]

i = C α ()

i

Having already eliminated ηi, η follows

w =

n ∑ i=1

(αi α

i )xi, thus f (x) =

n ∑ i=1

(αi α

i )(cid:104)xi, x(cid:105) + b.

(2.10)

(2.11)

so that

(2.12)

The fact that the dataxi only contributes in form of the dot product allows the introduction of kernel functions in such a way that

This allows the prediction of unknown data points via

k(xi, x j) = (cid:104)φ (xi), φ (x j)(cid:105)

n ∑ i=1

(αi α

i )k(x, x j) + b

f (x) =

6

(2.13)

(2.14)

2.3 Rotation with quaternions

2.3 Rotation with quaternions

Quaternions are an extension of the complex numbers invented by William Rowan Hamilton in 1843[Ham66] and formally introduced to computer graphics by the publication of Shoemaker [Sho85] [Har94]

Quaternions encode rotations by a set of 4 real numbers (or 2 complex numbers), while a linear representation of a rotation requires a 3 × 3 Matrix, thus 9 numbers. Further Quaternions occupy a smooth, seamless isotropic space which is the generalization of the surface of a sphere. This means that one doesnt need to take special care in avoiding singularities (e.g., the gimbal lock, where two rotation axes collapse into one making the interpolation irreversible).

The four-dimensional space H is spanned by the real axis and three additional orthogonal axes, spanned by the vectors i, j, k called the principal imaginaries, which obey Hamiltons rule

Where the three dimensional vectors i, j, k signify

i2 = j2 = k2 = ijk = 1

i = (1,0,0) j = (0,1,0) k = (0,0,1).

(2.15)

(2.16)

A quaternion q = r + xi + yj + zk consists of a real part r and a pure part xi + yj + zk and can be written as a three dimensional vector an a scalar

The sum of two quaternions is given as

q = (a, b)

q1 + q2 = (a1 + a2) + (v1 + v2)

and their product as

(2.17)

(2.18)

q1q2 = a1a2 b1 · b2 + a1b2 + a2b1 + b1 × b2

(2.19)

where the multiplication of two quaternions q1q2 with unit length (i.e. absolute value = 1) and q2 being a pure quaternion (i.e. with a = 0) causes a rotation of b2 around the axis described by b1 for cos1 2φ degrees. Where φ is the desired rotation angle.

2.4 RMSD calculation with quaternions

In various cheminformatic situations the problem arises of finding the best superposition of on rigid object onto another. For example to give a similartiy measure for two proteins or in case of this work two conformations of the same molecule. One method is finding the best rotation and translation to minimize the root mean square deviation (RMSD) [Kab76] with examples are given by [Dia76] and [McL72]. A prerequisite for this method is a given assignment of the points matched on each other. Usually such an assignment is already given (e.g., the canonical atom numbering of two different conformations).

The mathematical problem can the be stated as follows: [Cou04]

7

2 Background Information

“given a ordered set of vetors yk (target) and a second set xk (model), 1 ≤ k ≤ N, find a

orthogonal transformation U and a translation r such that the residual E (weighted by wk)

E :=

1 N

N ∑ k=1

wk|U xk + r yk|2

(2.20)

is minimized. ”Where the weight factor wk allows to lay the emphasis on certain parts of the structure in question.

While Kabschs method uses Lagrange multipliers, Mackay proposed a method in 1984 [Mac84] using quaternions to calculate the rotation matrix. One disadvantage of Mackays method was that, using a linear form of the least square errors, the results could be false where objects had different relative orientations in space. In 1989 Kearsley developed a method, solv- ing the non-linear least square error problem with an eigenvalue determination through the use of quaternions [Kea89]. The proof that both, Kabschs and Kearsleys methods lead to the same result was brought by Coutsias et al. in 2005 [Cou05].

If xk and yk are considered as pure quaternions, with xk := (0, xk) and xc

k = xk the rotation

U (q) can be written as

And the residual function is transformed using quaternions to

(0, U (q)xk) = qxkqc

Eq =

1 N

N ∑ k=1

(qxqc yk)(qxqc yk)c

An expansion and a multiplication by N leads to

NEq = ∑N

k=1(qxkqc)(qxkqc)c + ykyc

k(qxkqc)yc

k yk(qxkqc)c

(2.21)

(2.22)

(2.23)

= ∑N

k=1(xkxc

k + ykyc

k + (qxkqc)yk + yk(qxkqc))

where the normalization qqc = 1 and the property of pure quaternions xc = x has been used. qxkqc and yk being pure quaternions and with a, b pure ab + ba = 2(a · b, 0) = 2([ab]0, 0) the last two terms in eq. 2.23 can be combined as follows

(qxkqc)yk + yk(qxkqc) = 2([yk(qxkqc)]0, 0)

(2.24)

This means that only the 0th component is non-zero. Because of the associativity of the quater- nions one can write yk(qxkqc) = (ykqxk)qc and define xk := ykqxk wich leads to the 4-vector form of zk, Zk with Zk = AL(yk)AR(xk)Q with AL, AR defined as follows

AR(p) =

  

p0 p1 p2 p3 p3 p2 p1 p0 p1 p2 p3 p0 p0 p2 p1 p3

  

, AL(p) =

  

p0 p1 p2 p3 p2 p0 p3 p1 p0 p1 p2 p3 p0 p1 p3 p2

  

(2.25)

8

All together we can write

followed by the residue

with

2yT k

U (q)xk = 2[yk(qxkqc)0

= 2[zkqc]0 = 2(zk0q0 + zk · q) = 2QT Zk = 2Z T Al(yk)R(xk)Q

NEq =

N ∑ k=1

(|xk|2 + |yk|2) 2QT F Q

F :=

N ∑ k=1

AL(yk)AR(xk)

2.5 Genetic Algorithm

(2.26)

(2.27)

(2.28)

leading to the full form of the matrix F in terms of the correlation matrix R

F =

  

R11 + R22 + R33 R23 R32 R31 R13 R12 R21

R23 R32 R11 R22 R33 R12 + R21 R13 + R31

R31 R13 R12 + R21 R11 + R22 R33 R23 + R32

R12 R21 R13 + R31 R23 + R32 R11 R22 + R33

  

(2.29)

In this way the problem can be reduced to finding the extreme of a quadratic form QT F Q for the four variables qi, i ∈ {0, 1, 2, 3} subject to the constraint QTQ = 1. Here QT F Q is the standard Rayleigh quotient for a symmetric matrix F , where the maximum value of QT F Q is equal to its larges eigenvalue which leads to the following problem

which in turn leads to the following expression for the best RMSD Value

F Q = λ Q

(cid:115)

eq =

(cid:114)

min (cid:107)q(cid:107)=1

Eq =

∑N

k=1(|xk|2 + |yk|2) 2λmax N

(2.30)

(2.31)

2.5 Genetic Algorithm

In cheminformatics one often encounters optimization problems with several variable param- eters. Traditional optimization methods such as steepest decent often fail at this task because they often run into a local optimum. To get around this problem Prof. John Holland developed the class of Genetic Algorithms (GAs) at the University of Michigan during the 60s and 70s [Hol75].

Genetic algorithms belong to the class of stochastic search methods. Their distinctive feature is, that instead of operating on a single solution like most other stochastic search methods, they operate on a whole set of solutions. The term Genetic Algorithm is a tribute to their basic operations which derive from natural evolutionary processes, such as inheritance, mutation, selection, and crossover.

9

2 Background Information

Given a problem P with parameters x1, ..., xn the first step is to initialize a first set of solutions, called population M(0). Each single solution is called individual m and is represented by a bit string called chromosome (see fig x). The initial value of each parameter is chosen at random within its predefined range.

Figure 2.2: This figure shows two individuals with parameters x1, .., x4 encoded as a series of

binary representations of different length.

The second step is to evaluate each individual (i.e. solution) in the current population M(t) for its fitness. This is done by and applying the individuals parameter values to a fitness function (which in most cases is the initial problem function) and assigning the function result as fitness value u(m). This means that the parameters of individuals with higher fitness values lead to a better result of the problem function.

The third step is to assign each of the current individuals a selection probability p(m) which depends on the individuals fitness value u(m). This selection probability determines if a in- dividual is chosen for mating. There are several methods of assigning selection probabilities like roulette wheel selection (the likelihood of picking an individual is proportional to the indi- viduals score), tournament selection (a number of individuals are picked using roulette wheel selection, then the best of these are chosen for mating), and rank selection (pick the best in- dividual every time). Moreover it is important not to use a method which always picks the individuals with the best fitness because then the population will quickly converge to these individuals narrowing the search space.

The fourth step is to generate a new population M(t + 1) using the individuals selected in step three to produce offspring applying the already mentioned genetic operators mutation and crossover with a predefined probability (see figure 2.3(a) and 2.3(b) for genetic operators).

10

2.6 Quantitative Structure-Activity Relationship

(a) Mutation of the fifth bit from 0 to 1

(b) Crossover after the third bit

Figure 2.3: This figure shows the two genetic operators mutation and crossover. These allow to generate new individuals from already existing ones and to introduce new sets of parameters with possible better fitness values.

Steps two to four are then repeated until one of three possibilities occur. The best fitness in the current population reaches a given limit, the best fitness does not increase over several, predefined generations, or the steps two to four are repeated for a specific number of times.

2.6 Quantitative Structure-Activity Relationship

For the development of a new drug it is important not only to know its chemical formula but also its conformation. The underlying principle for that is the so called lock and key principle postulated by Emil Fischer in 1894 [Fis94] stating that an active compound has to be spacial complementary to its target to form a complex. But as we know today there are several other factors that influence the building of an active complex. Those can be direct features of the molecules, like hydrophobicity, partial atomic charge, binding sites etc., or there can be influ- ences from the surrounding solution (e.g., water) so that a ligand changes its conformation in the binding process. These considerations lead to the expansion of the lock and key principle to the induced fit theory in 1958 [Kos58][Kos94]

11

2 Background Information

Figure 2.4: This figure shows a Thrombin-Hirudin complex. The Hirudin(magenta) being the

key to the Thrombin(blue) lock.

Also new to this theory was the introduction of flexible binding sites which can account for differences in specificity and affinity. This leads to the conclusion that the biological activity is a direct function of the ligands three dimensional structure which in turn is the fundamental premise for the quantitative structure-activity relationship (QSAR) [SOW04]. QSAR Meth- ods attempt to represent the relationship between structural attributes of molecules and their biological activity. In the beginning QSAR models where used to retrospectively analyze the activity modulation of molecules in a specific subset. But in the last decade QSAR models have been increasingly used for predictions on novel derivatives of well known ligands [Eki04]. To be applicable to such a use the applied QSAR models must be able to generalize and predict activities correctly beyond the chemical space defined by the given training data.

To that end a large number of methods has been described in the literature since the begin- ning of the research on QSAR. The early methods implemented only 2D features of molecules (e.g. the connection table of a molecule), while newer ones often include 3D features like the chemical properties of molecules in their bioactive conformation [SJ93] [OW91].

12

3 Materials and Methods

In this chapter the two main strategies applied to the problem and the overall process will be explained in detail and their function will be exemplified. The parameters used for the experiments and their progress will be given. The implementation of the algorithms or the use of external programs or code will be described. All algorithms were written in Java.

3.1 Overall process

Because this work consist of a concatenation of different machine learning and chemoinformat- ical methods I will first give an overview of the whole process and then explain the appointed methods in depth.

The aim was to see if the best models for an activity prediction included the actual active

structures of the given molecules or if a better model could be found without them.

In this work I used two different approaches. The first was to precompile a set of conformers for each molecule maximizing the coverage of the conformer space and the second was to create random new conformers during the optimization process. From the set of precompiled conformers 100 (or the maximum available if lower then 100) were chosen equally distributed over the calculated relative energy range for each molecule and used as the training set. In both approaches the optimization was done by a genetic algorithm. The deciding facts for using a heuristic (in this case the genetic algorithm) were that both a full search of the optimization space isnt feasible for 100 molecules each with at least 100 conformation and that the solution hyperplane is very jagged and there was no information about a starting point. The information about the molecules conformation were encoded in the GAs genes, either as a direct reference to the whole conformation (in the precomputed approach) or as single dihedral angles for each rotatable bond in each molecule (in the implicit approach).

After each generation of the GA the fitness of its individuals was calculated. In this case each individual corresponded to a set of conformers for which a kernel matrix using one of the following kernel methods were used.

• Probability Product Kernel (PPK)

• radial basis function(RBF)

• Atom Pair Kernel (APK)

The first two of which were working on the RDF of a given molecule and the third one working directly on 3D model of the conformation. Each kernel matrix therefore consisted of similar- ity measures between the molecules. And for each molecule pKi value was known. These informations were used to build a SVR model to predict the activity of an unknown molecule in relation to its similarity to the molecule in the training set. For each model a set of best

13

3 Materials and Methods

parameters was searched using 5 repetitions of leave-one-out convoluted with a 5-fold cross- validation. These best parameters were used to compute the MSE of the model which in turn served as the fitness value for each individual.

The next generation of individuals in the GA was then generated using standard GA opera- tors such as mutation and cross over. The individual selected to mate for the next generation according to their fitness value.

(a) In the first approach the conformer sampling was done before the optimization process

(b) In the second approach the conformer sampling was done implicitly as part of the optimization process by mutating the conformers

Figure 3.1: These two figures show the different procedures of the two approaches. Both con- sist of four frameworks indicated by the different colors. The conformer gener- ation(either with MacroModel or implicit), the GA (with JavaEva2) which runs the optimization loop, the kernel matrix computation and the SVR modeling (with libsvm).

The process of generating new generations, the calculation of each kernel matrix and the evaluation via the SVR was then repeated 200 times. The development of the MSE for each individual and the RMSD between the conformations of the individual and the known active structure was calculated.

14

ConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult outputConformergenerationGAinitializationGA individualgenerationKernelcalculationSVR modelgenerationIndividualevaluationResult output 3.2 Radial Distribution Function

3.2 Radial Distribution Function

An important prerequisite for the computation of active structures with respect to the different conformations is keeping some kind of knowledge about the 3D structure of the molecules throughout the whole process. Therefor a molecular representation is needed that guarantees 3D sensitivity. To do so there are some prerequisites for a structure code

• independence form the numbers of atoms, i.e. the size of the molecule,

• unambiguity regarding the three-dimensional arrangement of the atoms and

• invariance against translation and rotation of the entire molecule

(a) Overlay of three different conformations of the same molecule

(b) The RDF for the three molecules shown on the left

Figure 3.2: These figures show an overlay of three conformations of the same Thrombin in- hibitor and their RDF. While the internal distances of the ring systems stay the same (i.e. the peeks representing the ring systems at r ≈ 1.5 and r ≈ 2.6 overlap for all three molecules) their relative spacial position vary (i.e. the peeks representing the distances of the ring systems among themselves at r ≈ 6 to r ≈ 12 are set off)

One method that meets all of the above requirements and which I used in this work is a derivation of the 3D-Molecule Representation based on Electron diffraction (3D-MoRSE) [Sch96] [Sel97], the radial distribution funtion [Gas96] [Gas97]. In general this function gives the probability to find a pair of atoms in the given molecule with similar properties in the distance r to each other.

g(r) = f

N1 ∑ i

N ∑ j>i

AiA jeB(rri j)2

(3.1)

15

3 Materials and Methods

where f is the scaling factor and N is the number of atoms. The exponential therm consists of the distance ri j between two atoms i, j and the smoothing factor B for the probability dis- tribution which will be explained later. Ai and A j are the characteristic Atom properties. The properties used in this work are standard properties of the JoeLib2 framework, for example:

• Electro-topological state

• Electronegativity (Pauling)

• Partial charge

• Atom mass

• Electron affinity

• Intrinsic state

• Free electron count

• Hybridisation

• Van-der-Waals volume

• Heavy atom valence

• Electrogometrical state

• Implicit valence

This distribution function allows to embed a lot of additional information, e.g. bond dis- tances, ring types, planar and non-planar systems and atom types, all of which are important in calculating the similarity of two molecules or as in this case the similarity of two conformers of the same molecule.

An important factor in using the radial distribution function is the resolution of the 3D model of the molecule on which the formula is applied. Using exact distances stands in contrast to physical reality and further restricts the application of any ability to interpolate for better results. Even though if one wants to compute the similarity of two conformers using paired atomic distances a certain amount of fuzziness is necessary to account for flexibility and errors in the initial measurement. Therefor the width of the peaks in the radial distribution function is determined by the factor B. As an approximation the value of B can be given as a relation between B and the chosen step size ∆r [Hem99] by

B (cid:117) (∆r)2

(3.2)

In this work I started with a value of B = 1000 for my computations. But on realizing that even slight changes had a large effect on similarity values I successively lowered it up to a value of B = 10 where only rotations of whole ring systems had a noticeable effect on similarity. The step size ∆r was always set to value of ∆r = 0.1 ˚A.

Implementation

In this implementation the function was internally represented by a vector of double values each representing the value of the RDF at point g(r) with r ∈ 0.1N. The length of the vector, and therefore the range of the function with y values ≥ 0 was predetermined by measuring the longest distance of atom pairs in a molecule over all molecules in the dataset and adding 2 ˚A as security margin. The preceding scaling factor f was not used (i.e. always set to f = 1).

16

3.3 Kernel

Figure 3.3: This figure shows the overlay of three RDF diagrams of the same molecule with three different values for B: 10, 100, 100. One can see that with increasing B the smoothness decreases but the information value increases.

3.3 Kernel

3.3.1 Probability Product Kernel

One of the two methods used in this work to give a 3D sensitive representation of a molecule was the radial basis function (RBF). This function can be regarded as a distinct distribution of atom pairs in the given molecule.

Typical kernels compute a generalize inner product between two input objects χ and χ (cid:48) which is equivalent to applying a mapping function φ to each object and then computing a dot product between φ (χ) and φ (χ (cid:48)) in a Hilbert space [Jeb04]. This kernel considers the case of a mapping φ (χ) being a probability distribution p(x|χ), restricting the Hilbert space to the space of distributions embedded in the Hilbert space.

In this work the probability distribution φ (x|χ) is given as the RDF function which leads to

the definition of the probability product kernel as follows

Definition Let p and p be probability distributions on a space X and ρ be a positive con- stant. Assume that pρ , p(cid:48)ρ ∈ L2(X), i.e. that (cid:82) X p(cid:48)(x)2ρ dx are well defined (not infinity). The probability product kernel (PPK) between distributions p and p is defined as

X p(x)2ρ dx and (cid:82)

kprob(p, p(cid:48)) =

(cid:90)

X

p(x)ρ p(cid:48)(x)ρ dx = (cid:104)pρ , p(cid:48)ρ (cid:105)L2.

(3.3)

Furthermore it is well known that L2(X) is a Hilbert space.Hence the defined kernel is positive definite for any set of P of probability distributions over X such that (cid:82) X p(x)2ρ is finite for any

17

3 Materials and Methods

p ∈ P.

Implementation

The first idea was to implement the computation of the probability product kernel with the numerical integration of the given RDF functions via Simsons rule (see figure 3.4)

(cid:90) b

a

f (x)dx ≈

(cid:20)

a b 6

f (a) + 4 f

(cid:19)

(cid:18)a + b 2

(cid:21)

  • f (b)

.

(3.4)

Figure 3.4: This figure shows the approximation of a function f (x) by a quadratic interpolation

P(x).

The RDF was interpolated by Simpsons rule in steps of 0.01 which led to an exact calcula- tion of the integral up to the 6th decimal place and also allowed to freely choose the factor ρ in the PPK formula.

But the first tests on this implementation showed that the computation of a single kernel value could take up to 10 seconds resulting in maximum total of 1.5 hours per kernel matrix. Being unfeasible due to the enormous amount of need computational power I decided to fix the parameter ρ with ρ = 1. With this the kernel takes the form of the expectation of one distribution under the other:

(cid:90)

k(p, p(cid:48)) =

p(x)p(cid:48)(x)dx = Ep[p(cid:48)(x)] = Ep(cid:48)[p(x)]

(3.5)

This is also called the expected likelihood kernel.

3.3.2 Radial Basis Function Kernel

Another method of measuring similarity between the two result vectors A and B of the RDF is the use of a radial basis function. A radial basis function (RBF) kernel, also known as an isotropic stationary kernel [HG04], is defined by a function ψ : [0, inf) → R such that

k(x, x(cid:48)) = ψ((cid:107)x x(cid:48)(cid:107))

(3.6)

18

where x, x(cid:48) ∈ X and (cid:107) · (cid:107) denotes the Euclidean norm. The use of a special RBF kernel, the Gaussian RBF kernel has been suggested in [Guy93] with

3.3 Kernel

k(x, x(cid:48)) = exp

(cid:18)

∑n

1 (cid:107)xi x(cid:48) 2σ 2

i(cid:107)2

(cid:19)

(3.7)

where xi and x(cid:48) width of the sphere surrounding the corresponding training pattern [Cha05].

i are the single data points in the result vectors of the RDF. And σ defining the

The issue on implementing this kernel was to find a viable value for the σ parameter in the above formula. On choosing σ to low the patterns will tend to be very similar over-fitting the model and taking away its ability to generalize outside its bounds. While choosing σ to high will have opposite effect letting the patterns appear very dissimilar and under-fitting the model. So finding a optimal value for σ is more about finding an acceptable trade-off between over-fitting in dense areas and under-fitting in sparse areas.

3.3.3 Atom Pair Kernel

While the preceeding kernel were based on a RDF representation another method to compare the 3D structure of two molecules or the different conformations of the same molecule is to represent the molecule as a trie data. For that I use a derivate of the optimal assignment of atom pairs [Jah09]. This method is based on a matrix D = of binned geometrical distances between the three- dimensional coordinates of atoms i, j. Where di j are the atomic distances and b is the binning factor. The matrix D is used a a lookup table for the information needed to build a trie con- taining all the geometrical information for all atom pairs from a fixed atom i to any other atom. Where a trie is a prefix based search tree that can be applied to any symbolic pattern with a reading direction. At the beginning the trie of atom i only consists of the root labeled with the hash code of the atomic symbol i. To fill the trie patterns of the form

(cid:106) di j b

(cid:107)

hash(symbol(i)), di j, hash(symbol( j))

(3.8)

are inserted successively as ordered triplets. An example of a local atom pair environment and the corresponding trie is shown in figure 3.5.

19

3 Materials and Methods

Figure 3.5: Binned geometrical distances, spheres and trie. The upper left figure shows the spheres of the binned geometrical distances 1.0, 2.0 and 3.0 ˚A for the centered carbon atom. The sphere of the binned geometrical distance of 0.0 ˚A (distances in the range [0.0; 1.0)) is not visualized as individual sphere because it contains no atoms. The upper right figure illustrates the resulting local atom pair environment of binned geometrical distances. For simplicity, only the distances to non-carbon atoms are displayed. The lower figure visualizes the corresponding trie of geometric atomic distances of the annotated atom in the upper figures. The root and leaves are labeled with the corresponding atom type. The leaves contain additionally the total number of occurrences in the local atom pair environment.[JZ10]

The representation of a local atom environment as tries allows the comparison of two local atom environments by comparing the tries. This can be achieved by applying a well known similarity measurement like the Tanimoto coefficient

T (A, B) =

A · B (cid:107)A(cid:107)2 + (cid:107)B(cid:107)2 A · B

(3.9)

In this case let LA, LB be two sets of local atom pair environments of two molecular graphs A, B and lAi ∈ LA, lB j ∈ LB the tries i, j of the nominal features (atom pair environments of atoms i, j.

20

Then the Tanimoto coefficient can be defined as

Sim(lAi, lB j) =

(cid:12) (cid:12)lAi ∩ lB j (cid:12) (cid:12)lAi lB j

(cid:12) (cid:12) (cid:12) (cid:12)

3.4 Dataset

(3.10)

Implementation

The implementation used in this work was based on the Chemistry Development Kit (CDK) [Ste03] [Ste06]an implemented by [Jah09].

The single arbitrary parameter b was initially set to b = 0.1 and subsequently set to b = 0.2

to account for errors in measurement of the crystal structure.

3.4 Dataset

The dataset used in the experiments consisted of two parts. A precompiled set of 88 molecules taken from [Boe99] and a smaller set of 12 molecules compiled for this work. All of the molecules in the dataset were thrombin inhibitors with a known pKi value. However only the 12 molecules in the compiled dataset had crystallographic determined active structures. The active structures were gained by taking the crystal structure analysis of thrombin with the respective ligand and extract the bound ligand from the whole structure.

The fist step therefore was to search for all potential thrombin inhibitors in the scBDP 1

[Kel06]

The second step was to find an entry with the identical structural formula in the Binding

Database 2 [XG02] for information about pKi Values and publications.

The third an final step was to download the crystallographic analysis given by the PDB ID from the Protein Data Bank 3 [Ber77] and to extract the bound ligand with Schr¨odingers Maestro program. Thus these 12 ligands will from now on be referenced by their originating PDB ID. They are depicted in figure 3.6 and their data and publications is shown in table 3.1. The 88 precompiled structures were only available as structural formulas so they had to be converted into a valid 3D conformation. To achieve this they were converted with the CORINA program [Sad94].

Trombin inhibitors were chosen both for their high flexibility and the fact that the interactions of inhibitors and Thrombin are well investigated and there are several well documented studies including crystal structures.

3.5 Conformation Sampling

3.5.1 Precomputed Conformation Sampling

The first strategy to be pursued was to precompute a set of conformers for all molecules, pick a subset of 100 of these conformers per molecule (or less, if less then 100 were available) and use the genes in the GA as indices for the molecules to chose from. Therefor a mutation operation in the GA lead not only to a single change in the conformation but could lead to a whole different one.

1http://bioinfo-pharma.u-strasbg.fr/scPDB/ 2http://www.bindingdb.org 3http://www.rcsb.org/pdb

21

3 Materials and Methods

(a) 1a4w

(b) 1c5n

(c) 1ghy

(d) 1gj4

(e) 1gj5

(f) 1o2g

(g) 1o5g

(h) 2zc9

(i) 2zda

(j) 2zgx

(k) 2zo3

(l) 3dhk

Figure 3.6: These figures show the 12 molecules with known active structure used in this work.

They are labeled with the PDB ID they were extracted from.

22

3.5 Conformation Sampling

PDB ID pKi value

resolution (in ˚A) first published in

1a4w 1c5n 1ghy 1gj4 1gj5 1o2g 1o5g 2zc9 2zda 2zgx 2zo3 3dhk

7.796 4.699 5.071 4.222 6.347 6.495 4.957 7.327 8.398 6.745 10 6.744

1.80 1.50 1.85 1.81 1.73 1.58 1.75 1.58 1.73 1.80 1.70 1.73

[Mat96] [Kat00] [Kat01a] [Kat01b] [Kat01b] [Kat03] [Kat04] [Bau09] [Bau09] [Bau09] [Bau09] [Bau09]

Table 3.1: This table gives information of all used molecules for which the crystal structure was

known

The conformations themselves were generated with the ConfGen program [WS10] which is

based on the molecular modeling Program MacroModel [SI08b].

The first step the program takes is to identify variable features which are rotatable bonds, flexible ring systems and invertible nitrogens. ConfGen generally identifies a bond as rotatable if the following criteria are met:

• It is a single bond

• It doesnt lie within a ring

• Neither of the atoms connected by the bond is terminal (i.e. has no other bonds to it)

• Neither end of the bond is a CH3, NH2 or NH+

3 group

• Neither atom in the bond is bonded to two or three atoms that are all equivalent and are

arranged with two- or three-fold rotational symmetry.

Ring conformers are generated using the same template based facility available in LigPrep [SI08a], Glide [Fri04], MacroModel [SI08b], or Phase [Dix06]. It is designed to generate a complete set of accurate, low energy ring conformation identifying individual rings with a smallest set of smallest rings (SSSR) method [Zam76]. When a ring system is identified it is compared to a set of 1252 templates to find the most similar template. This template is then used to calculate the relative energies of the ring within the molecule. There are Nri combinations of ring conformations for a whole molecule:

Nri = 2Ni ∏

r

Ncr

(3.11)

where Ni is the number of invertible nitrogen atoms, r runs over all flexible ring systems and Ncr is the number of templates selected to use for each individual ring system.

Each of the generated set of ring conformers is then processed as follows. First the potential of each rotatable bonds connecting the ring systems are calculated using a derivative of OPLS [Jor88] [Jor96] including a quick check of Lennard-Jones potentials of all atoms on one side of the bond to all on the other side to avoid local Van-der-Waals clashes. Then the potential

23

3 Materials and Methods

parameter maximum number of seach steps search steps per rotatable bond minimum heavy atom RMSD ( ˚A) for distinct conformer minimum dihedral angle difference for polar hydrogens (◦) maximum relative energy for flexible rings (kcal/mol) maximum number of ring conforma- tions per ligand maximum number of ring conforma- tions per ring maximum relative ConfGen energy (kcal/mol) energy threshold for periodic torsions (kcal/mol) restraint potentials for weak torsions in MacroModel (kcal/mol) restraint potential half width (◦) suppress hydrogen-bond electrostat- ics in MacroModel maximum relative energy all-atom en- ergy in MacroModel (kcal/mol)

intermediate 1000 75 1

comprehensive 1000 75 0.5

60

2.39

16

8

25

5.74

239

10 Yes

25

60

23.9

128

64

119.5

5.74

239

10 Yes

119.5

Table 3.2: This table shows the parameters used to generate the two datasets. The intermediate parameter set is more restrictive and almost certainly only picks energetic minima while the comprehensive parameter set allows for the algorithm to pick a conforma- tion lying between two optima.

minima are computed and used to create sets of rotational bonds surrounding the molecular core (i.e. the part of the molecule remaining if every outer rotational bond is severed).

For each combination of ring system conformation, invertible nitrogen atom geometry and minima of rotatable bond dihedral angle all molecule conformations are compiled and, if the sum of all relative potential (to the one with the least energy) energies doesnt exceed a preset limit, the conformation is added to the resulting set of conformers.

In this work I used two sets of parameters for the algorithm described above. One restrictive and one permissive. While the restrictive parameter set only generated conformers where the rotatable bonds took up a local minimum energetic state the permissive one allowed more freedom. Thus conformations were pickes lying in between local optima and allowing the GA to successively change more easily from one conformation to another. For the exact parameters used in the conformer sampling see table 3.2

3.5.2 Implicit Conformation Sampling

The second strategy to be pursued was to not use the precomputed conformation sampling but to generate a new set of conformations from generation to generation in the genetic algorithm. Therefor the encoding of the single individuals in the GA had to be different. In contrast to the

24

3.5 Conformation Sampling

optimization of the precomputed conformation set, where each generepresented the confor- mation ID of a whole molecule, here a gene only represented a single rotatable bond within a molecule. While a mutation on one gene in the GA meant to pick a whole new conformation of the concerned molecule with possibly every rotatable bond affected, the mutation of a gene in the implicit conformation sampling only meant the alteration of a single rotatable bond. In addition it had to be ensured that the crossover operator didnt cut in the middle of the encoding of a molecule but only at the end of one and the beginning of the other. Doing a crossover in the middle of a molecule could lead to an invalid conformation because it couldnt be guaranteed that the molecule wouldnt fold back on itself overlapping one or more atoms.

Figure 3.7: The figure shows an example of a molecule with nine rotatble bonds an the corre- sponding encoding as a gene for the GA. The denoted angles are the dihedral angles for a unique set of deterministically calculated atoms surroundingthe bond

But before encoding a molecule in the GA one had to know the exact number of rotatable bonds. For that each bond was inspected and had to meet a list of criteria to count as rotatable. The criteria used were the ones already implemented in the JoeLib2 framework:

• The atom at the beginning of the bond has to have a heavy atom valence of > 1

• The atom at the end of the bond has to have a heavy atom valence of > 1

• The bond order has to be 1

• The bond mustnt lie in a ring system

• The the atom at the beginning of the bond mustnt have a hybridization of 1

• The the atom at the end of the bond mustnt have a hybridization of 1

If these criteria were met, the bond was added to the molecules rotatable bond list.

The unit with which the rotations where encoded was 1◦ (i.e. degree) where degree refers to

the dihedral angle. An angle of 0◦ refers to the original crystallographic conformation.

For the first generation of the GA a initial set of conformations was computed picking a random value for the dihedral angle of each rotatabel bond of each molecule in the dataset.

25

3 Materials and Methods

After each occurring mutation in the GA the according molecule was computed again with the new degree value. Where the new value was in reference to the original 0◦ value and not to the currently applied one.

To compute a rotation around a rotatable bond one has to rotate each atom belonging to one of either of the two bipartite graphs formed by splitting the molecular graph at the designated bond. The bipartite graph was calculated using a stack, adding the beginning atom of the bond and then recursively adding every atom bound to the ones already on the stack (except for the atom at the end of the designated rotatable bond) until no new atoms could be found. This was possible because no bond were allowed to be rotatable if they were in a ring system and no molecules with macrocyles were in the dataset.

The actual rotation was achieved by applying a quaternion to each of the atoms in the bi- partite graph with the center of the coordinate system being the atom at the beginning of the bond.

(a) molecule in basic conformation

(b) molecule with bond Nr. 9 rotated by 90◦

Figure 3.8: These two figures show an example of a rotation around one rotatable bond.

Both, at the initial random initialization of the conformations and at every mutation event it has to be ensured that the generated conformation is valid (i.e. no atoms or bonds overlap or lie to close to each other). Therefore for every new conformation all pairwise atom distances have to be calculated. The chosen value for a lower bound (gathered by calculating the average minimal distance for non-bound atoms over the whole original dataset) was 2 ˚A while distances of covalent bonds were ignored.

26

3.6 SVR

3.6 SVR

In this work I used the libSVM implementation by Chang and Lin [CL01] 4. To compute the MSE a leave-one-out approach was applied. A model of the dataset (i.e represented by the kernel matrix) was built n times (with n beeing the size of the dataset) always with one different data point left out. For these datasets a five-fold cross-validation (inner fold) was run 5 times (inner runs). The inner fold was used to determine the best parameters of the regression (i.e values of parameters ε and c yielding the best performance on the validation dataset). The inner runs were used to the best model, with the just computed best parameters. The best model was then used to predict the currently left out data point. The set of parameters can be seen in table 3.3.

computation method inner folds inner repetitions c begin c end ε begin ε end

leave one out 5 5 -1 5 -7 -2

Table 3.3: This table shows the parameters used to calculate the MSE for the regression on the

datasets.

4http://www.csie.ntu.edu.tw/ cjlin/libsvm/

27

4 Results

In this chapter I will present the results, interpret and discuss them. The results are divided by precomputed and implicit conformation sampling. Each of those two parts is further split by the used kernel methods and parameters. The results are mostly presented in chronological order to try and replicate my line of thought.

Evaluation Method

To explain the values shown in the following diagrams I will give a short explanation for each of them. The meaning of the values remains the same for every run-diagram in this work.

MSE

The avg MSEvalue shown in each diagram is the average MSE value for one generation (i.e. 100 individuals). Where MSE is the best found Mean Square Error for each regression. In the corresponding table I will show the respective numerical values. The Best individual MSE relates to the absolute minimum found by at least one individual.

RMSD

The avg RMSD value is the average RMSD value for the conformation of the 12 molecules with known active structure encoded by the current individual to their respective active struc- ture. The Best individual RMSD relates to the individual with the lowest RMSD averaged over the 12 molecules in my dataset to their respective active structure.

5% Quantile

The 5% quantile demarks the value where every point below this line lies in the lowest 5% of all possible values for the average RMSD. Its value is exactly 1.629. This was computed picking 20000 random combinations from the conformer sets of each molecule and building the average RMSD to their respective active structures. From this normal distribution the pQuantile with p = 0.05 was calculated using the standard formula x(p) = µ + σ · z(p) where µ is the expectation and σ 2 the variance and z(0.05) was looked up in the normal distribution table.

4.1 Precomputed Conformation Sampling

For the first experiment, the optimization of the QSAR model I used the dataset described ear- lier. The dataset first included all conformers produced with ConfGen where every molecule had a different amount of conformers created. To reduce the extent of the combinatorial size I picked the lower of either 100 or the number of conformers originally created. The selec- tion method was to pick the conformers equally distributed over their relative energy to the

28

4.1 Precomputed Conformation Sampling

conformer with the absolute lowest energy to guarantee an equal distribution over the con- formational space of each molecule. Because the output sets for each individual conformer where already sorted by their relative energy I simply had to pick every n-th conformer. Where n = number of conformers available/100.

29

4 Results

(a) This figure shows the avg. MSE, avg RMSD, best MSE and best RMSD for Run 01 where the PPK Kernel was used with parameter B = 1000

(b) This figure shows the avg. MSE, avg RMSD, best MSE and best RMSD for Run 02 where the RBF Kernel was used with parameters B = 1000; sigma = 100

(c) [This figure shows the avg. MSE, avg RMSD, best MSE and best RMSD for Run 02 where the RBF Kernel was used with smoothing factor = 0.1

Figure 4.1: These figures show the results of the first three runs. One can see that the optimiza- tion works fine due to the MSE declining while the average RMSD only declines in Run03 using the APK but still doesnt reach the 5% quantile.

30

4.1 Precomputed Conformation Sampling

Paramter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Start avg. MSE End avg. MSE Diff. Start/End Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End Best avg. RMSD Best individual RMSD

01 PPK 1000

0.1 no intermediate 1.256 0.118 1.138 0.109 0.117 1.893 1.878 0.015 1.822 1.525

02 RBF 1000 100

0.1 no intermediate 1.191 0.536 0.655 0.536 0.526 1.915 1.900 0.015 1.808 1.505

03 APK

0.1 0.1 no intermediate 1.039 0.337 0.702 0.337 0.335 1.907 1.790 1.117 1.724 1.465

Table 4.1: This table shows the parameters and the results for Run01,Run02 and Run 03. Pa-

rameters denoted by - are not available for the chosen kernel method

4.1.1 Initial Runs

In the first runs (run 01-03) of the experiment I used the PPK and the RBF Kernel on the RDF of the molecules and the APK to generate the kernel matrix. The parameters were set to their default values to check the overall function of the optimization. (see table 4.1)

The results of these three runs are depicted on the left side in figure (4.1). One can see that the basic optimization is functional. The average MSE declines from the values of 1.256, 1.191 and 1.039 to values of 0.109, 0.536 and 0.337. But, while the average RMSD declines slightly in Run03 with the use of the APK it remains at the same level with the use of the PPK and RBF. Although some isolated individuals get below the 5% quantil mark they are dismissed in the next generation implying that the individuals with a higher average RMSD result in better models with lower MSEs.

My first consideration on evaluating these results where twofold. Either the use of the large dataset of molecules with unknown active structures impeded the decline of the ones with known active structures because their weight in the model building process was to large, or the parameters used were not fit for this kind of optimization.

Therefore I consecutively lowered the size of the dataset to 56, 41 and 34 molecules, al- ways including the 12 known active structures, and changed the parameters of the kernels used. Which are the B parameter for the RDF resulting in a smoother RDF function and the smooth- ing factor for the APK, both in the hope of a better generalization. These changes are shown in the next sections.

31

4 Results

(a) This figure shows the results for the PPK with B = 1000 on the dataset with 56 molecules

(b) This figure shows the results for the PPK with B = 1000 on the dataset with 41 molecules

(c) This figure shows the results for the PPK with B = 1000 on the dataset with 34 molecules

(d) This figure shows the results for the RBF Kernel with sigma = 100 on the dataset with 56 molecules

(e) This figure shows the results for the RBF Kernel with sigma = 100 on the dataset with 41 molecules

(f) This figure shows the results for the RBF Kernel with sigma = 100 on the dataset with 34 moleculesl

Figure 4.2: These figures show the results of the runs with reduced datasets for the PPK and

RBF Kernel.

32

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

04 PPK 1000

0.1 no interm. 56 1.3470 0.1184 1.2286 0.1128 0.1091 1.8626 1.7738 0.0888 1.5744 1.4630

05 PPK 1000

0.1 no interm. 41 1.4283 0.0917 1.3366 0.0885 0.0856 1.8348 1.6095 0.2253 1.5977 1.2913

06 PPK 1000

0.1 no interm. 34 1.5836 0.1366 1.447 0.1343 0.1334 1.8924 1.9664 -0.074 1.5805 1.5148

07 RBF 1000 100 0.1 no interm. 56 1.3432 0.3663 0.9769 0.3663 0.3578 1.8239 1.7765 0.0474 1.7443 1.2660

08 RBF 1000 100 0.1 no interm. 41 1.4391 0.3422 1.0969 0.3422 0.3393 1.8206 1.8151 0.0109 1.8120 1.3552

09 RBF 1000 100 0.1 no interm. 34 1.5611 0.4955 1.0656 0.4897 0.4862 1.8717 1.9408 -0.0691 1.8688 1.2588

Table 4.2: This table shows the parameters and the results for Run 04 through Run 09. Param-

eters denoted by - are not available for the chosen kernel method

4.1.2 Reduced Dataset with PPK and RBF Kernel

To see if reducing the dataset size would yield models with a lower average RMSD I ran the PPK and the RBF on datasets where the only every 2nd, 3rd and 4th molecule with unknown active structure where included. The hypothesis was that due to the fact that the overall influ- ence of the known active structures on the model is higher and if the general assumption of good models consisting of good data (i.e. the active structure) the RMSD would be lower.

The results of these runs are shown in figure 4.1.1 and table 4.2. With the use of the PPK (runs 04-06) the average RMSD gets below the 5% quantile at some point. In run 04 and run 05 the average RMSD gets below the 5% quantile within the first 50 generations but returns to its starting level shortly after and stagnates. In run 05 however the average RMS stays at a relative high value in comparison to run 04 and 06 but declines to a value below the 5% quantile mark after 50 generations. In addition the MSE of the best model found for run 05 was the lowest of all three runs with final value of 0.0917 in contrast to 0.1184 and 0.1366 for runs 04 and 06.

With the use of RBF kernel (run 07, run 08 and run 09) the average RMSD didnt get below the 5% quantile in any of the 3 runs. Although run 08 on the dataset with 41 molecules showed a steady decline of the average RMSD which is similar to the results of run 05. It is noticeable that the initial generation of all three runs consisted of at least one individual with a very low average RMSD and considering the low starting RMSD more than one. These indivduals however were dismissed in the first 25 generations resulting in a average RMSD. Furthermore the best models found by using the RBF kernel had MSE values of 0.3578, 0.3393 and 0.4852 which is for each more then three times the MSE value of the best model for the PPK with corresponding dataset size where the values are 0.1091, 0.0856 and 0.1334.

Considering this direct comparison of the PPK and the RBF kernel the PPK shows better

results, both for the modeling and for the use of the actual active structure.

33

4 Results

(a) This figure shows the results for the APK with smoothing f actor = 0.1 on the dataset with 56 molecule

(b) This figure shows the results for the APK with smoothing f actor = 0.1 on the dataset with 41 molecule

(c) This figure shows the results for the APK with smoothing f actor = 0.1 on the dataset with 34 moleculel

Figure 4.3: These figures show the results of the runs with reduced datasets for the APK. Run

12 is the model with lowest final average RMSD of all runs.

34

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

10 APK

0.1 0.1 no intermediate 56 1.0026 0.3420 0.6606 0.3420 0.3390 1.8399 1.6799 0.16 1.5922 1.4729

11 APK

0.1 0.1 no intermediate 41 1.2949 0.5053 0.7896 0.5051 0.5028 1.8381 2.0098 -0.1717 1.7233 1.3699

12 APK

0.1 0.1 no intermediate 34 1.2255 0.5067 0.7188 0.5023 0.5012 1.9241 1.3813 0.5428 1.2618 1.1870

Table 4.3: This table shows the parameters and the results for Ru10, Run11 and Run 12. Pa- rameters denoted by - are not available for the chosen kernel method

4.1.3 Reduced Dataset with Atom Pair Kernel

The reduction of the dataset had a similar effect on the use of the APK as id had on the RPK. The starting average RMSD was 1.8399, 1.18381 and 1.9241 and while run 10 and 12 had a considerably lower end RMSD with 1.699 and 1.3913 the final RMSD of run 11 was 2.0098. Which is 0.1717 higher then the start RMSD.

The first and third run, 10 and 12 show a similar development as the earlier runs 04, 05 and 06 with the average RMSD dropping by several percent around generation 50. But in contrast to all other previous runs the RMSD of run 12 declines further giving an indication that the optimization reaches a point where it can drop into several minima one of them beeing a model that included structures more likely to be near the conformation of the active structure.

Further one can see that for all three kernel methods the overall end MSE value rises with descending dataset size. This can be lead back to loss of information with decreased data set size. But both the APK and especially the PPK mostly lead to better models then the RBF dies with the full dataset of 100 molecules. Where the APK and PPK differ in the way that using the APK leads to models which have a lower RMSD to the active structures but a higher MSE while the use of the PPK leads to very good models with the lowest MSE of all all models created but with higher RMSD values.

Because most of the resulting models using the APK and PPK with the reduced datasets where as good as the the ones using the full dataset due to their equal or lower MSE, I decided to use the reduced dataset for future runs. Since the SVR is contained in O(n3) and the GA is contained in O(n) this measure cut the computation time for a complete run by at least 50%.

35

4 Results

(a) This figure shows the results for the PPK with B = 10 on the dataset with 34 molecules

(b) This figure shows the results for the PPK with B = 10 on the dataset with 56 molecules

(c) This figure shows the results for the PPK with B = 100 on the dataset with 56 molecules

(d) This figure shows the results for the PPK with B = 500 on the dataset with 34 molecules

Figure 4.4: These figures show the results of the runs with reduced datasets and altered param-

eters for the PPK.

36

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

13 PPK 10

0.1 no intermediate 34 1.5727 0.1701 1.4026 0.1645 0.1595 1.8585 1.8107 0.0478 1.6172 1.4992

14 PPK 10

0.1 no intermediate 56 1.3344 0.0853 1.2491 0.0838 0.0814 1.8323 1.8826 -0.0503 1.7662 1.3194

15 PPK 100

0.1 no intermediate 56 1.3834 0.0813 1.3021 0.0777 0.0763 1.8402 1.7181 0.1221 1.7118 1.0926

16 PPK 500

0.1 no intermediate 34 1.5559 0.1387 1.4172 0.1306 0.1285 1.9009 2.0711 -0.1702 1.6644 1.6265

Table 4.4: This table shows the parameters and the results for run 13, run 14, run 15 and run

  1. Parameters denoted by - are not available for the chosen kernel method

4.1.4 Alternative Parameters for the Product Probability Kernel

In addition to reducing the dataset I changed the parameters of the PPK and APK. The results for the PPK with the B parameter of the RBF set to 10, 100 and 500 in relation to 1000 at the previous runs are shown in figure 4.4 and table 4.4. One can see that run 16 still shows the low RMSD values around generation 50 with a strong increase and stagnation afterwards. The runs 13, 14 and 15 also show the decrease of the RMSD around generation 50 but not as strong as runs with a higher parameter.

The average RMSD values of runs 13 and 15 only decrease slightly by 0.0478 and 0.1221 from 1.8585 and 1.8107 to 1.8107 and 1.7181. While the average RMSD values of runs 14 and 16 even increase by 0.0503 and 0.1702 from 1.8323 and 1.9009 to 1.8826 and 2.0711. The MSE though reaches the lowest values of all runs with run 15 at a value of 0.777 and the second lowest at run 14 with 0.0838.

The fact that a run with parameter B = 10 renders the best resulting model can be lead back to the fact that the B parameter describes the smoothness and distinctness of a RDF. With declining B the RDF becomes more of a general description of the respective molecule and its conformation instead of an exact characterization. In this case the presence of distinct chemical groups or pharmacophores and their arrangement to each other is more important then their individual orientation. This leads to a better generalization of the model at the cost of a better discrimination of the conformations for each molecule.

37

4 Results

(a) This figure shows the results for the APK with smoothing f actor = 0.2 on the dataset with 34 molecules

(b) This figure shows the results for the APK with smoothing f actor = 0.2 on the dataset with 41 molecules

the APK (c) This figure shows with smoothing f actor = 0.2 on the dataset with 56 moleculesl

the results

for

(d) This figure shows the results for the APK with smoothing f actor = 0.2 on the dataset with 100 molecules

Figure 4.5: These figures show the results of the runs with reduced datasets and altered param-

eters for the APK.

38

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End RMSD Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

17 APK

0.2 0.1 no intermediate 34 1.1934 0.3901 0.8033 0.3857 0.3830 1.9084 1.8879 0.0205 1.7840 1.4992

18 APK

0.2 0.1 no intermediate 41 1.3013 0.3813 0.92 0.3813 0.3807 1.8337 1.7185 0.1152 1.5997 1.3194

19 APK

0.2 0.1 no intermediate 56 0.9636 0.2805 0.6831 0.2796 0.2774 1.8434 1.6696 0.1738 1.6470 1.0926

20 APK

0.2 0.1 no intermediate 100 1.0234 0.2878 0.7356 0.2878 0.2864 1.9025 1.9787 -0.0762 1.8900 1.6265

Table 4.5: This table shows the parameters and the results for run 17, run 18 and run 19 and run

  1. Parameters denoted by - are not available for the chosen kernel method

4.1.5 Alternative Parameters for APK

Figure 4.5 and table 4.5 show the results for the runs 17, 18, 19 and 20 using the APK with the full and the reduced datasets and a smoothing factor of 0.2. Runs 18 and 19 show the first deep decline of the RMSD at 50 generations to a global minimum of an average RMSD of 1.5997 and 1.6470 while run 17 shows a steady decline and run 20 an overall stagnation at a RMSD of approximately 2.0.

As with the alternation of the B parameter of the PPK, setting the smoothing factor to a value of 0.2 for the APK changes the generalization of the model resulting in lower MSE values than previous runs with the use of the APK for all four runs. While the APK only encodes atom types, distances and binding modes, doubling the smoothing factor still holds enough information to fit the model. It allows further for the GA to hold more individuals with a wider RMSD range. This can be seen in figure 4.5 with the best individual RMSD values being distinctively low than the average RMSD values over several generations in all four runs. The average MSE values of the final models was 0.3901, 0.3813, 0.2805 and 0.2878, which is approximately 0.15 below previous runs. But in change for the better generalization and lower MSE values the overall RMSD stagnated with only run 19 showing a decline to a finale value of 1.7185 which is still above the 5% quantile.

39

4 Results

(a) This figure shows the results for the PPK with B = 10 on 56 molecules and a mutation probability of 0.2

(b) This figure shows the results for the PPK with B = 10 on 56 molecules and a mutation probability of 0.2

(c) This figure shows the results for the PPK with B = 10 on 56 molecules and a mutation probability of 0.2

Figure 4.6: These figures show the results of the runs with reduced datasets and increased mu-

tation probability for the PPK and APK.

40

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

21 PPK 10

0.5 no intermediate 34 1.5839 0.1584 1.4255 0.132 0.1148 1.8889 1.8699 0.0190 1.7506 1.4206

22 PPK 10

0.5 no intermediate 56 1.3608 0.1012 1.2596 0.0909 0.0736 1.8520 1.9711 -0.1191 1.6712 1.3559

23 APK

0.1 0.5 no intermediate 34 1.2421 0.4708 0.7713 0.4651 0.4497 1.8823 2.0111 -0.1288 1.7752 1.4107

Table 4.6: This table shows the parameters and the results for Run 21, Run 22 and Run 23. Parameters denoted by - are not available for the chosen kernel method

4.1.6 Increased Mutation Rate

Studying the results of the changes in decreasing the dataset and altering the kernel parameter the next step was to change the parameters and overall process of the GA. The easier of both was to set the mutation probability to 0.5 instead of the standard 0.1 value. The mutation probability defines the rate at which mutations occur during the mating process from one generation to the next. Increasing the mutation probability allows the GA to search in a broader range and increases the chance of the optimization to jump out of a local minimum, but it also decreases the optimization rate and may lead to more diverse results.

As one can see in all three runs depicted in figure 4.6 the increased mutation probability leads to at least one individual in each generation with a significantly lower RMSD as the average. Further noticeable is the fact the the progression of the average and especially the best individual RMSD include more peaks.

While changing the mutation probability still leads to good models with an average MSE of 0.1584, 0.1012 and 0.4708, in two of the runs the average end RMSD was even 0.1191 and 0.1288 higher then their start RMSD with 1.8520 and 1.8823.

41

4 Results

(a) This figure shows the results for the APK with smoothing f actor = 0.2 on 56 molecules with alterna- tive conformation sampling and a mutation probability of 0.1

(b) This figure shows the results for the APK with smoothing f actor = 0.2 on 56 molecules with alterna- tive conformation sampling and a mutation probability of 0.1

(c) This figure shows the results for the PPK with B = 10 on 56 molecules with alternative conformation sampling and a mutation probability of 0.5

(d) This figure shows the results for the PPK with B = 10 on 56 molecules with alternative conformation sam- pling and a mutation probability of 0.1

Figure 4.7: These figures show the results of the runs with reduced datasets of the alternative conformation sampling and increased mutation probability for the PPK.

4.1.7 Alternative Conformation Sampling

A second way of increasing the chance of the optimization to jump out of a local minimum was to change the conformation sampling of the dataset. While the intermediate parameters for the ConfGen algorithm only allows local minima of relative molecular energy the comprehen- sive parameters allowed GonfGen to output molecules with dihedral angles and flexible ring energies not being in a local minimum.

42

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Dataset size Conformation Sampling Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

24 APK

0.2 0.1 no 56 comprehensive 0.9265 0.3775 0.549 0.3775 0.3756 1.7999 1.7873 0.0126 1.7175 1.376

25 APK

0.2 0.1 no 56 comprehensive 0.9252 0.3192 0.606 0.3192 0.3169 1.8047 1.7608 0.0439 1.7385 1.3512

26 PPK 10

0.5 no 56 comprehensive 1.3531 0.1084 1.2447 0.1025 0.0816 1.794 1.6181 0.1759 1.5084 1.2488

27 PPK 10

0.5 no 56 comprehensive 1.3671 0.1191 1.248 0.1059 0.0934 1.7862 1.7728 0.0134 1.6867 1.3333

Table 4.7: This table shows the parameters and the results for Run 24, Run 25, Run 26 and Run

  1. Parameters denoted by - are not available for the chosen kernel method

The thought was to allow the GA to successively get out of local minima due to the differ- ences in relative molecular energies not being as great as with the intermediate conformation sampling. Therefore the error of a change from one conformation to the one with the nearest relative energy would not be as great for the comprehensive conformation sampling as for the intermediate.

Another reason for using the comprehensive conformation sampling was that the relative energy of an active structure is not necessarily a local minimum due to the interaction of the molecule with its target and the solvent. So by allowing non-minima structures in the dataset I reduced the minimal RMSD between the conformers in the dataset and the active structures.

The results for the experiments with the dataset produced by conformation sampling with comprehensive parameters are shown in figure 4.7 and table 4.7. As one can see in run 24 and run 25 which used the APK and a smoothing factor of 0.2 the first decline of the average RMSD with its concurrent increase between generation 35 and 50 still occurs. But instead of stagnating at the same average level as in most previous runs the average RMSD declines again in later generations. The final average RMSD, however, was only 0.0126 and 0.0439 lower than the starting average RMSD with 1,783 and 1.7608 while the final average MSE with 0.3775 and 0.3192 was better than most of the previous runs with the APK. Therefore the final models were more precise but still did not include conformations near the active structure.

The results for run 26 and run 27 are also shown in figure 4.1.6 and table 4.7. Both runs used the PPK and a mutation probability of 0.5. In run 26 the average RMSD almost always lies within the 5% quantile. In run 27 the average RMSD declines to a value of approximately 1.7 and stagnates for the second half of the optimization. The runs have final average RMSD values of 1.6181 and 1.7728 and final average MSE values of 0.1084 and 0.1191.

43

4 Results

(a) This figure shows the results for the APK with smoothing f actor = 0.1 on 56 molecules with alternative conformation sampling and a the mutation only allowed on the 12 known active structures

(b) This figure shows the results for the APK with smoothing f actor = 0.1 on 34 molecules with alternative conformation sampling and a the mutation only allowed on the 12 known active structures

(c) This figure shows the results for the PPK with B = 10 on 56 molecules with alternative conformation sampling and a the mutation only allowed on the 12 known active structures

(d) This figure shows the results for the PPK with B = 10 on 34 molecules with alternative conformation sam- pling and a the mutation only allowed on the 12 known active structures

Figure 4.8: These figures show the results of the runs with reduced datasets of the alternative conformation sampling and altered mutation operator to allow mutation only on the conformers of the molecules with known active structure.

44

4.1 Precomputed Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Dataset size Conformation Sampling Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

28 APK

0.1 0.1 yes 56 comprehensive 0.856 0.682 0.174 0.6815 0.6815 1.7972 1.7187 0.0785 1.5975 1.3621

29 APK

0.1 0.1 yes 34 comprehensive 1.1405 0.9029 0.2376 0.9022 0.9022 1.7881 1.8774 -0.0893 1.727 1.5028

30 PPK 10

0.1 yes 56 comprehensive 1.311 0.7738 0.5372 0.7706 0.7656 1.7808 1.7369 0.0439 1.7098 1.3915

31 PPK 10

0.1 yes 34 comprehensive 1.5813 0.7589 0.8224 0.7255 0.7207 1.7808 1.8078 -0.027 1.7052 1.3247

Table 4.8: This table shows the parameters and the results for Run 28, Run 29, Run 30 and Run

  1. Parameters denoted by - are not available for the chosen kernel method

4.1.8 Alternative Mutation Operator

The final change to the mutation was to change the mutation operator in that way that it only allowed the conformers of the molecules with known active structures to be mutated during the mating process at the end of a generation. For the rest of the molecules, which are the ones with unknown active structures, the conformation with the minimal relative energy was fixed. The reason for this change was to reduce the search space for the optimization to the conformations of the molecules with known active structure. Therefore increasing the chance of finding a model with low MSE which included conformations similar to the active structures resulting in a lower RMSD.

The results of the four runs, 28, 29, 30 and 31 with altered mutation operator are shown in figure 4.8 and table 4.8. One can see that, while the average MSE rapidly declines in the first 25 generations the average RMSD remains a the same level throught the whole run for all four runs. Futher the average MSE only reaches values of 0.684 to 0.9029 which is significantly higher than in previous runs due to the fact that the remaining fixed molecules do not allow a better model.

This means that, while only optimizing over the generated conformers of the known active structures, the best models found still do not include conformations similar (i.e with a low RMSD) to those structures. Possible reasons for that are manifold and will be discussed in the next chapter.

45

4 Results

(a) This figure shows the average results for four runs with the APK with smoothing f actor = 0.1 on conform- ers of 34 molecules created with the intermediate param- eter set

(b) This figure shows the average results for four runs with the APK with smoothing f actor = 0.1 on conform- ers of 34 molecules created with the comprehensive pa- rameter set

Figure 4.9: These figures show the average results of four runs with the APK and reduced

datasets of both conformation sampling parameter sets.

46

4.2 Implicit Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Dataset size Conformation Sampling Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff. Start/End RMSD Best avg. RMSD Best individual RMSD

avg. of run 32-35 APK

0.1 0.1 no 34 intermediate 1.2388 0.5018 0.7370 0.4999 0.4965 1.8967 1.8979 -0.0012 1.8294 1.5099

avg. of run 36-39 APK

0.1 0.1 no 34 comprehensive 1.2199 0.5338 0.6862 0.5338 0.5316 1.8022 1.8070 -0.0048 1.6579 1.3405

Table 4.9: This table shows the parameters and the results for the average of runs 32-35 and

36-39. Parameters denoted by - are not available for the chosen kernel method

4.1.9 Reruns

The only run resulting in an considerably lower average RMSD then all other runs was run 12 with a final average RMSD of 1.3813 (see figure 4.3 and table 4.3). To check if this was a random result or if the constellation of kernel, dataset and parameters lead to models us- ing conformations with a low RMSD to the active structure I reran the specific parameter set of run 12 four times with either of both conformation sampling parameters intermediate and comprehensive. The averaged results of these runs are shown in figure 4.9 and table 4.9

As one can see the average RMSD stagnates at 1.8 which is also the mean value for the RMSD of all possible combinations of conformers. The decline and immediate return to the mean RMSD between generation 25 and 50 is also visible for both results.

This proves that run 12 was a random result with the GA finding a local minimum. With a value of 0.567 the MSE of run 12 is even higher then the average MSE for both of the 4 runs with 0.5018 and 0.5338.

4.2 Implicit Conformation Sampling

The runs of the optimization with the implicit conformation sampling were done parallel to the runs with precomputed conformation sampling. Therefore the results of the runs with precom- puted conformation sampling influenced the decisions made for the parameters and dataset size for the runs with implicit conformation sampling. One run with implicit conformation sam- pling on the full dataset took up to two weeks on a Xeon quadcore server. This is why there are fewer results for the implicit conformation sampling.

47

4 Results

(a) This figure shows the results for the RBF kernel with B = 1000 and sigma = 100 on the full dataset with im- plicit conformation sampling

(b) This figure shows the results for the RBF with B = 1000 on the full dataset with implicit conformation sam- pling

(c) This figure shows the results for the PPK with B = 10 on the full dataset with implicit conformation sampling

(d) This figure shows the results for the PPK with B = 10 on the full dataset with implicit conformation sam- pling

Figure 4.10: These figures show the results of the runs with implicit conformation sampling on

the full dataset and the use of the PPK and RBF kernel.

48

4.2 Implicit Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff Start/End RMSD Best avg. RMSD Best individual RMSD

01i RBF 1000 100

0.1 no implicit 100 1.1411 0.5331 0.608 0.5311 0.5285 1.8724 1.7978 0.0746 1.6873 1.5876

02i PPK 10

0.1 no implicit 100 1.0375 0.171 0.8665 0.171 0.1638 1.8826 1.9955 -0.1129 1.8246 1.5268

03i PPK 10

0.1 no implicit 100 1.1491 0.519 0.6301 0.519 0.5142 1.9039 1.8523 0.0516 1.8425 1.4942

04i PPK 10

0.1 no implicit 100 1.0389 0.3874 0.6515 0.3874 0.3808 1.8894 1.8909 -0.0015 1.8498 1.6142

Table 4.10: This table shows the parameters and the results for run 01i, run 02i, run 03i and run

04i. Parameters denoted by - are not available for the chosen kernel method

4.2.1 Initial Runs

The results for the initial runs, 01i and 02i, are shown in figure 4.10 and table 4.10 . Run 03i and run 04i are later runs with the same parameters as run 02i. As one can see the average MSE is decreasing. This shows that the optimization is functional. But in comparison to the runs with precomputed conformation sampling shown in the preceding section the average MSE decreases more slowly and is has not reached a minimum at the end of the run. This can be assumed due to the average MSE still decreasing in generation 150 to 200 and not reaching an even level. The final average MSE of the runs was 0.5331, 0.1710, 0.5190 and 0.3874. This is the range of the results for the average MSE from the runs with precomputed conformation sampling.

Further noticeable is the fact that the average RMSD shows almost no change after genera- tion 50 in all four runs stagnating for many generations. The mean RMSD over all generations of all four runs is 1.859, which is near the overall mean of 1.81 of all possible conformations. In addition to the best individual RMSD this can be explained by the small chance of a mutation occurring at the rotatable bonds of the molecules with known active structure.This low chance of a mutation is a result of the molecules with known active structure having fewer rotational bonds then the molecules without known active structure. Therefore the chance of a mutation occurring on a molecule without known active structure is increased in relation to the runs with precomputed conformation sampling where the mutation chances were equally distributed.

In addition all four runs lack the initial decrease of the average RMSD between generation

25 and 50 seen in the runs with precomputed conformation sampling.

49

4 Results

(a) This figure shows the results for the first run with the PPK and B = 10 on the reduced dataset of 34 molecules with implicit conformation sampling

(b) This figure shows the results for the second run with the PPK and B = 10 on the reduced dataset of 34 molecules with implicit conformation sampling

(c) This figure shows the results for the first run with the PPK and B = 10 on the reduced dataset of 56 molecules with implicit conformation sampling

(d) This figure shows the results for the secondt run with the PPK and B = 10 on the reduced dataset of 56 molecules with implicit conformation sampling

Figure 4.11: These figures show the results of the runs with reduced datasets, implicit confor-

mation sampling an the use of the PPK.

50

4.2 Implicit Conformation Sampling

Parameter / Run Nr. Kernel method RDF B factor RBF Sigma factor Smoothing factor Mutation Probability Mutate first 12 only Conformation Sampling Dataset size Start avg. MSE End avg. MSE Diff. Start/End MSE Best avg. MSE Best individual MSE Start avg. RMSD End avg. RMSD Diff Start/End RMSD Best avg. RMSD Best individual RMSD

05i PPK 10

0.1 yes implicit 100 1,687 0,6932 0,9938 0,673 0,6693 1,8592 2,0149 -0,1557 1,8592 1,5489

06i PPK 10

0.1 yes implicit 100 1,73 0,7425 0,9875 0,7398 0,7352 1,8605 2,0571 -0,1966 1,8605 1,5543

07i PPK 10

0.1 yes implicit 100 1,5033 0,9929 0,5104 0,989 0,9827 1,8487 1,9715 -0,1228 1,8487 1,5698

08i PPK 10

0.1 yes implicit 100 1,5079 0,9555 0,5524 0,9475 0,9459 1,8546 1,8685 -0,0139 1,7359 1,5687

Table 4.11: This table shows the parameters and the results for Run 04 through Run 09. Param-

eters denoted by - are not available for the chosen kernel method

4.2.2 Reduced Dataset and Fixed Conformation

In run 05 to run 08 I combined the several changes. First I reduced the dataset to 56 and 34 molecules including the 12 with known active structure. The second change was to fix the conformation of the molecules with unknown active structure to the conformation with the lowest relative energy and allowing mutation only at rotational bonds of the molecules with known active conformation. Run 07 was interrupted at generation 116 due to a server crash an could not be resumed.

The results for the four runs with this configuration are shown in figure 4.11 and table 4.11. One can see that the average MSE declines faster (within the firt 50 generation) than in the previous runs with implicit conformation sampling to a final value of 0.6932, 0.7425, 0.9938 and 0.9555. These higher average MSE values can be explained by the fixed conformations and the resulting missing possibility for optimization.

One can see the effect of allowing mutation only on the 12 molecules with known conforma- tion. The best individual RMSD is lower then the average RMSD for almost every generation in all four runs. This can be explained by the higher mutation rate resulting in individuals with conformations with a lower RMSD to the active structures.

51

5 Discussion

The hypothesis that the best achievable models to predict activity include the active structures of the training molecules, can not be confirmed in this work. The average RMSD over all runs is almost exactly the average RMSD over all possible sets of conformations to the active structure(See figure 5.1). While some runs show a RMSD below the 5% quantile they still are within the normal distribution and are countered by the runs with a RMSD above average. Though models were found with a a low average RMSD the optimization in most cases returned to models with an RMSD near the average value. The reasons for these results can come from two directions. They can be either chemically or mathematically qualified.

One possible reason is, that to many factors determining the active structure are missing from the models I created. For example the solvent of the molecules, in this case water, is not included in the model at all. But as studies have shown the solvent often has a great influence on the activity and the active structure of a molecule. It can change the molecules conformation to a more fitting one or even be part of the active site itself by filling the space not occupied by the ligand. Therefore disregarding the solvent may lead to a model with a feature space not having enough information about the active complex.

Figure 5.1: This figure shows the diagram for the mean values over all runs. One can clearly see the average RMSD stagnating at 1.81 over the whole run while the MSE declines to a value of about 0.4

A second reason may be that only part of the molecules conformations are critical for the activity while another, possibly larger part can take up a random, probably energetically min- imal, conformation. However this would only account for a part of the normal distribution of the RMSD values. The active part of the molecule conformations chosen for the model would have a distinctly lower average RMSD to the active conformation resulting in the overall RMSD being lower than mean of the normal distribution. For the dataset used in this work, this can be ruled out because all 12 molecules with known active structure are entirely integrated in the active process and have no parts which can take on free conformations.

52

Another reason for the model consisting of conformations with an average RMSD to the active structure may come from the used kernel methods in combination with the molecules in the dataset. Regardless of the chemical properties the molecules in the dataset often consisted of several ring systems. These ring systems contribute the same partial results to all kernel values for all different combinations of conformations due to the fact that they are rigid and do not change partial results between different conformations. To rule this out one would have to repeat the experiments with a dataset of molecules with less rigid parts or with a kernel method that prioritizes longer distances within the conformations.

The most important and apparent reason though is based on the principle of the SVR. On optimizing over the activity prediction with the activity value being the same for each con- former of a given molecule, it is clear that to achieve a maximal generalization the process will pick that conformation with the best representation of the whole conformational space of the molecule. In the conformational hypersphere this will be a conformation near the center of the sphere. Which in this case means a conformation with a maximal similarity to all other con- formations of that molecule. Or in other words, a conformation with a minimal average RMSD to all other conformers. In most cases this would not be the active conformation. For the 12 molecules with known active structure I used in this work, the average distance (i.e. RMSD) from the active conformation to the center of the conformational hypersphere was 1.81 ˚A.

Not regarding the resulting average RMSD values, the PPK kernel yielded the best MSE with values as low to 0.01 in contrast to the RDF kernel with MSE values only in range of 0.5 and the APK with MSE values in the range of 0.3. Furthermore the PPK had the steepes decent of the MSE values reaching an almost even level at generation 25-50, whereas the APK needed more generations. The resulting models of the implicit conformation sampling were less significant because in most runs the optimization process was not finished. This can be explained by the mutation probability only being set to 0.1 and the number of points being about 10 times as much as with the precomputed conformation sampling. One can see that by reducing the possible mutations as in the final runs with implicit conformation sampling and fixed conformers the MSE also decreases more rapidly.

53

6 Prospects

Following the results of this work models for activity prediction provide the best results not by using the active structures but a conformation with minimal distance to all possible confor- mations of the respective molecules. To confirm these results one would have to run further experiments with other kernels and data sets. In these experiments one would have to calculate the distances of the best resulting conformations for activity prediction not only to the active conformation but to all possible conformation, or at least an equally distributed set over the conformational space. These new model would be expected to provide the best results if they were based on these average conformations and not on the active structures.

If proven correct one would have to rethink the use of active conformations in 3D QSAR models in favor of more generalized conformations. Further this would suggest a method of finding a conformation near the center of the conformational hypersphere without calculating all pairwise distances of all possible conformations.

If proven wrong one would have to revise the results of this work and further investigate the reasons for the average RMSD of the best achievable models constantly being the exact RMSD of the active conformations to the middle of the proposed conformational hypersphere. Therefore one would have to copile a new set of molecules with known active structures. Where the set would include more and diverse active conformations to cover a wider range of the chemspace.

54

Bibliography

[Bau09] L.; Heine-A.; Smolinski M.; Hangauer D.; Klebe G. Baum, B.; Muley. Think twice: understanding the high potency of bis(phenyl)methane inhibitors of throm- bin. J.Biol.Mol, 391:552564, 2009.

[Ber77]

T.F.; Williams-G.J. Meyer E.E. Jr.; Brice M.D.; Rodgers J.R.; Kennard O.; Shi- manouchi T.; Tasumi M. Bernstein, F.C.; Koetzle. The protein data bank: A J. of Mol. Biol., computer-based archival file for macromolecular structures. 112:535, 1977.

[BGV92] B.E. Boser, I.M. Guyon, and Vapnik V.N. Annual Workshop on Computational Learning Theory, chapter Proceedings of the fifth annual workshop on Computa- tional learning theory, pages 144152. ACM, 1992.

[Boe99]

J.; Klebe G Boehm, M.;Stuerzebecher. Three-dimensional qantitive structure activ- ity relationship analyses using comparative molecular file analysis and comparative molecular similarity indices analysis to elucidate selectivity differences of inhibitors binding to trypsin, thrombin, and factor xa. Journal of Medical Chemistry, 42:458 477, 1999.

[Cha05] Q Chang. Scaling gaussian rbf kernel width to improve svm classification. Neural Networks and Brain, 2005. ICNN&B 05. International Conference on, pages 19 22, 2005.

[CL01]

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ ˜cjlin/libsvm.

[Cou04] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Using quaternions usings rmsd.

J.Comput. Chem, 25:18491857, 2004.

[Cou05] Chaok; Dill Ken A. Coutsias, Evangelos A.; Seok. Rotational superposition and least sequares: the svd and quaternions approach yield identical results. reply to the preceeding comment by g. kneller. J. Comput. Chem., 26:16631665, 2005.

[CV95] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273297,

[Dia76] R. Diamond. On the comparison of conformations using linear and quadratic trans-

formations. Acta Cryst, 32:110, 1976.

[Dix06] A.; Knoll E.; Rao S.; Schaw D. Friesner R.A. Dixon, S.; Smondyrev. Phase: a new engine for pharmacophore perception, 3d qsar model developement and 3d database screening. J. Comput.-Aided Mol. Design, 20(10):647671, 2006.

55

Bibliography

[Eki04]

S. Ekins. Predicting undesirable drug interactions with promisscuous proteins in silico. Drug Dicovery Today, 9:276285, 2004.

[Fis94]

E. Fischer. Einfluss der konfiguration auf die wirkung der enzyme. Berichte der deutschen chemischen Gesellschaft, 27:29852993, 1894.

[Fle89]

R. Fletcher. Practical Mehods of Optimization. John Wiley and Sons, New York, 1989.

[Fri04]

J.L.; Murphy R.B.; Halgren T.A.; Klicic J.J.; Mainz D.T.; Repasky M.P.; Knoll E.H. Shelley M.; Perry J.K.; shaw D.E.; Francis P.; Shenkin P.S. Friesner, R.A.; Banks. Glide: a new approach for rapid, accurate doachin and coring. 1- method and as- sessment of doching accuracy. J.Med.Chem, 47(7):173949, 2004.

[Gas96]

J.; Schuur J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Sadowski. Chem- ical information in 3d space. J. Chem. Inf. Comput. Sci ., 36:10301037, 1996.

[Gas97]

J.; Selzer P.; Steinhauer L.; Steinhauer V. Gasteiger, J.; Schuur. Finding the 3d structure of a molecule in its ir spectrum. Fresenius J. Anal. Chem., 359:5055, 1997.

[Guy93] B.; Vapnik V.N. Guyon, I.; Boser. Advances in Neural Information Processing Systems, chapter Automtic capactiy tuning of very large VC-dimension classifiers, pages 147155. Morgan Kaufmann, San Mateo, CA, 1993.

[Ham66] Sir. Hamilton, William Rowan. Elements of Quaternions. Longmans, Green & Co.,

London, 1866.

[Han69] C. Hansch. A quantitative approach to biochemical structure-activity relationships.

Acc. Chem. Res., 2:232239, 1969.

[Har94] George K.; Kauffman Louis H. Hart, John C.; Francis. Visualizing quaternion rota-

tion. Transactions on Graphics, 13:256276, 1994.

[Hem99] Markus C.; Steinhauer V.; Gasteiger J. Hemmer. Deriving the 3d structure of organic molecules from their infrared spectra. Vibrational Spectroscopy, 19:151164, 1999.

[HG04] H.Z. Hao and M. Genton. Compactly supported radial basis function kernels. 2004.

[Hol75]

John H. Holland. Adaptation in Natural and Artificial Systems. Univ. Michigan Press., 1975.

[Jah09] G.; Fechner N.; Zell A. Jahn, A.; Hinselmann. Optimal assignment methods for ligand-based virtual screening. Journal of Chemoinformatics, 1:14, 2009.

[Jeb04]

R.; Howard A. Jebara, T.; Kondor. Probability product kernels. Journal of Machine Learning Research, 5:819844, 2004.

[Jor88]

J.T. Jorgensen, T.L.; Tirado-Rives. The opls potential functions for proteins. energy minimization for crystals of cyclic peptides and crambin. J.Am.Chem.Soc., 110:165, 1988.

56

Bibliography

[Jor96]

[JZ10]

D.S.; Tirado-Rives J. Jorgensen, W.L.; Maxwell. Development and testing of the opls all-atom force field on cornformational energetcs and properties of organic liq- uids. J.Am.Chem.Soc., 118:1122511235, 1996.

G.; Fechner-N.; Henneges C. Jahn, A.; Hinselmann and A. Zell. Probabilistic model- ing of conformational space for 3d machine learning approaches. Mol. Inf., 29:441 455, 2010.

[Kab76] Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta

Crystallographica, 32(5)A:922923, 1976.

[Kat00] R.; Luong-C.; Radika K.; Martelli A.; Sprengeler P.A.; Wang J.; Chan H.; Wong L Katz, B.A.; Mackman. Structural basis for selectivity of a small molecule, s1-binding, submicromolar inhibitor of urokinase-type plasminogen ac- tivator. Chem.Biol., 7:299312, 2000.

[Kat01a] K.; Luong-C.; Rice M.J.; Mackman R.L.; Sprengeler P.A.; Spencer J.; Hataye J.; Janc J.; Link J.; Litvak J.; Rai R.; Rice K.; Sideris S.; Verner E.; Young W. Katz, B.A.; Elrod. A novel serine protease inhibition motif involving a multi-centered short hydrogen bonding network at the active site. J.Biol.Mol, 307:14511486, 2001.

[Kat01b] P.A.; Luong-C.; Verner E.; Elrod K.; Kirtley M.; Janc J.; Spencer J.R.; Breit- enbucher J.G.; Hui H.; McGee D.; Allen D.; Martelli A.; Mackman R.L. Katz, B.A.; Sprengeler. Engineering inhibitors highly selective for the s1 sites of ser190 trypsin-like serine protease drug targets. Chem.Biol., 8:11071121, 2001.

[Kat03] K.; Verner-E.; Mackman R.L.; Luong C.; Shrader W.D.; Sendzik M.; Spencer J.R.; Sprengeler P.A.; Kolesnikov A.; Tai V.W.-F.; Hui H.C.; Breitenbucher J.G.; Allen D.; Janc J.W. Katz, B.A.; Elrod. Elaborate manifold of short hydrogen bond ar- rays mediating binding of active site-directed serine protease inhibitors. J.Biol.Mol, 329:93120, 2003.

[Kat04] C.; Ho-J.D.; Somoza J.R.; Gjerstad E.; Tang J.; Williams S.R.; Verner E.; Mackman R.L.; Young W.B.; Sprengeler P.A.; Chan H.; Mortara K.; Janc J.W.; McGrath M.E. Katz, B.A.; Luong. Dissecting and designing inhibitor selectivity determinants at the s1 site using an artificial ala190 protease (ala190 upa). J.Biol.Mol, 344:527 547, 2004.

[Kea89] Simon K. Kearsley. On the orthogonal transformation used for structural compari-

son. Acta Crystallographica, 45(2)A:208210, 1989.

[Kel06]

P.; Schalon-C.; Bret G.; Foata N.; Rognan D. Kellenberg, E.; Muller. sc-pdb: an annotated database of druggable binding sites from the protein data bank. Journal of Chemical Information and Modeling, 46(2):717727, 2006.

[Kos58]

Jr. Koshland, D. E. Application of a theory of enzyme specificity to protein synthe- sis. Proc. Natl. Acad. Sci. U.S.A., 44:98104, 1958.

[Kos94]

Jr. Koshland, D. E. Angew.Chem.Int.Ed.Engl, 33:23752378, 1994.

The key and lock theory and the induced fit

theory.

57

Bibliography

[Mac84] A. L. Mackay. Quaternion transformation of molecular orientation. Acta Crystallo-

graphica Section A, 40(2):165166, Mar 1984.

[Mat96] R.; Costanzo-M.J.; Maryanoff B.E.; Tulinsky A Matthews, J.H.; Krishnan. Crystal structures of thrombin with thiazole-containing inhibitors: probes of the s1 binding site. Biophys.J,, 71:28302839, 1996.

[McL72] A.D. McLachlan. A mathematical procedure for superimposing atomic coordinates

of proteins. ActaCryst, 28:656657, 1972.

[OW91] T.I. Oprea and C.L. Walter. Reviews in Computational Chemistry, chapter Theoreti- cal and practical aspects of thee-dimensional quantitative structure-activity relation- ships, pages 127182. Wiley-VCH: New York, 1991.

[Sad94]

[Sch96]

[Sel97]

J. Sadowski, J.; Wagener M.; Gasteiger. Corina: Automatic generation of high- quality 3d-molecular models for application in qsar. In 10th European Symposium on Structure-Activity Relationships: QSAR and Molecular Modelling, 1994.

P.; Gasteiger J Schuur, J.H.; Selzer. The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure - spectra cor- relations and studies of biological activity. J. Chem. Inf. Comput. Sci., 36:334344, 1996.

J.H.; Gasteiger Selzer, P.; Schuur. Software Development in Chemistry 10, vol- ume 10, chapter Simulation of IR Spectra with Neural Networks Using the 3D- MoRSE Code, page 293. Gesellschaft Deutscher Chemiker: Frankfurt am Main, 1997.

[Sew07] Martin Sewell. Kernel methods. Technical report, Department of Computer Science

University College London, 2007.

[Sho85] K. Shoemaker. Animating rotation with quaternion curves. Comput. Graph.,

19:245254, 1985.

[SI08a] New York Schroedinger Inc. LigPrep, V2.1. 2008.

[SI08b] New York Schroedinger Inc. MacroModel, V9.6. 2008.

[SJ93] M. Stone and P. Jonathan. Statistical thinking and techniques for qsar related studies.

1 general theory. J. Chemom., 7:455475, 1993.

[SOW04] Jeffrey J. Sutherland, Lee A. OBrian, and Donald F. Weaver. A comparison of methods for modeling quantitative structure-activity relationship. J. Med. Chem., 47:55415554, 2004.

[Ste03] Y.; Kuhn S.; Horlacher O.; Luttmann E.; Willighagen E. Steinbeck, C.; Han. The chemistry development kit (cdk): an open source java library for chemo- and bioin- formatics. J Chem Inf Comput Sci, 43(2):493500, 2003.

[Ste06]

C.; Kuhn S.; Floris M.; Guha R. Steinbeck, C.; Hoppe. Recent development of the chemistry development kit (cdk) - an open source library for chemo- and bioinfor- matics. Curr Pharm Des, 12(17):21112120, 2006.

58

Bibliography

[Vap82] V. Vapnik. Estimation of dependencies bade on empirical data. Springer Verlag,

[Vap95] V. Vapnik. The Nature of Statistical Learning Theroy. Springer Verlag, 1995.

[VC64] V. Vapnik and A. Chervenonkis. A note on one class perceptrons. Automation and

Remote Control, 25, 1964.

[VC74] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka (Russia),

[VL63] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.

Automation and Remote Control, 24, 1963.

[WS10]

P.; Murphy R.B.; Sherman W.; Friesner R.A. Watts, K.S.; Dalal and J.C. Shelley. Confgen: A conformational search method for efficient gerneration of bioactive con- formers. J.Chem.Inf.Model., 50:534546, 2010.

[XG02] Y.; Ming L. Xi, C.;Lin and K. Gilson. The binding database: data management and

interface design. Bioinformatics, 18(1):130139, 2002.

[Zam76] A. Zamora.

An algorithm for finding the smallest set of smalles rings.

J.Chem.Inf.Comput.Sci., 16(1):4043, 1976.

59

List of Figures

1.1 Overlay of Thrombin inhibitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 QSAR process .

.

.

.

.

.

2 2

. .

6 . . . . 2.1 SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 GA individuals . 2.3 GA mutation operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Thrombin-Hirudin complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . .

. .

. .

. .

.

.

.

examples for RDF . overlay of RDF functions . curve approximation . .

3.1 flowchart of the overall process . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 3.5 Atom Pair Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 12 structures with known active conformation . . . . . . . . . . . . . . . . . . 22 3.6 example for implicit conformation sampling encoding . . . . . . . . . . . . . . 25 3.7 exapmle of rotatable bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.8

. .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 results of the initial runs 4.1 results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 32 4.2 results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 36 4.4 results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 38 4.5 results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 42 4.7 results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 44 4.8 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 results for the reruns 4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 48 4.11 results for runs with implicit conformation sampling, reduced dataset and fixed

.

.

conformation . .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 mean over all runs .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

60

List of Tables

3.1 3.2 3.3

table of compiled structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 table of parameters for conformer generation . . . . . . . . . . . . . . . . . . 24 table of parameters for the SVR . . . . . . . . . . . . . . . . . . . . . . . . . 27

results of the initial runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 results of reduced dataset with ppk and rbf . . . . . . . . . . . . . . . . . . . . 33 4.2 results of reduced dataset with APK . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 results for alternative parameters for the PPK . . . . . . . . . . . . . . . . . . 37 4.4 results for alternative parameters for the APK . . . . . . . . . . . . . . . . . . 39 4.5 results of increased mutation rate . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 results of alternative conformation sampling . . . . . . . . . . . . . . . . . . . 43 4.7 results of alternative mutation operator . . . . . . . . . . . . . . . . . . . . . . 45 4.8 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 results for the reruns 4.10 results for initial runs with implicit conformation sampling . . . . . . . . . . . 49 4.11 results for runs with implicit conformation sampling, reduced dataset and fixed

.

.

conformation . .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

61