Gryffin: Bayesian Optimization of Categorical Variables

We formulate a chemical reaction as an optimization algorithm. The parameters associated can be either continuous or categorical. There exist methods such as PHOENICS for the optimization of continuous variables, but its pretty tough for applying the same for categorical variables.

This is where GRYFFIN steps in, it tries to extrapolate the methods used for continuous variables to categorical variables.

Thus, to understand GRYFFIN we would have to first understand PHOENICS better.

PHOENICS - Continuous Variable Optimization

Let the true function be represented by $f$, and our estimate of the function (surrogate function) so far be $\hat{f}$. A three layered bayesian neural network is used for this purpose.

This is done by sampling values, and the acquisition function $\alpha (z)$ tells us which value to sample at.

This is an iterative procedure, the following steps are done:

$\alpha (z)$ proposes the optimal value at which to sample, $x’$
Sample at $x’$, and rebuild $\hat{f}$ using the bayesian network
We stop after a set number of iterations are done, or when convergence is reached

Extending to Categorical

For example, let there be three types of Ligands, A B and C. We can represent each of these ligands as a one-hot encoded vector, such as $(1,0,0)$ for Ligand A.

We now assume the following:

The Ligands A B and C are sufficiently independant and descriptive enough of the “Ligand Space”. {. :notice–success}

We then try to represent the rest of the ligands as a vector. That is, another ligand D might be represented by the vector $(0.2, 0.3, 0.7)$, using the Gumbel-SoftMax Distribution.

This is called as Soft One Hot Encoding

This method is not fool-proof though, it is plausible that our original assumption is not valid. That is, if A and B are similar in nature, there is a redundancy in the Ligand Space!

The issue of similarity is addressed by introducing a descriptor space.

Descriptor Space

Incorporate domain knowledge to measure similarity! That is, we can plot the ligands on a 2d plot where the axes correspond to the HOMO-LUMO densities, and use the “distance” of a ligand D from each of these ligands to get the soft-one hot vector.

We would like the set of descriptors to have:

High correlation with the objective function
Number of descriptors should be small
Low pair-wise correlation amongst each other

It is quite difficult to manually select a large number of descriptors which folow the above pair of rules.

What the paper did was just send the descriptor vector into a neural network with a single hidden layer and softsign activation to get a descriptor output vector, of smaller size.