Notes for Chemical Catalysis
Table of Contents
-1. References
Random Forest Algorithm
This is a very versatile algorithm, as it can act as both a classifier and an regressor. It is used in various fields, such as the stock market and banking for starters.
Working
This algorithm consists of building multiple decision trees to act as an ensemble. Each of the decision trees have access to a random set of features. Let the number of trees in the ensemble be N
. Each individual tree produces a result when data is provided for classification, and the majority is given as the result of the entire classifier.
This is considered to be efficient as it is very likely that one tree might have trouble with a certain type of data, but the majority of the trees can classify the data just fine. However, for this to happen, it is necessary that all trees have next to no correlation with each other.
This is acheived in the following ways:
-
Bagging:
Let one of the data set during training be [1,2,3,4]. This data is modified a little randomly so that all trees in the classifier get different variations of the same data. For example, one tree might get [1,1,2,4] and another can get [1,2,2,3] and so on.
-
Random Features:
Discussed already, each tree has access to a random subset of features. This further decreases the chances of two trees having similarities.
Important Hyperparameters
n_estimators
: Number of trees in the ensemble. Large values imply better learning but slower training and classification times.max_features
: Maximum number of features that a node can consider before splitting. Large features would improve each individual tree but correlation is increased as well.max_depth
: Maximum depth of a tree in the ensemble. Large values can cause overfitting.min_samples_split
: The minimum samples that must be present in a node before split.min_samples_leaf
: The minimum samples that a leaf can have.
SMILES Notation
SMILES stands for Simplified Molecular Input Line Entry System. This is somewhat akin to the IUPAC nomenclature, but is designed to be compact and use ASCII characters. Five basic syntax rules are to be followed, and they have been listed below.
-
Atom and Bond Nomenclature
Atoms are represented using their atomic symbols, and hydrogen is usually exempted in the string. That is,
C
refers to methane andCC
refers to Ethane.Capital letters denote normal atoms and small letters denote aromatic atoms. That is,
CCCCCC
is Cyclo-Hexane, wherascccccc
is Benzene. For atoms with multi character atomic symbols, it is usually better to represent them in []. For example, Scandium is represented as[Sc]
and notSc
as the latter refers to Sulphur being attached to an aromatic carbon.The symbols used for bonds are given below. Single bonds are usually not represented manually.
Symbol Character = Double Bond # Triple Bond * Aromatic Bond . Disconnected Structures -
Chains
As explained earlier, hydrogens need not be written down for the structure to be generated. That is,
CC#C
refers to propyne. However, if hydrogens are represented anywhere in the string, it is assumed that ALL hydrogens have been mentioned explicitly. For example,HC(H)=C(H)(H)
is ethene. -
Branches
Branches in molecules are represented using parantheses. The bond by which the branch is attached to the “main” chain is given after the opening paranthesis. For example,
CC(=C)C
andCC(C)=C
both represent 2-Methyl Prop-1-ene. -
Rings
Rings are represented using SMILES by marking the “start” and “end” carbons of a ring using a number. That is, Cyclohexane is represented as
C1CCCCC1
, and Benzene isc1ccccc1
. If the start and end of a ring are connected via a double or triple bond, it is mentioned before the number of the “start” atom. That is,C=1CCCCC1
is Cyclo-Hexene. -
Charge on Atoms
The charge on an atom is represented in braces as
+1
or-1
. That is, Enolate ion of Prop-2-one is given byCC(O{-1})=C
.