I recently attended UK QSAR event and Keishi Kohara from AstraZeneca gave a great talk on REINVENT4, using Mol2Mol transformer. I was vaguely aware of the option but had never used it. From Keishi’s talk it looked a lot simpler than the other “whack-a-molecule” options of iteratively editing input TOML files. Here is the relevant description in the paper.
Molecular optimization [22, 52]. A molecule is supplied to the generator as restraint. The generator will find a second molecule within a defined similarity. Depending on the similarity radius the molecule will be relatively similar to the supplied molecule but, importantly, the scaffold can change within the limits of the given similarity. (Mol2Mol)
Basically you just pick a pre-trained prior, a list of project molecules along with a few details and sample the model.
Install
Install is really easy, switch runtime to gpu/tpu, check what version of cuda is running (in this case 12.5). Set up conda for colab.
!nvcc --version
!pip install -q condacolab
import condacolab
condacolab.install_miniforge()
!git clone https://github.com/MolecularAI/REINVENT4.git
%cd REINVENT4
!python install.py cu125
That’s it, I wish installing other python packages was so easy! Next step is writing a toml file. We’ll create analogues around OSIMERTINIB where the smiles is saved in “osimertinib.smi”.
# REINVENT4 TOML input example for sampling
#
run_type = "sampling"
device = "cuda:0" # set torch device e.g. "cpu"
json_out_config = "_sampling.json" # write this TOML to JSON
[parameters]
## Mol2Mol: find molecules similar to the provided molecules
model_file = "priors/mol2mol_medium_similarity.prior"
smiles_file = "osimertinib.smi" # 1 compound per line
sample_strategy = "beamsearch" # multinomial or beamsearch (deterministic)
temperature = 1.0 # temperature in multinomial sampling
#tb_logdir = "tb_logs" # name of the TensorBoard logging directory
output_file = 'sampling.csv' # sampled SMILES and NLL in CSV format
num_smiles = 100000 # number of SMILES to be sampled, 1 per input SMILES
unique_molecules = true # if true remove all duplicatesd canonicalize smiles
randomize_smiles = true # if true shuffle atoms in SMILES randomly
Running the code is very simple
!reinvent -l sampling.log sampling.toml











