Introducing ChemTSv2: Toward Democratizing Functional Molecule Generators

In this blog post, we introduce ChemTSv2, an open-source AI-based molecule generator along with a short tutorial on how you can use it today. ChemTSv2 allows users to focus on specifying desired molecule functionalities while being easily integrated with any software package, such as AutoDock Vina, Gromacs, and Gaussian.

Users can get started with ChemTSv2 here.

In various fields, from drug and materials designs, there are constant demands for discovering and optimizing novel molecules with desired properties, such as the high potency against a target protein. However, even when considering small organic molecules, one would need to identify such drug-like molecules from the vast chemical space that easily exceeds 1060. This endeavour proves to be exceptionally challenging.

Molecular design can be considered as an inverse problem: the desired property (e.g., inhibitory activity against a protein) is provided as an input, and the challenge is to output a molecule structure possessing that property. In recent years, various molecule generators using deep learning have been developed to tackle this issue. However, when trying to adapt these molecule generators to optimize the desired properties of third-party users, which is outside developers’ focus, a deep understanding and expertise of these AI-based molecule generators is essential. This posed a significant barrier for non-experts wanting to utilize molecular design AI.

Introducing ChemTSv2

ChemTSv2[1] is a refined and extended version of ChemTS[2] which is based on Monte Carlo tree search (MCTS) and a recurrent neural network (RNN) and was developed by our co-author, Tsuda et al., in 2017. ChemTSv2 was designed to enable users to effortlessly perform functional molecule designs with their desired properties. There are three main features:

An easy-to-run interface requiring only a configuration file
A flexible framework allowing users to define any reward function and molecular filter for use during the molecular design process
A massive parallelization mode accommodating computationally expensive rewards

In this blog post, we won’t cover how to use parallel mode. Readers interested in that feature are encouraged to refer to our paper [1] and GitHub repository.

Molecular design with ChemTSv2 involves four steps, as illustrated in Fig. 1:

Preparation of Reward File: Users define how to evaluate desired molecular properties in ChemTSv2. Various software packages, such as RDKit, Gaussian, and AutoDock Vina, can be used for this evaluation. Detailed setup instructions will be provided later.
Setup of Configuration File: Users set the conditions, such as the number of molecules to be generated, the exploration parameters of MCTS, and if needed, the molecular filter to generate valid molecules. The reward file prepared in the previous step is also specified in this file. Detailed setup instructions will be provided later.
Execution of ChemTSv2: Users should execute using the chemtsv2 command along with the configuration file prepared in the previous step. The output format is CSV.
Analysis: Users can analyze the designed molecules using their preferred tools that can support CSV format.

Figure 1. Whole workflow of ChemTSv2. Adapted with permission from [1] under the terms of a Creative Commons Attribution License 4.0 (CC BY-NC).

Getting Started

ChemTSv2 is a Python package and can be installed using the pip command as follows:

pip install chemtsv2

To try an example molecule generation, please follow the steps below:

git clone [email protected]:molecule-generator-collection/ChemTSv2.git
cd ChemTSv2
chemtsv2 -c config/setting.yaml```

In this example, the molecules will be designed to have a high LogP value.

In the following sections, we will explain how to prepare the reward and configuration files, respectively.

Preparing Reward File

Any user-defined reward file should inherit from the Reward base class prepared in ChemTSv2. This base class specifies two static methods:

get_objective_functions(): This method accepts a configuration parameter in dictionary form. It returns a list of functions, each computing a float objective value from an RDKit Mol object.
calc_reward_from_objective_values(): Serving as an aggregator, this method takes a list of objective values, computed from the functions returned by get_objective_functions(), and the configuration dictionary. It processes these inputs to yield a single floating-point reward value.

The configuration dictionary has the keys and values defined in the configuration file. So, parameters that users want to use in the reward calculation can be used in both methods if they are defined in the configuration file.

Two examples are listed below, one using only Python packages and the other using non-Python software packages and, such as AutoDock Vina. If you want to use non-Python packages for the reward calculation, you can do so by using Python’s subprocess module. These examples have been simplified for clarity, so please refer to the GitHub repository for more details.

# This reward file aims to design molecules with high Jscore.
# ChemTSv2/reward/Jscore_reward.py

import sys
import numpy as np
from rdkit.Chem import Descriptors
import sascorer
from reward.reward import Reward

class Jscore_reward(Reward):
    def get_objective_functions(conf):
        def LogP(mol):
            return Descriptors.MolLogP(mol)
        def SAScore(mol):
            return sascorer.calculateScore(mol)
        def RingSizePenalty(mol):
            ri = mol.GetRingInfo()
            max_ring_size = max((len(r) for r in ri.AtomRings()), default=0)
            return max_ring_size - 6
        return [LogP, SAScore, RingSizePenalty]

    def calc_reward_from_objective_values(values, conf):
        logP, sascore, ring_size_penalty = values
        jscore = logP - sascore - ring_size_penalty
        return jscore / (1 + abs(jscore))

# This reward file aims to design molecules with high docking score.
# ChemTSv2/reward/Vina_binary_reward.py

import os
import subprocess
import shutil
import tempfile

from meeko import MoleculePreparation, PDBQTMolecule
from rdkit import Chem
from rdkit.Chem import AllChem, rdMolTransforms
from rdkit.Geometry import Point3D

from reward.reward import Reward

class Vina_reward(Reward):
    def get_objective_functions(conf):
        def VinaScore(mol):
            verbosity = 1 if conf['debug'] else 0
            temp_dir = tempfile.mkdtemp()
            temp_ligand_fname = os.path.join(temp_dir, 'ligand_temp.pdbqt')
            pose_dir = os.path.join(conf['output_dir'], "3D_pose")
            os.makedirs(pose_dir, exist_ok=True)
            output_ligand_fname = os.path.join(pose_dir, f"mol_{conf['gid']}_out.pdbqt")

            mol = Chem.AddHs(mol)
            AllChem.EmbedMolecule(mol)
                    mol_conf = mol.GetConformer(-1)
            centroid = list(rdMolTransforms.ComputeCentroid(mol_conf))
            tr = [conf['vina_center'][i] - centroid[i] for i in range(3)]
            for i, p in enumerate(mol_conf.GetPositions()):
                mol_conf.SetAtomPosition(i, Point3D(p[0]+tr[0], p[1]+tr[1], p[2]+tr[2]))
            mol_prep = MoleculePreparation()
            mol_prep.prepare(mol)
            mol_prep.write_pdbqt_file(temp_ligand_fname)

            cmd = [
                conf['vina_bin_path'],
                '--receptor', conf['vina_receptor'],
                '--ligand', temp_ligand_fname,
                '--center_x', str(conf['vina_center'][0]),
                '--center_y', str(conf['vina_center'][1]),
                '--center_z', str(conf['vina_center'][2]),
                '--size_x', str(conf['vina_box_size'][0]),
                '--size_y', str(conf['vina_box_size'][1]),
                '--size_z', str(conf['vina_box_size'][2]),
                '--cpu', str(conf['vina_cpus']),
                '--exhaustiveness', str(conf['vina_exhaustiveness']),
                '--max_evals', str(conf['vina_max_evals']),
                '--num_modes', str(conf['vina_num_modes']),
                '--min_rmsd', str(conf['vina_min_rmsd']),
                '--energy_range', str(conf['vina_energy_range']),
                '--out', output_ligand_fname,
                '--spacing', str(conf['vina_spacing']),
                '--verbosity', str(verbosity)]

            subprocess.run(cmd, check=True)
            pdbqt_mols = PDBQTMolecule.from_file(output_ligand_fname, skip_typing=True)
            min_affinity_score = pdbqt_mols[0].score
            return min_affinity_score
        return [VinaScore]

    def calc_reward_from_objective_values(values, conf):
        min_inter_score = values[0]
        if min_inter_score is None:
            return -1
        score_diff = min_inter_score - conf['vina_base_score']
        return - score_diff  0.1 / (1 + abs(score_diff)  0.1)

Preparing Configuration File

The configuration file should be written in YAML format. Below are the basic and filter setting options. Please refer to our paper [1] for the descriptions of each parameter. The search parameter, c_val, is particularly important when using ChemTSv2. For instance, setting c_val to 0.1 implies that when molecules that are considered optimal are found during the search process, the search algorithm gives priority to exploring similar molecules. Conversely, with c_val set to 1.0, emphasis is placed on broadly exploring unexplored molecules, beyond just those deemed optimal during the process. If you want to know all available parameters, please refer to the actual configuration file. To use pre-defined filters, such as the Lipinski filter and SAScore filter, specify the value of the use_*_filter key as either True or False. If you wish to define and use custom filters, please refer to the document.

### Basic settingc_val: 1.0
threshold_type: generation_num
generation_num: 300
output_dir: result/example01
model_setting:
  model_json: model/model.tf25.json
  model_weight: model/model.tf25.best.ckpt.h5
token: model/tokens.pkl
reward_setting: 
  reward_module: reward.logP_reward
  reward_class: LogP_reward

### Filter setting
use_lipinski_filter: True
lipinski_filter:
  module: filter.lipinski_filter
  class: LipinskiFilter
  type: rule_of_5
use_sascore_filter: True
sascore_filter:
  module: filter.sascore_filter
  class: SascoreFilter
  threshold: 3.5
use_ring_size_filter: True
ring_size_filter:
  module: filter.ring_size_filter
  class: RingSizeFilter
  threshold: 6
use_pains_filter: False
pains_filter:
  module: filter.pains_filter
  class: PainsFilter
  type: [pains_a]
include_filter_result_in_reward: False

Users can add their own parameters that they want to use in the reward calculation to the configuration file. For instance, some of the configuration parameters used in the above reward calculation using AutoDock Vina are as follows:

# User setting
vina_bin_path: /home/app/vina_1.2.3_linux_x86_64
vina_sf_name: vina
vina_cpus: 8
vina_receptor: data/1iep_receptor.pdbqt
vina_center: [15.190, 53.903, 16.917]

Please refer to this configuration file for all user-defined parameters.

Example Application for Drug Discovery

ChemTSv2 can be used in ligand-based (LBDD) and structure-based (SBDD) drug designs, as shown in Fig. 2. Yoshizawa et al.[3] utilized ChemTS and LightGBM to design selective inhibitors for kinase homologs, and Ma et al.[4] combined ChemTS with rDock to perform structure-aware molecule generations. Please refer to the paper for more detailed information on each.

Figure 2. Application of ChemTS in LBDD and SBDD. Adapted with permission from [1] under the terms of a Creative Commons Attribution License 4.0 (CC BY-NC).

Future Plan

With the current version of ChemTSv2, users need to be experts in how to evaluate desired properties from designed molecules. Looking ahead, we expect the development of an AI that can autonomously determine how to evaluate such properties of a molecule. Such advancements would pave the way for a molecular design AI that can truly be accessible and easy for anyone to use.

About the Authors

Shoichi Ishida

PostDoc @ Yokohama City University.

Shoichi Ishida completed his Ph.D. in Pharmaceutical Sciences at Kyoto University, Japan, in 2021. He currently serves as a Post-doctoral researcher at Yokohama City University and is working on AI applications for drug discovery and medical science.

Kei Terayama

Associate Professor @ Yokohama City University

Kei Terayama received a Doctor of Human and Environmental Studies degree from Kyoto University, Kyoto, Japan, in 2016. From 2016 to 2018, he was a researcher at the Graduate School of Frontier Sciences, the University of Tokyo. In 2018, he moved to the RIKEN center for Advanced Intelligence Project. Since 2020, he has been an associate professor at the Graduate School of Medical Life Science, Yokohama City University. His research interests include the development of machine learning and computer vision techniques for drug discovery, materials sciences, chemistry, and underwater monitoring.

Acknowledgements

This research[1] was conducted in "Development of a Next-generation Drug Discovery AI through Industry-academia Collaboration (DAIIA)" supported by Japan Agency for Medical Research and Development (AMED) under Grant Number JP22nk0101111.

References

[1] S. Ishida et al., “ChemTSv2: Functional molecular design using de novo molecule generator,” WIREs Comput. Mol. Sci., vol. n/a, no. n/a, p. e1680, doi: 10.1002/wcms.1680.

[2] X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, and K. Tsuda, “ChemTS: an efficient python library for de novo molecular generation,” Sci. Technol. Adv. Mater., vol. 18, no. 1, pp. 972–976, Jan. 2017, doi: 10.1080/14686996.2017.1401424.

[3] T. Yoshizawa, S. Ishida, T. Sato, M. Ohta, T. Honma, and K. Terayama, “Selective Inhibitor Design for Kinase Homologs Using Multiobjective Monte Carlo Tree Search,” J. Chem. Inf. Model., Jan. 2022, doi: 10.1021/acs.jcim.2c00787.

[4] B. Ma et al., “Structure-Based de Novo Molecular Generator Combined with Artificial Intelligence and Docking Simulations,” J. Chem. Inf. Model., vol. 61, no. 7, pp. 3304–3313, Jul. 2021, doi: 10.1021/acs.jcim.1c00679.