Introducing molfeat: the Open Source Hub of Molecular Featurizers

nounce the release of molfeat, an open source hub for easy access and implementation of molecular featurizers! In combination with the datamol package, the launch of molfeat marks the inception of datamol.io, an open-source toolkit that aims to simplify molecular processing and featurization workflows for ML scientists working in drug discovery.

Users can get started with molfeat here and follow us on Twitter for regular updates.

Challenges with molecular featurization

The process of molecular featurization, which involves transforming a molecule into a vector, poses unique challenges in the field of machine learning. Unlike other domains, there is no default approach and it remains unclear how best to represent the richness of molecular data in a unified format. The effectiveness of different approaches varies greatly depending on the specific modeling task and its constraints. Therefore, in order to achieve optimal performance, it is important to experiment with a variety of featurization techniques, from structural fingerprints, to physico-chemical descriptors, to neural network based embeddings, and beyond.

Unfortunately, these featurizers are distributed throughout scientific literature and various code repositories. This fragmentation makes experimenting with new featurizers time-consuming, as each approach may have different interfaces and codebases. The sheer number of available options, combined with the ever growing proposal of new featurizers from the field, further adds to the challenge.

Working with the community

Over the past few months, the datamol package has been actively maintained by a core group of open source developers. These developers originate from the Molecular Modeling and Drug Discovery (M2D2) community at Mila. This is an online community of researchers, engineers, ML scientists, students, and professors with a shared goal of accelerating innovation in the field of AI-enabled drug discovery. With over 3,000 members, M2D2 is becoming the de facto platform for the growing AI & drug discovery communities to come together, spark new perspectives, provoke discussions, and collaborate on interdisciplinary projects.

The next step in these efforts is to focus on the creation of open-source, community-driven tools to benefit the industry more broadly. As we progressed in developing datamol, we held countless discussions with active users of the package. We quickly discovered that the challenges and time-consuming nature of molecular featurization was a common issue. This is what inspired us to create molfeat.

Introducing molfeat

molfeat is a hub that unifies a diverse range of molecular featurizers into a single package. A growing list of the most popular and recent SOTA featurizers are offered, including:

Pre-trained embeddings (e.g. ChemBERTa, ChemGPT and Graphormer, pretrained Graph Isomorphism Networks)
Structural fingerprints (e.g. ECFP and MACCS)
Physico-chemical descriptors (e.g. 2D RDKit descriptors and Mordred)

Why should you use molfeat?

With molfeat, you no longer have to spend countless hours searching for the right featurizer. We centralize hundreds of featurizers into a single hub. By offering descriptions of featurizers and references to relevant papers, molfeat makes it easy to find and learn about the best option for your use case.
Get familiar with molfeat once and never waste time digging through unorganized codebases every again. ML research in drug discovery is progressing exceptionally fast. There are an overwhelming number of featurizers consistently being updated and released. We know from experience how frustrating it is to go through a learning curve with different codebases and interfaces every time you want to try a new featurizer. molfeat aims to abstract away the underlying complexities of each individual package.
Use molfeat to easily compare performance across different featurizers. With our extensive documentation and tutorials, it’s easy to see where and how molfeat can fit into your existing workflows.
molfeat is open source and supported by the M2D2 community, meaning it’s consistently evolving. Do you think we’re missing certain featurizers? It’s incredibly easy for the community to contribute something new. You can learn how through this tutorial.

Getting started

molfeat is a Python library and can be easily installed through conda or pip.

conda install -c conda-forge molfeat

After installation, our intuitive API makes it easy to access various molecular featurizers.

import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from molfeat.store.modelstore import ModelStore

# Load some dummy data
data = dm.data.freesolv().sample(100).smiles.values

# Featurize a single molecule
calc = FPCalculator("ecfp")
calc(data[0])

# Define a parallelized featurization pipeline
mol_transf = MoleculeTransformer(calc, n_jobs=-1)
mol_transf(data)

# Easily save and load featurizers
mol_transf.to_state_yaml_file("state_dict.yml")
mol_transf = MoleculeTransformer.from_state_yaml_file("state_dict.yml")
mol_transf(data)

# List all available featurizers
store = ModelStore()
store.available_models

# Find a featurizer and learn how to use it
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage()

Learn more about molfeat

We are excited to introduce molfeat into the datamol.io toolkit and continue accelerating the adoption of SOTA molecular machine learning tools across the drug discovery industry.

You can get started by viewing our documentation and tutorials. We welcome your feedback on the GitHub repository, on the forum, or on Twitter. You can also join the M2D2 Slack community where users can share their work, ask questions, and collaborate on projects using molfeat.