Genetic algorithms are strong baselines for molecule generation

You've heard the narrative pretty often: experts worked on hand-crafted methods for many decades with some success, and then in the last 10 years deep learning has blown them all away. In fields like computer vision and natural language processing I think this narrative is pretty accurate, and skimming the abstracts of ML conference papers creates the impression that every problem is this way. However, I think this is not true for a lot of small data problems, for example generating molecules with desirable properties. In this post I will give a quick explanation of why genetic algorithms, a class of methods popular in the 1970s and 1980s, are still an important algorithm for molecule generation today.

Problem setup

Let's assume you have a noiseless real-valued molecular property function f (e.g. binding affinity to some protein) and a small dataset of (m, f(m)) pairs for various molecules m. If you had to maximize f by hand with a limited budget to evaluate f, what would you do?

Most people’s intuitive strategy is to evaluate molecules which "look" like better molecules in the dataset and do not "look" like the worse ones. At a high level this is what most “fancy” molecule generation algorithms like reinforcement learning and Bayesian optimization try to do (although the implementation details are very different). However, from this perspective a simpler idea is evident: what if one ignored the worse molecules and instead proposed random variants of the best molecules?

Genetic algorithms are strong baselines

Key idea: GAs create random variations of the best molecules

This idea is the essence of genetic algorithms (GAs), which create random mutations of the best molecules encountered so far. The “mutation” operation is typically hand-crafted, and often involves combining multiple molecules together (a nod to biological reproduction). Unlike domains like images where it is nearly impossible to hand-craft a mutation which retains image quality, it is possible to hand-design reasonably operators for molecules because the validity rules for 2D molecular graphs are simple and well-characterized (the Jensen and Aspuru-Guzik groups have done nice work on this). Although anything hand-crafted opens the door to human biases, I thought it is worth testing GAs as a baseline algorithm.

To do this, I wrote a small python package called mol_ga (check it out here) and did some experiments (currently sitting on arXiv here). The results were somewhat surprising: on the practical molecular optimization benchmark, mol_ga actually beat all the baseline methods, including many fancy algorithms based on deep learning. This result mirrors the findings of the GuacaMol paper published almost 5 years ago. If this result seems counterintuitive to you, consider the following:

This benchmark is data-limited (10k evaluations), so learning-based approaches could be expected to suffer.
It is plausible that other algorithms also end up proposing variants of known molecules, just by different means. For example, learning a predictive model and optimizing with respect to that model might just yield molecules similar to the best in the training set.
The objectives in this benchmark are fairly smooth with respect to molecular structure, and therefore avoid a well-understood “mode-hopping” failure mode of GAs.

Conclusions and takeaways

What should we take away from this? I don’t think we should just conclude that GAs are the best algorithm and give up on deep learning. Rather, I think that people interested in molecular optimization should take a bit of time to reflect on the following points:

The “bold table” convention in conference papers creates a strong incentive to test few baseline algorithms (or tune them poorly). This is not unique to molecule generation. Although it may help get papers published, poor baselines create an illusion of progress and ultimately detract from the ML field’s credibility among chemists. I don’t have a solution to this, but I think researchers need to be aware.
As mentioned above, maybe fancier algorithms actually “act” like GAs? (I am not aware of any works which explicitly address this question, and I think someone should look into this)
Some of the most important aspects of molecule generation are not measured in existing benchmarks. For example, “synthesizability” is extremely important yet hard to define quantitatively, so most benchmarks just omit it (and measures like SAScore have well-known flaws and biases). This means the “true” utility of all these methods is very unclear.

Unfortunately none of these points are very actionable, so I tried to include one actionable suggestion for researchers working on molecular generation which I call the GA criterion:

Does your method have any advantage over GAs? If not, maybe reconsider the project.

This advantage could be empirical, but given the flawed nature of benchmarks in this space I think this advantage can also be conceptual (e.g. what failure modes of GAs does an algorithm do well in)? This is not meant to give any kind of primacy or special treatment to GAs, but given their conceptual simplicity and strong performance on existing benchmarks I think they are a very useful reference point when thinking about algorithm development.

Hopefully I’ve convinced you that GAs are still relevant and effective, at least for molecule generation. If you find this interesting, check out my preprint, the mol_ga repo, and feel free to reach out to me by email or Twitter.