A More Natural Interpretability of Molecular Property Predictions Through Contextual Explanations of Molecular Graphical Depictions

Introduction

In this blog post we tackle the question: can we do better than atomic attributions when designing explainability methods for molecular property prediction deep learning models? For more details you can read the preprint of the article here.

Atomic attribution consists in assigning a numerical value to each atom in a molecule, with the magnitude of the number reflecting the level of importance the atom has in relation to the prediction. The sign of the attribution, instead, reflects whether the atom has a positive or negative impact on the model output. A positive contribution means that the presence of the atom increases the prediction of the model, while a negative contribution has the opposite effect.

Various methods exist to extract these attributions. Most of them rely on the concept of gradient as a measure of “importance''. Let us consider a function Φ : X → R describing a neural network, mapping the data space X to the real line, which we can imagine to describe a regression task. Then, for each data sample x ∈ X we can compute the gradient

which is a vector representing how much Φ changes in the various components of x (we call these component features). If the gradient is close to zero along some of these components, it means that Φ(x) is rather flat along these directions, and changing the value of these components in the input does not have a significant impact on the output. Thus, we conclude that these features are not very relevant for the prediction. On the other hand, if the gradient assumes high values along a certain direction, the function Φ(x) varies strongly if we move away from the original input component. Thus, the prediction is highly sensitive to the given component.

What are these “components” in the context of deep learning models for small molecules? Or, in other words, what is the data space X?

This depends on the representation and the model type chosen. For Graph Neural Networks, X is a graph, that is, a collection of nodes and edges, representing the atoms and bonds, respectively. Another popular representation, which will key in our discussion, is given by the SMILES. These are strings that contain information about the structure of the molecule. In essence, they consist of a sequence of characters representing the molecule's atoms, plus a few special characters that encode the molecule's topology. These two representations are inherently of local nature. The main components of a data sample are atoms, represented as nodes in the graph or characters in a string. Explanations that base on these representations would lead to atomic attributions, as mentioned above. Another possible way to represent a molecule is via its graphical depiction. X would consist, in this case, in the set of pixels of a certain fixed size image.

CDDD and Img2Mol

Here we introduce the main ingredients on which our method relies.

The first is the CDDD space. This is constructed as the 512-dimensional bottleneck layer of a translation autoencoder network (sequence to sequence) trained to translate SMILES representations of molecules to their canonical SMILES. The bottleneck layer of the autoencoder defines a continuous molecular descriptor, which has been shown to be superior to other descriptors (e.g., fingerprints) when utilized as an input for training downstream tasks.

The second key ingredient is the Img2Mol network. The recently proposed model tackles the molecular optical recognition problem, as it aims at the automatic recognition of a molecule's SMILES from a graphical depiction of it. The key point for us is that Img2Mol is not trained end-to-end, but rather it solves its task to map molecular graphical depictions to their CDDD embeddings.

Setup

Our idea is to leverage the fact that we have two encoders at our disposal, both mapping different representations (SMILES and images) to the same embedding space. Given a downstream model trained on CDDD descriptors (to clarify, when we refer to CDDD descriptors we always refer to those derived from SMILES), we can replace the CDDD encoder with the Img2Mol encoder, and we obtain a perfectly sensible model! Now, we can use this model to extract explanations instead of the original RNN-based model.

There are several reasons for adopting this apparently unintuitive strategy:

We already mentioned above that the SMILES-based models lead to, by construction, local explainability in the form of atomic contributions. Chemists, instead, tend to rather think in terms of larger, chemically-meaningful substructures (like rings, functional groups, etc.), which often do not simply reduce to the average of their atomic constituents. In pixel space we avoid this issue since our features do not reduce to atoms and bonds, but we have access to the full substructure of the molecule.
CDDD-based models (based on SMILES) tend to fail to produce explanations which are invariant with respect to the molecule's symmetries. This is, to some degree, expected, as the SMILES string representations explicitly break such symmetries. The graphical depiction of SMILES, on the other hand, respects the symmetries of the molecule to a higher degree. It is therefore expected that this higher fidelity will be carried out to the explanations as well.
Finally, explainability techniques for image analysis and CNNs are substantially more advanced than for other network types, like RNNs. This is probably due to the fact that images constitute a natural data type where ground truth is trivial for humans to detect, and thus the validation of the explainability method is easier. Using Img2Mol as our auxiliary explanation network, we can leverage all the techniques that the ML community developed for CNNs.

Contextual Explanations

Our strategy is named contextual explanations and it relies on the fact that

deep layers in neural networks learn high-level concepts while shallow layers are activated by simpler concepts, and
for pure convolutional neural networks, the value of each “superpixel” is determined by its receptive field in input space.

This is illustrated very explicitly in Figure 1b, where we depicted some activation function outputs of the various convolutional layers in the Img2Mol encoder. While filters in early layers are activated by “simple” and local concepts, like nodes, angles, and edges, filters in deeper layers rather detect larger sub-structures in the molecule's image, e.g., rings and functional groups. We wish to adopt these layer activations as a chemically meaningful dictionary for our explanations.

To achieve this, given a layer activation, we can compute the corresponding “importance”' for the prediction by computing the gradient of the neural network function with respect to the activation itself. That is, we consider the layer activation as an input for the remaining downstream network, and the corresponding gradient quantifies the relevance of these layer features for the model's prediction. Thus, we obtain a layer-wise explanation of the prediction by multiplying the layer activations (which depict chemically relevant features learned by the network) with their gradients (which weigh how such features are relevant for the prediction).

To obtain the final contextual explanation, we aggregate the layer explanation in a single attribution map. Figure 1 depicts the various steps to compute contextual explanations.

Properties

As the example in Figure 2 illustrates, contextual explanation attributions consist of both atomic and structural features, originating from different layer explanations. In this specific case, our method assigns positive contributions (green overlay) to a Cl atom and methyl group, and it assigns negative contributions (pink overlay) to a N atom and the triazine ring.

We can now qualitatively observe and quantitatively measure whether our explanations are more robust with respect to the symmetries of the molecule. We start with molecules that exhibit symmetry with respect to a reflection across the vertical axis, x ↔ −x. In Figure 3 we visually observe that our contextual explanations capture the symmetry better than the SMILES-based ones. We can also quantify this through a symmetry score. Let T be a transformation on the molecule's graphical representation, and a an attribution heat map, we define the score

where â is obtained from a upon normalization to the range [-1, 1], and we perform an average in pixel space. The lower the score is, the more the transformation commutes with the attribution map. In the plot in Figure 3, we see that the score assumes a much lower value in average for our contextual explanations in comparison with the SMILES-based explanations, confirming our intuition that explanations in pixel space better capture the symmetries of the molecule.

We can also check the robustness of the contextual explanations with respect to transformations of the images. In Figure 4 we listed some examples for T=R(30), a rotation of a 30 degree angle in the plane of the image, as well as T=G representing the collection of different graphical molecule representations. Both the average score and the visual intuition from the depicted examples reveal a high level of agreement between the original explanations and the transformed ones.

Summary

In this blog post, we showed in the example of molecular property prediction how to leverage multi-modal data for model explanatory purposes. Specifically, while the model is trained on SMILES data, we can extract explanations in pixel space on graphical depictions of molecules corresponding to the same SMILES. This is possible since both feature extraction models (from SMILES and images) are trained to share a common representation space embedding. Note that the model is not retrained, but we just swapped the feature extraction model when computing explanations in pixel space.

There are at least two important reasons that motivate this approach:

Explanations might be easier to extract or more powerful in one data modality than the other. In our case, we could leverage the advances of the extensive literature of XAI applied to image analysis, which has undergone a steeper development than for RNN models. Relying on the previous point, we developed the concept of contextual explanations, which extract explanations from all layers of the network. The final attribution map contains both local and global information about the molecule, which is easier to interpret than simple atomic attribution.
One data modality might manifest the symmetry of the data more accurately. In our case, SMILES break the symmetry of the molecule by construction. On the other hand, molecular graphical depictions respect to a higher degree the symmetry of the molecule. In our analysis, we showed that extracting explanations in pixel space leads to attribution maps that respect the symmetries of the molecule to a higher degree than atomic attribution from SMILES data.

Finally, one advantage of using pixel space for molecule representation is its fixed length vector, where the dimension is determined by the number of pixels. Instead, SMILES representation lengths vary based on the number of atoms in the molecule. This makes pixel space a potentially better fit for driving generative models in XAI. In this space, attribution maps can also shed light on missing molecular groups. This means that the explanation can highlight the fact that the model's prediction was based on a specific structure being absent, compared to the examples it was trained on. Such reasoning cannot be achieved with atomic attributions, which only provide saliency values for existing atoms. Further exploration is needed in this area, but it has the potential to lead to an XAI-driven optimization approach for predicting molecular properties.

About the Author

Marco Bertolini is a machine learning researcher in the field of drug discovery with a background in theoretical physics. After obtaining his Ph.D. from Duke University, where his dissertation focused on the mathematical formulation of string theory, Marco joined the University of Tokyo as a postdoctoral fellow. During his postdoctoral research, he discovered the potential of machine learning in solving problems in physics and mathematics.

In 2020, Marco joined Bayer as a Machine Learning Researcher and became part of the team led by Djork-Arné Clevert. During his time at Bayer, he has developed machine learning models on histopathology whole slide images and interpretability/explainability methods for deep learning models. He has also researched on state-of-the-art deep learning models on graphs for physicochemical properties prediction on small molecules. Currently, Marco is working on representation learning techniques for quantum chemistry systems. Starting in March 2023, Marco will join Pfizer's Machine Learning Team, based in Berlin.