VN-EGNN: Protein Binding Site Identification Using Graph Neural Networks with Virtual Nodes

The full paper is available here. A huggingface application is available here.

Drug Development

Developing new drugs for certain diseases is an expensive process. It starts with the identification of a drug target. This target is a molecule in the body of an organism associated with a particular disease process. If this target could be addressed by a drug, it could create desired therapeutic effects. Usually these drug targets are proteins, like receptors, ion channels or enzymes. For the rest of this post I will only talk about protein drug targets.

Binding Sites of Drug Targets (Proteins)

So although a drug target can be identified, it is often not easy to directly address these drug targets. An important aspect is to identify specific regions on the protein's surface referred to as binding sites. Binding of a ligand to these sites can alter the cellular function of the protein. By identifying these sites, drugs can be designed to bind them, changing the protein’s behavior and potentially providing a therapeutic effect on these diseases.

In this post, I want to talk about different methods for detecting binding sites, from early methods to newer machine learning-based approaches, as well as a method we developed at our institute.

Binding Site Identification - Methods

Traditional Methods

Early methods for detecting binding sites include geometry-based and energy-based approaches. Geometry-based approaches analyze the shape of a molecular surface to locate cavities, taking into account the 3D spatial arrangement of atoms on the protein surface. Energy-based approaches, on the other hand, record interactions of probes or molecular fragments with the protein, assigning favorable energetic responses to binding sites.

Both strategies can be performed on a Cartesian grid-based representation of the protein (i.e. checking the environment per grid point) or without (i.e. grid-free).

Ligsite scans in seven directions (along the x, y, and z axes and the four cubic diagonals). Protein-Solvent-Protein (PSP) events per point are recorded, with rays restricted on both sides by the protein, resulting in 0-7 events per grid. Grid points with PSP events exceeding a given threshold (often 2) are retained and clustered into binding sites.
Surfnet places spheres midway between all pairs of atoms on the protein surface. If a probe clashes with any nearby atom, its radius is reduced until there is no overlap. The resulting probes then define the cavities.
Drugsite Carbon probes are placed on each grid point, and Van der Waals energies between the probe and the protein environment within an 8 Å distance are calculated. Grid points with unfavorable energies, exceeding an energy cut-off based on the mean energy and standard deviation across the entire grid, are discarded. The remaining grid points that meet this cut-off are then merged to a binding site.
Docking (not an algorithm itself, different algorithms exist) involves placing and scoring fragments (or small molecules) against the protein of interest. Binding sites are then identified based on the number of fragments that bind to specific areas of the protein. These methods were based on scoring functions, but currently new Deep Learning methods were developed that focus only on docking.

Overview of the different traditional binding site identification methods. (https://projects.volkamerlab.org/teachopencadd/talktorials/T014_binding_site_detection.html)

Machine Learning based Methods

In contrast to traditional methods, modern approaches leverage machine learning to enhance the accuracy and efficiency of binding site identification methods. Deep Learning methods rely on CNNs or Graph Neural Networks.

Deepsite employs a voxel-based representation of proteins, where the protein is divided into a 3D grid. For the grid a 3D CNN is used to predict the binding sites.
P2Rank is both simple and powerful. This method involves creating a Connolly surface around the protein. For each point on this surface, features based on the local environment are calculated. A random forest classifier is then trained on these points to predict whether a point lies within a binding site. Finally, points in proximity that are predicted to be part of binding sites are clustered together to form the final prediction.
DeepSurf calculates the solvent-accessible surface of a protein and places small grids along the normal vectors of this surface. Features based on the local environment are computed for each grid cell. Similar to DeepSite, DeepSurf uses 3D CNNs to predict binding sites.
Equipocket was among the first methods to use Message Passing Graph Neural Networks (MPGNNs) for binding site prediction. In this approach, the entire protein is encoded as a graph, combining surface and atomic graph representations. Some nodes in the graph represent features on the protein surface, while others represent atomic interactions. This hybrid graph representation allows EquiPocket to capture both surface-level and atomic-level information, providing a comprehensive analysis of potential binding sites. Equipocket classifies node as part of the binding site, and then calculates the geometric center of the classified nodes as the binding site center.

CNN and graph based methods for binding site identification

Enhancing GNNs with Virtual Nodes (VN-EGNN)

Also in our work we leverage GNNs, but despite their strengths, these models can struggle with expressiveness, oversmoothing, or oversquashing, which can hinder learning and predictive accuracy. To improve accuracy of GNNs for binding site identification, we introduce virtual nodes into E(n)-equivariant GNNs (EGNNs), which are representative for the binding site itself.

Each amino acid in the protein represents a node in the graph, where we used the position of the alpha carbons as node positions. To identify regions where binding sites are located, we add additional virtual nodes to the graph, which are connected to all the other nodes in the graph. We optimize our model to shift the virtual nodes to positions of potential binding sites.

The virtual nodes are initially positioned on a sphere surrounding the protein. To ensure an even distribution of the virtual nodes across the sphere, we utilized the Fibonacci Grid Algorithm.

VN-EGNN predicts binding sites, the final positions of the virtual nodes are clustered, resulting in four distinct predictions. Each of these predictions is accompanied by a confidence score.

Because our model predicts multiple binding sites, based on the chosen number of virtual nodes, we rank the virtual nodes based on their druggability. Due to the possibility of different virtual nodes converging to similar locations, we employed Mean Shift Algorithm, to cluster virtual nodes that are in close spatial proximity. By averaging their self-confidence scores and positions, we treated these clustered nodes as a single instance.

The number of virtual nodes is a hyperparameter and is chosen to balance computational complexity with the biologically meaningful number of binding sites in proteins.

Representation of the binding site

Each virtual node has a feature vector, this vector is updated during prediction, which we used to rank the binding site based on their potential druggability. We believe future work could focus on improving these binding site representations. In drug discovery, no prior work has established a direct vector representation of the binding site, and developing one could be beneficial for downstream tasks like virtual screening. It could also be used to analyze similarities between binding sites. Below you can see a visualization of the binding site representations.

Visualization of the learned virtual node feature embeddings, grouped by the corresponding protein’s target classification according to the ChEMBL database to analyze whether these representations contain relevant information about the protein/binding site. The feature vectors were downprojected using TSNE.

Expressivity of Geometric Graph Neuronal Networks

The expressive power of Graph Neuronal Networks can be analyzed through the Weisfweiler-Lehmann (WL) Graph isomorphism test, this framework was also extended to geometric graphs. The general idea is, how many message passing layers are required to distinguish two non isomorphic geometric graphs.

In our work we could show that by inserting a virtual node, which is connected to all other nodes in the graph, that only one layer of message passing is enough to distinguish k-hop distinct graphs. This increases the expressivity of Graph Neuronal Networks. More details and a formal proof of this can be found in our paper.

Conclusion

We were able to achieve state-of-the-art performance on three benchmark datasets for binding site identification: COACH420, HOLO4K, and PDBBind 2020.
The introduction of virtual nodes enhances the expressivity of geometric graph neural networks.
Our methods allows for the learning of potentially useful representations of binding sites, which could be beneficial for downstream tasks, like virtual screening
Example prediction of our model. Left: PDB: 1ODI Right: PDB: 3LPK