Local Euler Curvature (LEC) in Protein Structure and Interaction Analysis

In the ever-evolving landscape of bioinformatics and structural biology, researchers are continually seeking innovative ways to decode the language of proteins. Unraveling the complex three-dimensional structures of proteins holds the key to understanding their functions, interactions, and potential applications in drug discovery. In this blog post, we will explore a cutting-edge approach called Local Euler Curvature (LEC) and its integration with simplicial complexes. Buckle up, machine learning specialists, as we delve into a realm where mathematical elegance meets the intricacies of protein structures, paving the way for new advancements in drug discovery.

Graph Filtrations

Given a distribution of points in space and some radius we can form a geometric graph which has the points as vertices and edges between points that are closer together than the chosen radius. Associating a graph to a collection of points in this manner allows us to use tools from graph theory to study geometric structures. For example, one can choose the atoms of a protein as points.

If we look at the series of graphs as the radius (also known as the filtration cutoff) increases, we can capture the object as a nested sequence of graphs with the same vertices and an increasing number of edges. This forms, in mathematical language, a filtration of graphs. Studying geometric objects in this way can give extra insight in particular contexts.

Some sample graphs from the graph filtration associated to a distribution of points in 2D space. The density of the graphs increases with radius (or threshold distance) r.

An example application is, for instance, when we care only about the relative distances between points and want the object to be invariant under small perturbations of the points, such graphs will reflect this while getting rid of extraneous detail. By associating the resulting graphs to graph-theoretic measurements, we can form different measures of similarity between the geometric objects.

Simplicial complexes and Topological Invariants

Some measures of similarity between graphs can be constructed using a kind of topology. In graph theory and topology, a simplicial complex is a mathematical construct that represents the connectivity between nodes (vertices) in a network (graph). In simpler terms, it captures the relationships between different elements, such as different atoms or residues in a protein. 

This approach transforms protein structures into a network of interconnected nodes using a mathematical construction, forming the basis of our exploration into Euler Characteristics (ECs). The EC, a mathematical topological invariant, serves as the basis for defining Local Euler Characteristics (LECs), which decomposes the EC into local components. This is a particularly interesting case as the EC is the only homotopy invariant which can be determined locally [1]. 

The Euler Characteristic is defined as the alternating sum of the number of simplexes of increasing dimension d. It can be partitioned into a sum of Local Euler Characteristics at each of the N nodes of the graph.

The LEC of a node is expressed using a weighted sum over simplices in the neighborhood of that node. Some theorems of the geometry of manifolds have translations into the discrete realm of graphs, showing that LECs can be interpreted as providing a local measure of curvature for each node [2]. Using LECs as a “fingerprint” of a geometric graph, we can form a measure that allows us to infer how similar different structures are in terms of local curvatures.

LECs: Fingerprints for Protein Structures 

The local nature of LECs is particularly relevant when attempting to categorize protein secondary structures, which are dominated by local hydrogen bonds. In our recent work [3], we associated protein secondary structures to a series of geometric graphs, using the atoms of the protein as the distribution of points and examining the resulting filtration of geometric graphs.

The correspondence between a protein and the associated geometric graph of some radius.

Our dataset comprised 3D protein structures sourced from the latest version of the CATH database [4], providing a diverse range of non-redundant domains. For efficiency, we take advantage of the fact that a filtration of graphs also gives a simplicial filtration. Therefore using the LEC profile of the previous graph in the series can significantly simplify the calculation of the LEC profile of the next graph. 

Examining proteins via this understandable topological invariant can potentially lead to further insights on the nature of the local structure, something impossible with the black-box machine learning methods which have previously been successfully applied to the study of protein structure, for instance in [5] and [6]. Plotting the sum of the LECs for each protein with increasing filtration cutoff results in a curve we call LEC profile. Some remarkable patterns become immediately apparent.

A collection of LEC profiles of different clusters (given as different colors) of proteins. The shaded area represents standard deviation of the LEC profiles of the clusters, showing that individual clusters have highly correlated LEC profiles with easily identifiable differences between clusters.

Takeaways 

PCA on the LEC profiles of the proteins shows high correlation, indicating that such LEC curves evolve in a relatively stable and predictable manner. We also clustered and scored the data and found that the clustering reproduced broad divisions in the categorisation of proteins, as well as identifying separate and particularly dense subclusters, interesting as a subject of future research.

We use a Random Forest Classifier to explore the potential applications of LEC in protein structure classification. The classifier is trained on the so-called consensus classification of residues, i.e. those proteins on which the two major protein structure classification databases, STRIDE and CATH, agree. We then evaluated the using various metrics to demonstrate its effectiveness in classifying protein structures.

A schema of the interacting elements of the methodology.

LEC proves to be a powerful feature in accurately describing and classifying protein structures, outperforming traditional methods. The unsupervised clustering reveals distinct structural patterns, especially in the natural splitting of secondary structure categories, suggesting potential avenues for further investigation.

Possible applications in Drug Discovery

Now that we have a grasp of LEC's elegance, let's explore its potential applications in the realm of drug discovery when combined with machine learning.

  1. Protein Structure Classification: LEC offers a powerful metric for classifying different secondary structures within proteins. Its ability to discern subtle variations in local geometry makes it a promising candidate for refining existing methods or even introducing new approaches to classify proteins accurately. Machine learning models, particularly those using random forests, can leverage LEC as a feature space, contributing to more robust and precise protein structure predictions.

  2. Drug Target Identification: Identifying potential drug targets often involves understanding the nuances of protein structures. LEC provides a granular view of the local environment around specific residues, offering valuable insights into regions that might be suitable for drug binding. Machine learning algorithms can be trained to recognize patterns in LEC profiles associated with successful drug-binding sites, aiding in the identification of novel therapeutic targets.

  3. Predicting Protein-Ligand Interactions: The success of drug discovery hinges on our ability to predict and understand how proteins interact with potential drug candidates. LEC, with its local curvature information, can enhance our predictive capabilities in modeling protein-ligand interactions. Integrating LEC features into machine learning models allows for a more nuanced understanding of binding sites and the potential efficacy of different ligands.

  4. Structural Analysis of Protein Networks: Proteins seldom function in isolation; they interact within intricate networks. LEC, applied to simplicial complexes, can unravel the topological features of these protein networks[3,7]. Machine learning specialists can develop models that leverage LEC information to identify critical nodes, understand network dynamics, and predict the impact of perturbations within these complex biological systems.

As we journey through the realm of Local Euler Curvature and its integration with simplicial complexes, it becomes evident that the marriage of mathematical elegance and protein structures has far-reaching implications. Machine learning specialists are poised at the forefront of this exciting intersection, where computational methods meet the challenges of understanding and predicting protein behavior.

In the context of drug discovery, LEC emerges as a fingerprint, providing a level of detail and precision that was previously challenging to achieve, and that is due to its intrinsically intrinsically oscillatory behavior, which grasp catch structural changes easily and better than monotonically functions of the distances, like traditional "contact maps" or potentials. Compared to traditional methodologies like "contact maps" and/or Voronoi diagrams, our methodology is also useful in future multiscale, coarse-grain, applications [8]. 

By incorporating LEC into machine learning workflows, we open new avenues for unraveling the mysteries of protein structures, accelerating drug discovery, and ushering in a new era of precision medicine.

As you continue your exploration into the world of machine learning and bioinformatics, keep an eye on the evolving landscape of LEC and its transformative potential in decoding the language of proteins. The future of drug discovery might just be one LEC profile away.

_______________________

Interested in learning more about this space? Check out these articles!

[1] Levitt, N. The euler characteristic is the unique locally determined numerical homotopy invariant of finite complexes. Discrete and computational geometry 7, 59-67 (1992).

[2] Knill, O. A graph theoretical gauss-bonnet-chern theorem. arXiv preprint arXiv:1111.5395 (2011).

[3] Moreira, R. A., et al. Discovering Secondary Protein Structures via Local Euler Curvature. bioRxiv, (2023).

[4] Knudsen, M. and Wiuf, C. The cath database. Human genomics 4, 1-6 (2010).

[5] Kulkarni, P. et al. Intrinsically disordered proteins: critical components of the wetware. Chemical Reviews 122, 6614-6633 (2022).

[6] Guo, Z., Liu, J., Skolnick, J. and Cheng, J. Prediction of inter-chain distance maps of protein complexes with 2d attention-based deep neural networks. Nature Communications, 13, 6963 (2022).

[7] de Amorim Filho, E. C., Moreira, R. A., & Santos, F. A. (2022). The Euler characteristic and topological phase transitions in complex systems. Journal of Physics: Complexity, 3(2), 025003.

[8] Cofas-Vargas, L. F., Moreira, R. A., Poblete, S., Chwastyk, M., & Poma, A. B.). The GoMartini approach: Revisiting the concept of contact maps and the modeling of protein complexes. arXiv preprint arXiv:2311.08174 (2023).

3