Tue, Oct 17, 3:00pm

Large Language Molecular Representation and Learning

Talk Abstract: Molecular machine learning shows significant potential for enhancing molecular property prediction and drug discovery. Nevertheless, obtaining labelled molecule data can prove to be a costly and time-consuming endeavour. The scarcity of labelled data presents a formidable challenge for supervised machine learning models, as they struggle to generalize effectively within the vast expanse of the chemical space.

A further challenge pertains to the representation of molecules. The question arises: Is the Graph Neural Network a suitable means of representing a molecule, or does a textual description like SMILES offer a superior alternative? In this presentation, I will begin by introducing MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks). MolCLR employs a self-supervised learning framework that capitalizes on a substantial volume of unlabeled data, totaling approximately 10 million distinct molecules. During the pre-training phase of MolCLR, we construct molecular graphs and develop graph neural network encoders to acquire differentiable representations. A contrastive estimator maximizes the alignment of augmentations derived from the same molecule while minimizing the alignment of augmentations from distinct molecules. Following that, I will delve into our endeavours concerning the textual representation of polymers, Metal Organic Frameworks (MOFs), catalysis systems, and organic molecules. We will assess their effectiveness when compared to graph representation. To this end, we leverage pre-trained language models such as BERT and RoBERTa to propose a framework that capitalizes on rich textual and qualitative representations of molecules for property prediction. An analysis of attention scores will reveal how these models prioritize tokens associated with words and semantics that hold pivotal roles in representation. Furthermore, I will demonstrate how multimodal learning applied to molecules can yield synergistic improvements in the learning process.

5

Previous Talks