ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Improving the rigour and reproducibility of bio-related ML will involve more standardization of datasets, processing, and model evaluation. This paper presents a software package to simplify dataset creation and model evaluation for deep learning on protein structures, and provides a set of pre-processed datasets from the Protein Data Bank (PDB) and AlphaFoldDB. They benchmark prediction tasks associated with each dataset and evaluate model generalization, with an eye towards real-world implications.