Application of machine learning and structural biology to understand protein-ligand interaction

Date
2023
DOI
Authors
Muellers, Samantha N.
Version
Embargo Date
2026-02-05
OA Version
Citation
Abstract
Machine learning (ML) constitutes an array of methods powered by statistics or artificial intelligence (AI) to analyze large complex sets of of data. ML techniques can be applied to a wide range of biochemical problems, including the understanding of protein-ligand binding. Much of the data from experimental biochemistry goes unutilized, but ML methods can be applied to these data to identify patterns and extract useful information. In this work, we develop and demonstrate several applications of ML techniques to data from X-ray crystallographic structures of protein-ligand complexes, molecular dynamics (MD) simulations of ligand structures, binding affinity assays, enzymatic assays, and protein sequence alignments. In one application of ML, we studied the protein-ligand interactions between Kelch-like ECH-associated protein 1 (KEAP1) and a series of linear and cyclic peptide (CP) inhibitors. Despite modest variance in the conformations of these inhibitors bound to KEAP1, a wide range of binding affinities were observed. The ML techniques of partial least squares regression (PLSR) and principal component analysis (PCA) identified that this variance was resultant of changes in conformational preorganization and strain in the unbound and bound states of the peptides, respectively. These techniques also elucidated the contributions of cyclization to variance in binding affinity between each pair of linear and cyclic peptides in the dataset. MD simulations of some of these CPs were utilized to observe the conformations of each peptide in the unbound state. Another ML technique, k-medoids clustering, was applied to these data from the MD simulations to identify correlated conformational motions within these peptides, which may explain the non-linear structure activity relationships (SAR) that these cyclically-constrained peptides exhibited in our study and in general. The mechanisms of CP membrane permeability are also poorly understood, posing challenges for medicinal chemists interested in improving oral bioavailability of these modalities. Therefore, in a separate study we used PCA to identify physicochemical properties of CPs related to permeability. We also demonstrate the use of the predicted structures in the AI program, AlphaFold as molecular replacement models for determining phases or as homology models to identify trends in structure-function relationships. Lastly, we develop a new method using partial least squares discriminant analysis (PLS-DA), to classify bacterial protein sequences based on species growth temperatures. This novel use of PLS-DA successfully identified amino acid substitutions to impart stability in the target protein, as validated by a prospective, experimental analysis of a polyamine oxidase. Taken together, each of these studies demonstrate the power of ML for the quantitative analysis of physicochemical properties to elucidate underlying biochemical effects.
Description
License