Filippo Grazioli, Anja Mösch, Pierre Machart, Kai Li, Israa Alqassem, Timothy J. O’Donnell, Martin Renqiang Min: "On TCR binding predictors failing to generalize to unseen peptides”, Frontiers in Immunology, 2022
Several recent studies investigate TCR-peptide/-pMHC binding prediction using machine or deep learning approaches. Many of these methods achieve impressive results on test sets which include peptide sequences that are also included in the training set. In this work, we investigate how state-of-the-art deep learning models for TCR-peptide/-pMHC binding prediction generalize to unseen peptides. We create a dataset called TChard, which include positive samples from IEDB, VDJdb, McPAS-TCR and the MIRA set, as well as negative samples from both randomization and 10X Genomics assays. We propose the hard split, a simple heuristic for training/test split, which ensures that test samples exclusively present peptides that do not belong to the training set. We investigate the effect of different training/test splitting techniques on the models’ test performance, as well as the effect of training and testing the models using mismatched negative samples generated randomly, in addition to the negative samples derived from assays. Our results show that modern deep learning methods fail to generalize to unseen peptides. We provide an explanation why this happens and verify our hypothesis on the TChard dataset. We then conclude that robust prediction of TCR recognition is still far for being solved.
Published in: Frontiers in Immunology, 2022
Research partners: Icahn School of Medicine at Mount Sinai, NEC Laboratories America
Full paper download: TCR_Binding_Predictors_Failing_to_Generalize_to_Unseen_Peptides.pdf
J. Cheng, K. Ritter, K. Bendjama, B.Malone, “BERTMHC: improved MHC–peptide class II interaction prediction with transformer and multiple instance learning”, Bioinformatics 2021
Motivation: Increasingly comprehensive characterization of cancer-associated genetic alterations has paved the way for the development of highly speciﬁc therapeutic vaccines. Predicting precisely the binding and presentation of pep-tides to major histocompatibility complex (MHC) alleles is an important step toward such therapies. Recent data suggest that presentation of both class I and II epitopes are critical for the induction of a sustained effective immune response. However, the prediction performance for MHC class II has been limited compared to class I.
Results: We present a transformer neural network model which leverages self-supervised pretraining from a large corpus of protein sequences. We also propose a multiple instance learning (MIL) framework to deconvolve mass spectrometry data where multiple potential MHC alleles may have presented each peptide. We show that pretraining boosted the performance for these tasks. Combining pretraining and the novel MIL approach, our model outperforms state-of-the-art models based on peptide and MHC sequence only for both binding and cell surface presentation predictions.
Availability and implementation: Our source code is available at github.com/s6juncheng/BERTMHC under a noncommercial license. A webserver is available at bertmhc.privacy.nlehd.de
Published in: Bioinformatics
Full paper download: BERTMHC_improved_MHC–peptide_class_II_interaction_prediction.pdf
F. Grazioli, R. Siarheyeu, I. Alqassem, A. Henschel, G. Pileggi, A. Meiser: "Microbiome-based disease prediction with multimodal variational information bottlenecks“, PLOS Computational Biology, April 2022
Scientific research is shedding light on the interaction of the gut microbiome with the human host and on its role in human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. Most of them leverage shotgun metagenomic sequencing to extract gut microbial species-relative abundances or strain-level markers. Each of these gut microbial profiling modalities showed diagnostic potential when tested separately; however, no existing approach combines them in a single predictive framework. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model capable of learning a joint representation of multiple heterogeneous data modalities. MVIB achieves competitive classification performance while being faster than existing methods. Additionally, MVIB offers interpretable results. Our model adopts an information theoretic interpretation of deep neural networks and computes a joint stochastic encoding of different input data modalities. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundances and strain-level markers. MVIB is evaluated on human gut metagenomic samples from 11 publicly available disease cohorts covering 6 different diseases. We achieve high performance (0.80 < ROC AUC < 0.95) on 5 cohorts and at least medium performance on the remaining ones. We adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to the model’s predictions. We also perform cross-study generalisation experiments, where we train and test MVIB on different cohorts of the same disease, and overall we achieve comparable results to the baseline approach, i.e. the Random Forest. Further, we evaluate our model by adding metabolomic data derived from mass spectrometry as a third input modality. Our method is scalable with respect to input data modalities and has an average training time of < 1.4 seconds. The source code and the datasets used in this work are publicly available.
Published in: PLOS Computational Biology, April 2022
Paper available at: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010050
Full paper download: Microbiome-based_Disease_Prediction_with_Multimodal_Variational_Information_Bottlenecks.pdf
B. Malone, C. Tosch, B. Grellier, K. Onoue, T. Sztyler, K. Ritter, Y. Yamashita, E. Quemeneur, K. Bendjama: "Performance of neoantigen prediction for the design of TG4050, a patient specific neoantigen cancer vaccine", American Association for Cancer Research Annual Meeting AACR, April 2020
The development of therapeutic cancer vaccines to immunize against tumor antigens constitutes a promising modality. Mutation associated antigens are considered major targets given their specificity to tumor cells. These mutations are specific to the patients and require a tailor-made vaccine targeting mutations identified in each tumor. Many mutations are identified in the tumoral genome in most patients, but only a small fraction (around 1%) is suitable as vaccine target. Herein, we report data documenting the prediction performance of the algorithm used for the design of TG4050, a clinical stage patient specific viral-based neoantigen vaccine.
We have trained a set of independent machine learning algorithms to score each candidate neoantigen for several steps of the MHC antigen presentation pathway, including MHC binding, intracellular processing, similarity to self, and likelihood to elicit a T-cell response in peptide stimulated ELISPOT. Further, we have developed a novel graph neural network to combine all these scores to predict the likelihood that a neoantigen will elicit a T-cell response while also incorporating patient-specific factors, such as expression level and conservation of the mutation across different clones. To validate the system, we collected samples from 6 patients diagnosed with NSCLC, sequenced healthy and tumor tissue, identified mutations and ranked them using our algorithm; then, to evaluate immunogenicity, we focused our analysis on CD8+ T cell and measured the frequency of IFN γ+ cells against predicted peptides in autologous PBMC. Immunogenicity of peptides was assayed in 5 pools then deconvoluted against individual peptides.
From 3339 to 4782 somatic variants were detected in tumor tissue samples. After applying technical filtering, removing synonymous mutations, and filtering on transcript expression we detected a median of 281 (192-471) expressed tumor mutations resulting in a median of 2767 candidate class I epitopes (1769 - 4573). The model resulted in high accuracy allowing us to identify peptides with pre-existing ex vivo immunogenic responses in 5 out of 6 patients. Immunogenicity of peptide pools was correlated with ranking by the algorithm. Immunogenicity of the 6 top ranking individual epitopes in each patient showed a median of 5 (2-6) immunogenic peptides resulting in a 77% of true positive rate (TP). It should be noted that when no response was detected, it cannot be excluded that a response could be primed by a vaccine. In a similar setting, the netMHC 4.0 algorithm yielded a TP of 30% and only identified 39% of positive calls of our algorithm.
We demonstrate that the prediction algorithm is accurate in identifying immunogenic cancer mutations even among a large set of candidates. Ongoing TG4050 clinical studies (NCT03839524 and NCT04183166) will allow further validation of the antitumor activity of the elicited immune response.
Presented at: American Association for Cancer Research Annual Meeting AACR, April 2020
Paper available at: Cancer Res 2020;80(16 Suppl):Abstract nr 4566
Brandon Malone, Boris Simovski, Clément Moliné, Jun Cheng, Marius Gheorghe, Hugues Fontenelle, Ioannis Vardaxis, Simen Tennøe, Jenny-Ann Malmberg, Richard Stratford, Trevor Clancy: "Artificial intelligence predicts the immunogenic landscape of SARS-CoV-2: toward universal blueprints for vaccine designs", Scientific Reports 2020
The global population is at present suffering from a pandemic of Coronavirus disease 2019(COVID-19), caused by the novel coronavirus Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).The goals of this study were to use artificial intelligence (AI) to predict blueprints for designing universal vaccines against SARS-CoV-2, that contain a sufficiently broad repertoire of T-cell epitopes capable of providing coverage and protectionacross the global population. To help achieve these aims, we profiled the entire SARS-CoV-2 proteome across the most frequent 100 HLA-A, HLA-B and HLA-DR alleles in the human population, using host-infected cell surface antigen presentation and immunogenicity predictors from theNEC Immune Profilersuite of tools, and generated comprehensive epitope maps. We then used these epitope maps as input for a MonteCarlo simulation designed to identify statistically significant “epitope hotspot” regions in the virus that are most likely to be immunogenic across a broad spectrum of HLA types. We then removed epitope hotspots that shared significant homology with proteins in the human proteome to reduce the chance of inducing off-target autoimmune responses. We also analyzed the antigen presentation and immunogenic landscape of all the nonsynonymous mutations across 3400 different sequences of the virus, to identify a trend whereby SARS-COV-2 mutations are predicted to have reduced potential to be presented by host-infected cells, and consequently detected by the host immune system. A sequenceconservation analysis then removed epitope hotspots that occurred in less-conserved regions of the viral proteome. Finally, we used a database of the HLA genotypes of approximately 22 000 individuals to develop a “digital twin” type simulation to model how effective different combinations of hotspots would work in a diverse human population, and used the approach to identify an optimal constellation of epitopes hotspots that could provide maximum coverage in the global population.By combining the antigen presentation to the infected-host cell surface and immunogenicity predictions of the NEC Immune Profilerwith a robust Monte Carlo and digital twin simulation, we have managed to profile the entire SARS-CoV-2 proteome and identify a subset of epitope hotspots that could be harnessed in a vaccine formulation to provide a broad coverage across the global population.
"Learning Representations of Missing Data using Graph Neural Networks for Predicting Patient Outcomes," AAAI Workshop 2021
Extracting actionable insight from Electronic Health Records(EHRs) poses several challenges for traditional machinelearning approaches. Patients are often missing data relativeto each other; the data comes in a variety of modalities, suchas multivariate time series, free text, and categorical demo-graphic information; important relationships among patientscan be difficult to detect; and many others. We propose anovel approach to address these first three challenges usinga representation learning scheme based on graph neural net-works. Our proposed approach is competitive with or outper-forms the state of the art for predicting in-hospital mortality(binary classification), the length of hospital visits (regres-sion) and the discharge destination (multiclass classification).
Timo Sztyler, Brandon Malone: “Learning Embeddings from a Biomedical Knowledge Graph for Predicting Novel Relations”, GCB2019
Timo Sztyler, Carolin Lawrence, Brandon Malone: “Building a Biomedical Knowledge Graph and Predicting Novel Relations”, AKBC 2019
Alberto García Durán, Mathias Niepert, Brandon Malone: “MULTI-modal Knowledge Graph Completion to Predict Polypharmacy Side Effects”, DILS 2018