The Latest Tools for Proteogenomics
How mass spectrometry, next-generation sequencing, and bioinformatics tools are revolutionizing personalized medicine
Precision medicine is defined by the Precision Medicine Initiative—a nationwide initiative launched by President Barack Obama in 2015—as “an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person.”1
To be able to drill down to the fine level of precision that precision medicine demands, it’s helpful to step back and see the bigger picture. Proteogenomics can provide a systems perspective that can help scientists interpret genotypic-phenotypic correlations. This type of integrated, holistic view—spanning genome sequences, RNA transcription, protein synthesis, and post-translational modifications—has helped answer questions like why only 30 percent of changes in mRNA translate into corresponding changes in protein structure.
Proteogenomics provides a holistic approach to help scientists envision what may be happening within the human body. Not only does proteogenomics help in the prevention, diagnosis, and treatment of disease, but it can also be leveraged to identify biomarkers to understand disease mechanisms, support drug discovery, and enable patient stratification for precision medicine.
High-sensitivity, high-performance liquid chromatography tandem-mass spectrometry is the method of choice for protein identification in proteogenomics, as it allows for faster and easier analysis of a larger number of analytes than the more conventional gas chromatography mass spectrometry.2 Alternatively, matrixassisted laser desorption ionization (MALDIMS), involving the use of a laser beam that vaporizes and ionizes the sample, can be used to identify proteins. MALDI-MS is a bottomsup approach for protein identification that uses peptide-map fingerprinting and is used mainly in the identification of microbes. Signature peptides of potential protein biomarkers can be identified using multiple reaction monitoring or selected reaction monitoring.
While MS is a powerful tool used in proteogenomics for the relative quantification of changes in a cell’s protein content, a targeted proteomics approach, using synthetic proteotypic peptides (PTPs), can enable the quantification of selected proteins of interest. Targeted proteomics can also provide quantitative information to support the specific and sensitive simulation of biochemical systemic perturbation. PTPs mimic peptides produced by the proteolytic cleavage of target analyte proteins. Deep learning algorithms have been developed for peptide detectability prediction. These algorithms can interrogate deep neural networks that have learned from vast protein databases and use tools such as SVMLight, RankNet, or LambdaMART to solve peptide ranking problems.3 This methodology is complemented by next-generation sequencing (NGS), or massively parallel sequencing (MPS), which enables the sequencing of millions of fragments of DNA per run.
Proteogenomics establishes the correlation between mRNA and protein pairs in samples, showing mutations, post-translational modifications, and signaling pathways. The analysis of a direct association between these markers of genetic variation and gene expression levels, typically measured in tens or hundreds of individuals, can be performed using techniques such as expression quantitative trait loci (eQTL) analysis, microRNAs (miRNAs), and copy number aberrations.4 eQTLs are genomic loci that explain all or a fraction of variation in expression levels of mRNAs. miRNAs are small, noncoding RNA molecules found in plants, animals, and some viruses. The miRNAs pair with complementary sequences within mRNA molecules to affect their action and silence gene expression. Single nucleotide polymorphisms (SNPs) and translocations in DNA can be identified using NGS. These SNPs can be translated into proteoforms and added to vast protein databases, which can be used to interpret MS data. Similarly, RNA-seq, which uses high-precision, deep-sequencing technology to convert RNA to cDNA and then amplify and sequence it in a high-throughput manner, can be used to analyze and quantify the ever-evolving transcriptome.5 The sequenced DNA can be integrated with proteomic data to identify novel peptides and derive meaningful insights into the mechanism of action of drugs.
The use of bioinformatic software systems and seamless data integration is crucial to enable an effective, real-time analysis of the continuous feedback loop between genomic, proteomic, and transcriptomic data. The proteogenomic suite of bioinformatics solutions includes Ingenuity Pathway Analysis, parallel reaction monitoring, Progenesis, Library of Integrated Network-Based Cellular Signatures, Skyline, DESeq, limma, edgeR, R, MStats, and PGTools, some of which are open source.
Proteogenomic data is voluminous and spans multiple genomic and proteomic platforms. Information-rich, AIdriven data visualizations leverage mixed reality to integrate physicochemical properties and predictive models. They have the potential to span multiple genomic/proteomic platforms and provide unprecedented insights into the molecular mechanism of the action of drugs.
One of the key therapeutic areas where proteogenomics is being utilized is oncology, where the discovery of biomarkers is leading the path for innovation. The National Cancer Institute, in collaboration with the National Human Genome Research Institute, has led major initiatives driving the growth of PM in oncology. For example, the launch of the Cancer Genome Atlas (https://cancergenome.nih.gov) in 2006 resulted in the approval or addition of new indications to 47 drugs or biologics for oncology by the FDA in 2018, and 20 by mid-2019. However, while extensive research has been done in the genomics space, genomics alone has been inadequate to establish firm linkages between tumor biology and patient outcomes. Research has also shown that there may be not one but rather multiple mutations causing phenotypic changes and that, owing to protein modifications and configurational changes, genotypic changes may not necessarily result in corresponding phenotypic changes. This realization has resulted in a heightened understanding of the significance of proteogenomics and the launches of the Clinical Proteomic Tumor Analysis Consortium in 2006 and the International Cancer Proteogenome Consortium in 2016.6
Proteogenomics is an integrative approach that leverages genomic, transcriptomic, and proteomic data and computational power to analyze and interpret the molecular basis of disease. It continues to revolutionize precision medicine by providing a robust, interconnected framework to the earlier disconnected omics fabric.
References
1. Ferryman, Kadija, Mikaela Pitcan. “What is precision medicine? Contemporary issues and concerns primer.” Data & Society (2018).
2. Loo, A. J. “The tools of proteogenomics.” Advances in Protein Chemistry 65 (2003): 25-56.
3. Zimmer, David, et al. “Artificial intelligence understands peptide observability and assists with absolute protein quantification.” Frontiers in Plant Science 9 (2018): 1559.
4. Ruggles, Kelly V., et al. “Methods, tools and current perspectives in proteogenomics.” Molecular & Cellular Proteomics 16.6 (2017): 959-981.
5. Wang, Zhong, Mark Gerstein, and Michael Snyder. “RNA-Seq: A revolutionary tool for transcriptomics.” Nature Reviews Genetics 10.1 (2009): 57.
6. Rodriguez, Henry, and Stephen R. Pennington. “Revolutionizing precision oncology through collaborative proteogenomics and data sharing.” Cell 173.3 (2018): 535-539.