Methods for detection and measurement of virulence factors of pathogenic bacteria
by George Rosenberger (MSc, completed 2011-09-02)
Pathogenic bacteria like M. tuberculosis and S. pyogenes are responsible for many severe human diseases. The strategies of these bacteria to infect hosts are not yet completely resolved, but many host-pathogen interactions involve protein-protein recognition. The involved bacterial proteins are called virulence factors. Since many of them are secreted or located on the cell surface, their protein-protein interaction (PPI) interfaces are under heavy evolutionary pressure for host-specific adaptations.
To investigate and improve current methods for detection and measurement of viru- lence factors of pathogenic bacteria, two independent projects were defined as part of this work:
The goal of the first project was the development of a method to detect putative virulence factors in silico. The problem was approached by estimation of residue con- servation of orthologous protein sequences of the proteomes of human pathogenic and non-pathogenic species and strains of the genera Mycobacterium and Streptococcus. This revealed protein hotspots, which were projected to known and modeled protein structures. To filter only those hotspots required for host-specific adaptation, a method was devel- oped to detect clustering hotspots and thus putative PPI sites on the protein surface. The method was evaluated by a statistical analysis and literature research of known virulence factors.
The goal of the second, independent project was the development of a method to im- prove current approaches for absolute label-free quantification of proteins by mass spec- trometry experiments. Current methods employ an empirical method for selection of peptides to infer the protein quantity from. The problem was approached by correction for the peptide sequence-specific portion of the measured peptide intensity, the peptide response. Peptide response predictors were trained by machine learning regression algo- rithms on experimental data. Different models were created to integrate the predicted peptide responses and the measured peptide intensities of all peptides of a protein to es- timate the theoretical protein intensity. The prediction and estimation performance was evaluated using internal and external validation methods.
For both projects, the proof of concept was demonstrated and software packages were implemented to enable automated analysis workflows and rapid evaluation of the results.
Prediction of peptide observance in mass-spectrometry experiments
by George Rosenberger (Undergraduate research project, completed 2010-05-25)
Prediction of peptide observance in mass-spectrometry experiments using a Random forest classifier and physicochemical features was evaluated and implemented in an automatic and unsupervised module to extend the 2DDB proteomics work-flow. The predictor was applied to multiple problems, including modeling with a selected training dataset, evaluation of prediction with varying conditions and first approaches to quantitative prediction were explored.
An algorithm was developed to select subsets of peptidomes, according to se- lected constraints, such as occurrence in different genomes or singularity in the transcriptome. The 2DDB proteomics work ow was extended to integrate genomic precursors. Rapid and automated evaluation of genomes was enabled and different predictors were applied to identify a most-likely selection of observable peptides of these genomes.
These approaches were used to provide support in purchasing synthetic peptides for further investigation and experiments.
Measuring yeast kinase activity: Computational tools for generating SRM-assays to quantify phosphorylation states of proteins
by Hannes Roest (MSc, completed 2010-04-26)
Protein phosphorylation is a post-translational modification that regulates many of the dynamic changes in the cell. In order to understand the cells reponse to internal and external stimuli, the quantitatie study of phosphorylation states in temporal resolution is necessary. Here we describe an approach using computational tools to generate quantitative SRM-based assays to study such phosphorylation events using tandem mass spectrometry. For this study, we focused on inferring the activity of eukaryotic protein kinases (ePKs) in yeast via their phosphorylation state.
We combined publicly available MS-generated phosphoidentification data from Phos- phoPep with sequencing data and annotations from kinase.com to generate multiple sequence alignments of the yeast kinase domains. We used evolutionary conservation together with aligned phosphosite identifications to find potential regulatory phospho- rylation sites in the activation segment of the kinase domains. The corresponding phos- phopeptides were synthesized externally and then analyzed by tandem MS in order to generate reference spectra for SRM-assay development.
In the course of this preliminary study, we developed several computational tools that were used to address the challenges encountered:
A database framework termed compep was developed in order to analyze tryptic pep- tidomes and select proteotypic peptides efficiently. It was used to generate lists of pep- tides that were synthesized based on predictions of external tools (MS1 signal intensity and retention time predictions) and criteria such as occurrence in another peptidome, occurrence in other gene loci or other splice variants from the same locus.
An interface between MySQL and the spectral search program SpectraST was imple- mented which allows the use of powerful SQL queries to create custom spectral libraries. We show the feasibility and efficacy of this approach by creating chimeric spectral li- braries that contain spectra from five different species and classifying 15 000.mzXML files by species using this library. To address the problem of the uniqueness of transitions in an SRM-assay in a com- 4 plex background, the SRMCollider was developed. This tool compares all transitions of a given peptide to all potential transitions in the selected background. It reports an interference when another transition is in proximity in the RT-Q1-Q3 space which might distort the result of the assay.
A web interface is available that allows the user to supply the query peptide with a web browser and retrieve the results directly or in an OpenOffice.org compatible CSV file.
We implemented the workflow described above, describe a theoretical analysis of each individual step and provide a computational toolkit that facilitates its execution on any set of target proteins. We developed a program that can select proteotypic peptides according to several criteria. We map the RT-Q1-Q3 space of a human peptidome and provide insights into SRM-assay development. We used the developed MySQL- SpectraST interface to create an organism-specific classifier-library. In conclusion, we provide computational framework whose potential applications extend beyond the scope of the work presented here.
A Software Pipeline for High-Throughput Quantification of Mass Spectrometry-Based Proteomics Data: Applications to Streptococcus pyogenes
by Hendrik Weisser (Ph.D., completed 2013-01-10)
All living organisms are composed of cells, of which proteins are the primary structural building blocks. Proteins also form the molecular machines (enzymes) with which cells are able to accomplish their many functions. The study of proteins is thus of central importance in biology. The scientific discipline that investigates the entirety of the proteins in a cell or in a set of cells the proteome is called proteomics. As the key technology responsible for its success, modern proteomics applies liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) to identify and quantify the protein contents of complex biological samples. Label-free shotgun mass spectrometry is one quantitative proteomics technique that is especially suited for high-throughput applications. The label-free shotgun approach relies heavily on the computational processing of LC-MS/MS data to identify proteins and determine their relative abundances in different samples.
This thesis describes the development of a software pipeline for the quantification of peptides and proteins in LC-MS/MS data from label-free shotgun experiments. New algorithms for a number of data processing tasks are introduced. Implemented in the OpenMS software framework, the label-free pipeline improves upon existing alternatives by being highly flexible, applicable to large datasets (50+ samples), and amenable to automation. Based on two datasets that provide ground truths for the quantification, the performance of the pipeline was evaluated and found to be at the state of the art, with high accuracy and good coverage.
To realize its full potential for processing large datasets in a high-throughput fashion, the label-free quantification pipeline was adapted for use on a powerful computing cluster. Based on a software framework for workflow management, an automated workflow implementing the label-free pipeline, applicable to distributed computing, was created. The connection to a data management system gives this workflow access to LC-MS/MS raw data and allows the traceable storage of processing results. Further, a web interface for the configuration and submission of label-free analyses was designed, and a software module that supports aspects of workflow development was implemented.
This thesis also presents two biological studies in which the label-free quantification pipeline was applied. Both studies investigate the proteome of the bacterium Streptococcus pyogenes. S. pyogenes is an important pathogen that causes a variety of diseases ranging from mild (superficial skin and throat infections) to severe and life-threatening (necrotizing fasciitis, toxic shock syndrome). The use of shotgun mass spectrometry in connection with the labelfree quantification pipeline allowed the reliable analysis of a large part of the S. pyogenes proteome over 800 proteins in both studies. The first investigation focused on the adaptation of S. pyogenes to growth in the presence of human blood plasma, and found a marked downregulation of proteins required for fatty acid biosynthesis. The uptake of fatty acids from plasma, mediated by the binding of a fatty acid-carrying human protein (albumin) to specific proteins on the bacterial surface, allows S. pyogenes to conserve energy by reducing its internal production, as follow-up experiments could clarify.
The second study compared two S. pyogenes strains, a virulent wild-type and a hypervirulent mutant strain, under 26 different growth conditions. The analysis focused on the impact of increased virulence on the proteome of the bacterium. The results highlight specific protein expression patterns in the hypervirulent strain for virulence factors and for transcriptional regulators, largely confirming previous findings. Further, a remarkable downregulation of the protein biosynthesis machinery in the hypervirulent mutant strain was observed for a broad range of conditions, matching an apparent growth deficit of the mutant bacteria.
These applications illustrate the power and utility of mass spectrometry-based proteomics strategies in general, and of shotgun approaches combined with the OpenMS label-free quantification pipeline in particular.
Inference of CMGC Kinase Interaction Network Topology
by Ke Li (MSc, completed 2013-06-11)
Protein complexes are the basic functional modules carrying out a variety of fundamental cellular functions. Identification of protein complexes is of central importance in current biological research for interpreting the information encoded in genomes and understanding many fundamental cellular processes. The CMGC kinase group, consisting of 9 subfamilies, has been found to play critical roles in cell signaling, cell cycle regulation, metabolic and slicing control, etc. In CMGC kinase group, some subfamilies such as MAPKs and CDKs are among the most highly studied protein groups whereas other subfamilies such as HIPKs and RCKs are only poorly understood. The first global proteomic analysis on complexes of the human CMGC kinase group accomplished by Matthias Gstaiger provides valuable information for many poorly studied CMGC kinases, including the 652 high-confidence kinase-protein interactions identified from AP-MS experiments with the help of computational tools. As a result of the limitation of AP-MS experiments and current computational methods, these identified 652 interactions may not necessarily be actual physical interactions, and there is no clear way to identify different complexes formed by the same kinase only based on the identified 652 interactions. These issues cannot be resolved only using AP-MS experimental data. Therefore, this project was launched to approach these difficulties.
In order to identify physical interactions and possible protein complexes formed by CMGC kinases, we combined information from the PrePPI database which is mainly a structurebased protein interaction database and then applied machine learning techniques on the integrated protein interaction data. Machine learning, a booming interdisciplinary field from computer science and statistics with extensive applications in scientific research and engineering, focuses on making predictions for new observations based on known properties (also known as features, attributes) learned from the training data. After the generation of a list of features from various bioinformatic sources for all proteins present in the identified 652 interactions, a feature selection procedure was carried out to determine the features to be included in model training step. After that, three different classes of machine learning models were trained and tested individually and finally compared to select a final model for prediction. Three classes of machine learning models applied here are logistic regression, Random forests (RF) and Support Vector Machine (SVM) models. Support Vector Machine (SVM) with RBF (Radial basis function) kernel was finally chosen and applied to predict protein-protein interactions. After the modeling and prediction step, an overall interaction network was constructed from 652 highconfidence kinase-protein interactions identified by Matthias Gstaiger, PrePPI interactions and our predicted interactions. In this integrated global network, first a manual study was performed and several interesting results were found. Because of the large scale of this network which makes further manual detection impossible, a graph density based protein complex detection algorithm was proposed and applied to the network. It was found that the decomposition of a complicated network centered on one kinase into several dense subnetworks is possible and the scale and interaction intensity of the subnetworks are majorly dependent on the choice of graph density and degree thresholds.
In conclusion, we utilized a variety of bioinformatic resources and machine learning techniques to overcome the limitation of AP-MS experimental data and endeavored to infer the network topology of CMGC kinases. The results suggest combining AP-MS experimental data with protein interaction data from other bioinformatic resources and using machine learning methods for prediction can help to construct a refined and more realistic protein interaction network topology. The complex detection algorithm designed in this project can be applied as a computational assistance for large-scale complicated networks.