Nature Methods
- 4, 787 - 797 (2007)
Published online: 27 September 2007; | doi:10.1038/nmeth1088
Analysis and validation of proteomic data generated by tandem mass spectrometryAlexey I Nesvizhskii1, Olga Vitek2 & Ruedi Aebersold3, 41 University of Michigan, Department of Pathology and Center for Computational Medicine and Biology, Ann Arbor, Michigan 48105, USA. 2 Purdue University, Departments of Statistics and Computer Science, West Lafayette, Indiana 47107, USA. 3 Institute of Molecular Systems Biology, Swiss Federal Institute of Technology (ETH) Zurich, CH-8093 Zurich, Switzerland and Faculty of Sciences, University of Zurich, CH-8006 Zurich, Switzerland. 4 Institute for Systems Biology, Seattle, Washington 98103, USA.
Correspondence should be addressed to Ruedi Aebersold rudolf.aebersold@imsb.biol.ethz.ch The analysis of the large amount of data generated in mass spectrometry–based proteomics experiments represents a significant challenge and is currently a bottleneck in many proteomics projects. In this review we discuss critical issues related to data processing and analysis in proteomics and describe available methods and tools. We place special emphasis on the elaboration of results that are supported by sound statistical arguments.Introduction A main goal of proteomics has been the complete and in most cases quantitative analysis of the proteome of a species or, in multicellular organisms, a particular cell or tissue type. Although this goal has remained elusive, significant progress has been made in the development of an array of technologies for proteome analysis and their application to biological and clinical research1. At present, the vast majority of proteomic data are being generated by mass spectrometry, more specifically by tandem mass spectrometers of ever increasing performance2. These instruments and the diverse workflows they support have in common that they generate hundreds to tens of thousands of fragment ion spectra per hour of data acquisition. The assignment of these fragment ion spectra to peptide sequences, the inference of the proteins represented by the identified peptides and the determination of their abundances in the analyzed sample present complex computational and statistical challenges. It is essential for proteomics to develop and generally apply tools and solutions to these problems that provide accurate and reproducible results. Failure to do so introduces and propagates errors in the literature, makes it difficult for reviewers and readers to evaluate the conclusions of manuscripts or to meaningfully compare the results of different studies, and renders databases containing proteomic data essentially useless3. In this review we discuss critical problems facing the analysis of mass spectrometry–derived proteomic datasets and present the currently available solutions.
Assignment of fragment ion spectra to peptide sequences The currency of information for tandem mass spectrometry (MS/MS) based proteomics is the fragment ion spectrum (MS/MS spectrum) of a specific peptide ion that is fragmented, typically in the collision cell of a tandem mass spectrometer. The correct assignment of such a spectrum to a peptide sequence is a first and central step in proteomic data processing. A large number of computational approaches and software tools have been developed to automatically assign peptide sequences to fragment ion spectra. These can be classified into three categories: (i) Database searching, where peptide sequences are identified by correlating acquired fragment ion spectra with theoretical spectra predicted for each peptide contained in a protein sequences database, or by correlating acquired fragment ion spectra with libraries of experimental MS/MS spectra identified in previous experiments (spectral library searching); (ii) De novo sequencing, where peptide sequences are explicitly read out directly from fragment ion spectra; and (iii) hybrid approaches, such as those based on the extraction of short sequence tags of 3–5 residues in length, followed by 'error-tolerant' database searching. For large-scale proteomics studies database searching remains the most frequently used peptide identification method. However, the other strategies provide attractive alternatives in specific situations, as discussed below.
Spectral identification by sequence database searching. Several MS/MS database search programs have been developed (Table 1), and their basic functionality is illustrated in Figure 1. The programs take the fragment ion spectrum of a peptide as input and score it against theoretical fragmentation patterns constructed for peptides from the searched database. The pool of candidate peptides is restricted based on user-specified criteria such as mass tolerance, proteolytic enzyme constraint and types of post-translational modification allowed (see Supplementary Notes online for discussion of the most important criteria). The output from the program is a list of fragment ion spectra matched to peptide sequences, ranked according to the search score. Typically, only the best scoring peptide match is considered during the subsequent statistical analysis step (see below). The search score measures the degree of similarity between the experimental spectrum and the theoretical spectrum, and therefore serves as the primary discriminating parameter for separating correct from incorrect identifications.
 | |  |
 | |  | A number of scoring schemes have been described in the literature, including spectral correlation functions (for example, SEQUEST) or related concepts such as shared fragment counts and dot product (for example, TANDEM, OMSSA, MASCOT). Scoring functions can also be based on empirically observed rules (for example, SpectrumMill) or statistically derived fragmentation frequencies (for example, PHENYX). The score that is actually reported by the tool can be based on a somewhat arbitrary scale (for example, Xcorr score in SEQUEST), or converted to a statistical measure called expectation value, E value, which refers to the expected number of peptides with scores equal to or better than observed score under the assumption that peptides are matching the experimental spectrum by random chance (OMSSA, TANDEM and more recently MASCOT). E value is computed either by assuming that the database search score follows a certain (for example, Poisson) distribution4,
5, or by empirical fitting of the observed distribution of scores6 (see Fig. 1). This score is largely invariant under different scoring methods and gives a clearer interpretation of goodness of match across different instrument platforms and search algorithms. It should be stressed, however, that neither the best match nor a high search score (or low E value) are reliable indicators for a true match. Discriminating true from false matches is therefore a critical next step in proteomic data analysis.
Spectral identification by spectral matching. A notable inefficiency of shotgun proteomics experiments lies in the repeated rediscovery of the same identifiable peptides by sequence database searching methods, which often are time consuming and error prone. With the availability of large amounts of proteomic data, part of which are organized in generally accessible databases (Table 1), it can be anticipated that all the proteins of a species that are detectable by mass spectrometry will eventually have been discovered. In fact, systematic sequencing of proteins produced by microbes and eukaryotic species7,
8 has already reached remarkable depth of proteome coverage. Such extensive proteome maps now open the possibility of inferring the sequence of a peptide by matching its fragment ion patterns against a library of spectra representing the peptide sequences contained in the proteome map9,
10,
11,
12.
In this approach, a spectral library is compiled meticulously from a large collection of experimentally observed mass spectra of correctly identified peptides. An unknown spectrum can then be identified by comparing it to all the candidates in the spectral library to determine the match with the highest spectral similarity13. Recently, a number of tools have been developed that support peptide identification by spectral matching (Table 1). The spectral matching approach substantially outperforms classical sequence database searching in speed, error rate and sensitivity characteristics of the results12 and has the advantage that the statistical models developed for assessing the output of database search tools (see below) are easily adaptable to the method12. However, no peptides will be identified that were not previously entered into the respective spectral library. At this time, when no proteome map has been completed, spectral matching approaches might be used most effectively as a rapid first pass in an incremental search strategy.
Spectral identification by de novo sequencing. In the de novo sequencing approach the amino acid sequence of a peptide is explicitly read from a fragment ion spectrum. Initially this was accomplished manually. More recently, an array of tools has been developed that assist the researcher with this task (Table 1). The main advantage of de novo sequencing over the database search method is that it allows identification of spectra for which the exact peptide sequence is not present in the searched sequence database, such as peptides containing sequence polymorphisms and modified peptides. It is therefore mainly used for protein analysis in species for which no or limited genome sequence information is available or for identifying modified peptides. However, de novo analysis is computationally intensive and requires high quality fragment ion spectra. Furthermore, researchers analyzing proteomic data are more interested in knowing what proteins are present in the sample. This means that peptide sequences extracted from MS/MS spectra using de novo algorithms need to be matched, using for example BLAST, against the sequences of known proteins present in the sequence databases, a strategy that is tedious in high throughput proteomics environment. Thus, a more effective strategy may be to start with database searching, and apply de novo sequencing tools to the remaining unassigned high quality spectra14.
Spectral identification with hybrid approaches. Spectral identification can also be carried out using hybrid approaches that combine elements of both de novo sequencing and database searching. The analysis starts with inference of short sequence tags (partial sequences) from MS/MS spectra, followed by an error-tolerant database search: that is, a search that allows one or more mismatches between the sequence of the peptide that produced the MS/MS spectrum and the database sequence. First pioneered in ref. 15, this approach has been recently extended by several groups16,
17 (see Table 1). By limiting the search space to only those database peptides that contain the sequence tag extracted from the spectrum (or one of the several sequence tags, if more than one per spectrum is extracted), the database search time can be significantly reduced. Hybrid approaches are also potentially very powerful for the systematic analysis of post-translationally modified peptides, or peptides containing artifactual modifications. Allowing all possible types of modifications at all possible sites leads to a combinatorial explosion of the database search space and is therefore poorly compatible with sequence database searching. The use of sequence tags, or related approaches such as look-up peaks18 can reduce the size of the space to be searched back to manageable levels.
Statistical assessment of peptide assignments in large-scale datasets Database and spectral matching search tools typically produce a peptide match for each input spectrum, some of which may be true matches and some false. In some experiments, the best-scoring peptide assignment produced by a database search program is incorrect for the majority of searched MS/MS spectra. Some of the reasons for the high failure rate are listed in Supplementary Notes. Early on in proteomics it was customary to generate a list of 'high confidence' identifications according to an ad hoc cutoff value of the score provided by the search engine, often in conjunction with visual inspection of peptide assignments to fragment ion spectra by an expert. However, the score distributions produced by a search tool depend on a multitude of factors, including the performance of the mass spectrometer, data quality, and the size of the database. Thus, application of the same thresholds to data from different experiments would result in different (and unknown) error rates, making comparison between data-sets practically impossible. Manual inspection by an expert cannot be regarded as viable validation process because it is time-consuming and not compatible with the high numbers of fragment ion spectra acquired in proteomics, it is subjective, and the results depend on the level of expertise of the validating individual. Therefore, modern proteomics has gradually moved away from manual inspection of the data and ad hoc scoring schemes, and toward probabilistic approaches that provide statistical measure of confidences and estimates of error rates. See Box 1 and Table 2 for statistical terminology relevant to the assessment of database search results.
 | |  | Recently, several approaches that translate the database search tool output scores into probabilities or estimated false discovery rates (FDRs)19 have been introduced. These global approaches (as illustrated in Fig. 2) are concerned with modeling the distribution of search scores constructed by taking the top-scoring peptide assignment for each experimental spectrum in the whole dataset ('global distribution'). This distinguishes them from the expectation value calculation involving modeling the single-spectrum distribution of scores constructed for each experimental spectrum separately from all peptides in the searched sequence database that were scored against that particular spectrum. In fact, the global and single-spectrum–based approaches are complementary: that is, whole-dataset modeling and FDR analysis can be performed using E values in place of the original search scores. The global statistical approaches can be broadly grouped into two categories: target-decoy searching and empirical Bayes approaches.
 | |  | Target-decoy searching. The methods of the first group rely solely on searching target-decoy databases, and compute an optimized cut-off score for each dataset. The target-decoy search strategy20 involves two steps. In the first step MS/MS spectra are searched against a target database of protein sequences augmented with the reversed (or randomized, or shuffled) sequences of the same database. The approach assumes that matches to decoy peptide sequences and false matches to sequences from the original database follow the same distribution. The plausibility of these assumptions is discussed in ref. 20. In the second step, peptide assignments are filtered using various score cut-offs, and the corresponding FDR for each cut-off is estimated as 2Nd/N, where N is the number of peptide matches with scores above the cut-off and Nd is the number of matches to decoy sequences among them.
The advantage of this FDR estimation method is that it is simple to implement and requires minimal distributional assumptions, which makes it easily applicable in a variety of situations. The drawbacks of this approach include doubling the database search time. A more fundamental issue arises of whether reversing or randomizing sequences can provide an accurate assessment of the distribution of false peptide matches when many of those are known to be sequences homologous to the true peptides rather than completely random sequences.
Empirical Bayes approaches. The methods in the second category are exemplified by PeptideProphet21, which employs a so-called empirical Bayes22 approach that models the distributions of database search scores and auxiliary information (see below) observed for all peptide assignments in the dataset as a two-component mixture of distributions representing correct and incorrect identifications. Before that step, PeptideProphet combines multiple search score–related parameters (for example, the search score itself, Xcorr, and its derivative, Cn score, in the case of SEQUEST) into a single score, called discriminant search score. The discriminant score coefficients and the functional form of the resulting discriminant score distributions are determined for each search engine using training datasets. Those distributions are modeled, however, anew for each dataset using the expectation-maximization algorithm, leading to posterior probabilities of correct identifications as inferential indicators. These probabilities are then used to estimate the FDR for any minimum probability used as a cut-off. In contrast to the target-decoy database search approach, appending decoys is not necessary for deriving the distribution of incorrect identifications. Furthermore, additional modeling in PeptideProphet results in an increase of statistical power compared to threshold-based approaches.
The limitations of PeptideProphet are largely related to the parametric assumptions and, to a lesser degree, to the use of fixed coefficients in computing the discriminant search score. These limitations are currently being addressed in several ways. First, the probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single framework through a semi-supervised expectation-maximization algorithm that explicitly incorporates the class label available for decoy peptide matches. Second, parametric specification of continuous mixture components in the model can be relaxed, for example, by using multiple components to model each class of peptides, correct and incorrect. These new developments result in improved robustness and higher accuracy of computed probabilities even in the case of the most challenging datasets (H. Choi and A.I.N., unpublished data).
Which scoring method when? The statistical power of all the identification procedures is strongly influenced by a number of factors, including the discriminative ability of the database search score, the quality of the spectra and the size of the database. Although there is currently no theory on the optimality of the score, empirical evidence suggests that some scores perform better than others in different settings23,
24. Combining several search scores produced by the same search tool improves the overall performance21,
25,
26,
27,
28. Several programs (for example, TANDEM and SpectrumMill) allow an efficient multistep analysis, starting with an enzyme-constrained search, followed by a second search for peptides with modifications, nonspecific cleavage or missed cleavage sites. The statistical power of the identification procedure can also indirectly benefit from further processing of MS/MS spectra performed before database search29,
30, clustering of redundant spectra31,
32, recognition of spectra produced by cofragmentation of two or more peptides33, removal of low quality spectra14,
34,
35,
36,
37 and application of automated charge state determination algorithms38,
39. Furthermore, improved discrimination can be achieved by combining the output from two or more different database search tools40,
41,
42,
43 or by combining data from multiple consecutive stages of mass spectrometry (for example, MS/MS and MS/MS/MS (MS3))44.
Use of auxiliary information to improve spectral identification The database search score (or the composite of multiple scores) measuring the degree of similarity between the experimental and theoretical spectra represents only one set of discriminant features useful for separating correct from incorrect identifications. Using this information alone, it may be difficult to accurately separate true from false identifications, even if optimal statistical methods are being used. The discrimination can be further improved if auxiliary information that may be generated coincidentally in the course of a proteomics experiment is also included in the analysis. Such types of information include mass accuracy—that is, the difference between the measured and calculated mass of the peptide ions (available from the first stage of mass spectrometry, MS1)—and peptide separation coordinates, for example, retention time45,
46 or pI value47,
48,
49 (peptide separation step). Other useful peptide properties include the number of termini consistent with the type of enzymatic cleavage used and the number of missed cleavage sites (digestion step). In some cases, additional information such as presence of a specific amino acid or sequence motif—for example, cysteine in the case of avidin affinity purification of peptides containing biotinylated cysteines1, or the sequence motif N-X-S/T for peptides containing N-linked glycosylation sites50 (peptide enrichment step)—can be used as further constraints.
These types of auxiliary information provide evidence that can be used to incrementally augment the search score(s) generated by the search engine. The availability and information content depends strongly on the experiment that was carried out to generate the data. For example, the contribution of the mass accuracy parameter to differentiating true from false identifications depends on the mass accuracy of the mass spectrometer used, and the pI value is only useful if isoelectric focusing was used as one of the peptide separation tools. Although it is possible to take into account auxiliary information in the threshold-based approaches46,
49,
51,
52,
53, handling experimental variations (for example, a bias in the mass measurement, or inaccurate determination of the pH value in each peptide fraction) can be problematic. Application of threshold-based approaches also requires datasets of sufficiently large size owing to the need to subdivide peptide assignments into subcategories based on the search score and all extra parameters. At the same time, auxiliary information can be effectively used in PeptideProphet21,
47,
54. As it models all data types simultaneously, it has the inherent flexibility to detect and correct for measurement bias, and to weigh the contributions of the different types of information as a function of the experiment in computing posterior peptide probabilities.
Inferring protein identifications from spectral identifications The purpose of most proteomic experiments is not the identification of peptides, but the identification of the proteins present in the sample before digestion55. Thus, the peptide sequences of the identified fragment ion spectra need to be grouped according to their corresponding protein, and the confidence measures need to be recomputed at the level of proteins. This process is not straightforward owing to several challenges, and it is a likely source of significant errors in the proteomics literature.
The first challenge is related to the fact that many correctly identified peptides tend to group into a relatively small number of proteins56. This is particularly obvious in the analysis of human serum samples, where the dominant peptide identifications come from a dozen of the most abundant serum proteins, and the total number of identified proteins is typically less than a thousand57. At the same time, incorrect spectral identifications match randomly to the much larger number of proteins in the searched sequence database (for example, more than 40,000 in human IPI database; Box 2). Thus, almost every high-scoring incorrect spectral assignment introduces one additional incorrect protein identification, resulting in an increase in the false discovery rates when going from the spectral to the protein level.
The second challenge arises because of shared peptides: that is, peptides whose sequence is present in more than a single entry in the protein sequence database. Such cases most often result from the presence of homologous proteins, splicing variants or redundant entries in the protein sequence database. This problem is particularly serious in the case of higher eukaryote organisms55,
58. As a result, in shotgun proteomics it is often not possible to differentiate between different protein isoforms. In general, this is less of a problem when proteins are first separated using a multidimensional protein separation technique (for example, using two-dimensional gels), where additional information such as the molecular weight of the sample proteins can assist in the determination of the protein identities. A detailed discussion of the difficulties in interpreting the results of shotgun proteomics experiments at the protein level can be found in ref. 55.
Most frequently, protein identification is performed by determining peptide sequence identity in MS/MS spectra as described in the previous section, and by grouping peptide sequences into proteins, deterministically40,
59,
60 or probabilistically (for example, by apportioning peptides to proteins with some weights41,
55,
56). An alternative approach61 sidesteps the process of spectral identification, and combines overlapping uninterpreted MS/MS spectra into longer chains, then maps the chains to protein sequences directly. With both approaches, combining MS/MS spectra into proteins is often insufficient for unambiguous protein identification owing to a large number of shared peptides, in particular in cases when the protein database contains many homologous proteins and isoforms. Thus the issue is what it means for a protein to be identified. Some publications report all proteins identified with at least one distinct peptide, or select one representative protein among isoforms and homologs62. A nomenclature based on the parsimony principle (also called Occam's razor), which consists of determining the smallest number of proteins that can account for all observed peptides, has been described in ref. 55 and provides a consistent and concise way of representing the results of a proteomic experiments.
Once peptides are grouped into proteins, the plausibility of the protein identification is quantified with a score. On one hand, protein identifications with low spectral coverage are likely to be spurious. On the other hand, the number of identified peptides mapped to a protein sequence is strongly correlated with length and abundance of the protein, and one can hardly expect good peptide coverage in complex mixtures, or with experimental designs that enrich for a particular class of peptides. Scoring functions attempt to distinguish between false and true protein identifications in a number of ways. For example, the Bayes rule–based scoring scheme in ProteinProphet includes the concept of the number of sibling peptides56. Other approaches are based on Poisson distribution–based statistics that take into account the protein length62,
63 or model the protein abundance as a latent variable41.
The final goal of the protein-level analysis is to derive a list of proteins with a controlled FDR. FDR-controlling procedures that are analogous to the ones used at the spectral level are frequently used for proteins. For example, protein-level FDR can be again estimated using the target-decoy strategy20,
60, or as a sum of posterior probabilities of correct identifications56. In addition, P values can be derived directly from the distributional assumptions of protein identification followed by Bonferroni adjustment to control family-wise error rate62, a more conservative criterion than FDR.
As in the case of spectral-level analysis, the statistical power of protein identification depends on the scoring function and the method used to control FDR. The power can be improved by incorporating more information into the scoring function: for example, predicted detectability of peptides64 or similarity of quantitative profiles of peptides mapped to a same protein. As the level of analysis (MS/MS spectra, distinct peptide sequences, proteins, and so forth) and the methods used to compute data summaries at each level become more complex, proving the appropriateness of data analysis procedures becomes more difficult. At the protein level, the ultimate validation of the results can be obtained by independent technical and biological replication of the experiment using the same or a different (for example, targeted) experimental strategy.
Quantitative proteomics Mass spectrometry is increasingly used for relative or absolute quantification of peptides and proteins1,
65. A typical analysis involves extraction of quantitative information from mass spectra at various levels of summarization, such as MS1 spectrum features (peaks in the MS1 spectrum characterized by their intensity, m/z value, and the time of acquisition of the spectrum), peptide features (that is, groups of isotopic mass peaks originating from the same peptide ion), or peptide (that is, multiple peptide features corresponding to different charge states of the same peptide). The goal of the experiment is to quantify changes in the abundance of those features across the samples that are being compared, and to provide a maximal list of differentially abundant features with a controlled FDR. Quantitative proteomics workflows can be generally divided into three categories (Fig. 3): stable isotope labeling, spectral counting and spectral feature analysis.
 | | Figure 3. Quantitative proteomics workflows. |  |  |  | (a) In the stable isotope labeling workflow, proteins are labeled using a light (sample 1, red) or heavy (sample 2, yellow) mass tag, mixed, digested into peptides and analyzed using tandem mass spectrometry. Spectral features observed in MS1 data (indicated as black dots in the m/z retention time plots) are identified from the acquired MS/MS spectra. Identified peptides are quantified from the signal intensities of MS1 features, and this information is used to infer the identity and relative quantification of their corresponding protein (protein A). Spectral features for which no MS/MS spectrum was acquired (blue dots), or for which no high probability peptide assignment was obtained (indicated by X) are not further analyzed. (b) In the spectral counting strategy, unlabeled protein samples are analyzed separately using the same protocol as each other, and the relative protein quantification is established by comparing the number of MS/MS spectra identified for each protein. (c) In the spectral feature analysis strategy, the analysis starts with alignment of MS1 data from different samples, extraction of spectral features and their quantification, all of which is done before the identification step. Spectral features showing differential expression are then identified using a targeted MS/MS-based workflow.
Full Figure and legend (27K) |
|  | Stable isotope labeling. One commonly used approach is based on stable isotope labeling of proteins, in which samples are labeled chemically (for example, in isotope coded affinity tag, ICAT; or isobaric tags for relative and absolute quantification, iTRAQ) or metabolically (for example, in stable isotope labeling with amino acids in cell culture, SILAC), mixed together, and digested into peptides1,
65 (Fig. 3a). Because of the mass shift introduced by the reagent, MS1 spectrum features corresponding to the same peptide can be quantified separately in the same mass spectrometry run, and their ratio represents the relative abundance of the corresponding peptide. Representative tools supporting this type of analysis are listed in Table 1. The correspondence between the spectral features representing the same peptide is established through the identification of the peptide sequence from acquired MS/MS spectra. In addition to the increased complexity posed by the labeling steps, this workflow is limited by the need to acquire and interpret the MS/MS spectra.
Spectral counting. Quantification can also be done without isotopic labeling by means of spectrum counting (from MS/MS data) or integrated ion intensities (MS1)62,
66,
67,
68,
69,
70 (Fig. 3b). In this strategy, the samples that are being compared are analyzed in the mass spectrometer separately but using the same data acquisition protocol. A separate list of proteins is created for each of the samples, and the lists are then compared to find differentially expressed proteins. The protein abundance in each sample is estimated from the number of MS/MS spectra identified corresponding to each protein normalized to account for protein length or expected number of tryptic peptides. As a variation of this strategy, peptide abundance can be determined from the intensity of the corresponding spectrum features. This method suffers from inability to quantify low abundance proteins identified from only one or two peptides, and in general is less accurate than the methods based on stable isotope labeling. Still, the practical utility of this method has been demonstrated in a number of applications66,
67,
68,
69,
71,
72.
Spectral feature analysis. The third kind of the workflow is different from the first two in that it does not require identification of the peptide sequence corresponding to each observed spectrum feature before quantification50,
73 (Fig. 3c). In this label-free strategy, biological samples are analyzed in separate mass spectrometry runs, and the correspondence between spectral features across the runs is established by means of computational tools and with at most a minimal amount of information from MS/MS spectra50,
73,
74,
75. This workflow allows analysis of a large number of spectrum features and allows higher data throughput, and is compatible with applications that require profiling of multiple biological samples, such as proteomics-based candidate biomarker discovery. The drawbacks include increased computational complexity owing to the presence of a large number of spurious features and noise, and more stringent requirements in robustness and reproducibility at various data acquisition steps76,
77. Subsequent or parallel experiments using, for example, targeted workflow2,
78 are typically necessary to verify the presence of these features and their changes in abundance, and to determine their identity.
Regardless of the workflow used, a typical output consists of a list of detected proteins (or spectral features for which the identity may not be known) and their absolute or relative abundances across all samples or runs. The resulting information is similar to the information from other high-throughput experiments, such as gene expression microarrays. Determining changes in abundance that are significant requires statistical methods that take advantage of the large number of features to compensate for the small sample size76. A number of such methods have been implemented as a part of the computational tools that perform quantification. Alternatively, data can be exported to an external tool developed more generally in support of high-throughput data processing; for example, those that are available as a part of the Bioconductor project79. Furthermore, the methods described in Box 1 can be used to control the FDR in the list of differentially abundant features.
Although data from quantitative proteomic experiments have similarities to other data, such as those from gene expression experiments, they present many specific challenges. Proteomic data are more complex than gene expression data owing to the large span of protein concentrations, and to the fact that the identities (peptide sequences) of spectral features are not always known, or may be determined incorrectly. Another complication relates to the ambiguity in assigning peptides to proteins55. In the case of a shared peptide, its quantification may not be a reliable measure of the abundance of any of its corresponding proteins. In fact, the procedures for peptide identification and quantification are interdependent and complementary, and the power of both procedures can be increased by summarizing the data at different levels, such as at the level of protein identity. For example, shared quantitative profiles of peptides corresponding to the same protein increase the confidence in the identification. Conversely, observing changes in abundance across different peptides from the same protein may suggest the presence of several protein isoforms having differential expression55.
Conclusions and outlook Mass spectrometry–based proteomics, specifically proteome analysis by a shotgun approach, has reached a high level of maturity with respect to sample processing, data acquisition and data analysis. However, a number of significant challenges remain. They are primarily related to the complexity of proteomes, which has so far precluded true proteomic analyses (that is, the analysis of all the components of a proteome) and generated partially overlapping datasets from identical samples, suggesting poor reproducibility of the technology. Secondarily, these challenges are related to the analysis of the information contained in proteomic datasets. In combination, these problems have created the impression that published proteomic data are at times of dubious quality.
It can be expected that incremental improvements of tools and methods such as the ones described in this review will further increase the quality of published proteomics data. The most significant improvements, however, will come from the skilled and systematic application of the most advanced available tools. It is encouraging that leading journals publishing proteomic studies have recognized this fact and started to request that authors follow specific guidelines and that the raw data supporting the conclusions of a paper be made accessible. The practical implementation of these guidelines is facilitated by the development of data sharing mechanisms such as Tranche (Table 1) and common file formats (Box 3). We must recognize, however, that some of the informatics issues facing shotgun proteomics datasets can be completely resolved by neither expert validation nor statistical arguments. These include the problem of inferring the identities of the proteins, protein isoforms and differentially modified proteins in a sample from confidently identified peptides. This and similar problems, in our opinion, can only be rigorously solved by the development of alternative proteomic workflows.
Two such alternatives are becoming apparent. The first, referred to as top-down proteomics, is focused on the analysis of intact proteins rather than peptides and therefore has the potential to resolve populations of proteins into their components80,
81,
82. The second alternative is based on targeted analysis of specific peptides of high information content, termed proteotypic peptides, that collectively represent the proteome, thus eliminating to a large extent the redundancy of current methods2,
64,
83. Although substantial progress has been achieved in both directions, significant technology development, including development of new algorithms and analysis tools, remains before the routine implementation of these technologies.
Note: Supplementary information is available on the Nature Methods website.
Published online: 27 September 2007.
REFERENCES
- Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003). | Article | PubMed | ISI | ChemPort |
- Domon, B. & Aebersold, R. Mass spectrometry and protein analysis. Science 312, 212–217 (2006). | Article | PubMed | ISI | ChemPort |
- Carr, S. et al. The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 3, 531–533 (2004). | Article | PubMed | ISI | ChemPort |
- Geer, L.Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004). | Article | PubMed | ISI | ChemPort |
- Sadygov, R.G. & Yates, J.R. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003). | Article | PubMed | ISI | ChemPort |
- Fenyo, D. & Beavis, R.C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003). | Article | PubMed | ISI | ChemPort |
- King, N.L. et al. Analysis of the Saccharomyces cerevisiae proteome with PeptideAtlas. Genome Biol. [online] 7, R106 (2006).
- Brunner, E. et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 25, 576–583 (2007). | Article | PubMed | ISI | ChemPort |
- Yates, J.R., Morgan, S.F., Gatlin, C.L., Griffin, P.R. & Eng, J.K. Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998). | Article | PubMed | ISI | ChemPort |
- Craig, R., Cortens, J.C., Fenyo, D. & Beavis, R.C. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849 (2006). | Article | PubMed | ISI | ChemPort |
- Frewen, B.E., Merrihew, G.E., Wu, C.C., Noble, W.S. & MacCoss, M.J. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 78, 5678–5684 (2006). | Article | PubMed | ISI | ChemPort |
- Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007). | Article | PubMed | ISI | ChemPort |
- Stein, S.E. & Scott, D.R. Optimization and testing of mass-spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994). | Article | ISI | ChemPort |
- Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006). | PubMed | ISI | ChemPort |
- Mann, M. & Wilm, M. Error tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994). | Article | PubMed | ISI | ChemPort |
- Tabb, D.L., Saraf, A. & Yates, J.R. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003). | Article | PubMed | ISI | ChemPort |
- Tanner, S. et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005). | Article |
|