With the new genomic
data bases of model species, such as Esherichia coli, Saccharomyces cerevisae, mouse, and human, the
sequences of many proteins of biological interest will in principle be
known, and the problem of characterizing a protein primary structure will be
reduced to identifying it in the data base.
Within the past few
years several research groups have demonstrated how MS can be used for
identification of proteins in sequence data bases. One approach is to cleave the
protein with a sequence-specific proteolytic enzyme, measure molecular weight
values for the resulting peptide mixture by mass spectrometry, and search a
sequence data base for proteins that should yield
these values. Search algorithms have also been implemented recently that utilize low
resolution tandem mass spectra of selected peptides (<3 kDa) from the protein
degradation. Yates and coworkers compared the MS/MS sequence data to the sequences predicted for each of the
peptides that would be generated from each protein in the data
base. In the PEPTIDESEARCH sequence tag approach of Mann and
Wilm, a partial sequence of 2–3 amino acids is assigned from
the fragment mass differences in the MS/MS spectrum. This partial sequence
and its mass distance from each end of the peptide (based on the
masses of the fragment and molecular ions) are used for the data base
search.
For
three studied proteins, a single sequence tag retrieved only the correct protein from the data
base; a fourth protein required the input of two sequence tags.