Tutorial: How do I identify an enzyme using peptide fragment data?

Occasionally you will find that a paper, patent, or database contains peptide fragment data for an orphan enzyme. This is most common in papers published before the wide availability of mass spec-based sequencing.

The typical case is that the researchers sequenced the first 10-20 amino acids in the enzyme. Even relatively short peptides like these are often enough to uniquely identify an orphan enzyme. The two ingredients we need for this type of identification are some kind of peptide fragment data and a sequenced genome for the source organism.

aryl-aldehyde dehydrogenase (NADP+) (EC was an enzyme for which we had identified peptide fragment data. This paper contained sequence for the enzyme’s first sixteen amino acids (sort of):




This was the amino-terminal sequence data listed in the paper:


The “X” in the 15th slot is actually a placeholder. It’s important to watch out for these placeholders and other potential “variables” in your sequence data. It is also reasonably common to see a residue in parenthesis, like this:


That means that limitations in the sequencing method that was used make it uncertain whether that residue (in this case, histidine) is actually present.

In the case of our example, we know that there was a residue in the 15th slot, but its identity was too unclear to call.

Once you have peptide fragment data, you can use it for a BLAST search against the full NCBI non-redundant protein sequences database (nr).

Click here to set up a protein BLAST search.

Just copy the peptide fragment sequence and past it into the search window:


In cases where there is a residue in parenthesis, it is usually best to search both possible variations separately. For example, we would BLAST both:




The BLAST software will automatically adjust your search parameters to account for the shortness of the sequence you are using.

Although it is possible to limit your BLAST search to a specific organism, we recommend not doing so. Organism names can and do change over time, and you are highly unlikely to find a perfect match in anything other than the correct source organism anyway.

Your search results may look something like this:


In this case, we actually do not have any results with 100% identity. A closer look at the lone result with a good E value reveals why:




That “X” residue in the 15th slot will naturally never have identity with any other residue. However, given the perfect match of the remaining residues, it is reasonable to conclude that the correct identity of X is actually Q (glutamine).

Clicking through to the matched protein’s database entry reveals something else important:




This sequence is from Nocardia iowensis. The original paper describes purification and characterization of the enzyme from Nocardia strain NRRL 5646…which was later renamed to iowensis.

With the combination of a short peptide sequence, a sequenced genome for the source organism, and the BLAST search tool, we are able to uniquely identify the sequence for a previously orphan enzyme.