Tutorial: How do I evaluate an orphan enzyme?

In our paper “Finding sequences for over 270 orphan enzymes” we describe the process by which we evaluated 1,122 putative orphan enzymes. We have two goals when we evaluate an orphan enzyme:

  • Find sequence data
  • Collect identification information

The best case scenario is that during the evaluation process we find sequence data for our enzyme. This happens about 25% of the time. The rest of the time we want to collect identification information, which are pieces of info that make it easier for us to identify a sequence for the orphan either in the lab or via computational prediction.

We outline a basic recommended procedure for evaluating an orphan enzyme in our paper. Here’s a slightly modified version of the paper’s figure, giving an overview of the identification process:



Let’s take a look at each phase of the overall process to explain it a bit more.

Collect all known names / Check databases for sequence data

The first thing we ever do when trying to evaluate a potential orphan enzyme is to make sure that it actually is an orphan. We define an orphan enzyme as an enzyme that has been experimentally characterized but which lacks sequence data in any major sequence database.

The Is this an orphan enzyme? tutorial explains in detail how to collect the names and synonyms for a possible orphan enzyme, and then how to search all the major sequence databases.

During this step you may discover that you don’t really have an orphan enzyme. You are also reasonably likely to find that the enzyme’s sequence exists in one or more of these sequence databases, but you couldn’t find it because of a typo or other database error.

Keep that idea in mind as you search databases for your enzyme. About half of the orphan enzymes we recovered were “lost” due to database entry errors (that’s about 12% of all the orphans!).

Collect documents

In all those cases where you don’t get lucky and find a sequence in a database, collecting documents related to the orphan enzyme is key.

Documents to collect include scientific publications as well as patents. We want to collect all documents that contain either sequence data or identification data.

It is easiest to just collect documents as you search. You will likely be exposed to potentially relevant documents as citations in databases you searched in the last step. In addition, this step includes direct searching of PubMed and the US Patent and Trademark Office database with all of your collected names for the enzyme.

Look out for these especially helpful items:

    • Accession numbers
    • Gene or protein names
    • Amino acid or nucleotide sequence data

Obviously, finding an accession number or actual sequence data ends your search immediately – your orphan is no longer an orphan!

Gene or protein names sometimes lead to sequence data. When you find a gene or protein name associated with your enzyme activity, return to the major sequence databases and search with it as well. Sometimes the enzyme was deposited only under a somewhat cryptic protein name (e.g. Hmg2) and was never labeled with any of its full enzymatic names or synonyms.

Identification information

The blue inset box shows the many kinds of identification information. These are valuable to collect because they make it much easier for us to find a sequence for an orphan enzyme. For example, knowing how an enzyme was purified (#2) and how it is assayed (#3) are both extremely important if a lab wants to try purifying the enzyme again so it can be sequenced. In some cases, identification information alone is sufficient to puzzle out an enzyme’s sequence.

It is also quite helpful to catch other incidental notes such as whether or not an enzyme is extremely unstable. A quick note now can save a lot of pain later.

You can see examples of real-world identification information in our database.

Peptide fragment data

Some papers, especially those that precede the advent of mass spec based protein sequencing, will include the first 10-20 amino-terminal residues of the protein. If the protein is from an organism with a sequenced genome, this kind of peptide fragment data is often sufficient to uniquely identify the enzyme’s full sequence via BLAST.

We discuss the process of identifying a sequence from peptide fragment data in more detail in a this tutorial.

What happens if I don’t find a sequence?

If you don’t find sequence data for your orphan enzyme, this process still leaves you with all the identification data you have collected. This can sometimes be enough to: