You are on page 1of 36

Curator’s Guide for Pathway/Genome Databases

Pathway Tools Version 11.0

Ron Caspi, Carol Fulcher, John Ingraham, Ingrid Keseler, Markus Krummenacker, Suzanne Paley, A
SRI International
333 Ravenswood Ave.
Menlo Park, CA 94025

ptools-support@ai.sri.com

July 3, 2007

Previous Contributors: Martha Arnaud, Cynthia Krieger

1
Contents

1 Introduction 4
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Search 5
2.1 Recommended Databases for Literature Search . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 MEDLINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 BIOSIS via LANL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 SciSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 PGDB Curation 5
3.1 Naming Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Overview of PGDB Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Chemical Compounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Compound Classes for Broad Substrate Specificity and Polymerization . . . . . 10
3.2.4 Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.5 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.6 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Summaries and History Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Writing Style Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Formatting Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Say It in Your Own Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Citation Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Saving Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Evidence Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Pathway Curation 23
4.1 Pathway Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Defining Pathway Start and End Points . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Pathway Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2
4.5 Database Searching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Pathway Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 EcoCyc-Specific Information 27
5.1 E. coli Gene Frame Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Interrupted Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 MetaCyc-Specific Information 28
6.1 Database Searching Strategies for MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Species Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.1 Taxonomic Range Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 E. coli Pathways in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Pathways from Other Pathway Databases . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.5 Proteins as Substrates in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.6 Curation with Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.6.1 Gene Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Organism Summary 31

8 Update Propagation among KBs 32


8.1 Invoking KB Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Overview of the Updating Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

9 Database Release Process 34


9.1 Correctify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2.1 Database Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2.2 Updates to the Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Updates to the General KB Information . . . . . . . . . . . . . . . . . . . . . . . . . . 36

10 Programming Hints 36

3
1 Introduction

This guide contains information for curators of Pathway/Genome Databases (PGDBs) such as EcoCyc
[4] and MetaCyc [3]. PGDBs are created, updated, and queried using the Pathway Tools software
system [2].
This curator’s guide addresses issues regarding PGDB conventions, literature search and review, PGDB
entry and editing, and a variety of programs that may be run periodically on PGDBs. Since the roles
of curators may vary, only parts of this guide may be relevant to any particular reader. For example,
some curators may not conduct literature research and others may never run any of the automated
programs. Furthermore, sections of this guide are specific to the curation of a particular PGDB,
such as the organism-specific PGDB EcoCyc, and thereby may not be applicable to other PGDBs.
For example, there is an organism-specific section for EcoCyc and it addresses such issues as the
conventions for naming E. coli genes. Organism-specific sections are labeled as such.
Another important source of information for curators is the Pathway Tools User’s Guide document.
This guide is a work in progress. Please mail suggestions for improvements to
ptools-support@ai.sri.com.

1.1 Definitions

The following terms are used in this manual.


Pathway/Genome Database: A PGDB describes the genome of an organism (its chromosome(s),
genes, and genome sequence), the product of each gene, the biochemical reaction(s) catalyzed by each
gene product, the substrates of each reaction, and the organization of reactions into pathways. A
PGDB can also describe the genetic network of an organism: its promoters, operons, transcription
factors, and transcription-factor binding sites. A PGDB is a type of MOD (Model Organism Database).
EcoCyc Database: A PGDB for the organism E. coli. The majority of the information in EcoCyc
is derived from the biomedical literature.
MetaCyc Database: A PGDB containing metabolic data for many different organisms. The goal
of MetaCyc is to contain broad coverage of experimentally elucidated metabolic pathways from many
different organisms, rather than to attempt to model the complete pathway complement of any par-
ticular organism. MetaCyc contains a broad base of well-established pathways that are used by the
PathoLogic program to predict the pathway complement of a particular organism, which is modeled
within a separate PGDB for that organism. The majority of the information in MetaCyc is derived
from the biomedical literature.
BioCyc Knowledge Library: The collection of PGDBs at URL BioCyc.org is called the BioCyc
Knowledge Library. EcoCyc, MetaCyc, and BsubCyc are all component databases within the BioCyc
Library.

4
2 Literature Search

2.1 Recommended Databases for Literature Search

2.1.1 PubMed

This database is a quick and easy starting point for researching metabolic pathways. Independent of
what database you use for your literature search, you’ll want to get the PubMed or MEDLINE citation
number for the articles you’ll reference to add a web link from your PGDB to PubMed. For those arti-
cles that don’t have PubMed or MEDLINE citations you’ll have to create a frame for the reference (See
Pathway Tools User’s Guide). For a description of PubMed searching strategies see the following URL:
http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching. The
word “coli” is useful for restricting searches for EcoCyc references that return a huge number of
irrelevant results. Use of the “Limits” option to limit searching to titles and abstracts also helps
restrict searches in cases where the gene name happens also to be a name of a person or institution
or part of an institution’s address.

2.1.2 MEDLINE

MEDLINE, not through PubMed, has more database searching functions such as combining different
searches, subtracting others, etc.

2.1.3 BIOSIS via LANL

This database is accessible free of charge to Stanford University so you can access it from any terminal
connected to Stanford. If you have access to Lane library there are computers there that can be used
for database searching.

2.1.4 SciSearch

This database allows you to search for articles that cite an article of interest, which enables you to
search for related articles that have been published after the article of interest.

3 PGDB Curation

This section discusses curation of specific PGDB datatypes.


Figure 1 provides an overview of the relationships among some of the main PGDB datatypes. A
pathway is connected to its component reactions, which in turn are connected to the enzymes that
catalyze them. Enzymes are connected to the genes that encode them, which are connected to the
replicons (chromosomes or plasmids) on which they reside. Enzymes can be active as monomers or as
complexes; complexes are connected to objects that represent their polypeptide subunits. Reactions
are also connected to objects representing their substrates (such as Metabolite-1 in this example).
Genes are also connected to objects representing the transcription units (operons) containing them.
Transcription units are also connected to their promoters.

5
Figure 1: Relationships among PGDB datatypes. Each name in this drawing denotes a PGDB object
of a given datatype, for example, Reaction-1 is an object that represents a reaction. A line between two
objects indicates that a DB relationship exists between the objects, for example, a slot (attribute) of
Reaction-1 called In-Pathway has Pathway-1 as a value, thus linking the reaction to the pathway that
contains it. All of these links are bi-directional, for example, a slot of Pathway-1 called Reaction-List
has Reaction-1 as a value.

3.1 Naming Issues

A variety of naming issues arise when developing PGDBs. We strive for consistency in naming. One
reason consistency is important is so that users of a database can rely on a naming scheme being used
in a systematic way. For example, if a user wishes to find all degradation pathways, they can do a
substring search for the term “degradation,” and we can ensure that they will not miss some pathways
that use the term “catabolism” instead.
Although we advocate using a single consistency criterion for the common names of different object
types, such as chemicals, enzymes, and pathways, it is also reasonable and preferable to consistently
apply several naming schemes when creating synonyms, if you believe that different communities of
users are likely to consistently try to use different sorts of names when querying different entities. The
addition of synonyms for different object types increases the flexibility and robustness of the database.
Synonyms enable users 1) to search the database for information using alternative names and readily
find the information and 2) it prevents the addition of redundant information in the database under
alternative names. Furthermore, if a reaction does not have a specific Enzyme Commission (EC)
number, the software program PathoLogic [2] relies on enzyme names to correlate an annotated gene
with an enzyme, emphasizing the importance of enzyme names.
Try not to put commas inside the common names for objects, because the Pathway Tools software uses
commas as separators between the names of objects when it displays a list of objects. One alternative
is to use hyphens instead of commas.
More detailed naming conventions are presented in sections that follow on each datatype.

3.2 Overview of PGDB Content

The following sections discuss curation of different PGDB datatypes, such as the naming conventions
for each datatype. Frequently, it is beneficial to know the type of information collected in PGDBs

6
prior to curating a given pathway or other object type, so these sections summarize the information
to be gathered for each datatype.

3.2.1 Pathways

Please note that Section 4 discusses curation of pathways in more detail.


Summary of information collected for pathways:

• Common name

• Synonym(s) of pathway name

• Superclass(es) of pathway

• Names of species in which the pathway has been experimentally demonstrated (only relevant to
MetaCyc since other PGDBs are specific to a given organism).

• Summary

– General description of the pathway and its significance.


– Statement regarding the initial and end substrates. If the pathway is defined as the degra-
dation of substrate A to substrate E, but E is further degraded, comment on how E is
further degraded and to what end products in the species noted.
– Statement regarding whether the pathway is shared among different types of species.
– Relationship to similar pathways in the same or different species.
– Relationship to linked pathways (preceding and subsequent pathways), and sub- and super-
pathways (see definitions), if applicable.
– Experimental evidence for the pathway.
– Highlight interesting or novel reactions/enzymes in the pathway.
– If the pathway contains proposed intermediates, or hypothetical reactions, discuss the rel-
evant circumstantial or preliminary evidence.

• Links to other pathways within the same PGDB.

• Links to other DBs, such as other PGDBs, the WIT DB, and The University of Minnesota
Biocatalysis/Biodegradation DB (UM-BBD).

• Label hypothetical reactions as such.

• Net reaction equation, specifies the net chemical transformation of a pathway, including stoi-
chiometry, as defined in the PGDB. This can be added via the frame editor.

• Citations

7
Pathway Naming. When adding a new pathway name, try to use the format and style of other,
similar pathway names.
For consistency, please use the terms “degradation” and “biosynthesis” for pathway common names
when appropriate. Do not use the terms “catabolism”, anabolism, nor “utilization”. If “degradation”
or “biosynthesis” is not appropriate, “metabolism” may be used.
Pathway variants should be enumerated with roman numerals, not cardinal numbers. For example,
TCA cycle variation I, TCA variation II, etc.
Begin the name of any superpathway with Superpathway of...
Explicitly note pathways that are anaerobic by inclusion of (anaerobic) at the end of the pathway
name. The one exception is the anaerobic energy pathways (“anaerobic respiration,” etc.), in which
the word “anaerobic” is found at the beginning of the pathway name.
Write out the names of amino acids within all pathway names, rather than using full names within
some pathway names and abbreviations within others.

3.2.2 Chemical Compounds

Summary of information collected for chemical compounds:

• Chemical name and synonym(s)

• Superclass(es) of chemical compound

• Chemical structure

• Chemical formula and molecular weight are calculated from the chemical structure.

• Gibbs free energy of formation This can be added via the frame editor. Add it if you come across
this information, but do not spend time searching for it.

• Links to other databases, such as other PGDBs, Chemical Abstract Service (CAS), etc.

• Summary

• Citation(s)

Chemical names, synonyms and structures may be found in a variety of sources, such as the following:

• On-line databases

– Kyoto Encyclopedia of Genes and Genomes (KEGG):


(http://www.genome.ad.jp/kegg/ligand.html). Go to Search Compound.
– The University of Minnesota Biocatalysis/Biodegradation Database (UMBBD):
(http://umbbd.ahc.umn.edu/index.html). Use compound search.
– Google and Google Image
– Online literature databases
– Sigma-Aldrich product catalog: (http://www.sigmaaldrich.com/).

8
– Klotho Biochemical Compounds Declarative Database:
(http://www.biocheminfo.org/klotho/).
– IUBMB Enzyme Nomenclature Database: (http://www.chem.qmul.ac.uk/iubmb/enzyme/).
This DB can be helpful if the chemical of interest is a substrate or product of an EC
reaction, and there is a link from the reaction to a reaction diagram, displaying the chemical
structure.
– WIT Database: (http://wit.mcs.anl.gov/WIT2/).
– Enzymes and Metabolic Pathways (EMP) Database: (http://emp.mcs.anl.gov/).
– Agency for Toxic Substances and Disease Registry (ATSDR):
(http://www.atsdr.cdc.gov/toxfaq.html).

• Books and chemical catalogs

– Merck Index
– General chemistry and biochemistry textbooks if the chemical is common or is a central
metabolite
– Sigma catalog
– Aldrich catalog

Chemical structures and synonyms may also be found by searching the scientific literature for research
articles. PubMed can be searched readily on-line. If you are lucky you will find a citation for a
paper that is available on line and contains a figure with the chemical structure. Articles cited in the
References section of MetaCyc enzyme and pathway pages are good candidates to try searching.
If a compound cannot be found by name-based searching, it can sometimes be found by searching for
a reaction in which it is known to be a substrate, or by searching for an enzyme that is known to
catalyze a reaction in which it is a substrate.
It is desirable but not essential to find each compound in at least two sources, both to ensure correctness
of its structure, and to aid in finding additional synonyms.
It is often faster to enter a structure by copying and modifying the structure of an existing similar
compound rather than entering the structure from scratch.
If a structure is not found for a compound, enter a value in the Comment-Internal slot using the frame
editor to record this fact of this form: “No structure found in search on DATE by PERSON”.

Chemical Naming. Common names for chemical compounds should not be capitalized except
where uppercase characters are strictly required, e.g., use “L-tryptophan,” not “L-Tryptophan.” In
addition, for consistency, when naming organic acids, please use the name of the conjugate base as the
common name and the conjugate acid as a synonym. For example, the common name for acetic acid,
and benzoic acid should be acetate and benzoate, respectively, and their respective synonyms should
be acetic acid and benzoic acid.
Many chemicals have a common name as well as a trivial name, and an IUPAC (International Union
of Pure and Applied Chemistry) name. For example, dimethyl ketone (common name) has the trivial
name of acetone, and the IUPAC name of 2-propanone. It is also known as beta-ketopropane, and 2-
oxopropane. Different communities of users should be able to query the chemical using any one of these

9
names and find the chemical entry. To make this possible, each chemical entry in the PGDB should
include the chemical common name(s), trivial name(s), and IUPAC name(s) as either the common
name or synonym. Furthermore, some functional groups have multiple prefixes and/or suffixes. For
example, the functional group of ketones, >C=O, has two different prefixes, oxo- (IUPAC name) and
keto- (common name). The bifunctional compound 4-oxohexanoate is equivalent to 4-oxohexanoic
acid, 4-ketohexanoate and 4-ketohexanoic acid, and thus the chemical entry for this compound should
include all of these names as either the common name or synonym. Furthermore, if you think that
many people are likely to use names such as both “glucose-6-P” and “glucose-6-phosphate,” you may
wish to consistently use one style for the common name, and one style as a synonym. Note that
the Pathway Tools matching software for chemical names eliminates ambiguity due to capitalization,
hyphenation, and spacing. Therefore the names “glucose-6-P” and “glucose 6-p” will match one
another during a user query, so there is no need to enter both of these names as synonyms.

3.2.3 Compound Classes for Broad Substrate Specificity and Polymerization

Some enzymes (and their reactions) have broad and/or ill-characterized substrate specificities, and
the rationale and philosophy for how to faithfully represent these complexities will be explained here.
Most reactions in the small molecule metabolism convert distinct and well-defined compounds, the
representation of which is achieved by compound instance frames that reside under the Compounds
class hierarchy. However, a substantial number of reaction equations, including many that were
imported from the EC enzyme classification system, refer to broad compound classes, and the question
arises of how to represent these classes, such that much of the chemical logic of metabolic pathways
can be captured and made computationally accessible.
One of the strengths of the PGDB schema is that the concept of broad substrate specificity can be
faithfully captured, by using a compound class frame instead of a compound instance as a substrate
in the reaction equation. Such a compound class can have multiple specific instances, i.e. example
compounds for which the reaction equation is known or presumed to hold true.

Organization of the Compound Class Hierarchy Compounds are organized in a class hierarchy
for several reasons, including specification of groups of compounds that are interchangable as enzyme
substrates, and to allow user navigation and retrieval of sets of compounds that share functional groups
or that share metabolic purposes.
Underneath the class Compounds, compound classes describing functional groups include All-Amines,
All-Carboxy-Acids, All-Amino-Acids, All-Carbohydrates. Classes describing metabolic purposes in-
clude Coenzymes, Hormones, Vitamins, Secondary-Metabolites. These classification hierarchies are
not yet as fully developed as they could and should be, and may be subject to additional refinement
and rearrangement.
The special class Unclassified-Compounds serves as the catch-all bin for compounds that have not
yet been carefully placed in an appropriate category. Newly created compound instances (when the
curator uses the Compound Editor [2]) are created in this class by default, unless another class is
explicitly specified.
For now, please create new compound classes in the KB SCHEMABASE, from which they should be
propagated to organism-specific PGDBs. We plan to improve the mechanism for creating new classes;
for now please create them using the Taxonomy Editor.

10
Broad Substrate Specificity Compound classes allow a PGDB to represent broad substrate speci-
ficities in reaction equations. By using a compound class as a substrate in a reaction equation, you
are effectively stating that all instances of that class are interchangable within that reaction. As an
example in EcoCyc, the reaction LYSOPHOSPHOLIPASE-RXN (EC 3.1.1.5) contains two classes:
a 2-acylglycerophosphocholine + H2O = a fatty acid + L-1-glycero-3-phosphocholine
The class 2-Acylglycero-Phosphocholines on the left side stands for a group of Lysolecithin compounds
that have a variable-length fatty acid tail. On the right side, the fatty acids class stands for the
corresponding hydrolysis products. In contrast, H2O and L-1-glycero-3-phosphocholine are specific
compounds, represented as instances.
What usually determines the exact extent of the specificity is the enzyme that catalyzes the reac-
tion, and this even may differ depending on the isozyme or organism. Because the classes used for
representing substrate specificity do not necessarily fully overlap with the classification based on the
chemical viewpoint, often the curator has to make an informed decision about how much can be safely
assumed about the likely extent of specificity in a given reaction equation.
As a complicated example, in EcoCyc, the class Nucleosides is a substrate in the transport reac-
tion TRANS-RXN-108. However, the narrower subclass Ribonucleosides is a substrate in the 3-
NUCLEOTID-RXN phosphatase reaction, which appears to apply only to RNA-derived substrates
and not to the other major subclass Deoxy-Ribonucleosides. The overlap of these classes is quite good
between their viewpoints of functional groups versus roles as broad reaction substrates. However, one
subtle difference is that the class Ribonucleosides includes the compound INOSINE, because this class
definition clearly makes sense from the functional group viewpoint. However, in the 3-NUCLEOTID-
RXN, it is unlikely that INOSINE will ever be a substrate that occurs biologically in the RNA material
that is processed. Such classification decisions often have to be made, and more precise guidelines will
likely evolve for some time to come.
In summary, to represent broad substrate specificity in reaction equations, classes can be used that
will generally contain specific compound instances, and could in principle acquire additional future
instances as more new compounds are added to the database and as the enzymes for the reactions are
investigated in greater biochemical detail. Maintaining detailed information about compound classes
and their instances allows inferences to be made about compound conversions mediated by enzymes
with broad specificity, which would not become apparent in the metabolic network otherwise.

Ill-characterized Substrates Some EC reactions describe very broad conversions of functional


groups, such as the reduction of aldehydes to alcohols (EC 1.1.1.2):
NADP + an alcohol = NADPH + an aldehyde
This is so broad that it would be dangerous to make assumptions about which compound instances are
truly affected. Often, the description of such reaction equations is so vague, that it is very difficult to
search and obtain information on the scope of the reaction. In other cases, the biochemical experiments
simply have not yet been performed that could delineate the extent of the specificity more precisely.
For example in HumanCyc, a substantial number of function assignments to enzymes were made by
sequence similarity to a family of enzymes that is known to convert a general class of compounds, but
the true substrate in humans has not been determined yet for every particular putative enzyme.
In these cases, it seems best for now to create compound classes that contain no compound instances.
This is in contrast to the classes that represent broad substrate specificity, which generally will contain

11
compound instances. The classes representing ill-characterized substrates serve as place holders, until
more information becomes available that allows a better reclassification.
In the special class Pseudo-Compounds, ultra-ill-defined compound anomalies can be recorded, until
they can be resolved more satisfactorily.

Polymerization A number of EC reactions try to describe what in truth are polymerization reac-
tions. Various polymerases, nucleases, and also fatty acid synthase and the corresponding degradation
can all be viewed as causing polymerizations or de-polymerizations. Polymerization under a high
degree of enzymatic control is what sets cells apart from just a simple bag of small molecules and
their direct interconversions. Because of the key role that polymerization plays in biology, and to close
the loop computationally for the full metabolic flow of materials, a representation had to be found
that captures the essence of polymerization reactions. As far as we know, this feature is unique to
the PGDB schema, and other metabolic pathway databases do not yet have high-level concise ways
of representing polymerizations, which are a feature of key importance in cellular biochemistry.
This can be done in the PGDB schema by looking at a growing polymer as if it were a compound class,
and as if its hypothetical instances (which do not necessarily have to be explicitly enumerated) repre-
sent the intermediate products as the polymer is being grown. Internally, PGDB reaction equations
that add a monomer to a polymer will generally list the compound class representing the polymer on
both sides of the reaction equation. But when the name of that polymer is displayed to the user by
Pathway Tools in a reaction or pathway drawing, it generates different names for different lengths of
the polymer, depending on the side of the equation the class appears on.
For example, the reaction FOLYLPOLYGLUTAMATESYNTH-RXN (EC 6.3.2.17) refers to the same,
identical compound class THF-GLU-N on both sides of the equation, but the display of this reaction
shows the appropriate (n) and (n+1) versions of the polymer name:
L-glutamate + H4PteGlu(n) + ATP = H4PteGlu(n+1) + phosphate + ADP
These additional names are stored in 3 special slots in the compound class frame, namely N-NAME,
N+1-NAME, and N-1-NAME. These slots can be edited to contain a single string each, which can be
selected as an alternate COMMON-NAME in reaction and pathway displays.
In a reaction frame that refers to a polymer, the selection of the appropriate name is achieved by
attaching the NAME-SLOT annotation to the polymer compound class frame that is one of the values
of the LEFT and RIGHT slots in the reaction frame. The value for the NAME-SLOT annotation is
the symbol that stands for one of the three slots mentioned above. [PK: Tell the reader how they do
this]
Using the “Show Frame” menu command to print the frame contents in the example above shows
these annotations:

--- Instance FOLYLPOLYGLUTAMATESYNTH-RXN ---


Types: EC-6.3.2

COMMENT: "This reaction is involved in the conversion of folates to


polyglutamate derivatives."

COMMON-NAME: "FOLYLPOLYGLUTAMATE SYNTHASE"

12
CREATION-DATE: 3000588944

CREATOR: |mriley|

EC-NUMBER: "6.3.2.17"

ENZYMATIC-REACTION: FOLYLPOLYGLUTAMATESYNTH-ENZRXN

IN-PATHWAY: FOLSYN-PWY

LEFT:
GLT
THF-GLU-N
---NAME-SLOT: N-NAME
ATP

NAMES: "FOLYLPOLYGLUTAMATE SYNTHASE",


"FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE"

OFFICIAL-EC?: T

RIGHT:
THF-GLU-N
---NAME-SLOT: N+1-NAME
|Pi|
ADP

SCHEMA?: T

SUBSTRATES: ADP, |Pi|, GLT, THF-GLU-N, ATP

SYNONYMS: "FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE"

TEMPLATE-FILE: "/home/bolinas2/ecocyc/templates/new/folsyn/folC.template"
____________________________________________

In a pathway that contains polymerization reactions, the selection of the appropiate names is achieved
by the POLYMERIZATION-LINKS slot of the pathway frame. Each value of this slot is a list of
frame ID symbols of the form (cpd-class product-rxn reactant-rxn). When both reactions are non-nil,
an identity link is created between the polymer compound class cpd-class, serving as a product of
product-rxn, and the same compound class serving as a reactant of reactant-rxn. The identity link
is displayed to the user by a white dashed line in the pathway drawing. The PRODUCT-NAME-
SLOT and REACTANT-NAME-SLOT annotations on this list specify which slot should be used for
displaying the compound label in product-rxn and reactant-rxn above respectively — if one or both
are omitted, COMMON-NAME is assumed. Either reaction above may be nil; in this case, no identity
link is created — this form is used solely in conjunction with one of the name-slot annotations to
specify a name-slot other than COMMON-NAME for a polymer compound class in a reaction of the
pathway.

13
As an example, this is what the POLYMERIZATION-LINKS slot looks like in the pathway FOLSYN-
PWY, as printed out by “Show Frame”:

POLYMERIZATION-LINKS:
(THF-GLU-N FOLYLPOLYGLUTAMATESYNTH-RXN FOLYLPOLYGLUTAMATESYNTH-RXN)
---PRODUCT-NAME-SLOT: N+1-NAME ---REACTANT-NAME-SLOT: N-NAME

The Pathway Editor [2] supports the selection of these alternate name labels. Right-clicking on a
compound in the pathway editor will bring up a menu with commands, if applicable, that allow
creating and deleting polymerization links, and selecting from the name labels that the polymer
compound class makes available in the name slots described above.
The Reaction Editor [2] still lacks support for name label selection as of June 2003. For reaction
frames, the Frame Editor can be used to manually add the appropriate annotations.

Macromolecules as Substrates Numerous reaction equations refer to proteins or nucleic acids


that are enzymatically modified. One prominent example is protein phosphorylation in a regulatory
context. In a sense, proteins and nucleic acids are polymers, but they are generally treated differently
in the PGDB schema from the polymerization compound class concept discussed above.
Historically, the PGDB schema has made the high-level distinction between small molecule compounds
and macromolecules. Proteins and nucleic acids are in the latter category. Usually, when a macro-
molecule is listed as a subtrate in a reaction, a corresponding modified form of the macromolecule
must be specified on the other side of the equation. The unmodified macromolecule generally will be
a gene product, such as a protein monomer or tRNA. For the modified variant, a new frame generally
needs to be created, which represents the modified macromolecule, and which points to the unmodi-
fied version in its UNMODIFIED-FORM slot. The modified and unmodified variants should be made
instances in a class of their own. Also see the description in [2].
Redox-active proteins are also treated in this manner, with the corresponding oxidized and reduced
variants paired up.
Many reaction equations accept a broad range of macromolecules, and in these cases, macromolecule
classes can be used as the substrates, in analogy to how compound classes are utilized to represent
broad substrate specificity. A typical example includes tRNA-charging reactions, such as LEUCINE–
TRNA-LIGASE-RXN (EC 6.1.1.4):
tRNAleu + L-leucine + ATP = L-leucyl-tRNAleu + pyrophosphate + AMP
On the left side, the class LEU-tRNAs contains 8 distinct gene products in EcoCyc.
There are also numerous reaction equations coming from the EC system that are very generic modi-
fications of macromolecule structure, such as DNA methylation at one type of base, wherever it may
occur in a long DNA polymer. For now, the best way to represent such modifications at particular
macromolecule residues, which are likely to apply to many different gene products that are left other-
wise unspecified, is by using macromolecule classes, which in analogy to ill-specified compound classes
will contain no instances. At a future time, these placeholder classes might be replaced by a better
mechanism for representing residue- and site-specific modifications on macromolecules.

Naming Conventions of Compound Classes Like all class names, the frame IDs for compound
classes should be capitalized plurals, with words separated by dashes. Example: All-Amines. The

14
“All-” prefix means this is a high-level class from a chemist’s viewpoint that will contain more detailed
distinctions. [PK: I wouldn’t rule out what the next sentence states.] This type of organizational class
is generally not intended to serve as a substrate for a reaction.
The COMMON-NAME slot should be left empty, so that the Taxonomy Editor will simply display
the frame IDs.
However, the SYNONYMS slot should list as its first value a name string that makes good sense when
written in a reaction equation, which generally will be a lower cased singular name, usually beginning
with ”a ” and essentially the same name as the frame ID, but with spaces used instead of dashes to
separate words. Example: the class Aliphatic-Amines has as its first synonym ”an aliphatic amine”.
(This name will be generated automatically by the software that displays compound names in reaction
equations.)

Chemical Structures of Compound Classes Compound classes can also have molecular struc-
tures, just like instances. The main difference is that a class needs to show in its structure that
variations exist, which if explicitly enumerated, would correspond to the compound instances populat-
ing this class. In a class structure, this variation can be represented by an “R” group. The Compound
Structure Editor allows “R” to be entered as a special atomtype. There are class structures that need
to represent 2 or more different “R” groups that have to be distinguished between. Proper software
support for this does not yet exist.

3.2.4 Reactions

Summary of information collected for reactions:

• EC number. There is a slot to label whether this is an official EC number or not. The default is
that it is the official number. The EC number for the reaction may not be official if the reaction
is not in the exact form as defined by the Enzyme Commission.

• EC name

• Spontaneous reaction?

• Change in Gibbs Free Energy for the reaction in the direction of the reaction as written. DeltaG0
can be added via the frame editor. Add it if you come across this information, but do not spend
time searching for it.

• Summary

– Brief description of the reaction. Highlight any interesting or unusual aspects of the reac-
tion.
– If the reaction is novel, describe why.
– If the reaction is hypothetical, explain the supporting evidence.
– If the reaction is spontaneous, describe under what conditions. In vivo? In vitro?

• Citation(s)

15
Reaction Directionality. By way of background, it is important to know that the Enzyme Com-
mission dictates a preferred direction for every enzymatic reaction that they classify. All PGDBs
should store reactions in this preferred direction, therefore, you should not change the direction in
which a reaction is stored in the DB for reactions that have EC numbers. Reactions that do not have
EC numbers should be stored in the same direction as comparable EC-classified reactions, if known.
The direction in which the Enzyme Commission writes the reaction often differs from the direction in
which the reaction occurs physiologically. If the reaction is irreversible, or a strong preference for its
directionality exists in the cell, a Reaction Direction can be specified using the Enzyme Editor (see
documentation for the Reaction-Direction slot).
Note that the Navigator software chooses which direction to display a reaction in reaction windows and
enzyme windows based on several criteria. Please see the Navigator User’s Guide for more information.

Hypothetical Reactions. Reactions within pathways may be indicated as hypothetical if there


is insufficient experimental evidence. The last few reactions in anaerobic toluene degradation are
marked as hypothetical because the intermediates, 2-carboxymethyl-3-hydroxyphenylpropionyl-CoA
and benzoylsuccinyl-CoA have yet to be observed in the lab. These intermediates and their respective
reactions were proposed based on analogous biochemical reactions. There is also genetic evidence for
the hypothesized enzymes. If a reaction is marked as hypothetical it should be further explained in
the summary of the pathway and possibly in the reaction or enzyme windows.

Reaction Balancing. Ideally, all biochemical reactions within a PGDB should be exactly balanced
with respect to mass and atoms, meaning that the sums of the masses and atoms for the reactants of
the reaction should equal that for the reaction products. In practice, some reactions are unbalanced.
Here we discuss how to find and correct unbalanced reactions.
The Correctify-KB process will print a list of reactions that it finds to be unbalanced with respect
to mass or atoms. Note that it ignores reactions that are unbalanced with respect to hydrogens
because the chemical structures within BioCyc PGDBs are stored with inconsistent ionization states
that cannot be expected to balance across every reaction.
Reactions are typically unbalanced for several possible reasons. (1) The literature itself contains an
unbalanced reaction, such as because a full reaction equation has not been determined reliably, and
the author does not know the full equation. These reactions should all be given a summary to indicate
that the reaction is expected to be unbalanced. (2) The reaction involves polymerization, and therefore
involves complexity that the reaction balancer cannot handle properly. (3) The reaction equation is
in error, such as due to lack of a water molecule, and should be corrected. (4) One or more chemical
structures for substrates of the reaction are in error and should be corrected. When errors in reactions
or compound structures are corrected, those corrections should be logged in a history note for the
reaction or compound, created using the command Right-Click → Notes → Add to History. If an
error is found in a reaction that has an EC number, please alert the Enzyme Commission (IUBMB).
The information printed for each reaction will aid the curator in identifying the reason the reaction
is unbalanced. For example, if a large fraction of all reactions containing a given compound are
unbalanced, there is a good chance that this is because the structure of that compound is in error.

16
3.2.5 Proteins

Summary of information collected for proteins:

• General Protein Information

– Species name
– Common name and synonym(s) of protein
– Cellular location (eg. membrane, cytoplasm, chloroplast, etc.)
– Native molecular weight in kilodaltons
– Neidhardt Spot Number (reflects the proteins electrophoretic behavior in 2-dimensional
electrophoresis). (Can be added via the frame editor, but not typical. Add it if you come
across this information, but do not search for it).
– Summary
∗ General description of the protein, highlighting any interesting aspects. The com-
ment on each protein page should include a summary of the most important infor-
mation known about the protein; for example, its function and role within the cell,
any phenotypes resulting from its mutation or absence, protein domains, its partici-
pation as a component of a larger structure, its similarity to other proteins (includ-
ing any functional complementation studies), membership in protein families, and in-
formation about its structure. If the enzyme is novel, explain why. If the enzyme
has different isoforms, describe the substrate specificity of the isoforms and the cell-
type/tissue/developmental specificity of the isoforms. Any reviews may be noted specif-
ically.
∗ If a search fails to turn up any information about a particular object, it is useful to
add a summary to this effect. It should include the date of the search, for example,
“No information about this protein was found by a literature search conducted on 18
June 2003.”
– Citation(s). In EcoCyc, we are assembling comprehensive reference lists for each protein.
– Last-Curated. A PGDB can track a “last-curated” date for a gene product. Checking the
“last-curated” box in the protein editor will cause this field to be updated. The last-curated
date is the date on which a systematic literature search was last performed by curators for
this gene product. This date can be used by both curators and database users to determine
how up to date the entry is. Do not check this box if only partial curation is performed by
the gene product, as this would interfere with the purpose of this field, which is the record
the last date on which full curation was performed.

• Enzyme Activity

– Enzyme activity name and synonym(s) (name based on the activity of the reaction)
– Summary
∗ General description of the enzymatic activity, highlighting any interesting aspects. Ki-
netic data, such as Michaelis constants (Km(s)) for the substrates, and Ki(s) for in-
hibitors should be noted, if available.
– Inhibitors (physiologically relevant or not?)

17
∗ Competitive inhibitors: inhibit enzyme activity by binding reversibly to the enzyme
and thereby preventing the substrate from binding.
∗ Noncompetitive inhibitors: inhibit enzyme activity by binding reversibly to either the
free enzyme or the enzyme-substrate complex. The substrate is not prevented from
binding, but the enzyme with the inhibitor bound is not catalytically active. This
category was added to the Pathway Tools schema for the February 2004 release.
∗ Uncompetitive inhibitors: inhibit enzyme activity by binding reversibly to the enzyme-
substrate complex. This category was added to the Pathway Tools schema for the
February 2004 release.
∗ Allosteric inhibitors: inhibit enzyme activity by binding reversibly to the enzyme and
inducing a conformational change that decreases the affinity of the enzyme to its sub-
strates. Allosteric inhibitors can be competitive or noncompetitive, therefore those
inhibition categories may be used in conjunction with this one.
∗ Irreversible inhibitors: irreversibly inhibit the enzyme activity by binding to the enzyme
and dissociating so slowly as to be considered irreversible. This category was added to
the Pathway Tools schema for the February 2004 release.
∗ Other inhibitors: inhibit the enzyme activity by a mechanism that has been charac-
terized but does not fall cleanly into one of the above categories. This category was
added to the Pathway Tools schema for the February 2004 release. It replaces a sim-
ilar category in previous versions for all inhibitors that were neither competitive nor
allosteric.
∗ Inhibitors of unknown mechanism: inhibit enzyme activity, but the mechanism of ac-
tion is unknown either because it has not yet been elucidated, or because it has not
been curated. This category was added to the Pathway Tools schema for the Febru-
ary 2004 release. It combines and replaces two old categories from previous versions
that distinguished between mechanisms that were unknown because they had not been
extensively curated versus ones that remained unknown after a substantial literature
search.
– Activators (physiologically relevant or not?)
Categories of enzyme activation were similarly reexamined at the time of the February
2004 release, and documentation was updated. Although no new activation categories were
added, the two categories for activators whose mechanism was unknown after a minimal
versus a substantial literature search were combined into a single category for activators of
unknown mechanism.
– Prosthetic Groups/Cofactors (eg. metals, FAD, NAD(H), etc.) (Note: Prosthetic groups
are defined as being covalently or tightly bound to an enzyme, whereas cofactors are not.)
– Alternative substrates
For a description of some of these terms see Appendix A of Pathway Tools User’s Guide.
– Citation(s)

• Enzyme Subunit Composition

– Subunit composition. Specifies the number of copies of each monomer subunit of a mul-
timeric protein. In cases where sub-complexes of a large multimer have been observed,
those sub-complexes can be created as PGDB objects that are sub-complexes of a larger
super-complex.

18
– Subunit name(s) and synonym(s)
– Subunit molecular weight (experimental and computed from sequence)
– PI (isoelectric point). Can be added via the the frame editor. Add it if you come across
the information but do not search for it.
– SWISS-PROT primary accession number of subunit, if available
(http://us.expasy.org/sprot/). Links protein subunit to SWISS-PROT
– Summary
∗ Brief description of the enzyme subunit, such as its function, if known or proposed.
For example, a subunit may be known to house the catalytic active site, it may have
an FAD binding motif and may be proposed to be involved in electron transport, or it
may be proposed to be the membrane anchor subunit of a membrane-bound enzyme.
– Citation(s)

Protein Naming. The word “monomer” may be used to refer to a polypeptide that has intrinsic
activity or to a polypeptide that acts as a member of a homo-multimeric complex. The words “subunit”
or “component” may be used to refer to a member of a hetero-multimeric complex.
When naming a protein complex, use the name that is commonly recognized by the scientific commu-
nity. Any other names for the complex should be added as synonyms.
Avoid extra wordiness (“complex”, for example) unless it is necessary to avoid confusion. For example,
the long name of the pyruvate dehydrogenase multienzyme complex distinguishes it from pyruvate
dehydrogenase, which is one of its components.
Any protein product of a named locus without an associated gene should be named using the guidelines
for protein complex names: ”acetate kinase B” should be named ”acetate kinase B” with ”AckB” as
a synonym, rather than ”AckB,” because the ackB mutation has not been localized to a particular
gene.

Enzyme Naming. Be sure not to assign non-specific enzyme names to enzymes. For example, do not
use names such as “oxidoreductase” or “hydrogenase” because these names refer to nonspecific classes
of enzyme activity, not to specific enzyme activities. Use of these names will be very problematic for the
PathoLogic program, because genome annotations often use these nonspecific enzyme names for genes
whose specific functions cannot be inferred. Thus, if a MetaCyc enzyme were called “oxidoreductase,”
PathoLogic would assign a a gene annotated with the name “oxidoreductase” to the corresponding
reaction in MetaCyc, which is not correct.
The following procedures should be followed regarding entry of enzyme names ending in roman nu-
merals and arabic numbers, such as “pyruvate kinase II” and “tagatose-1,6-bisphosphate aldolase
2.”
Enzyme names ending in roman numerals fall into two categories: those in which the roman numerals
are an integral part of the enzyme activity, and those in which the roman numerals differentiate
isozymes that have the same enzyme activity.
PGDBs allow two sets of names and synonyms to be defined for enzymes: names for the enzyme
activity, and names for the protein. Consider that E. coli has two proteins with pyruvate kinase
activity, designated pyruvate kinase I and pyruvate kinase II. The name of the enzyme activity for

19
both of these enzymes is pyruvate kinase, however, the names of the proteins are pyruvate kinase I
and pyruvate kinase II.
To encode this situation in a PGDB, enter the protein name, pyruvate kinase I, in the top section
of the protein editor, by adding a common name to the protein. Enter the enzyme activity name,
pyruvate kinase, in the second section of the protein editor, under the first horizontal line, in the box
labeled “Enzyme activity name.” Note that every enzyme should have at least one enzyme activity
name, but protein names should be assigned less frequently, such as when different isozymes exist. In
addition, protein names should be defined when names exist for an enzyme that are specific to that
protein. Arabic numbers and the ends of enzyme names are also used to differentiate isozymes, and
should be treated in the same manner.
In contrast, consider the enzymes exodeoxyribonuclease I and exodeoxyribonuclease III. These enzymes
catalyze different reactions, and the roman numerals designate exactly which enzyme activity an
enzyme has. Therefore, these names should be entered in the “Enzyme activity name” field of a
PGDB, because the names are specific to the enzyme activity, not to the protein.

3.2.6 Genes

Summary of information collected for genes:

• Common name and synonym(s)

• Superclass(es) of gene

• Gene product type (eg. enzyme, regulator, leader, etc.). Evidence for product type (experimental
or predicted based on sequence analysis)

• Transcription direction (unspecified, forward or reverse)

• Left and right end position of gene on chromosome or plasmid

• Link to other DBs

• Summary

• Citation(s)

Gene Naming. Regarding gene naming, for bacterial databases, automated programs will period-
ically ensure that the capitalized gene name is a synonym for the name of the gene product (e.g.,
“TrpA” will become a synonym for the product of “trpA”). For most organisms, the frame name
for each gene frame will be the unique identifier assigned to the gene by the genome project (e.g.,
“HP0001” for an H. pylori gene). Therefore, there is no need to add this same identifier as a synonym
for the gene name.

3.3 Summaries and History Notes

Text stored in the Comment slot of a PGDB will be seen by the public under the heading “Summary”.
Should you want to record information that will not be seen by the public but that will be visible in
the Navigator to curators, put that information in the Comment-Internal slot using GKB-Editor.

20
Curators are encouraged to record historical information about PGDB objects to provide explanation
and justification of edits to a PGDB. Such commentary should be stored in a history note for the
object, created using the command Right-Click → Notes → Add to History. Example commentary
could describe the reasons for changing a gene function, a chemical structure, or a pathway definition.
History note contents are also public, and will be displayed with the date they were created and the
username of the creating curator.

3.3.1 Writing Style Guidelines

Summaries should be written in full sentences rather than in sentence fragments. Use multiple para-
graphs within summaries where the extra whitespace adds clarity and separates ideas. Embed citations
within summaries. Other than general commentary, most of the text of a summary should consist of
an assertion followed by a citation, then another assertion followed by a citation. Do not lump all the
citations at the end of the summary.

Guidelines for Gene-Product Summaries We typically store summaries in the protein (or RNA)
product of a gene, rather than in the gene itself. The first sentence of the summary should summarize
the function of the gene product.
Except in rare cases, in which the gene product is particularly complex or physiologically significant, an
effort should be made to keep the length of ”Summaries” (including references)to less than 500 words.
In longer summaries, the first paragraph should in a few sentences summarize what is known about
the gene product. Experimental support for these conclusions and information regarding protein
structure and regulation of expression should come later. In other words, ”Summaries” should be
organized more like a news story than a scientific paper: conclusions should precede rather than
follow detailed information and evidence; the user should be able to gain the essential facts without
necessarily having to read the entire summary.
In those cases in which background information is considered to be an important aid to some users,
such information can be added but it should not be included in the first paragraph.
If relevant reviews are available, they should be included, and grouped together at the end of the
summary, such as: “Reviews: [Smith95,Jones98].”
Literature citations are an important and valuable part of summaries, but some effort should be made
to restrict their numbers to the most significant ones — try not to exceed 10–20 references within
most summaries (in some cases of extremely well-studied genes, more references may be appropriate).
Sentences, particularly those in the first paragraph, should not be interrupted by long listings of tens
of references.

3.3.2 Formatting Summaries

Pathway Tools supports a subset of HTML tags for encoding special characters and formatting (such
as italics) within summaries and names. Following is a list of tags that are accepted.
For example, to encode the name α-D-glucose, enter the characters: “α-D-glucose”.

• Greek letters

21
– alpha symbol: α
– beta symbol: β
– delta symbol: δ
– gamma symbol: γ
– omega: ω
– mu (micro): μ

• Text

– italicized text: <i>italicized text</i>


– bold text: <b>bold text</b>
– underlined text: <u>underlined text</u>
– supertext: <sup>supertext</sup>
– subtext: <sub>subtext</sub>

A few, simple, HTML tags, such as <P> for paragraph, and as<BR> for hard line break, are detected
by Pathway Tools, but are removed from the displayed text and not observed. Thus, to display a new
paragraph, hit the return key twice instead of using <P>.

3.3.3 Say It in Your Own Words

Avoid word-for-word duplication from papers, and enclose any small duplication (one to three sen-
tences) within quotes and cite the source.

3.3.4 Citation Guidelines

Citations should be used within summaries to cite the source of the information just conveyed. Cita-
tions can also be entered independently of summaries, in which case they are stored in the Citations
slot of the object (such as a protein), and they provide general references for the object.

3.4 Saving Changes

If changes have been made to a database, an asterisk will appear to the left of the database name at
the top of the navigator window, for example, *MetaCyc. Changes are not saved automatically. It is
important to remember to use the “Save KB” command to save changes.

3.5 Evidence Codes

PGDBs include an evidence ontology that is designed to encode information about why we believe
certain assertions in a PGDB, the sources of those assertions, and the degree of confidence scientists
hold in those assertions.

22
A detailed description of the evidence ontology can be found in [1]. The evidence ontology can
be browsed within a PGDB by running the GKB Taxonomy viewer on the PGDB class hierar-
chy rooted at class Evidence. An HTML version of the evidence ontology is available at URL
http://bioinformatics.ai.sri.com/ptools/evidence-ontology.html.

An assertion could be the existence of a biological object described in a PGDB. For example, we would
like to be able to encode the evidence supporting the existence of a gene, an operon, or a pathway
that is described within a PGDB.
Curators can assign evidence codes by clicking on the “Evidence Code” button within most Pathway
Tools Editors. For example, within the protein editor, there is an Evidence Code button below the
Enzyme Activity box. This button allows the curator to assign an evidence code; a citation to the
source of the evidence should also be added when available (the citation is optional since some evidence
codes such as Ev-AS-NAS are used when no citation is available). Whenever an evidence code has been
assigned, a new “Evidence Code” button will be drawn to allow assignment of an additional evidence
code. It is proper to assign multiple evidence codes if multiple types of evidence support a given
conclusion.
We offer several guidelines as to how to apply the evidence ontology to different classes of PGDB
objects.
Pathways. Assign code EV-Exp or one of its sub-codes to a pathway if some experimental evidence
supports the existence of the pathway. Code Ev-Comp or its sub-codes should be used for pathways
whose presence is inferred computationally, such as by the PathoLogic program. Therefore, we expect
that all pathways in the MetaCyc PGDB will be assigned an EV-Exp code because MetaCyc contains
experimentally elucidated pathways.
Proteins. Evidence codes should be assigned to a protein to define the evidence supporting the
function of the protein. For example, was the function of the protein elucidated using sequence
analysis, or using experimental methods; if the latter, what class of method was used? Enzymes.
Several evidence codes specific to enzymes capture evidence from experimental methods that are
specific to elucidating enzyme activities, such as EV-Exp-IMP-Reaction-Enhanced. These evidence
codes are actually stored within instances of the Enzymatic-Reactions class.

4 Pathway Curation

This section discusses curation of pathways in more depth.

4.1 Pathway Definitions

A metabolic pathway is a set of one or more enzymatic transformations (such as biosynthesis,


degradation, conversion, or utilization), as it occurs in a particular organism. Identical pathways that
exist in other organisms are not repeated in the MetaCyc database; instead a single pathway is labeled
by the multiple organisms in which it occurs. Metabolism of a substrate exogenously supplied to cells,
such as a vitamin, or drug, can also constitute a pathway.
Pathways can be classified into base pathways and superpathways. A base pathway is considered
a lowest-level pathway in the sense that it is not subdivided into smaller component pathways. Base
pathways can be linear, circular, or branched.

23
A superpathway is an aggregation of two or more pathways that are related in some way. A pathway
component of a superpathway is referred to as a subpathway. A subpathway is part of a superpathway,
and a superpathway is composed of subpathways. The subpathways of a superpathway can be base
pathways, or can themselves be superpathways. Some superpathways will contain additional reactions
and enzymes not found within the base pathways, such as reactions that connect two base pathways
together. PGDBs always contain links between associated base pathways and superpathways, and
those links are displayed by the Pathway/Genome Navigator toward the bottom of a pathway display
page.
There are two main types of superpathways: those whose subpathways are related by a common
substrate, and those whose subpathways are related by being analogous base pathways from different
organisms. For the first type of superpathway, its subpathways could be derived all from the same
organism, or they could be derived from multiple organisms. Multispecies superpathways that are
connected via a common substrate are potentially useful in metabolic engineering. The steps used in
creating superpathways from subpathways are described in the Pathway Tools User’s Guide, Volume
II, section 2.3.5.3.
More specifically, superpathways can be created based on the following types of relationships among
their subpathways: (1) subpathways that are physically connected through a common substrate (that
is, one subpathway produces the substrate, and another subpathway consumes it); (2) subpathways
that are unconnected, but that metabolize the same substrates (e.g. MetaCyc Superpathway of as-
partate and asparagine biosynthesis: interconversion of aspartate and asparagine); and (3) analogous
subpathways consisting of an analogous series of reactions catalyzed by the same enzymes, or by
analogous enzymes (e.g. MetaCyc Superpathway of isoleucine and valine biosynthesis).
Additionally, pathway variants exist in which the same substrate is synthesized or degraded using
different enzymes and/or cofactors, in the same or different organisms. Pathway variants share iden-
tical pathway names followed by a roman numeral. Many examples of pathway variants can be found
in MetaCyc by browsing the pathway ontology. These pathway variants can potentially be combined
into superpathways.

4.2 Defining Pathway Start and End Points

Several considerations guide the questions of how to define the start and end points of a pathway, and
of whether a given published pathway should be encoded in a PGDB as a single base pathway, or as
a set of base pathways within a common superpathway. The following rules should be used to guide
the creation and editing of base pathways and superpathways, when possible.

1. The substrate biosynthesized or degraded by a pathway should be a stable substrate, as op-


posed to a transient intermediate. However, a pathway could show the biosynthesis of a stable
intermediate that is a precursor for the biosynthesis of other substrates.

2. Biosynthetic pathways should begin with an intermediate of central metabolism. These inter-
mediates are the 13 precursor metabolites: glucose-6-phosphate, fructose-6-phosphate, ribose-5-
phosphate, erythrose-4-phosphate, triose phosphate, 3-phosphoglycerate, phosphoenolpyruvate,
pyruvate, acetyl CoA, alpha-oxoglutarate, succinyl CoA, oxaloacetate, and sedoheptulose-7-
phosphate. A pathway link (see Section 4.3) should be created to indicate the pathway that
produces the precursor metabolite at the start of the pathway.

24
3. Degradative pathways that produce an intermediate of central metabolism should stop at that
point. Some degradative pathways may not produce intermediates of central metabolism, but
instead produce compounds that are excreted from the cell. If appropriate, a pathway link (see
Section 4.3) should be created to indicate the pathway that processes the resulting metabolite
at the end of the pathway.

4. Another class of pathways is applicable in cases where compounds are metabolized in a dis-
similatory manner for the production of energy. Examples for these pathways include sulfate
reduction and ammonia oxidation. In such cases, the metabolites are unlikely to consume or
produce intermediates of central metabolism. Such pathways should start with the natural form
of the compound being used as an electron donor or acceptor, and end with the compound gen-
erated at the end of the electron transport process, which would generally be secreted by the
organism.

5. Very large or complex pathways should usually be defined as superpathways that combine several
smaller base pathways, where those base pathways are divided at breakpoints. Dividing a large
or complex pathway in this fashion is particularly useful to optimize the accuracy of PathoLogic
predictions, especially in cases where it is the base pathways, rather than the entire pathways,
that tend to be present as units across different organisms. If the pathway was defined as one
large base pathway, rather than as a set of base pathways connected through a superpathway,
PathoLogic would be unable to predict the presence of the smaller base pathways independently
in different organisms. Breakpoints for large base pathways can be chosen based on various
criteria such as: branch point substrates; substrates involved in regulation; a major metabolite
that is further metabolized; the cellular compartment in which the reactions occur (organelle
or cytosol); a transport segment or a utilization segment; or the type of reaction (oxidative or
non-oxidative).

6. If a published pathway contains several pathways that are already defined in a PGDB as base
pathways, it should be represented as a superpathway.

7. If the pathway contains too many reactions to conveniently represent in one base pathway, it
should be broken into two or more base pathways, which should be linked together by pathway
links (see below).

4.3 Pathway Links

Pathway links are a mechanism for indicating substrate connections among pathways. Pathway links
are displayed as arrows connecting an input or output substrate in a pathway to the name of a second
pathway in which that substrate is metabolized. Pathway links can illustrate the source pathway for an
input substrate, or the destination pathway for an output substrate (assuming that it is not completely
metabolized). Clicking on the second pathway name takes the user to that pathways display page.
Links can be created to another base pathway, a superpathway, or to a class of substrates that derive
from the pathway (e.g. MetaCyc glycolysis I). The steps used in creating pathway links are described
in the Pathway Tools User’s Guide, Volume II, section 2.3.5.3.
Note that although it is helpful to explain the origin or fate of substrates in the pathway summaries
field, this unstructured text is not computationally useful, and thus cannot replace the use of pathway
links.

25
4.4 Limitations

Metabolic pathways involving macromolecules and cellular structures may be difficult to represent
in PGDBs. This is a factor in pathway selection. Pathways that involve reactions that synthesize,
degrade, or modify small molecule components of macromolecules and cellular structures can be repre-
sented. However, some processes may be beyond the scope of PGDBs, which focus on small molecule
metabolism.

4.5 Database Searching Strategies

For those curating MetaCyc, please see Section 6.1 for additional information regarding MetaCyc
specifically.
To search available databases for information regarding a particular pathway, it is recommended
to begin by using general keywords related to the pathway name (eg, creatinine degradation to
formate methionine biosynthesis, etc). You may need to search using several alternative names,
for example: toluene degradation, toluene oxidation, toluene catabolism etc. Adding additional
search terms such as anaerobic will help avoid getting irrelevant hits. As in all searches, if
you get too many hits you should narrow your search by adding keywords, such as bacteria to
limit the search to only bacterial pathways, or a species name to limit the search to only path-
ways in a particular species. Some databases allow the use of wildcards, which are truncated
names followed by a special character such as * to designate different variations for the ending
of the word. For example, bacter* would include bacteria, bacterial, bacterium, etc. Differ-
ent databases may use different wild cards, so it is always useful to consult the databases search
description/overview. For a description of PubMed searching strategies see the following URL:
http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching.

If you know a little bit about the pathway, such as the names of intermediates, or enzymes involved in
the pathway, you should also search for articles using these keywords. Furthermore, if you know the
names of the researchers who studied the pathway, you can search for articles using a combination of
their names and the substrate or enzyme they are working on. Once you get some articles of interest,
you would usually find other related articles by 1) looking at their references, and 2) using SciSearch
to find articles that cite them. Often an articles full text is available online in html format. These
html formatted articles often have web links to articles that they reference as well as to articles that
cite them, which is a very convenient way of finding additional articles. Once you have identified and
gathered the relevant articles for a pathway, try to find their PubMed ID (PMID) numbers and label
them as these numbers are the easiest way to enter references information into PGDBs.

4.6 Pathway Entry

Once you have enough papers to put a pathway together, draw out the pathway, making note of the
chemical reactions, chemical names and structures of the metabolites, and enzyme names if known.
Try to identify any EC numbers that may have been assigned to the reactions in the pathway. It is
very important to find out whether the chemicals and/or reactions already exist in the database. This
may not be straightforward, as chemicals may have many different names. MetaCyc already includes
all of the reactions which have been assigned EC numbers, so you may want to search the IUBMB web
site carefully to find out whether existing EC reactions fit any of the reactions in the new pathway.
Often authors are not aware of such EC numbers and do not include them in their publications. Make

26
sure you do not create duplicate chemicals and/or reaction in the database, as this will lead to certain
problems in the future. After you have identified all existing reactions, you may need to create new
reactions and chemicals. Write down the frame IDs of both existing and new reactions and add them
to the drawing of the pathway which you have prepared. This will greatly facilitate the creation of
the new pathway.
Once you finished these steps, you are ready to define the new pathway. Do not forget to assign appro-
priate class and evidence codes. An ideal reference for the pathway evidence code is a recent review
article which cites all the relevant experimental literature. In such case use the code EV-EXP-TAS.
Make sure you assign the appropriate organisms to the pathway, and mark any hypothetical reaction
as such. See section xxx for guideline for writing summaries for new pathways. Once the pathway
is defined in the PGDB, you need to enter enzymes and genes for the various reactions. Review the
papers in greater detail while taking notes of the relevant information (see Section 3.2 below). As
mentioned earlier, it is best if you are already aware of the type of information you’ll input into the
PGDBs. This way, you can skim/read the papers for the relevant information, and take notes in a
similar fashion to how the PGDB is organized, which expedites inputting the information into the
database. In addition, you’ll need to cite the information you input into PGDBs. Hopefully, your
papers have PMID reference numbers; in which case, the full reference information will be imported
automatically

5 EcoCyc-Specific Information

5.1 E. coli Gene Frame Names

All frame IDs for E. coli genes are derived using numeric sequences. Some genes frame names are the
same as the identifier used in Rudd’s EcoGene database; they use the prefix “EG”. Genes with prefix
“G” were either (a) created by the MBL group when they encountered new genes in the literature, or
(b) created by SRI — virtually all of these genes were created in the course of including data from
the Blattner GenBank entry for the full E. coli sequence into EcoCyc. Examples: EG10115, G1001.
Identifiers used by other databases are often listed in the synonyms slot for the gene, e.g., the Blattner
“b” identifiers and later assigned “EG” identifiers.
Dr. K. Rudd developed a naming scheme for E. coli ORFs, in which ORF genes are named beginning
with “y.” The rest of the name encodes the position of the gene on the E. coli chromosome. These
names are retained as synonyms for genes whose functions are later determined.

5.2 Interrupted Genes

Interrupted genes in E. coli are defined as follow. For each piece of the coding region, a separate gene
frame is created. The same EG number is stored as a synonym for each interrupted gene, but the
two interrupted genes have different b-numbers. In addition, a name of the form “ilvG” is stored as a
synonym, but a name of the form “ilvG 1” is stored as the common-name. The interrupted? slot
is set to T, and the start-base and end-base delimit the segment of the coding region defined by that
gene frame.

27
6 MetaCyc-Specific Information

MetaCyc describes the union of pathways across a range of different organisms.

6.1 Database Searching Strategies for MetaCyc

The database searching strategies discussed below are unique to curating MetaCyc since MetaCyc
is currently the only PGDB covering metabolic pathways from multiple species. Section 4.5 above
should be reviewed in addition to this section prior to embarking on database searching.
While researching a pathway you may find that the pathway was studied in many different species,
just a couple, or maybe only one. If the pathway has been studied in more than one species, choose
the model species in which the pathway has been studied the most and thereby is best defined. This
way you can build the most complete picture of a pathway. Pathways in MetaCyc may be associated
with one or more species. In order for a species to be associated with a pathway, each of the reactions
composing the pathway must have been either experimentally demonstrated or hypothesized to occur
in the given species. Enzymes in MetaCyc, on the other hand, are associated with only one species
because it is assumed that the same enzyme in different species will have slightly different properties.
For an enzyme to be associated with a pathway (i.e. displayed in the pathway navigator window over
the reaction it catalyzes) the enzyme and the pathway both must be associated with the same species.
This further emphasizes the reason for studying all of the reactions and enzymes for a pathway in
one particular species. It is a greater priority to add as many different pathways to MetaCyc than it
is to find all of the species in which a given pathway is found. However, if during your research you
discover that many species have the same pathway, list all of the species for that given pathway. To
reduce redundancy in MetaCyc, the same pathway in multiple species is only entered once. If you find
that other species use a slightly different pathway, a new pathway can be created for these species.
For example, you may note that the glutamate fermentation-the hydroxyglutarate pathway includes
a list of several species in which the pathway is found, but it only lists one example for each enzyme.
The priority was to find enzyme information for one of the species listed in the pathway for each of
the pathway reactions, and not to include enzyme information for each of the species.

6.2 Species Information

Unlike the EcoCyc KB, which describes pathways for E. coli only, MetaCyc describes pathways for
many different organisms. When entering a pathway into MetaCyc, you should record the one or more
species in which the pathway is known to occur in the Species slot of the pathway, which can be done
using the Species field of the Pathway Info Editor. The list of species recorded for a pathway should be
considered a representative list, not necessarily an exhaustive list of all species in which the pathway
occurs.
When creating the enzymes that catalyze steps in the pathway, specify the species from which the
enzyme was obtained. Each enzyme in MetaCyc should define a protein from a single species. You
may create enzymes that catalyze a reaction even if those enzymes are from other species besides those
in which the pathway occurs, but those enzymes will not be displayed within the pathway diagram.

28
6.2.1 Taxonomic Range Information

Should the curator be reasonably certain that a pathway should be expected to occur in a limited
set of taxa (e.g., plants only, or animals only) a high-level taxonomic classification for its expected
taxonomic range should be entered in the Taxonomic-Range slot of the pathway frame by using the
Expected Taxonomic Range field in the Pathway Info Editor. Here you can limit the taxonomic
range of organisms in which the pathway occurs by selecting the appropriate taxonomic domain(s) or
subdomain(s). You can determine the most specific higher level domain for each genus and species
by using the NCBI Taxonomy Browser. The higher level domain will usually be a phylum, or class.
If the species is not present in the NCBI taxonomy, select only the highest level, a superkingdom
(Archaea, Bacteria, or Eukaryota). Only enter this information if you are reasonably confident that
the distribution of the pathway will be limited.
The Taxonomic-Range slot is primarily intended for use by the PathoLogic pathway prediction program
during generation of new PGDBs. By assigning an expected taxonomic range to the organisms in a
MetaCyc pathway, the domains, or subdomains that are represented in the expected taxonomic range
can be compared with that assigned to the organism in the new PGDB. PathoLogic could then assign a
lower probability score to the pathway if the organism for the PGDB is not in the expected taxonomic
range of the MetaCyc pathway. This form of reasoning can help to exclude inappropriate pathways
from the new PGDB. Expected Taxonomic Range also gives MetaCyc users a taxonomic perspective
of the pathway that they might not otherwise have if they are not familiar with some of the organisms
in the species slot.

6.3 E. coli Pathways in MetaCyc

MetaCyc contains many E. coli enzymes and pathways. Pathways and enzymes of E. coli K–12 are
automatically imported from EcoCyc into MetaCyc periodically. To ensure that updates are not lost
as a result of that automatic import process, please observe these rules when updating E. coli K–12
enzymes and pathways in MetaCyc:

• Please consider EcoCyc to be the authoritative source for K–12 enzymes. Do not update K–12
enzymes in MetaCyc. Update them in EcoCyc.

• For K–12 pathways, the only slots you should update in MetaCyc are:

– Species, Summary, Common-Name, Synonyms, Citations, History

You should not change the pathway topology (set of reactions within a pathway or their relative
arrangements). The most likely changes to a K–12 pathway in MetaCyc is extending that
pathway to another species.

6.4 Pathways from Other Pathway Databases

When entering pathways into MetaCyc from other pathway DBs, please:

• Enter a citation in the Citations slot of the pathway, for the appropriate DB

• Enter a web link from the MetaCyc pathway to the other pathway DB

29
• Describe in the summary field of the pathway any changes that you made to the pathway, and
why

6.5 Proteins as Substrates in MetaCyc

Some reactions in MetaCyc pathways involve protein substrates such as acyl carrier protein (ACP)
or thioredoxin. These protein substrates should be entered as protein class frames within MetaCyc.
The protein class frames are generic, species-independent descriptions of these proteins. The reason
for this approach is that a number of different MetaCyc reactions may reference a particular protein
as a substrate, but those reactions may be from different pathways in different species, leading to
confusion as to exactly which version of the protein is intended. In addition, when MetaCyc pathways
are predicted in organism-specific PGDB’s, these reactions are now used in the context of this specific
organism, and must not refer to protein instances from other organisms (e.g. a human pathway
utilizing an E. coli protein as a substrate).
If organism-specific instances of such proteins exist in MetaCyc (e.g. specific E. coli thioredoxins),
these instances should reside within the appropriate protein class. This will enable the user to see a
list of all such instances by clicking on a protein substrate in a pathway diagram.

6.6 Curation with Classification Systems

PGDBs contain several classification systems to which individual objects can be assigned. For ex-
ample, the classification system for pathways divides all metabolic pathways into over a 100 different
categories and sub-categories, such as biosynthesis, biosynthesis of amino acids, and biosynthesis of
carbohydrates. Individual pathways can be assigned to those classes to facilitate retrieval and drill-
down by users into categories and sub-categories.
Classification systems exist for pathways, chemical compounds, reactions, and genes. In some cases
it is appropriate to assign an object to more than one class, such as in the case of a multifunctional
protein.
When updating an object, please consider whether its assignment within a classification system should
be revised.

6.6.1 Gene Classes

The gene class hierarchy is the MultiFun system developed by Monica Riley and her colleagues [5].
The gene class hierarchy has several classes specific for classification of uncharacterized genes. The
general gene class ”ORF” has three child classes as of the most recent updates to the hierarchy (in
July 2003):
1. conserved ORF 2. conserved hypothetical ORF 3. nonconserved ORF
As we describe them to the users: ”Conserved ORFs have homologs, usually in other organisms, where
one or more of those homologs have known functions, but the sequence similarity of a conserved ORF
to its homologs is not strong enough to permit inference of function of the conserved ORF. In contrast
conserved hypothetical ORFs have homologs, but none of those homologs have known functions.” The

30
remaining category, nonconserved ORF, is used to refer to genes that do not have assigned gene class
functions and which do not have significant similarity to other genes.
As guidelines for naming of the products of these uncharacterized genes, we suggest that a product of
a ”conserved ORF” will be called ”conserved protein,” a product of a ”conserved hypothetical ORF”
will be called ”conserved hypothetical protein,” and a product of an ”nonconserved ORF” will be
called ”hypothetical protein.” In cases where some additional information has been recorded, but this
information is insufficient for prediction of protein function, additional information may be appended
to the standard name; for example, ”conserved protein with an unusual xyz domain.”
The explanations of selected gene class definitions are provided by Dr. Gretta Serres:

• The “cell structure” category contains both genes encoding components of the structure and
genes encoding products involved in the biosynthesis of the structure.

• The terms “trigger” (3.4) and “modulator” (3.5) refer to specific compounds/proteins that either
trigger a response or modulate a response. I, for example, in the case of lacI, lactose acts as the
trigger and CRP acts as a modulator of the regulation.

• Any transcriptional activator or repressor should be assigned a class of 3.1.2.2 or 3.1.2.3, or both
for dual regulators.

• The “adaptation to stress” class is intended to cover the classical inducers of stress such as
osmotic pressure, temperature, and starvation.

• The “protection” class is intended to cover those inducers related to cell killing or stress induced
by compounds or chemicals.

• The “defense/survival” category is intended to cover virulence factors.

• The “ extrachromosomal origin” category is intended to include chromosomal genes affecting


functions of genes of extrachromosomal origin as well as genes that are of extrachromosomal
origin.

7 Organism Summary

The organism summary display differs from most of the other displayed information in that it is not
computed directly from the KB at runtime. Rather, it is created by a part-manual, part-automated
procedure, and the resulting statistics information is cached to allow fast re-display.
The information for every organism lives in a separate KB. Each KB has a class Organisms, which
tends to have exactly one instance in a given KB, namely the frame that describes the organism. The
frame name for that frame is the PGDB unique ID for the organism, such as ECOLI for E. coli. In
its Genome slot, that frame will point to the genetic elements, which often is just one chromosome.
Both the organism and chromosome frames have a slot called Cached-Statistics, in which the values are
stored that will be displayed in the organism summary. For example, the numbers of genes coding for
proteins and coding for RNAs are stored. Most of these values are filled in automatically by running
a stats-collecting function over the KB.

31
8 Update Propagation among KBs

Many KB updates are non-species-specific, and therefore are intended to apply to all KBs. It is
convenient to make these changes only once in one KB, and rely on an automated mechanism for
propagating such updates to all the other KBs. In general, it is recommended that all such updates be
performed in MetaCyc. Since this is not always convenient, the update propagation mechanism will
take care of propagating certain kinds of updates from clone KBs (including EcoCyc) “up” to MetaCyc.
However, all schema changes must be performed in MetaCyc — otherwise they will be undone during
update propagation. For instance-level updates, update propagation uses the transaction logs collected
in Oracle to determine when one updated value is more recent than another, to determine in which
direction the update should be performed.
Update propagation is invoked manually and, for the time being, relatively infrequently. It is a time-
intensive operation, which can take on the order of a half hour per KB. The update propagation
mechanism does not save any of the updated KBs — it is up to the user to examine the output to
make sure that it looks reasonable, and to save the affected KBs.

8.1 Invoking KB Updating

To invoke the update propagation mechanism, call (propagate-kb-updates). This is the normal
mode of update propagation and will cause updates to be propagated among all known KBs. Alter-
natively, to propagate only among a subset of KBs, the above function can be called with a list of
orgkb-descriptors (an orgkb-descriptor can be found by calling find-org on the org-id). For example,
to propagate updates only between EcoCyc and MetaCyc ( MetaCyc is automatically always included
in update propagation), you would type (propagate-kb-updates (list (find-org ’ecoli))).
Update propagation should be done with care, and the output examined fairly carefully before any
KBs are saved (update propagation does not automatically save the changed KBs). There may still
be bugs in the mechanism, and we want to take pains to avoid losing any data or undoing important
changes.

8.2 Overview of the Updating Procedure

Update propagation involves the following sequence of events:

• Any schema classes (classes can be explicitly specified to be either schema or data classes) in
MetaCyc that are not present in the clone KBs are copied to the clones.

• All of MetaCyc’s slotunits are copied over to all other clones. This will overwrite the correspond-
ing slotunits in the clone KBs. Slotunits present in a clone KB but not in MetaCyc will not be
deleted, but a warning will be printed.

• All instances of the Databases class will be copied from MetaCyc to all clone KBs, overwriting
the existing frames if they exist.

• The class hierarchy of each clone KB is “molded” to match that of MetaCyc, i.e. each class
will have the same set of parents in all KBs. This operation doesn’t copy any classes, it just
potentially alters the parents of existing classes.

32
• For each clone KB, we loop through all instances of designated classes, looking at certain slots to
see which updates need to be propagated “upwards” to MetaCyc. The set of considered classes
and slots is listed below. The mechanism of this is as follows:

– If a frame in a clone KB is not present in MetaCyc, and either it has never been deleted
from MetaCyc or the creation date in the clone KB is more recent than the deletion date
in MetaCyc, then the frame is copied to MetaCyc.
– If the frame’s parents are different in the clone KB and MetaCyc, and the last change to
the parents is more recent in the clone KB than in MetaCyc, then the parents in MetaCyc
are changed to match those in the clone KB. Any parents not present in MetaCyc that are
necessary for this operation (presumably, these are all data classes) will first be copied to
MetaCyc.
– If, for each considered slot, the slot values and/or annotations are different between the
two KBs and the modification date is more recent for the clone than for MetaCyc, then all
slot values and annotations (except for citation and comment annotations, which are often
species-specific) for that slot in that frame are copied from the clone KB to MetaCyc.

While looping through these frames, we collect a list of frames and slots that either were updated
in MetaCyc or that differ between the two KBs (and were not updated because the MetaCyc
changes were more recent). This list is used in the next phase of the update.

• After updates have been propagated “upwards” from all clones to MetaCyc, we iterate through
all the clone KBs again, propagating changes “downwards” from MetaCyc to the clones:

– We “mold” all instances of the designated classes in the clone KB so that each instance’s
parents match those in MetaCyc (because any more recent changes in parentage in any of
the clone KBs have already been copied to MetaCyc, we cannot lose any changes here). If
an instance exists in a clone KB, but its new parents do not (these would have to be data
classes, otherwise they would already have been copied), we copy those parents over from
MetaCyc.
– We iterate through the list that we have collected of frames and slots that either have been
updated in MetaCyc or potentially need updating in one or more clones, comparing values
between the clone and MetaCyc. If there is any difference, the values and annotations
(except for citation and comment annotations, which are often species-specific) are copied
over from MetaCyc to the clone. If the frame does not exist in MetaCyc, then it must have
been deleted recently (otherwise it would already have been copied there), so it is deleted
from the clone KB.
– For all slot values that are copied to a clone KB and that represent frames in MetaCyc, we
check to see if the corresponding frame exists in the clone KB. If not, we copy that frame to
the clone KB also. The exception is if the frame is a modified protein, and the unmodified
form is not present in the clone KB either, in which case we do not import either frame.

While update propagation is being performed, messages about any changes are printed to the ter-
minal window. The user should examine these messages to make sure that the changes seem
reasonable before saving the KBs. It is also good practice to save this output to a file in the
$GPROOT/ecocyc/metacyc/released/kb/ directory, so that it can be examined in case changes have
been lost or updates fail to be propagated (more so that we can determine if there is a bug in the
propagation code rather than to recover any data, which can be retrieved from the change logs stored

33
in Oracle). One known failing of the update propagation mechanism is that if a different change has
been made to some slot of a frame in two different clone KBs (but not in MetaCyc), then the value that
prevails is that from the last clone in the orgkb list (i.e. when updates are propagated upward from a
clone, they may overwrite updates previously propagated upward from earlier clone KBs), and then
of course that value is propagated downwards to all other clones. It is largely to avoid this problem
that we recommend that the majority of changes be performed in MetaCyc rather than in any of the
clones.
The classes and slots for which update propagation is done are listed below. This set of classes and
slots is currently hard-coded into the update propagation code.

• Compounds: common-name, synonyms, overview-node-shape, smiles, pka3, pka2, pka1,


structure-atoms, structure-bonds, superatoms, systematic-name, n-1-name, n+1-name,
molecular-weight, comment, citations, dblinks, chemical-formula, display-coords-2d, charge, cas-
registry-numbers, atom-chirality, atom-charges, gibbs-0, and aromatic-rings.

• EC-Reactions and Unclassified-Reactions (i.e. this excludes Transport-Reactions, which are not
propagated): common-name, synonyms, requirements, right, left, substrates, dblinks, deltago,
ec-number, ec-number-old, and official-ec?.

• Super-Pathways: common-name, synonyms, primaries, polymerization-links, layout-advice, net-


reaction-equation, class-instance-links and disable-display.

• All Pathways except for Super-Pathways (which are covered above) and 2Comp-Pathways
(which are not propagated): reaction-list, predecessors, common-name, synonyms, primaries,
polymerization-links, layout-advice, net-reaction-equation, class-instance-links and disable-
display.

9 Database Release Process

The database release process is performed by BioCyc project staff at SRI. At each release, SRI curators
will need to run the Correctify program and fix any errors identified, update the database release notes,
and update general database information that is displayed to the users.

9.1 Correctify

The correctify program performs checks and corrections of the data. This program is run prior to each
release. Correctify builds indexes, checks citation format, checks correspondence between polypeptides
and genes, changes compound names to the corresponding frame ID in some slots, updates references
to compound name strings to refer to the corresponding frames, checks that physiological regulators
are on the master list of regulators, checks various cross-references, checks that all reactions listed
within pathways exist, checks pathway links, checks various types of reaction information, checks for
enzymatic reactions that are lacking a link to the reaction or the enzyme, checks compound structure
information and removes redundant bonds, calculates sub- and superpathways, checks links to modified
proteins, checks computed slot values, verifies replicon components and positions, and, finally, checks
various constraints.

34
To run the program, open the database (e.g., EcoCyc or MetaCyc) in the Pathway Tools Navigator
and then select Exit to get back to the Lisp prompt. Type: (correctify-ecobase) at the Lisp prompt
and let it run. This usually takes a few minutes, but has been known to take an hour or two on
occasion. When correctify has finished, type (eco) at the Lisp prompt to return to the Navigator. Use
the Revert KB command to undo any changes. Now, examine the correctify-kb output in the Lisp
window and fix any errors that have been identified.
Repeat the process until the errors have all been addressed. Run correctify-kb one last time, return
to the Navigator, and save the changes in the Navigator.
It is advisable to save a copy of the output of the correctify, kb-stats, and new-pathways programs in
case they are needed for reference at a future time.
The correctify program may also make some changes that will show up in the history record displayed
by invoking the Show Changes command in the editor software. This can cause confusion while
trying to understand the history of a particular object, and curators are advised to keep in mind the
possibility that a series of otherwise mysterious changes to an object may be due to the correctify
program and not to intentional manipulations of the object in question.

9.2 Release Notes

Updates to the EcoCyc and MetaCyc release notes are made by EcoCyc and MetaCyc curators at
SRI.

9.2.1 Database Statistics

The (kb-stats) and (new-pathways) queries are used to gather database statistics for the release notes.
If all updates to the database prior to the release have been completed, and the database has not yet
been saved to a file, open EcoCyc in the Pathway Tools Navigator and then select Exit to get back
to the Lisp prompt. At the Lisp prompt type (kb-stats) and let it run. The output generated will
contain the statistics that we record in the release notes, which are displayed on our web site. It is
helpful to save the output to a file for future reference.
To generate a list of new pathways entered since the last release, type (new-pathways) and let it run.
The output will provide the list for inclusion in the release notes. Do not include superpathways
among new pathways listed in the notes.
After the new pathway list has been gathered, only once a release, run (new-pathways T) to re-set the
pathway list.
It is advisable to save a copy of the output of each of these programs for reference.
The statistics may be gathered after the database has been frozen and saved to a file, even after
curation into the Oracle database has resumed. At a Lisp prompt, type (select-organism :org-id ’insert
organism here :version ”insert version number here” :dbms-type :file) with the correct organism and
version number inserted, for example, (select-organism :org-id ’ecoli :version ”7.5” :dbms-type :file)
Then, at the next Lisp prompt, type (kb-stats) or (new-pathways) and proceed as previously described.
The statistics will be computed using the file, which includes the data that will be displayed in the
new released version of that database.

35
9.2.2 Updates to the Notes

The text of the release notes is found in the file


/home/hapuna4/aic/ecocyc/genopath/[version number]/biocyc.org:80/[KB-NAME]/release-
notes.shtml
for example, /home/hapuna4/aic/ecocyc/genopath/7.5/biocyc.org:80/ecocyc/release-notes.shtml
Copy the format of previous release notes; mention the names of newly curated pathways and other
significant improvements. Use a horizontal rule (HTML command ¡hr¿) to separate notes from different
releases.

9.3 Updates to the General KB Information

There is a special frame that contains general information about each database, such as the list of
authors, the copyright information, and citation information. This information should be checked and
updated, if necessary, prior to each release. To update this information, edit the frame using the GKB
editor. At a Lisp prompt type (efr ’ecoli) and the EcoCyc information frame will open in the frame
editor. To open the MetaCyc KB information frame, type the command: (efr ’meta) and then edit
the frame as necessary.

10 Programming Hints

• Frame fetching hangs: If frame fetching hangs, try breaking and then calling
(gfp::reset-prefetching)

References

[1] P.D. Karp, S. Paley, C.J. Krieger, and P. Zhang. An evidence ontology for use in pathway/genome
databases. In R. Altman and T. Klein, editors, Proc Pacific Symposium on Biocomputing, pages
190–201, Singapore, 2004. World Scientific.

[2] P.D. Karp, S. Paley, and P. Romero. The Pathway Tools Software. Bioinformatics, 18:S225–S232,
2002.

[3] P.D. Karp, M. Riley, S. Paley, and A. Pellegrini-Toole. The MetaCyc database. Nuc Acids Res,
30(1):59–61, 2002.

[4] P.D. Karp, M. Riley, M. Saier, I.T. Paulsen, S. Paley, and A. Pellegrini-Toole. The EcoCyc
database. Nuc Acids Res, 30(1):56–8, 2002.

[5] M.H. Serres and M. Riley. MultiFun, a multifunctional classification scheme for Escherichia coli
K–12 gene products. Genome Biol., 5(4):205–222, 2000.

36

You might also like