March 2004 Newsletter

« February 2004 Newsletter | Main | April 2004 Newsletter »

March 2004 Newsletter

Well, the SEED Developers' meeting in San Diego has completed. With the start of these meetings, the SEED Project is now moving from an development stage into what might be called an "initial deployment stage". The development stage succeeded in producing a system that has some remarkable properties:

  1. It does support comparartive analysis of a rich set of genomes. The nonredundant protein database used to develop similarities now has 1.92 million entries, and the entire database contains over 270 more or less complete genomes.
  2. It is being used in at least two sequencing/annotation efforts as a framework to analyze the data.
  3. It has been used successfully in several short courses as a framework for students to explore genomic data. We could that within two days, the central functionality could be conveyed, and students were able to focus on meaningful analysis issues. In one case, several potential research topics were exposed.
  4. It is good enough to justify initiating what we consider a key FIG effort: the Project to Annotate 1000 Genomes.
  5. The system can be easily installed. The process is not yet perfect, problems arise in almost every case, but we can usually have a system up with only a few minutes of human effort (the time to transmit copies over the network and to build the database is substantial, but normally requires no intervention). It runs on both Macs and Linux systems using either Postgres of MySQL.
  6. It supports a rudimentary p2p update/data exchange capability. This part still needs work, but not too much. It will, like the configuration/installation process, become solid fairly rapidly as deployment begins.

We view the next stage as potentially quite exciting, and certainly pretty demanding. The most important development has been our increase in manpower: the SEED effort now includes four part-time researchers and we have increasing help from collaborators (perhaps, "participants in the project" is a better term). We are receiving help in enhancing the system from a number of individuals, and in some cases efforts have already begun to make it possible for independently developed tools to be easily integrated within the SEED. So, development and manpower seem likely to explode. Controlling the results to produce a reliable, distributable system that is useful to everyone should not be too difficult.

The Initial Deployment Stage

So, what is going to happen during "initial deployment", and how will we know when that stage completes? Here is how we see it now:

  1. We will put up a public server in March. We plan on cross-linking with all of the larger integrations (and as many of the useful smaller efforts as we can). The original WIT effort at Argonne National Lab had to be brought down due to issues that arose in a hacker attack, and we plan on using this initial public SEED as a replacement to which other projects will link. This requires defining a protocol for exposing links. This should be completed in March.
  2. We will finally offer versions to any sequencing/annotation efforts that wish to use it (with the understanding that there will be some initial learning and help required to get things going smoothly). The ANL/FIG team will do its best to help this deployment, but we want everyone to realize that this effort is largely a volunteer one. We welcome the experience, since exposing shortcomings and correcting them is exactly what is needed at this stage, but we also expect users to view themselves as participating in the development.
  3. We will begin the Project to Annotate a 1000 Genomes.
  4. A number of individuals cooperating within the SEED effort will together construct a web service to provide gene calls for prokaryotic sequence. We believe that these will be relatively accurate. The service is planned to reside at the University of Bielefeld in Germany. A second server to provide computation of similarities for genomes to be added to the SEED is planned at Argonne National Lab. Along with the SEED itself, these servers will make it possible for university sequencing/annotation projects to have access to state-of-the-art annotation tools.
  5. We will call an end to "initial deployment" when we have
    • successully installed five systems in a row by having users just following instructions (i.e., without our help),
    • have sequencing/annotation efforts that can routinely add new genomes and update versions of old ones,
    • have over 20 users that synchronize weekly using p2p operations, and
    • have a set of at least 10 annotated systems that are routinely distributed and updated via p2p operations.

The Project to Annotate 1000 Genomes

There are differences of opinion concerning what makes the SEED Project important. Ross' view is that the SEED should be thought of as a workbench for producing "subsystem annotations", and that these annotations will eventually be understood to be the most important development growing out of the SEED Project. The roles of

  1. supporting initial annotations and
  2. helping individual researchers explore genomic data

are important, but not nearly as important as the role of supporting subsystem annotations. At this point, we estimate that roughly 50% of genes in the public archives have solid function assignments, about 20% are completely uncharacterized, and about 30% have either very broad class characterizations or are over specified. Many of the genes within this last 30% have been given accurate characterizations in review articles, but the contents from these review articles often fail to reach the public sequence archives. We propose to use the SEED as a framework for supporting development of "subsystem annotations", which can be viewed as the organized data to be included in a review article. From this perspective, reviews form the essential cutting edge for annotations, and that the SEED should become a vehicle for reducing the effort to produce a review. By providing this service, along with the capability to easily exchange and export subsystem annotations, the SEED will facilitate the flow of assignments from reviews to sequence annotations. A researcher basing his career on analysis of a specific subsystem will have a framework for producing a sequence of reviews, using tools that support maintenance of the subsystem annotation (largely automating the addition of new genomes as they become available).

We believe that there are many, many individuals with extensive experience in a given subsystem that would be willing to produce and maintain one of these "subsystem annotations". Each such curated subsystem would amount to the raw data standing behind a detailed review article. The key issues that must be addressed are roughly these:

  • Once a detailed subsystem has been carefully constructed, it must not be lost. This is the most common worry. An expert using a system like the SEED is always concerned that some shift of IDs, new release or whatever will result in loss of the expert's work. By making it straightforward to export an annotated subsystem and exchange them via p2p operations, we believe that we have addressed this issue.
  • The system must significantly reduce the effort required to extract and relate relevant data. It is not unusual for an expert to spend years in developing a detailed picture of a subsystem; we believe that this can be dramatically reduced by the development of appropriate tools, but it is essential that the individuals developing tools participate closely in the annotation process; otherwise, it is likely that powerful tools will be developed that do not address the rate-limiting operations.

A minimal notion of "curated subsystem" would include

  1. a list of the functional roles included in the subsystem (for metabolic subsystems, this amounts to a list of catalytic domains) and
  2. a spreadsheet with genomes along one axis, and functional roles along the other. Each cell would contain a list of the genes in the given organism that implement a specfific functional role.

The rows of the spreadsheet each represent the genes implementing the set of funjctional roles in a given organism, while each column presents a reliable set of genes implementing a specific functional role.

An extended notion would include a number of other items as well:

  1. a diagram representing the relationships between the functional roles (in the case of a pathway, a depiction of the reactions that make up the pathway),
  2. a discussion of the "variants" represented by the entries in the spreadsheet, and
  3. a detailed commentary each set of genes implementing a functional role (describing what can be inferred about the evolutionary origins of the set of genes).

Recently a major effort has been launched to include within the SEED the capability of curating these subsystem annotations and exchanging them via p2p operations.

In March, Ross, Andrei, Veronika and Gary Olsen will begin curation of specific subsystems, synchronizing all assignments and annotations on a weekly basis. Once this initial effort is running smoothly, we intend to rapidly expand the set of individuals participating.

So, in the next newsletter, expect a detailed discussion of the gene-calling and similarity servers, along with a discussion of the outcome of the initial efforts to begin annotation of subsystems.

Finally, our sincere thanks to Dusko Ehrlich and Barny Whitman for sending statements supporting the utility of the SEED. We owe you.

# Permalink


Trackback Pings

TrackBack URL for this entry: