Late January 2005 Newsletter

« In-Between Fall 2004 Newsletter | Main | Web Site Goes Live »

Late January 2005 Newsletter

A great deal has happened during the last three months, so we decided to write another newsletter.

Development of the NMPDR

As we mentioned in our last newsletter, FIG received a sizable portion of a 5-year grant to build a National Microbial Pathogen Database Resource. This grant is providing a welcome stability. In addition, the NMPDR is a completely open source project, and the extensions to the existing SEED technology that are under development will benefit everyone. Many aspects of the basic design are being refined and cleaned up. The first release is only a few weeks away. While based on SEED technology, the team doing the development has introduced a number of key innovations that we will discuss in the next newsletter.

The Manifesto

A discussion has been taking place relating to the appropriate objectives of FIG and the SEED project. Now that FIG has solid funding due to its participation in a the development of the NMPDR, there is a welcome chance to re-examine goals. There are a number of ways things might develop, and as one might expect, different views are constantly being expressed. Ross attempted to formulate his view, which was posted at http://TheSEED.uchicago.edu/FIG/Html/1KG.html. The focus he advocates is largely based on annotations: success should be measured by FIG's ability to advance the development of subsystems and detailed encodings of metabolism (in the form of stoichiometric matrixes).

The SEED Developers Meeting in Chicago in October

The SEED Developers meeting was held on Oct 24-25. It focused on preparing a distribution version of a merged environment that would support both the SEED and GenDB. Teams at both Bielefeld and Argonne National Lab put substantial effort into merginjg the systems. A prototype was produced, DVDs were created (shortly after the meeting), and the result was shown at Supercomputing 2004, which occurred in early November. The merged system is not completely distributable yet for two reasons:

  1. The installation scripts are not solid. We can manually do an install by working through the steps, but the details for making things work properly in arbitrary environments is taking time.
  2. It became clear that we wanted to include a number of genomes in the GenDB side that had all of the precomputed data needed to support analysis (rather than making users begin by doing a fairly substantial amount of computation to prepare their genomes of interest).

The effort to make both systems available in an integrated environment will continue during the first quarter of 2005. If anyone really wants a copy right away for production use (as opposed to just evaluation), we will make an effort to help you install it. However, the major release scheduled for the end of the first quarter will include hundreds of genomes already preloaded into the GenDB component, tens of thousands of newly-called genes, and on the order of a hundred additonal genomes. We sincerely believe that it will be worth the wait.

Workshops, Tutorials, etc.

We held tutorial/workshops at MIT and in Mexico, and things went extremely well. We have another scheduled in early March at the University of Florida, and Andrei will be giving one in the Netherlands in May. We have been reconsidering the most effective way to spread the technology, given that everyone is already hugely over commited.

One observation has been that, in the few cases in which the SEED technology for annotation of genes and subsystems has actually been used in classrooms, we got more benefit for our efforts. Most notably, we had a really productive experience helping introduce the technology in Bernhard Palsson's graduate class at UCSD. The students took it very seriously, they built a number of well-done subsystems, and it established the foundation for ongoing use and collaboration.

The position that is emerging is that advancing the effort to produce accurate, well supported subsystems should be the main goal; all tutorial efforts would be judged on the odds that the expended effort would advance that goal. Hence, tutorials given to experts planning on using the technology to support development of reveiws or in their own research would get highest priority, graduate classes would get next highest priority, etc.

New Releases, etc.

The effort required to add hundreds of genomes at an increasing pace is draining. Just trying to help people make new installations and update their old ones is time consuming. It is becoming very clear that we need to establish a strategy that scales, and we need to implement it very quickly. The current plan is really pretty exciting. The key points are as follows:

  1. The major release that we will make about the end of the first quarter will be the "first and only more-or-less official release of the SEED/GenDB data". Code will be constantly updated and installed over the network as it is done now. However, we will stop constructing "current releases".
  2. Rather than new releases of an ever-increasing collection of genomes, we will support incremental addition of single genomes. Users can download genomes, similarities, improved gene calls, etc. from the clearinghouse. Users will download and install genomes the way they now download and install subsystems.
  3. A number of participants will be helping to prepare genomes for addition, and they will all be depositing them into the clearinghouse.
  4. For new participants, we will provide the single release, and we will implement the ability to "clone" and existing SEED/GenDB system. This allows a new user to acquire a "clone" from someone who has made the investment to download whatever genomes are desired, install whatever subsystems they trust, etc. It also frees us from having to worry about how to help people get started.

There are numerous aspects of this strategy that have not been properly implemented yet. We will do our best to have everything fully functional by the big release.

The More Important Release: the Subsystems

The big release will include not just the SEED/GenDB integration, but an initial set of subsystems, as well. These represent the output of the "prototyping and evaluation" stage. There will be on the order of 100-150 subsystems in differing stages of development. Some were done by experts, have lovely diagrams, and reflect major amounts of effort. Many were done by enthusiastic participants with less knowledge and they reflect it. There are many ways that one might measure the utility of this initial batch. We believe that it represents a major milestone. Much of the next two months will be spent organizing and doing quality control for this initial release.

Well, that is about it for now. There are probably many things that we forgot to cover, but we did hit the major topics. We will try to get out another newletter next month, but it is not clear that we will take the time to do so before "the big release", which should be about the start of March. In any event, we wish you well and hope you prosper,

the team at FIG

# Permalink


 

Trackback Pings

TrackBack URL for this entry:
http://www.conservativecat.com/mt/mt-tb.cgi/328