“…where the buffalo roam and the data are rounded up-up all day….”
A few weeks ago, from Sep 27 to Oct 1, we met in the Black Hills of South Dakota with a group of guest data curators and outside advisors to curate high priority papers, refine the curation workflow and Phenex interface, and evaluate the first prototypes for the web-based user interface to the database. Not only did the workshop end up highly productive (see below), we also had a chance to observe the annual roundup of the largest herd of buffalo in North America, swim in cold Sylvan lake, and see Mount Rushmore one evening.
After participant introductions, the day began with several presentations focused on the utility of synthetic research databases (Monte Westerfield) and the way that phenotype-genotype relationships are handled in model organism databases (Suzanna Lewis, Judy Blake) (for the curious, all presentations are available from the Jamboree agenda page). After a brief description of the Phenoscape data policy (Todd Vision), Jim Balhoff introduced the curation software tool, Phenex, which Wasila Dahdul followed with a hands-on curation exercise that involved using Phenex and associated ontologies to annotate phenotypes (characters) using EQ syntax.
We then paired up our guest data curators with project personnel and got them started. In order to really focus them on EQ syntax and ontology development, we gave each of our guest curators pre-curated Phenex files containing taxon lists, character matrices and free text character and state descriptions from several important publications authored by the guest data curators themselves or within their area of specialty.
Inevitably, curators spent a significant portion of the time on ontology development. Many new anatomical and quality entities, along with their relationships and definitions, needed to be added to either the Teleost Anatomy Ontology (TAO) or the Phenotype Ontology (PATO) before characters could be expressed in EQ syntax. In the process of training curators to submit new entities to the term trackers for these ontologies we recognized the need for small, dedicated ontology development meetings (e.g. focused on one anatomical region such as the lateral line system in fishes, or all geometrical or shape qualities). The better the ontology, the more efficient curation will be.
The curation workflow itself is significantly different in Phenex from the Phenote tool, which we used in the first Data Jamboree. It were the issues brought up there that guided the development of Phenex, and indeed this time we noticed a significant improvement in curatorial efficiency. We were very pleased that by the end of the workshop, the curators had curated 156 characters (~312 character states) for approximately 150 taxa, resulting in over 45,000 EQ annotations.
One frequent topic of discussion was the depth of annotation granularity to which characters should be curated. For example, many systematic characters pertain to shape, frequently with complex descriptors. These can be curated to a high level (e.g. fin: shape) or to a more granular level (e.g. fin: anterior margin rounded). Judy Blake, one of our outside advisors, pointed out that there is a continuum between use of a structured vocabulary and free text. A reasonable guideline is to say that data specific to an individual study may not warrant the extra effort with granular annotation and should be left as free-text while data that can be compared across studies should be annotated with ontologies at a rather detailed level.
One observation was that the higher the level of granularity, the more often post-composition is required. The mechanics in using multiple post-composition windows in Phenex was somewhat confusing to guest curators. Another important topic involved the difficulty of not having a universal standard across systematic studies – e.g. descriptors pertaining to size and shape cannot easily be extended across systematic studies. This same issue came up at our previous data jamboree, too. John Lundberg suggested that for comparisons of size within a study, an internal grading of character states (a numbered scale from 1 for smallest and higher numbers corresponding to larger sizes) could be implemented, and this is something that we plan to try.
Following two days of curation, we conducted an experiment to assess curation consistency among our group of curators, and to identify areas where consistency can be improved, for example by better curator training, upfront ontology development, or user interface enhancements. We wanted to determine how often, and for what reasons, curators choose divergent EQ conceptualizations for the same character and character states. Five curators encoded EQ annotations for the same 10 character and state descriptions, using Phenex as the software tool. Immediately afterward we reviewed the results with the group. Notably, only two of the 10 characters were annotated identically among all curators. The reasons for why the other annotations differed revealed different interpretations of shape descriptors, inexperience and unfamiliarity with the ontologies and software, and lack of adequate terms in the ontologies (e.g. shape) as major hurdles towards consistency between curators. Identifying these hurdles helps us to prioritize efforts on visualization tools, ontology development and workflow and Phenex development.
In parallel to the curation activities, we discussed the Teleost Taxonomy Ontology (TTO), taxon concepts, and intermediate synonyms. Peter Midford has added the intermediate synonyms from the Catalog of Fishes (CoF), but many additional synonyms are present in the literature that must be added. We determined that the need to associate synonyms to their references/publications will require an OBO request for one or more database identifiers. At first we considered using the CoF publication database to generate dbxrefs rather than hunting for DOIs or generating our own, but CoF doesn’t contain many of our publications. Thus we will need to maintain references and unique identifiers for them in our own database.
In another parallel activity, pairs of participants met with Jim Balhoff and Cartik Kothari for demonstrations of the web user interface. Jim and Cartik are creating the database and web-application for the research community to explore and query the data. The central gateways for entering the database that emerged from those consultations are querying by gene, anatomy, taxon, or publication. The use case-level query to be implemented first is to find evolutionary phenotypes that match a mutant ZFIN phenotype, or, going in the reverse direction, to find ZFIN mutants for a set of phenotypes that differ between taxa. We plan to be able to demonstrate the first version of the interface at the educational outreach workshop at the SICB meeting on Jan. 6, where we also plan to conduct a first series of usability tests. (For those curious, there was much more feedback and suggestions we received, which we compiled all on the project wiki).
Working our way towards that meeting over the coming months, we will put into place the first components of a production-level database and a fleshed out query interface. Aside from the software development efforts, we are planning ontology development workshops in small groups in the coming year, for example for TAO and for PATO, and we tentatively aim our next outreach workshop for ASIH 2009. In the meantime, we will need to review our data curation priorities in light of our high priority use cases and proof-of-concept experiments for queries across ZFIN and Phenoscape.