I attended the Evolution 2014 meeting a few months ago in Raleigh, NC, and presented a poster on Phenoscape’s curation effort: “Moving the mountain: How to transform comparative anatomy into computable anatomy?”, with coauthors A. Dececchi, N. Ibrahim, H. Lapp, and P. Mabee. In this work, we assessed the efficiency of our workflow for the curation of evolutionary phenotypes from the matrix-based phylogenetic literature. We identified the bottlenecks and areas of improvement in data preparation, phenotype annotation, and ontology development. Gains in efficiency, such as through improved community data practices and development of text-mining tools, are critical if we are to translate evolutionary phenotypes from an ever-growing literature. The poster was well received and several researchers at the meeting were interested in learning more about open source tools for phenotype annotation.
There is a wealth of phenotypic information in the evolutionary literature that comes in the the form of semi-structured character state descriptions. To get that information into computable form is, right now, an awfully slow process. In Phenoscape I, we estimated that it took about five person-years in total to curate semantic phenotype annotations from 47 papers. If we are to get computable evolutionary phenotypes from a larger slice of the literature, we really need to figure out ways to speed this up.
One promising approach is to use text-mining. This could contribute in a few different ways. First, one could efficiently identify all the terms in the text that are not currently represented in ontologies and add them en masse, so that data curation does not have to stop and resume whenever such terms are encountered. Second, one could present a human curator with suggestions for what terms to use and what relations those terms have to one another, speeding the process of composing an annotation.
CharaParser, developed by Hong Cui at the University of Arizona, is an expert-based system that decomposes character descriptions into recognizable grammatical components, and it is now being used in several different biodiversity informatics projects. Baseline evaluation results from BioCreative III showed that a naive workflow combining CharaParser and Phenex, the software curators use to compose ontological annotations and relate them to character states, was capable of identifying candidate entity and quality phrases (it outperformed biocurators by 20% in recall on average) but had difficulty translating those into ontological annotations. This first iteration workflow also was not yet reducing curation time.
In March, a small contingent from NESCent (Jim Balhoff, Hilmar Lapp and Todd Vision) visited Hong Cui’s group in Tucson. We talked through improvements to CharaParser and the curation workflow, brainstormed plans for a more thorough set of evaluation tests, began refactoring of the code so that it can be more easily shared across projects, and gained a better understanding of what features make a character difficult to curate for humans vs. text-mining. We made substantial progress on all fronts, and are looking forward to seeing how much improvement in the accuracy and efficiency of curation will be achieved in the next round of testing.
We are also pleased to report that the CharaParser codebase will now be available from GitHub under an open source (MIT) license.
A new bugfix release of Phenex is available. Phenex 1.4.2 addresses the following issues:
- Fixed missing “not” relationship in post-composition editor, https://github.com/phenoscape/Phenex/issues/14
- Fixed term filters to allow choosing provisional terms, https://github.com/phenoscape/Phenex/issues/13
- Fixed “freezing” panels display anomalies
- Fixed some ontology loading issues by updating internal OBO-Edit components to latest versions
On 15–16 February 2012, I visited NESCent to work with Peter Midford, Jim Balhoff, and, especially, Wasila Dahdul. The focus of my trip was to push forward on the continued development of the Amphibian Anatomical Ontology and the integration of phenotypic data for amphibians into the larger Phenoscape project.
With Peter Midford, I worked to make a significant update to the Amphibian Taxonomy Ontology based largely on a recent revision to the higher-level taxonomy used on AmphibiaWeb (for which I am part of the steering committee). AmphibiaWeb provides an excellent resource for Phenoscape and other related projects because it provides a list of currently recognized species of living amphibians and is updated daily.
The majority of my visit was spent working with Wasila Dahdul on issues related to the Amphibian Anatomy Ontology (AAO) and on curating our first evolutionary dataset related to the fin–limb transition (Ruta et al., 2003). During this work, we plowed through a significant portion of AAO terms lacking parent terms (either adding parents or synonymizing the terms with others in either VAO or AAO). We also evaluated whether to add terms to the AAO that are present in the Xenopus Anatomy Ontology (XAO; Xenopus is a genus of African frogs used as a model system) but absent in the AAO. In some cases, this led to recommending that those terms be removed from the XAO. As we have started to curate morphological characters related to the limbs from the study by Ruta et al. (2003), we encountered many terms not present in existing anatomy ontologies, such as AAO or the Vertebrate Anatomy Ontology. Some terms had been slated for inclusion in the Amniote Anatomy Ontology (AmAO) being developed by Nizar Ibrahim and Paul Sereno (University of Chicago). Because these terms are also present in non-amniotes, we are recommending that they be migrated from the AmAO to the higher-level VAO.
As we start to focus on curating phenotypes from the literature of vertebrate paleontology, a few issues are emerging. One important issue is that curation of data from paleontological studies will likely necessitate adding a field to our information for specimens to accommodate free text alongside museum abbreviations and catalog numbers. The reason for this is that paleontological studies can rely on a combination of materials, including both specimens and examination of literature. We will also need to add to and refine the collection of museum codes used to curate specimen data. These last points about accurately curating data related to specimens examined are important if we are to use the Phenoscape knowledgebase to point to records for those same specimens in on-line databases, or if databases (such as those for museum collections) want to point to records of specimens in the Phenoscape knowledgebase.
In the original Phenoscape project, our focus was on asking comparative questions regarding living taxa. Although we added fossil taxa to the Teleost Taxonomy Ontology (TTO) when our publications included them, we had no general need to add fossil taxa to the contemporary groups provided by the Catalog of Fishes. However, in our renewal, the focus has both expanded taxonomically (to all vertebrates) and narrowed to the evolution of fins and limbs. The evolution of limbs from fins occurred over 300 million years ago, meaning the morphological data for this transition exists only in the fossil record. Therefore, including fossil data and taxonomy has become essential.
These fossil taxa are not available in the major online sources of names, whether taxon-specific, such as Catalog of Fishes, or general such as Catalog of Life or the NCBI taxonomy. Although NCBI includes some fossil taxa, taxa are only included when a related molecular sequence is submitted, which will never be the case for the vast majority of fossil taxa. These latter taxa will only ever be represented as morphological remains.
This need for fossil data, along with the absence of names from recognized sources, requires us to either add names (and hopefully plausible taxonomy) as curators encounter them in papers, or find an alternative source for names of fossil taxa. Although we have and will continue to add fossil taxa to our taxonomy, we do not, and did not intend to become a name or taxonomy authority in our own right. In light of the strengths and weaknesses of the Phenoscape team allying with a recognized source of fossil taxonomy seems the best option.
The Paleobiology database also called PaleoDB or simply PBDB is an online repository covering a wide range of paleontological data across all taxa represented in the fossil record. These data include names as well as taxonomic opinions appearing in paleontology publications. These data are available and queryable on the PBDB website and are also available for bulk download. As part of developing the Vertebrate Taxonomy Ontology (VTO), an expansion of the TTO to cover all vertebrates and several chordate groups of interest, I have implemented a tool that adds the content of these bulk downloads to a taxonomy ontology. The process of updating from PBDB was designed to minimize disruption to the existing taxonomy by only adding new taxa from PBDB along with whatever taxonomic lineage is required to link each new taxon to a taxon already known to the existing taxonomy. This way, updating from PBDB does not disrupt any existing taxonomic hierarchy we have either incorporated from other resources or were the result of prior curators’ efforts.
However, no taxonomic resource is ever complete. As our term of curators annotate publications, they are encountering fossil taxa unknown to PBDB, and have begun contributing the publication and taxonomy information back to the PBDB. John Alroy and the PBDB board have accepted several project members as authorizers and enterers of data into the PBDB. This allows us to give back to the PBDB as well as simplify the process of adding fossil taxa to our vertebrate taxonomy. We have developed a workflow where a curator can enter publications, names, and taxonomic opinions directly into the PBDB. This immediately makes our additions visible to a wider community and the opportunity to engage expertise we may not have known existed. Subsequent PBDB bulk downloads will include these new names and reflect any changes to the taxonomic opinions entered during curation. These will then be added to the next update of the VTO.
With the help of Phenoscape and DeepFin intern Ben Frable, I recently finished adding 117 French anatomical terms and synonyms from Chanet & Desoutter’s glossary publication  to the Teleost Anatomy Ontology (TAO). These authors spent many years defining and translating Paul Chabanaud’s anatomical analyses of flatfishes into modern French and English to help researchers understand his important publications. Adding these terms to the TAO takes their translation one step further, enabling computers to link Chabanaud’s unusual terms to an ontology ID for each anatomical ‘concept’, which in turn enables connections among all phenotypic and related data that reference this ID.
These synonyms can now be used in searches of the Phenoscape Knowledgebase. For example, you can see the French synonyms for ‘paired fin’. One can imagine ultimately being able to select a preferred language or term label when browsing the ontology in the Knowledgebase.
These were the first set of foreign terms to be added to the teleost ontology, and we had to tweak the Phenoscape Knowledgebase interface to display the diacritical marks correctly. We are ready to accept more! Please send me anything you’d like added or changed to the TAO term tracker.
 Chanet, B., & Desoutter-Meniger, M. (2008). French-English glossary of terms found in Chabanaud’s published works on Pleuronectiformes. Cybium, Electronic Publication no 1:1-23. PDF download