I attended the Evolution 2014 meeting a few months ago in Raleigh, NC, and presented a poster on Phenoscape’s curation effort: “Moving the mountain: How to transform comparative anatomy into computable anatomy?”, with coauthors A. Dececchi, N. Ibrahim, H. Lapp, and P. Mabee. In this work, we assessed the efficiency of our workflow for the curation of evolutionary phenotypes from the matrix-based phylogenetic literature. We identified the bottlenecks and areas of improvement in data preparation, phenotype annotation, and ontology development. Gains in efficiency, such as through improved community data practices and development of text-mining tools, are critical if we are to translate evolutionary phenotypes from an ever-growing literature. The poster was well received and several researchers at the meeting were interested in learning more about open source tools for phenotype annotation.
There is a wealth of phenotypic information in the evolutionary literature that comes in the the form of semi-structured character state descriptions. To get that information into computable form is, right now, an awfully slow process. In Phenoscape I, we estimated that it took about five person-years in total to curate semantic phenotype annotations from 47 papers. If we are to get computable evolutionary phenotypes from a larger slice of the literature, we really need to figure out ways to speed this up.
One promising approach is to use text-mining. This could contribute in a few different ways. First, one could efficiently identify all the terms in the text that are not currently represented in ontologies and add them en masse, so that data curation does not have to stop and resume whenever such terms are encountered. Second, one could present a human curator with suggestions for what terms to use and what relations those terms have to one another, speeding the process of composing an annotation.
CharaParser, developed by Hong Cui at the University of Arizona, is an expert-based system that decomposes character descriptions into recognizable grammatical components, and it is now being used in several different biodiversity informatics projects. Baseline evaluation results from BioCreative III showed that a naive workflow combining CharaParser and Phenex, the software curators use to compose ontological annotations and relate them to character states, was capable of identifying candidate entity and quality phrases (it outperformed biocurators by 20% in recall on average) but had difficulty translating those into ontological annotations. This first iteration workflow also was not yet reducing curation time.
In March, a small contingent from NESCent (Jim Balhoff, Hilmar Lapp and Todd Vision) visited Hong Cui’s group in Tucson. We talked through improvements to CharaParser and the curation workflow, brainstormed plans for a more thorough set of evaluation tests, began refactoring of the code so that it can be more easily shared across projects, and gained a better understanding of what features make a character difficult to curate for humans vs. text-mining. We made substantial progress on all fronts, and are looking forward to seeing how much improvement in the accuracy and efficiency of curation will be achieved in the next round of testing.
We are also pleased to report that the CharaParser codebase will now be available from GitHub under an open source (MIT) license.
A new bugfix release of Phenex is available. Phenex 1.4.2 addresses the following issues:
- Fixed missing “not” relationship in post-composition editor, https://github.com/phenoscape/Phenex/issues/14
- Fixed term filters to allow choosing provisional terms, https://github.com/phenoscape/Phenex/issues/13
- Fixed “freezing” panels display anomalies
- Fixed some ontology loading issues by updating internal OBO-Edit components to latest versions
On 15–16 February 2012, I visited NESCent to work with Peter Midford, Jim Balhoff, and, especially, Wasila Dahdul. The focus of my trip was to push forward on the continued development of the Amphibian Anatomical Ontology and the integration of phenotypic data for amphibians into the larger Phenoscape project.
With Peter Midford, I worked to make a significant update to the Amphibian Taxonomy Ontology based largely on a recent revision to the higher-level taxonomy used on AmphibiaWeb (for which I am part of the steering committee). AmphibiaWeb provides an excellent resource for Phenoscape and other related projects because it provides a list of currently recognized species of living amphibians and is updated daily.
The majority of my visit was spent working with Wasila Dahdul on issues related to the Amphibian Anatomy Ontology (AAO) and on curating our first evolutionary dataset related to the fin–limb transition (Ruta et al., 2003). During this work, we plowed through a significant portion of AAO terms lacking parent terms (either adding parents or synonymizing the terms with others in either VAO or AAO). We also evaluated whether to add terms to the AAO that are present in the Xenopus Anatomy Ontology (XAO; Xenopus is a genus of African frogs used as a model system) but absent in the AAO. In some cases, this led to recommending that those terms be removed from the XAO. As we have started to curate morphological characters related to the limbs from the study by Ruta et al. (2003), we encountered many terms not present in existing anatomy ontologies, such as AAO or the Vertebrate Anatomy Ontology. Some terms had been slated for inclusion in the Amniote Anatomy Ontology (AmAO) being developed by Nizar Ibrahim and Paul Sereno (University of Chicago). Because these terms are also present in non-amniotes, we are recommending that they be migrated from the AmAO to the higher-level VAO.
As we start to focus on curating phenotypes from the literature of vertebrate paleontology, a few issues are emerging. One important issue is that curation of data from paleontological studies will likely necessitate adding a field to our information for specimens to accommodate free text alongside museum abbreviations and catalog numbers. The reason for this is that paleontological studies can rely on a combination of materials, including both specimens and examination of literature. We will also need to add to and refine the collection of museum codes used to curate specimen data. These last points about accurately curating data related to specimens examined are important if we are to use the Phenoscape knowledgebase to point to records for those same specimens in on-line databases, or if databases (such as those for museum collections) want to point to records of specimens in the Phenoscape knowledgebase.
In the original Phenoscape project, our focus was on asking comparative questions regarding living taxa. Although we added fossil taxa to the Teleost Taxonomy Ontology (TTO) when our publications included them, we had no general need to add fossil taxa to the contemporary groups provided by the Catalog of Fishes. However, in our renewal, the focus has both expanded taxonomically (to all vertebrates) and narrowed to the evolution of fins and limbs. The evolution of limbs from fins occurred over 300 million years ago, meaning the morphological data for this transition exists only in the fossil record. Therefore, including fossil data and taxonomy has become essential.
These fossil taxa are not available in the major online sources of names, whether taxon-specific, such as Catalog of Fishes, or general such as Catalog of Life or the NCBI taxonomy. Although NCBI includes some fossil taxa, taxa are only included when a related molecular sequence is submitted, which will never be the case for the vast majority of fossil taxa. These latter taxa will only ever be represented as morphological remains.
This need for fossil data, along with the absence of names from recognized sources, requires us to either add names (and hopefully plausible taxonomy) as curators encounter them in papers, or find an alternative source for names of fossil taxa. Although we have and will continue to add fossil taxa to our taxonomy, we do not, and did not intend to become a name or taxonomy authority in our own right. In light of the strengths and weaknesses of the Phenoscape team allying with a recognized source of fossil taxonomy seems the best option.
The Paleobiology database also called PaleoDB or simply PBDB is an online repository covering a wide range of paleontological data across all taxa represented in the fossil record. These data include names as well as taxonomic opinions appearing in paleontology publications. These data are available and queryable on the PBDB website and are also available for bulk download. As part of developing the Vertebrate Taxonomy Ontology (VTO), an expansion of the TTO to cover all vertebrates and several chordate groups of interest, I have implemented a tool that adds the content of these bulk downloads to a taxonomy ontology. The process of updating from PBDB was designed to minimize disruption to the existing taxonomy by only adding new taxa from PBDB along with whatever taxonomic lineage is required to link each new taxon to a taxon already known to the existing taxonomy. This way, updating from PBDB does not disrupt any existing taxonomic hierarchy we have either incorporated from other resources or were the result of prior curators’ efforts.
However, no taxonomic resource is ever complete. As our term of curators annotate publications, they are encountering fossil taxa unknown to PBDB, and have begun contributing the publication and taxonomy information back to the PBDB. John Alroy and the PBDB board have accepted several project members as authorizers and enterers of data into the PBDB. This allows us to give back to the PBDB as well as simplify the process of adding fossil taxa to our vertebrate taxonomy. We have developed a workflow where a curator can enter publications, names, and taxonomic opinions directly into the PBDB. This immediately makes our additions visible to a wider community and the opportunity to engage expertise we may not have known existed. Subsequent PBDB bulk downloads will include these new names and reflect any changes to the taxonomic opinions entered during curation. These will then be added to the next update of the VTO.
With the help of Phenoscape and DeepFin intern Ben Frable, I recently finished adding 117 French anatomical terms and synonyms from Chanet & Desoutter’s glossary publication  to the Teleost Anatomy Ontology (TAO). These authors spent many years defining and translating Paul Chabanaud’s anatomical analyses of flatfishes into modern French and English to help researchers understand his important publications. Adding these terms to the TAO takes their translation one step further, enabling computers to link Chabanaud’s unusual terms to an ontology ID for each anatomical ‘concept’, which in turn enables connections among all phenotypic and related data that reference this ID.
These synonyms can now be used in searches of the Phenoscape Knowledgebase. For example, you can see the French synonyms for ‘paired fin’. One can imagine ultimately being able to select a preferred language or term label when browsing the ontology in the Knowledgebase.
These were the first set of foreign terms to be added to the teleost ontology, and we had to tweak the Phenoscape Knowledgebase interface to display the diacritical marks correctly. We are ready to accept more! Please send me anything you’d like added or changed to the TAO term tracker.
 Chanet, B., & Desoutter-Meniger, M. (2008). French-English glossary of terms found in Chabanaud’s published works on Pleuronectiformes. Cybium, Electronic Publication no 1:1-23. PDF download
While working to describe two species of lizardfish (Synodus) with Carole Baldwin at the Smithsonian National Museum of Natural History, she received an email from Paula Mabee asking if she knew or had any students interested in working on the Phenoscape Project. I had realized that with advances in technology and communication, evolutionary biology and all science was headed towards a future of large-scale interdisciplinary collaborations to help address big questions and make tools and data readily available. Therefore, I immediately jumped on the opportunity to work on Phenoscape!
With the support of funding from DeepFin, I started my internship with Phenoscape at the National Evolutionary Synthesis Center (NESCent) in August 2011. My three months here at NESCent have flown by and even though it is my last day, I am just as excited about the project as the day I started! Working with Wasial Dahdul, Peter Midford and Jim Balhoff has enabled me to learn and understand a great deal about databases, collaboration and morphology. Phenoscape has completely changed the way I think about phenotypic characters. Breaking them down into logical statements in Phenex really allows you to understand a character as it fits in the bigger picture. I was able to work with Wasila in forging interdisciplinary ties by contributing to other ontologies and databases, such as PATO and PaleoDB. Additionally, working to assist in the expansion of Phenoscape to incorporate all vertebrates taught me a lot about the origins of vertebrates and the plethora of prehistoric life I did not realize existed- including my new personal favorite prehistoric fish, Jagorina!
NESCent is an amazing place. Being one of the few people here without a higher degree or a long list of publications under their belt, I was initially a little intimidated. However, the informatics group, post-docs and professors have been great and pushed me to participate in seminars and intellectual discussion. This is a stimulating environment that facilitates thinking outside the box and looking at bigger picture issues in evolutionary biology.
I am excited to continue my work on Phenoscape offsite back at the Smithsonian and I hope to contribute throughout my graduate career in Dr. Brian Sidlauskas’s (former NESCentian and Phenoscape tester and contributor) lab at Oregon State University.
Graduate Student, Oregon State University
Student Researcher, Smithsonian National Museum of Natural History
We’re happy to report that a paper describing the Phenex curation tool has just recently been published in PLoS ONE:
Balhoff JP, Dahdul WM, Kothari CR, Lapp H, Lundberg JG, et al. (2010) Phenex: Ontological Annotation of Phenotypic Diversity. PLoS ONE 5(5): e10500. doi:10.1371/journal.pone.0010500.
Abstract: Phenotypic differences among species have long been systematically itemized and described by biologists in the process of investigating phylogenetic relationships and trait evolution. Traditionally, these descriptions have been expressed in natural language within the context of individual journal publications or monographs. As such, this rich store of phenotype data has been largely unavailable for statistical and computational comparisons across studies or integration with other biological knowledge. Here we describe Phenex, a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic similarities and differences using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Phenex can be configured to load only those ontologies pertinent to a taxonomic group of interest. The graphical user interface was optimized for evolutionary biologists accustomed to working with lists of taxa, characters, character states, and character-by-taxon matrices. Annotation of phenotypic data using ontologies and globally unique taxonomic identifiers will allow biologists to integrate phenotypic data from different organisms and studies, leveraging decades of work in systematics and comparative morphology.
In early November Wasila and I attended the AmphibAnat workshop in Kansas City, MO (Nov. 5-8) that was organized by Anne Maglia. As you may know, Phenoscape has a close relationship with this group, not only because they work on herps (ichthyologists and herpetologists have a long tradition of working together…), but because they are also developing ontologies to annotate the published comparative anatomical literature. I presented the status of our work in Phenoscape to the large group (~40) of amphibian development and anatomy experts who were present. As these folks added new terms, synonyms, and images to the amphibian ontologies over the course of the next few days, we solicited comments on the prototypes of three new interfaces for the Phenoscape Knowledgebase. Using both images and paper copies of these prototypes, we invited people to sit down with us on a one-on-one basis and describe in detail what worked and what was missing or unclear. The feedback was extremely useful, and we appreciated the AmphibAnat time. We have now gone over all the comments within Phenoscape and logged them individually to FogBugz, our internal tracking system. We’ll be generating new versions of these prototypes through early February, when we plan a formal round of usability testing.