Curation of evolutionary phenotypes from the systematic evolutionary literature of fishes is central to accomplishing our goal, which is to prototype an ontology-based informatics system to integrate evolutionary, anatomical, developmental, and genetics data. This summer we accomplished a fair chunk of curation, and I’m summarizing this here (see graph below).
During the past year we’ve contacted taxon experts to suggest comparative systematic treatments and to prioritize them based on generality of taxonomic coverage, authors’ enumeration of characters and a published character X taxon matrix. You can see the 76 papers that we ranked as top priority (“A”) and our curation priority in a publicly available Google spreadsheet and on the graph below. We use this document internally to record the nitty-gritty of our progress. (If you see any papers that we’ve missed, please let us know.)
First things first – we needed a OCR formated pdf of each of the 76 “A” papers. It took a summer of searching, interlibrary loaning and downloading docs from university libraries, the BioHeritage Library, and scanning hard copies of electronically unavailable papers. We completed this work for our “A” papers last week.
In order to populate a database with skeletal phenotypes from many species, we need information on the taxonomic distribution of particular phenotypes (i.e., which species show which phenotype). Using Phenote, the software curation tool customized by Jim Balhoff (NESCent) for curation of evolutionary biology literature, we entered taxon lists for each paper by cutting and pasting from the pdfs into a “Publication Taxon” column and into a “Valid Taxon” column. Where published taxon name was no longer valid, it is sometimes designated as a synonym for the valid taxon in the Teleost Taxonomy Ontology, and was thus replaced. Because some intermediate synonyms are not yet in the Teleost Taxonomy Ontology, taxonomic expertise is still required to find valid taxon name for some of the published names. We have taxon lists for 59 of the 76 papers; many of these still require expert review.
Character X taxon matrices make the association between taxa and character states. To save time required for manual entry, we contacted authors directly for their matrices, and some had them available to send to us (thank you!). It’s interesting that about a third of the “A” papers (24/76) do not have published character matrices. We may be able to reconstruct matrices from the text in some cases. We currently have 27 character matrices and 25 remain to be manually entered.
Annotating the skeletal phenotypes in these papers using EQ syntax and ontologies is our current challenge, along with merging old Phenote files into new Phenex format. We are taking a first-pass through the characters, initially coding the easy ones (e.g. Basibranchial 1 one present or absent) and submitting needed terms to the Teleost Anatomy Ontology (to see these terms and comment on them as proposed, please subscribe). We currently have 11 papers curated at this level; our goal is to do a first pass through most if not all of the A papers before our second data jamboree, where taxon experts will review and edit the data.
I want to thank our Deep-Fin sponsored Phenoscape intern this summer, Sonia Crichigno, who worked with our Ontology Curator Dr. Wasila Dahdul to enter taxon lists and character EQs. I particularly want to thank our other summer students Carin Rambow and Leah Mabee, who gathered pdfs, scanned and OCR’d documents, organized and updated the Google spreadsheet, typed in character matrices, and made initial taxon lists. I also want to thank Dr. Miles Coburn, a close colleague of mine and friend of all of us involved in Phenoscape. Miles was an expert on cypriniform fishes, who was reviewing and updating taxon lists and beginning character curation when he was killed in a bicycle accident on August 16th. We all miss him.