Report from Tucson: from characters to annotations with text mining

March 30, 2013

There is a wealth of phenotypic information in the evolutionary literature that comes in the the form of semi-structured character state descriptions. To get that information into computable form is, right now, an awfully slow process. In Phenoscape I, we estimated that it took about five person-years in total to curate semantic phenotype anphenowordcloudnotations from 47 papers. If we are to get computable evolutionary phenotypes from a larger slice of the literature, we really need to figure out ways to speed this up.

One promising approach is to use text-mining.  This could contribute in a few different ways.  First, one could efficiently identify all the terms in the text that are not currently represented in ontologies and add them en masse, so that data curation does not have to stop and resume whenever such terms are encountered. Second, one could present a human curator with suggestions for what terms to use and what relations those terms have to one another, speeding the process of composing an annotation.

CharaParser, developed by Hong Cui at the University of Arizona, is an expert-based system that decomposes character descriptions into recognizable grammatical components, and it is now being used in several different biodiversity informatics projects. Baseline evaluation results from BioCreative III showed that a naive workflow combining CharaParser and Phenex, the software curators use to compose ontological annotations and relate them to character states, was capable of identifying candidate entity and quality phrases (it outperformed biocurators by 20% in recall on average) but had difficulty translating those into ontological annotations.  This first iteration workflow also was not yet reducing curation time.

In March, a small contingent from NESCent (Jim Balhoff, Hilmar Lapp and Todd Vision) visited Hong Cui’s group in Tucson. We talked through improvements to CharaParser and the curation workflow, brainstormed plans for a more thorough set of evaluation tests, began refactoring of the code so that it can be more easily shared across projects, and gained a better understanding of what features make a character difficult to curate for humans vs. text-mining.  We made substantial progress on all fronts, and are looking forward to seeing how much improvement in the accuracy and efficiency of curation will be achieved in the next round of testing.

We are also pleased to report that the CharaParser codebase will now be available from GitHub under an open source (MIT) license.

Phenex 1.6 released

October 10, 2012

Phenex 1.6 has been released. Updates:

  • Support for entry of polymorphic values in matrix cells (documentation).
  • Improvements to the tab-delimited export format.

Download for Mac, Windows, or Unix.

DILS 2012

August 28, 2012

In June I had the opportunity to attend DILS 2012 (Data Integration in the Life Sciences), at the University of Maryland in College Park. I presented a poster on Phenoscape, “The Phenoscape Knowledgebase: Integrating phenotypic data across taxonomy, from biodiversity to developmental genetics”. The poster highlighted some of the new directions the Phenoscape project is heading, such as broadening taxonomic coverage and adoption of semantic web technologies. DILS was a small conference but had several talks discussing the applications of ontologies to biological data. I’m looking forward to DILS 2013 in Montreal, in conjunction with ICBO and the Canadian Semantic Web conference.

Phenex 1.4.2 released

August 16, 2012

A new bugfix release of Phenex is available. Phenex 1.4.2 addresses the following issues:


Phenoscape goes mobile

July 9, 2012

Previous layout of the KB faceted browsing page on the iPhone. Text is tiny and must be zoomed and panned.

The NESCent Informatics group periodically holds “hack days”, one day mini-hackathons where we take a break from our usual schedule and push forward on a specific topic of interest. Most recently, the topic was support for the mobile web. I took a look at the Phenoscape Knowledgebase layout on the iPad and iPhone. In general the site did not adapt well to small screen sizes.

In order to avoid serving different layouts to specific devices, I applied techniques from the Responsive Web Design approach, which uses new functionality from CSS 3 to dynamically adjust the page layout based on the size of the browser window. In the new layout, when the window is small, controls move from the side to the top, allowing both the controls and the content table to use the full screen width.

Using responsive web design, the controls and content become stacked on small screens.

The new layout works across most of the pages on the Knowledgebase site. In general, it is a big improvement on mobile devices. However, there are a few remaining glitches to address, such as controls that appear upon mouse hover: difficult to use on a touchscreen device, where there is no mouse.

Collaborative editing in Phenex 1.2

February 13, 2012

We have recently released version 1.2.1 of our Phenex annotation software. This release adds some functionality for easier collaborative editing of data files. While our curators have used Subversion revision control software in the past, the new features make it more reliable to share Phenex data files with user-friendly file synchronization software such as Dropbox. While a NeXML document is open in Phenex, the application monitors for changes to the document file in the background. If the file is being shared via Dropbox and is simultaneously edited by someone else, Phenex will alert the user that the file has changed and offer to load the new version. If there are no unsaved edits then Phenex will reload the file automatically. Phenex 1.2 also provides an autosave feature which saves the document after every edit—this reduces the chance that the file might be edited elsewhere while one has unsaved changes, avoiding complicated file merges.

Notes from ISWC 2011

November 3, 2011

Last week, I attended the 10th International Semantic Web Conference (ISWC) in Bonn, Germany. A tremendous variety of sophisticated work is going on both in academia and industry to improve the technology for, and take advantage of, the ever-growing network of data and concepts published, through open standards, on the web.

You might say it is the best of times and the worst of times for semantic web enthusiasts, in that reasoning and query engines that can be used on large collections of RDF have in the last few years become a reality (one of the Challenge Tracks provided contestants with a *billion* triples to work with).  But some see clouds on the horizon. The web search titans (Bing, Google and Yahoo!) are now pushing, a microformat and vocabulary standard for web content that some worry may threaten the development of richer semantic web technology.  Still, most treated the news positively, happy to know that these organizations now seem to agree on the importance of semantics.  In fact, Yahoo! described at the conference how they are trying to build a “Web of Objects” that takes advantage of, together with more extensive internal vocabularies, to regroup knowledge pieces that are scattered around the Web.

Conference chair Natasha Noy showed a revealing pair of tag clouds comparing the abstracts from the first year of the conference in 2001 to today — the terms “semantic” and “web” have shrunk in importance and “data” is now king! ISWC 2011 tag cloud

Ivan Herman’s blog gives a good sampling of the flavor of talks presented at the meeting.  I especially enjoyed the Industry Track, since these applications are less familiar to me than the academic/scientific ones, and  I was particularly impressed by the importance of semantic technologies to the news media and other content industries.  These technologies are being deployed by news organizations with great enthusiam (e.g. the BBC).  I also came away with a strong sense that semantic technologies are helping to create demand, and drive a revolution in the use of, Open Government Data; there were a number of demonstrations of useful real-world applications, particularly to environmental monitoring.

With my Phenoscape hat on, I attended a Linked Open Data for Science (LISC) satellite workshop prior to the main conference.  The event included both presentations and discussions from a variety of perspectives about the opportunities and challenges of this new technology.  A diversity of fields were represented (social science, linguistics, geosciences, biomedicine, etc.).  But, it is clear that uptake of linked open data as an alternative means of publication is still in its infancy within the sciences.  This despite the fact that the bioinformatics data centers account for nearly a quarter of the real estate in the famous linked data cloud diagram.  Some of the most exciting opportunities, in my opinion, come from the ability to allow radically decentralized data publication, and this is something that we might wish to pilot in a modestly distributed data curation environment like Phenoscape.  Another observation: I was surprised to discover at the meeting how much the utility of the linked data cloud (and, by extension, the semantic web) depend on the social convention by which everyone provides links into a relatively small number of large ‘concept repositories’ like DBPedia (which was originally a Master’s project, BTW).

The breakout discussion sessions at LISC  highlighted how scientific practice will place difficult demands on linked data with respect to provenance, context, granularity, distributed authority, etc.  This resonated with the message of our own contribution to the workshop, which outlined some of the particular challenges in making context-dependent links between scientific objects, when the descriptions of those objects are scattered across different resources, and when the similarities between objects are spread weakly over many properties [1].  Another important question that hit home for a number of us coming from the bioinformatics and biodiversity informatics world is how scientists are going to be able to take advantage of the innovations now going on in the commercial sector (including some of the exhibitors at the main conference) within the constraints and DIY culture of small individual university-based research grants.

There is no denying the explosion in linked data resources out there (comparisons of the growth in the cloud diagram are about as common as graphs showing the growth in sequence data at a biology conference).  But another recurrent theme of the meeting was that unfortunately much of that content is missing semantics (i.e. a lack of use or availability of ontologies for many concepts, and lack of links between content at different endpoints), and generating semantically annotated triples needs to be easier that it currently is (a message certainly relevant to those of us developing curation tools).

One of the keynotes, from Frank van Harmelen, generated quite a bit of buzz.  He looked back on 10 years of the semantic web, asking what theoretical principles we can learn from the experience so far, and his annotated slides are well worth a look.

The conference was a great mix of different formats.  In addition to the keynotes and regular talks, there are a host of workshops and tutorials, challenges, panel discussions (including one billed as a ‘Death Match’), and even a special competition for the best “Outrageous Ideas”.  The winner of that one was a proposal to bring linked data to the non-networked portion of humanity.  A particularly nice feature of the meeting was the ‘Minute Madness’ preceding the poster session in which each of the poster presenters gave a short timed pitch with to all the attendees – it was a very entertaining and informative way to ‘see’ every poster and allowed everyone to quickly pick out which ones to hit during the session.

For more, see the excellent day-by-day summary of the meeting from Juan Sequeda, where there are links to all the winning presentations and challenge entries.  [Ironically, the conference website is down temporarily while it is being moved, so come back later if the links to the papers hang].  The next ISWC will be November 11-15, 2012 in Boston.


[1] Vision T, Blake J, Lapp H, Mabee P, Westerfield M (2011) Similarity Between Semantic Description Sets: Addressing Needs Beyond Data Integration, in Proceedings of the First International Workshop on Linked Science, Bonn, Germany, October 24, 2011, Tomi Kauppinen, Line C. Pouchard, Carsten Kessler (eds), published in CEUR Workshop Proceedings, Volume 783.

Phenoscape visits Xenbase for Anatomy Ontology Update

September 23, 2011

Last month I visited Xenbase and Aaron Zorn’s lab at the Cincinnati Children’s Hospital for a couple of days (August 21-23, 2011) to work with Xenbase curators in preparing the Xenopus Anatomy Ontology (XAO) for its next big release.  Xenbase curators Christina James Zorn and VG Ponferrada have been leading the effort, and Erik Segerdell, the ontology development coordinator for the Phenotype RCN and former Xenbase curator, was also visiting for the week and helping with the update. Erik and I provided training in ontology editing and synchronization tools. Read the rest of this entry »

Postdoctoral Opportunity: Semantic Reasoning for Biological Phenotypes

July 28, 2011

We seek a postdoctoral researcher in computational biology for Phenoscape.  This person will contribute to two important research strands within the project:

  1. Development of computational and statistical methodology for measuring semantic similarity between sets of phenotypes, in order to support searches within extremely large phenotype datasets.
  2. Development and testing of methods for automatically generating ontologically based phenotype expressions from structured excerpts of natural language.

The position is based in the informatics group at the National Evolutionary Synthesis Center (NESCent), and will be administered through the University of North Carolina at Chapel Hill (UNC-CH) under the supervision of Hilmar Lapp at NESCent and Dr. Todd Vision at UNC-CH.   The research will be in collaboration with Dr. Chris Mungall at Lawrence Berkeley National Lab and Dr. Hong Cui at the University of Arizona.  The project also includes biologists and bioinformaticists from the University of South Dakota, the University of Chicago, the University of Kansas, in addition to the model organism databases for mouse (MGD), zebrafish (ZFIN), and Xenopus (Xenbase).

Applicants should have a PhD in bioinformatics, computational biology or a related field. Prior experience with machine reasoning using ontologies is strongly preferred. The position is for two years, pending satisfactory performance and availability of funds.  To apply, please provide a cover letter, CV, and contact information for three references.  Inquiries and applications may be sent to Hilmar Lapp at  The post is open immediately and will remain open until filled.


March 9, 2011

I recently attended the Conference on Semantics in Healthcare and Life Sciences (CSHALS), in Cambridge, MA. The CSHALS meeting was a change for me in that it’s much more healthcare-oriented than other venues in which I’ve presented work from Phenoscape. This was a great opportunity to see how far the healthcare community has pushed semantic web technologies, and also to become more familiar with some of the more commercial packages which are available for storing and querying very large knowledgebases based on RDF (for example, AllegroGraph and Gruff from Franz, Inc., and Sentient Knowledge Explorer from IO Informatics). A particularly interesting talk was the keynote by Toby Segaran, of Metaweb Technologies, advocating semantic techniques as a more agile approach to data. Slideshows from the conference presentations are available for download here, including my own.


