The Challenge

There are over 1,000,000 specimens at BRIT, with a large variety of quality and style. It takes an average of five minutes to type the label data into their respective fields for one specimen. In one year, data from 24,960 specimens could be entered by a human. An efficient workflow with machine assistance is necessary.

Woman in lab coat looks in dismay at several tall stacks of specimen sheets. Text reading 24,960 specimens.

Herbarium Sheet Example

Herbarium specimens such as the plant specimen pictured here present unique challenges for data digitization, parsing, and preservation. Each mark, label, and even location of the label has an important meaning. Careful capture and description of these data are essential for providing a true digital representation of the specimen for wider access to digital collections.

At one time there were high hopes for the capabilities of optical character recognition (OCR) software to transform label data without human intervention and create machine-processable data from digital images. However, while the kinds of data included on and associated with a herbarium specimen are fairly standard, the labels themselves are products of individual plant collectors spanning 250 years. The placement of data fields and the explicitness of data provided vary widely, creating great difficulties for attempts at automatic parsing. The most significant issue, however, is that the majority of labels were not produced in a format that is easily machine-readable. This issue is compounded since specimens with non-OCRable, handwritten labels are often the most valuable; these older specimens can tell us the most about human effects on the Earth’s vegetation over the last 250 years, including climate change, the movement of invasive species, and the loss of endangered species over time.

Text Parsing

Label text must be parsed into appropriate categories based on the information's characteristics. There are several classes of data which provide a broad structure to the information gleaned from labels. These provide the basis for metadata about the specimens, and can subsequently be ingested by a database and presented to users via an interface.

A key challenge faced by all natural history collections is determining a transformation process, from the physical to the digital,  that yields high-quality results in a cost- and time-efficient manner. The laborious process of manual keystroking required to parse the correct parts of the label data is the most costly step in terms of staff time and expense (for training, actual data entry, and checking/cleaning the data). It takes an average of five minutes to type the label data into their respective fields for one specimen. In one year, data from 24,960 specimens could be entered by one full-time person.  We simply can’t afford enough humans and time to database the 90-million+ herbarium specimens in the United States (conservatively, this would take 3,606 person-years).  New tools and timesaving workflows are a necessity to increase access to these valuable data.