Cultural institutions have among their mission the objective of ensuring long-term digital preservation and providing access to their collections. The whole digitization process, from the selection of collections to digitize to the online publication is a complex and long process to manage. Many institutions don’t have the funding or have limited staff resources to be able to digitize all of the documents of their archives and often the images are published on an online database without any deep metadata enrichment. Today, an increasing number of institutions are leveraging volunteers and adopting the crowdsourcing approach to complement their internal digitization operation for data entry.
What about Crowdsourcing?
Crowdsourcing can be defined as the practice of obtaining services or ideas for a project by soliciting the participation of a large number of people, and especially through the internet. In digitization, crowdsourcing tools allow the public to transcribe and to tag digital documents making them easier to search, use and read and thus more widely accessible.
Crowdsourcing can be used for transcription projects involving old or recent handwriting (manuscripts, diaries, war or service reports, letters, postcards, civil registers, biodiversity specimen labels, etc.).
For instance, Europeana, the European digital library, has developed a specific website “Europeana transcribe” and launched several crowdsourcing projects to transcribe texts of digitized World War I documents which are handwritten and often very hard to decipher such as postcards and letters.
Europeana also runs competitions, known as “Transcribathons”, for teams to compete with each other in the transcription of handwritten texts online. Europeana calls upon public contribution not only for transcribing documents but also for gathering stories, memories, and any other historical documents families still have from World War One. Another example is the Natural Museum of Natural History (MNHN) in Paris that used crowdsourcing for transcribing labels from specimens totaling six million scanned images.
Crowdsourcing can also be a valuable instrument for correcting OCR results, especially for digitized old newspapers collections. Indeed, mistakes can occur rather regularly when using Optical Character Recognition software on old printed materials. The National Library of Australia (TROVE project) and National Library of Finland (Digitalkoot project) have both involved volunteers in the correction of digitized text of historic newspapers.
Another use of crowdsourcing can be for tagging and commenting digitized documents. The Library of Congress started one of its crowdsourcing project back in 2008 for tagging, commenting and identifying people and places on historical photos using Flickr platform. With geographic location tools, it is possible to compare an old photo or painting with present day location. The New York Public Library has undertaken several crowdsourcing projects for geotagging street maps, drawings, photographs and illustrations of New York City.
Many large cultural institutions have already developed their own platform to allow crowdsourcing participation. There are also commercial digital library solutions that feature the necessary tools (annotation, tagging, OCR correction, metadata correction, geographic location, clipping, etc.) to allow owners of digitized content to solicit the contribution of internet users to enrich their metadata and thus enhance the access to their digitized collections.
By transcribing, editing, tagging or adding comments to digitized documents, the public helps creating a highly valuable database for researchers and academics making the information more accessible, accurate and interesting. The crowdsourcing approach is a viable opportunity to support overwhelming digitization efforts.