Crowdsourcing can help decrease the bias of the search engine of the ‘Koninklijke Bibliotheek’ (KB), the national library of the Netherlands. CWI researchers will present that finding at the Joint Conference on Digital Libraries in Texas, which takes place on 4 June 2018. The search engine shows a preference for certain articles, while other documents rarely appear on the surface. When users manually correct errors that occur during the digitization of texts, the search engine immediately provides a more varied search result of the collection. This is even the case when only a limited number of documents are corrected.
The KB has digitized many old publications, including millions of newspaper pages, using optical character recognition (OCR). OCR software searches an image for letter patterns. Whenever the system recognizes the shape of a certain letter, it saves that letter in a file. In this way, a picture of a text is converted, letter by letter, into a digital, editable and searchable text.
In documents digitized by OCR software, mistakes occur rather regularly. For example, an OCR system may recognize an ‘o’ as an ‘a’ because the shape of these letters is very similar, which leads to spelling mistakes. As a consequence, search engines may not correctly recognize such words in the digitized document – and not bring them up in a search.
Researchers from CWI’s Information Access group collaborated with the KB to find out to what extent OCR errors influence the search engine’s bias. The team looked at an archive containing both documents with OCR errors and documents in which OCR errors had been manually corrected. The project included over 800 publications from the 17th century and the Second World War. Interested readers had manually corrected these documents at an earlier stage. The researchers looked at users’ search behavior, and analyzed which documents often, and which documents rarely or never appeared in search results. They saw that the ‘retrievability scores’ of the corrected documents were significantly higher than those of the uncorrected publications.
Minor selection, major impact
“Of course, those results were in line with our expectations,” says CWI group leader Jacco van Ossenbruggen. “But our research also showed a second, much more surprising result. In large collections such as the KB collection, only a very small part of the corpus has been manually corrected. But our research showed that a manual correction of even a small part of the collection, immediately makes the search results less biased: a greater variety of documents are found through different search queries. This means that the retrievability difference between documents becomes smaller. In other words: manual OCR corrections, for example through crowdsourcing, increase the retrievability of the corrected publications, without compromising the retrievability of the rest of the KB collection.”
Traub, Myriam, Jacco van Ossenbruggen, Thaer Samar, and Lynda Hardman. Impact of Crowdsourcing OCR Improvements on Retrievable Bias. In JCDL ’18: The 18th ACM/IEEE Joint Conference on Digital Libraries, 2018. doi:10.1145/3197026.3197046.
Introduction video of the Information Access group