Optical Character Recognition Pipeline For Maori Archival Documents

Abstract

With the increasing expectation of online access to documents, there has been increasing demand for optical character recognition (OCR) software that can reliably transform archival documents into a searchable, electronic file. Currently, there is still a lack of OCR software that can interpret the meaning behind specific documents. The Maori Land Court is currently in possession of thousands of scanned documents that are not being utilised fully. The format of these documents provides an obstacle for the people interested in finding and extracting therelevant information effectively. This thesis presents a Python-based OCR software package that allows the underlying meaning of a structured document to be incorporated into the digitisation process. We allow the user to tailor the OCR process to specific documents. As part of this research, we propose a novel method for page layout interpretation and we show that this algorithm is sufficiently flexible to work with different document structures. In addition, our software provides an interface that allows the user to design an output based on their future processing needs. Our software has been used to convert two scanned historical Maori Land Court documents for the Parininihi ki Waitotora Incorporation into electronic text. During this research, we have created many Python programs and software tools that enable a better user experience when using our OCR system on Maori documents.

Publication
UoA Archive