Typical System & Application
The client is a National Library in the process of establishing an electronic library. Some files, printed nearly 100 years ago such as Latvia newspapers and magazines utilizing Gothic font needed to converted to electronic files. Responsible for identifying large documents, OCR will be an important part of digital library projects,
OCR Functions & Problems
OCR technology is an important part of any document management system, in which OCR is mainly used to recognize characters in an image to reduce manual entry time. The below problems often occur during recognition.
- Degraded documents: The newspapers to be recognized are nearly 100 years old, which leads to character adhesion, double-sided ink penetration, excessive noise and other issues. Since the recognition results will be used for content research, the accuracy rate required hasto be more than eighty percent accuracy, which is almost impossible to achieve if using conventional scanning tools.
- Ancient font text: In the beginning of the 20th century, the first attempts were made to create the Latvian writing system based on German influences, like German text of that time. Latvian books and newspapers were printed using Gothic fonts that are considerably different from Latin fonts used in modern times. However the Gothic fonts for printed Latvian texts were abandoned in the 1930s and no similar fonts have been in use for many years. With such an archaic font, there was no OCR available to recognize the outdated characters
- Special national characters: In the early 20th century, many Latvian printing houses invented their own character “dialects” ; gothic fonts were supplemented with irregular changes with specific words at that time. Collecting all these different characters that appeared in that period was the primary work before recognition, which would require significant physical labor, time and effort. In addition, recognizing all the characters increased the difficulty of recognition, which can not be achieved unless the OCR engine is customized.
- Special Translation Tool: At that time, Latvian orthography was quite different than the modern system, which needed translation rules to compare the recognition result of ancient characters with the modern words database to achieve auto-correction function. Since the translation rule is special, there is no off-the-shelf software that can solve this problem unless the OCR vendor provides customizing service.
- System integration: NLL has decided to use Olive Active Paper Archive software for information retrieval, access and management tools. Although the software has OCR functions, it does not support special Latvia Gothic font’s recognition. So the new OCR software should be integrated with NLLs application and the recognition results needed to be shared.
Based on all of the above problems, to meet our client’s need we customized our standard RTK gearing to the client’s image samples.
- Customized scan tool to improve image quality
- Grouped training to improve OCR accuracy
- Incremental software development process for effectiveness and efficiency
- Collaborative research & development with both parties’ advantages
- Leverage process, technology and human factors