Commissioned by the U.S. Department of Energy, Information Science Research Institute (ISRI) conducted the most authoritative of the Test of OCR Accuracy for 5 consecutive years in the mid-90s.
About Information Science Research Institute (ISRI)
Information Science Research Institute (ISRI) is a research and development unit of [University of Nevada]. ISRI was established in 1990 with funding from the [U.S. Department of Energy]. Its mission is to foster the improvement of automated technologies for understanding machine printed documents.
To pursue this goal, ISRI’s specific programs have been established:
1.ISRI conducts a program of applied research in recognition of information from machine-printed documents. ISRI’s research focuses on developing new metrics of recognition performance, on measures of print quality, on document image enhancement, and on characterization of document analysis techniques.
2.ISRI conducts a program of applied research in Information Retrieval. This research is focused on issues related to the combined use of recognition and retrieval technologies. For example, ISRI are focused on evaluating the effectiveness of different retrieval models in the presence of OCR errors. ISRI are interested in improvements that can be made in the retrieval environment to reduce the effects that recognition errors have on retrieval. Further ISRI are developing systems to automatically tag the physical and logical structure of documents to establish a mapping betISRIen the text and the image. This mapping can be exploited in various ways to improve both retrieval and display of documents.
3.Each year, ISRI sponsors a “Symposium on Document Analysis and Information Retrieval” (SDAIR). This symposium provides a forum for presenting the results of research into improved technologies for document understanding with emphasis on both recognition and retrieval from machine-printed documents.
4.ISRI conducts an annual “OCR Technology Assessment” program. Each year, using its automated test facilities, ISRI prepares an in-depth, independent comparison of the performance characteristics of all available technologies for character recognition from machine-printed documents. The results of this test are first made public at the SDAIR symposium.
These programs interact very strongly. ISRI expect that continued development of new measures of OCR system performance will contribute to a better understanding of recognition problems.
ISRI’s Technology Assessment program provides an opportunity each year to apply new metrics.
Metrics, such as non-stopword and phrase accuracy, reflect on ISRI’s ability to retrieve information. ISRI’s view is that new measures of recognition technologies are needed and that goal-directed measures may be the most important. Finally, SDAIR is a natural forum not only for presenting and discussing detailed test results but also for stimulating interaction betISRIen recognition and retrieval researchers. ISRI’s goals are to promote improved understanding of the current state-of-the-art in both recognition and retrieval and to promote the exchange of information among the user, vendor, and academic communities.
Annual Test of OCR Accuracy
Commissioned by the [U.S. Department of Energy], ISRI conducted the most authoritative of the Test of [Optical character recognition] (OCR) Accuracy for 5 consecutive years in the mid-90s.
The Information Science Research Institute (ISRI) at the [University of Nevada, Las Vegas (UNLV)] has conducted an experiment to determine the accuracy of six commercially-available OCR devices: Caere [OmniPage] Professional, [Calera] RS 9000, [ExperVision] TypeReader; [Kurzweil] 5200, [Recognita] Plus, [Toshiba] ExpressReader
2 Test Data
Test data consisted of 240 pages that were selected at random from the GTl database . For each page in the GT1 database, there is a 300 dpi binary image file. These images were produced under conditions that are typical for a large-scale data conversion operation. The operators who performed the scanning had adequate training, but were not “experts.”
Each device processed exactly the same zoned portions of the same images. All processing, including the determination and tabulation of errors , was carried out entirely under computer control, i.e., there was no human interaction with the devices during the experiment . The software tools that were used are part of the ISRI OCR experimental environment  .
3.2 Error Counting
Accuracy was determined on a character basis. Each character insertion , substitution, or deletion required to correct the generated text was counted as an error. Any “reject characters” that were generated were not treated specially , but were counted as errors.
4 Results and Analysis
Accuracy Statistics for the Entire Sample
2.[ExperVision] 99 .72%
ISRI has conducted its second annual assessment of the accuracy of devices for [optical character recognition] (OCR) of machine-printed, English-language documents. This year’s test featured more devices, more data, and more measures of performance than the test conducted a year ago [Rice 92a].
ISRI has attempted to acquire one copy of every OCR technology available. Only one version was tested from each vendor, but vendors were allowed to submit their latest, most accurate version. In many cases, this was a \pre-release” or \beta” version. The deadline for submissions was January 31, 1993. Table 1 lists the versions that were evaluated: Caere [OmniPage] Corp., [Calera] Recognition Systems, Inc., Cognitive Technology [Cuneiform] Corp., [CTA], Inc., [ExperVision], Inc., [OCRON] Inc., [Recognita] Corp., [Xerox] Imaging Systems, Inc.
The data used in the test consisted of 500 pages selected at random from a collection of approximately 2,500 documents containing 100,000 pages. The documents in this collection were chosen by the [U.S. Department of Energy] (DOE) to represent the kinds of documents from which the DOE plans to build large, full-text retrieval databases using OCR for document conversion. The documents are mostly scienti¯c and technical papers [Nartker 92].
Each device processed the same zoned portions of the same binary images. This processing was carried out in an entirely automated manner, i.e., there was no human interaction with the devices.
5 Character Accuracy
Each character insertion, substitution or deletion required to correct the generated text is counted as an error. This metric is attributed to Levenshtein [Levenshtein 66]; the number of errors has been termed edit distance by Wagner and Fischer [Wagner 74].
ISRI has conducted its third annual test of the accuracy of OCR systems. Vendors submitted their latest technology for recognizing machine-printed English text from page images. This year’s test re-used the 460-page sample from [U.S. Department of Energy] (DOE) documents that was used a year ago [Rice 93a]. In addition, a new 200-page sample, randomly selected from popular magazines, was utilized.
Eleven vendors elected to participate and submitted a version by the deadline, January 18, 1994. Caere [OmniPage] Corporation, [Calera] Recognition Systems, Inc., [Electronic Document Technology], [ExperVision], Inc., Recognita [Cuneiform] Corp., [Xerox] Imaging Systems, Inc.
2 Test Data and Methodology
Two sets of test data were utilized: the “[U.S. Department of Energy] (DOE) sample” and the “Magazine sample.” The DOE sample consists of the same 460 pages that were used in last year’s test. These pages were selected at random from a collection of approximately 2,500 scienti£c and technical documents (about 100,000 pages). This collection is described in [Nartker 92].
3 Character Accuracy
Each character insertion, substitution or deletion required to correct the text generated by an OCR system is counted as an error.
6.[EDT Image Reader] 95.52%
For four years, ISRI has conducted an annual test of [optical character recognition] (OCR) systems known as “page readers.” These systems accept as input a bitmapped image of any document page, and attempt to identify the machine-printed characters on the page. In the annual test, we measure the accuracy of this process by comparing the text that is produced as output with the correct text.
2 Test Data
Five test samples were used in this year’s test.
1) The Business Letter Sample contains a variety of letters received by businesses and individuals
and donated to ISRI.
2) The [U.S. Department of Energy] (DOE) Sample is the third and largest sample we have prepared by randomly selecting pages from a DOE collection of scientific and technical documents.
3) The Magazine Sample, which was used in the third annual test, consists of pages selected at random from the 100 U.S. magazines having the largest circulation.
4) The English Newspaper Sample contains articles selected at random from the 50 U.S. newspapers having the largest circulation.
5) The Spanish Newspaper Sample contains articles selected at random from 12 popular newspapers
from Argentina, Mexico, and Spain.
3 Character Accuracy
While there are many ways of quantifying the deviation between OCR-generated and correct text, in our most fundamental measure, we reflect the effort required by a human editor to correct the OCR-generated text. Specifically, we compute the minimum number of edit operations (character insertions, deletions, and substitutions) needed to fully correct the text. We refer to this quantity as the number of errors made by the OCR system. Expressing this as a percentage of the total number of characters, we obtain the character accuracy:
The Information Science Research Institute (ISRI) at the [University of Nevada], Las Vegas, conducts an annual test of page-reading systems. A page-reading system, or “page reader,” accepts as input a bitmapped image of any document page. This image is a digitized representation of the page and is produced by a scanner. Through a process known as [optical character recognition] (OCR), the page reader analyzes the image and attempts to locate and identify the machineprinted characters on the page. The output is a text file containing a coded representation of the characters which may be edited or searched.
2 Test Description
Any organization may participate in the annual test by submitting a page-reading system by the established deadline, which was December 15, 1995 for the fifth annual test. The system must be able to run unattended on a PC or Sun SPARCstation. Participation in the test is voluntary and free, but only one entry is allowed per organization, and participants must sign an agreement regarding the use of the test results in advertising. Table 1 lists the participants in this year’s test.
3 Character Accuracy
The text generated by a page-reading system is matched with the correct text to determine the minimum number of edit operations (character insertions, deletions, and substitutions) needed to correct the generated text. This quantity is termed the number of errors.