OCR Shop XTR Tips for Better Recognition and OCR Processing From working with customers over the past several years, we have identified the most common issues that arise when processing images with OCR Shop XTR, and have outlined them below along with methods for improving results and processing times. =============================================================================== Contents: * Input Image Resolution * Improving Results with Non-binary Input Images * Automatic Processing and Filtering * Output filesize * PDF and PS Input: Bit-depth, Memory Usage, and Processing Speed * Non-square Fax Images * TIFF Fill-order Bit * OCR Processing Speed * Understanding OCR Processing Using the PDF "normal" Output Format * Output of Non-Latin1 Character Sets =============================================================================== * Input Image Resolution Make sure the resolution of the input image, as well as the font size with respect to that resolution, are within normal limits. OCR Shop XTR accepts: - Image resolutions from 72dpi to 900dpi - Fonts from 5 to 72 points The resolution of the input image determines what one "point" means in the font point size. The resolution of the input image is specified in the input image file, or, when not specified, is assumed to be 300 dpi by default. - There are 72 points per inch. - The point size of a font is measured from the top of the highest ascender to the bottom of the lowest descender. - The dpi specifies the number of pixels per inch. If the type in your image is particularly large or small, it might fall outside the accepted font point sizes, depending on the image resolution. OCR Shop XTR allows you to adjust how the OCR engine interprets the font size through the "in_res" option. For example, if your font size is 15 pixels high and the image resolution is 300dpi, then the font point size is approximately 3 points, too small for the engine to recognize well. In this case, we recommend setting the "in_res" option to 200dpi or 100dpi so the font will be interpreted as 5 or 10 points in size, respectively. Similarly, if your font size is 80 pixels high and the image resolution is 72dpi, then the font point size is approximately 80 points, too large for the engine to recognize well. In this case, we recommend trying an "in_res" of 100, for instance, to have the font interpreted at a point size of 57 points. You may approximate the point size of your font with the equation: [height of font in pixels] * 72 points/inch / [image dpi] = [point size of font] Remember the height of the font is measured from the top of the highest ascender to the bottom of the lowest descender. If you count the pixels, make sure you view that portion of your image at full resolution on your screen, sometimes referred to as "actual pixels". * Improving Results with Non-binary Input Images When a grayscale or color image is sent to OCR Shop XTR as input, the OCR engine converts it to 1-bit black and white image data before processing. You can control this transformation using the "black_threshold" option. The default conversion to 1-bit image data is optimized for black text on a white background. If your input image is low in contrast, you can probably dramatically improve the results by adjusting the black_threshold. The default value of black_threshold is 60, and its range is 0-100, 101, 102, where 101 and 102 are used to indicate special algorithms: random threshold, and Floyd-Steinberg. A good way to understand the effect of the black_threshold option is to generate a debug output file that shows you the 1-bit black and white image data that is sent to the OCR engine for processing: Try running this command-line: ocrxtr -out_debug_files=y image.tif An image file called "converted_input_file" will be created. This is the image data after it has been converted to 1-bit black and white image data; it is the data that will be recognized by the engine. View "converted_input_file" in an image viewer to see what the OCR engine is attempting to recognize. Try adjusting the black_threshold and regenerating this file; notice how "converted_input_file" looks different depending on how you set the black_threshold option. Note that you should remove "converted_input_file" before generating it again, because ocrxtr will append to it and not overwrite it. There are a few different approaches to handling the conversion to 1-bit over a large number of images: First, if all of your images come from the same source material and were scanned on the same scanner, then you can simply adjust the black_threshold to the best value for one of the pages, then use that value when recognizing the entire set of documents. However, if you are planning to recognize a large number of documents from a variety of sources, you might want to take a different approach. If you will be calling ocrxtr from within another program, then you could, for instance, adjust the black_threshold programatically for each logical set of similar documents: recognize the first page, evaluate the quality of the results, adjust the black_threshold if needed and recognize again. OCR Shop XTR does provide an output format called "XDOC" which can include confidence values for each word or character, which could help you make this judgement programatically. Alternatively, you could convert the input images to 1-bit prior to submitting them to ocrxtr. This would give you more control over the conversion and then you would know exactly what image data the OCR engine is operating on. One other method you can use to try to improve recognition results with a multi-bit input image is increasing the image resolution before using it with OCR Shop XTR. Be aware that increasing image resolution will increase the image's file size, with the result of longer OCR processing time and increased memory usage. You may reduce this side effect by converting the input image to 1-bit depth, after increasing its resolution and before using it with OCR Shop XTR. * Automatic Processing and Filtering Turn on the options "auto_process" and "auto_filter" in order to have the OCR engine determine which filters will provide the best results for your input images. Both of these options are on by default. When you first try OCR Shop XTR, it is best to try it with all default options to observe the basic behavior and results. Then, if you turn off auto_process and auto_filter, you can try the different filter options to determine if any will improve your results: fax_filter, newspaper_filter, and dotmatrix_filter. You may also leave auto_process and auto_filter on, and turn off the specific processing and filter options individually. If you turn off auto_processing, you should be careful to turn auto_orient to "correct" and auto_flip to "Y" if any image might need to have its orientation automatically detected and corrected. Similarly, setting "photometric_interp" to "correct" is important if some areas of the input image are black on white and other areas are white on black. * Output filesize To control the filesize of an output format that contains image data (PDF, HTML, and graphics output), set the bit-depth for the output image data using the parameter, "out_depth". For instance, to create a 1-bit output PDF file, run the command-line: ocrxtr -out_depth=1 -out_text_format=pdf image.tif By default, the bit-depth of the output matches the bit-depth of the input. For PDF and PS input, this default is 1 bit per pixel, because the bit-depth of an input PDF or PS file is unknown. The "out_depth" option may have these values: input Bit depth of the input image (default) 1 1 bit per pixel 8 8 bits per pixel 24 24 bits per pixel * PDF and PS Input: Bit-depth, Memory Usage, and Processing Speed PDF and PS input files are rendered by default to 1-bit image data prior to OCR processing. When using a PDF or PS input file, if you need to retain 8 or 24-bit image data in PDF, HTML, or graphics output, set the "out_depth" parameter to 8 or 24 bits per pixel. WARNING: If your input PDF or PS file contains a large number of pages, setting "out_depth" to 8 or 24 may result in excessive memory usage, slow processing times, and potential swap errors if memory is exceeded. For large input PDF and PS files, we recommend using the default "out_depth" of 1 bit per pixel. Details: OCR Shop XTR treats PDF and PS input files differently from other input formats such as JPEG and TIFF, because PDF and PS input files must be rendered into image data. For maximum efficiency, OCR Shop XTR renders PDF and PS input files as 1-bit image data by default. However, in cases where the user sets the "out_depth" option, OCR Shop XTR must render the input PDF or PS file at the bit-depth specified so that level of graphical information is maintained through to the PDF, HTML, or graphics output. * Non-square Fax Images Some fax images have resolutions that are not square. If your input image is a fax and you suspect this is the case, try setting the option "-double_dimension=y". When "double_dimension" is set, the engine internally doubles either the columns or rows to make the image more square and improve recognition, if one dimension or the other is rectangular. Turning on this flag does not guarantee that the image will be doubled. Image output, either in an embedded document or with plain graphics output, is not affected by image doubling. * TIFF Fill-order Bit If your input image is a TIFF file and your results are much worse than you expect, given the quality and properties of the input image, it is possible that the "fill-order" bit is set in the input image file. Most TIFF images do not use the fill-order bit; in fact, many programs that create TIFF files write the fill-order bit incorrectly. As a result, by default, OCR Shop XTR ignores the TIFF fill-order bit. In the rare case where an image has the fill-order bit set correctly, you will need to instruct OCR Shop XTR to obey it. To determine if OCR Shop XTR should obey the fill-order bit for your input TIFF image, run this command-line with your image: ocrxtr -out_debug_files=y image.tif An image file called "converted_input_file" will be created. This is the actual image data that will be recognized by the engine. View the file in an image viewer. Does it look odd, as though each byte of image data has the bits reversed? If so, then the fill-order bit in the image is probably set and should be obeyed. To have OCR Shop XTR obey the TIFF fill-order bit, set the command-line option, "-ignore_tiff_fillorder=n". Alternatively, you may set an environment variable, VV_IGNORE_FILLORDER, to "n". Note on the "out_debug_files" flag: This flag instructs OCR Shop XTR to create two debug TIFF files: "converted_input_file" and "unconv_input_file". OCR Shop XTR creates "unconv_input_file" immediately after reading in the input image, and it should be an exact copy of the input image in TIFF format. OCR Shop XTR creates "converted_input_file" after converting the input image data to 1-bit prior to OCR processing. OCR Shop XTR appends to these debug files, instead of overwriting them. As a result, with "out_debug_files" set, running OCR Shop XTR multiple times, passing multiple images on the command-line, or passing a multipage input file will result in a new page of image data appended to "converted_input_file" and "unconv_input_file" for each page of input image data. We recommend that you delete "converted_input_file" and "unconv_input_file" between each run, and/or view them with an application designed to handle multipage TIFF images. * OCR Processing Speed The main variables that affect how fast OCR Shop XTR will process an image are: - Filesize of the input image Large input files require more memory and can result in a longer processing times. PDF and PS input files may require a larger amount of memory than anticipated at first glance, because PDF and PS input files are rendered into image data before being loaded into the OCR engine. By default, PDF and PS input files are rendered into 1-bit image data, which is small. If the user specifies an "out_depth" of 8 or 24 bits, however, the input PDF and PS input files will be rendered at 8 or 24 bits per pixel, and the amount of image data may be large. This typically only presents complications if the input PDF or PS file has a large number of pages; see the section "PDF and PS Input: Bit-depth, Memory Usage, and Processing Speed". Be aware that setting the "out_depth" to a value lower than the input image's bit-depth by definition reduces the colorspace in PDF, HTML, or graphics output. - Quality of the input image Lower quality input images always take longer to process. In the preprocessing step of OCR, cleaning up and deskewing lower quality images can be time consuming. In the recognition step, the engine simply takes longer to recognize less clear text. Similarly, extraneous marks on the input images, such as handwriting, stamps, or scribbles, will cause the engine to take longer; note that distinct image regions are much easier for the engine to understand than amorphous marks. If you can ensure your input images will be high quality, with clear text, no image skew, and no extraneous marks, OCR Shop XTR will run fastest. - Command-line options During the preprocessing step of OCR, certain options detect image properties such as orientation, fax images, or skew automatically. If you know for example, that your images will always be oriented correctly, that they aren't faxes, and that they aren't skewed, you can turn these options off and improve processing time. - Your machine (CPU speed, RAM) A faster processor will result in faster OCR processing. Sufficient RAM will help you achieve the fastest results, and is especially important for complicated images, large input files, and large combined output documents. If you notice your machine thrashing, where it spends excessive time reading and writing to swap, then more memory is likely to improve performance. OCR Shop XTR is not multithreaded, so multiple CPUs will not significantly improve performance unless you are running multiple instances of OCR Shop XTR concurrently. Multiple instances of OCR Shop XTR may run concurrently if you purchase multiple OCR Shop XTR licenses for the same machine. Trial users may request a demo license key that permits multiple instances. * Understanding OCR Processing Using the PDF "normal" Output Format When you first use OCR Shop XTR, the PDF output format "normal" can be helpful in understandng how OCR Shop XTR recognizes the text and reconstructs the formatting, as well as in providing visual feedback for how different command-line options you chose affect the processing of your image: ocrxtr -out_text_format=pdf -pdf_format=normal image.tif While this format is typically not as useful for archiving images as "-pdf_format=img_txt", it makes a good experimentation and debugging tool. The PDF "normal" format contains the recognized text, reconstructions of tables and lineart, plus small images that correspond to the image regions identified by the OCR engine. This information is laid out in the output PDF document in an attempt to approximate the original image's formatting as closely as possible. See how changing different command-line parameters affects the appearance of an output PDF "normal" document. Consider using this format to find the optimal settings for a particular set of your input images, then create final output in format you wish. * Output of Non-Latin1 Character Sets If you generate output using a character set other than Latin1, be careful which output format you choose because not all output formats support non-Latin1 characters. For example, Russian cannot be represented by ASCII text ("iso"), but can be represented by Unicode ("unicode"). Also be aware that the viewer with which you open output files must support the character set and format generated.