OCR Shop XTR Tips for Better Recognition and OCR Processing


From working with customers over the past several years, we have identified
the most common issues that arise when processing images with OCR Shop XTR,
and have outlined them below along with methods for improving results and
processing times.


===============================================================================
Contents:

    * Input Image Resolution
    * Improving Results with Non-binary Input Images
    * Automatic Processing and Filtering
    * Output filesize
    * PDF and PS Input:  Bit-depth, Memory Usage, and Processing Speed
    * Non-square Fax Images
    * TIFF Fill-order Bit
    * OCR Processing Speed
    * Understanding OCR Processing Using the PDF "normal" Output Format
    * Output of Non-Latin1 Character Sets 
===============================================================================


* Input Image Resolution

  Make sure the resolution of the input image, as well as the font size with
  respect to that resolution, are within normal limits.
  
  OCR Shop XTR accepts:
  
    - Image resolutions from 72dpi to 900dpi
    - Fonts from 5 to 72 points

  The resolution of the input image determines what one "point" means in the
  font point size.  The resolution of the input image is specified in the
  input image file, or, when not specified, is assumed to be 300 dpi by
  default.
  
    - There are 72 points per inch.
    - The point size of a font is measured from the top of the highest
      ascender to the bottom of the lowest descender.
    - The dpi specifies the number of pixels per inch.
  
  If the type in your image is particularly large or small, it might fall
  outside the accepted font point sizes, depending on the image resolution.
  OCR Shop XTR allows you to adjust how the OCR engine interprets the font
  size through the "in_res" option.

  For example, if your font size is 15 pixels high and the image resolution is
  300dpi, then the font point size is approximately 3 points, too small for
  the engine to recognize well.  In this case, we recommend setting the
  "in_res" option to 200dpi or 100dpi so the font will be interpreted as 5 or
  10 points in size, respectively.

  Similarly, if your font size is 80 pixels high and the image resolution is
  72dpi, then the font point size is approximately 80 points, too large for
  the engine to recognize well.  In this case, we recommend trying an "in_res"
  of 100, for instance, to have the font interpreted at a point size of 57
  points.

  You may approximate the point size of your font with the equation:

  [height of font in pixels] * 72 points/inch / [image dpi] = [point size of font]

  Remember the height of the font is measured from the top of the highest
  ascender to the bottom of the lowest descender.  If you count the pixels,
  make sure you view that portion of your image at full resolution on your
  screen, sometimes referred to as "actual pixels".


* Improving Results with Non-binary Input Images

  When a grayscale or color image is sent to OCR Shop XTR as input, the OCR
  engine converts it to 1-bit black and white image data before processing.
  You can control this transformation using the "black_threshold" option.

  The default conversion to 1-bit image data is optimized for black text on a
  white background.  If your input image is low in contrast, you can probably
  dramatically improve the results by adjusting the black_threshold.

  The default value of black_threshold is 60, and its range is 0-100, 101, 102,
  where 101 and 102 are used to indicate special algorithms:  random threshold,
  and Floyd-Steinberg.

  A good way to understand the effect of the black_threshold option is to
  generate a debug output file that shows you the 1-bit black and white image
  data that is sent to the OCR engine for processing:  Try running this
  command-line:

    ocrxtr -out_debug_files=y image.tif

  An image file called "converted_input_file" will be created.  This is the
  image data after it has been converted to 1-bit black and white image data;
  it is the data that will be recognized by the engine.  View
  "converted_input_file" in an image viewer to see what the OCR engine is
  attempting to recognize.  Try adjusting the black_threshold and regenerating
  this file; notice how "converted_input_file" looks different depending on how
  you set the black_threshold option.

  Note that you should remove "converted_input_file" before generating it again,
  because ocrxtr will append to it and not overwrite it.


  There are a few different approaches to handling the conversion to 1-bit over
  a large number of images:

  First, if all of your images come from the same source material and were
  scanned on the same scanner, then you can simply adjust the black_threshold
  to the best value for one of the pages, then use that value when recognizing
  the entire set of documents.

  However, if you are planning to recognize a large number of documents from a
  variety of sources, you might want to take a different approach.  If you will
  be calling ocrxtr from within another program, then you could, for instance,
  adjust the black_threshold programatically for each logical set of similar
  documents:  recognize the first page, evaluate the quality of the results,
  adjust the black_threshold if needed and recognize again.  OCR Shop XTR does
  provide an output format called "XDOC" which can include confidence values
  for each word or character, which could help you make this judgement
  programatically.

  Alternatively, you could convert the input images to 1-bit prior to
  submitting them to ocrxtr.  This would give you more control over the
  conversion and then you would know exactly what image data the OCR engine is
  operating on.


  One other method you can use to try to improve recognition results with a
  multi-bit input image is increasing the image resolution before using it
  with OCR Shop XTR.
  
  Be aware that increasing image resolution will increase the image's file
  size, with the result of longer OCR processing time and increased memory
  usage.  You may reduce this side effect by converting the input image to
  1-bit depth, after increasing its resolution and before using it with OCR
  Shop XTR.


* Automatic Processing and Filtering

  Turn on the options "auto_process" and "auto_filter" in order to have the
  OCR engine determine which filters will provide the best results for your
  input images.  Both of these options are on by default.

  When you first try OCR Shop XTR, it is best to try it with all default
  options to observe the basic behavior and results.  Then, if you turn off
  auto_process and auto_filter, you can try the different filter options to
  determine if any will improve your results:  fax_filter, newspaper_filter,
  and dotmatrix_filter.  You may also leave auto_process and auto_filter on,
  and turn off the specific processing and filter options individually.

  If you turn off auto_processing, you should be careful to turn auto_orient
  to "correct" and auto_flip to "Y" if any image might need to have its
  orientation automatically detected and corrected.  Similarly, setting
  "photometric_interp" to "correct" is important if some areas of the input
  image are black on white and other areas are white on black.


* Output filesize

  To control the filesize of an output format that contains image data (PDF,
  HTML, and graphics output), set the bit-depth for the output image data
  using the parameter, "out_depth".  For instance, to create a 1-bit output
  PDF file, run the command-line:

    ocrxtr -out_depth=1 -out_text_format=pdf image.tif

  By default, the bit-depth of the output matches the bit-depth of the input.
  For PDF and PS input, this default is 1 bit per pixel, because the bit-depth
  of an input PDF or PS file is unknown.

  The "out_depth" option may have these values:

    input        Bit depth of the input image (default)
    1            1 bit per pixel
    8            8 bits per pixel
    24           24 bits per pixel
  

* PDF and PS Input:  Bit-depth, Memory Usage, and Processing Speed

  PDF and PS input files are rendered by default to 1-bit image data prior to
  OCR processing.
  
  When using a PDF or PS input file, if you need to retain 8 or 24-bit image
  data in PDF, HTML, or graphics output, set the "out_depth" parameter to 8 or
  24 bits per pixel.
  
  WARNING:  If your input PDF or PS file contains a large number of pages,
  setting "out_depth" to 8 or 24 may result in excessive memory usage, slow
  processing times, and potential swap errors if memory is exceeded.  For
  large input PDF and PS files, we recommend using the default "out_depth" of
  1 bit per pixel.
  
  Details:

  OCR Shop XTR treats PDF and PS input files differently from other input
  formats such as JPEG and TIFF, because PDF and PS input files must be
  rendered into image data.  For maximum efficiency, OCR Shop XTR renders PDF
  and PS input files as 1-bit image data by default.  However, in cases where
  the user sets the "out_depth" option, OCR Shop XTR must render the input PDF
  or PS file at the bit-depth specified so that level of graphical information
  is maintained through to the PDF, HTML, or graphics output.
  

* Non-square Fax Images

  Some fax images have resolutions that are not square.  If your input image
  is a fax and you suspect this is the case, try setting the option
  "-double_dimension=y".

  When "double_dimension" is set, the engine internally doubles either the
  columns or rows to make the image more square and improve recognition, if
  one dimension or the other is rectangular.  Turning on this flag does not
  guarantee that the image will be doubled.

  Image output, either in an embedded document or with plain graphics output,
  is not affected by image doubling.


* TIFF Fill-order Bit

  If your input image is a TIFF file and your results are much worse than you
  expect, given the quality and properties of the input image, it is possible
  that the "fill-order" bit is set in the input image file.
  
  Most TIFF images do not use the fill-order bit; in fact, many programs that
  create TIFF files write the fill-order bit incorrectly.  As a result, by
  default, OCR Shop XTR ignores the TIFF fill-order bit.  In the rare case
  where an image has the fill-order bit set correctly, you will need to
  instruct OCR Shop XTR to obey it.
  
  To determine if OCR Shop XTR should obey the fill-order bit for your input
  TIFF image, run this command-line with your image:

    ocrxtr -out_debug_files=y image.tif

  An image file called "converted_input_file" will be created.  This is the
  actual image data that will be recognized by the engine.  View the file in an
  image viewer.  Does it look odd, as though each byte of image data has the
  bits reversed?  If so, then the fill-order bit in the image is probably
  set and should be obeyed.

  To have OCR Shop XTR obey the TIFF fill-order bit, set the command-line
  option, "-ignore_tiff_fillorder=n".  Alternatively, you may set an
  environment variable, VV_IGNORE_FILLORDER, to "n".

  Note on the "out_debug_files" flag:

    This flag instructs OCR Shop XTR to create two debug TIFF files:
    "converted_input_file" and "unconv_input_file".  OCR Shop XTR creates
    "unconv_input_file" immediately after reading in the input image, and it
    should be an exact copy of the input image in TIFF format.  OCR Shop XTR
    creates "converted_input_file" after converting the input image data to
    1-bit prior to OCR processing.
  
    OCR Shop XTR appends to these debug files, instead of overwriting them.
    As a result, with "out_debug_files" set, running OCR Shop XTR multiple
    times, passing multiple images on the command-line, or passing a multipage
    input file will result in a new page of image data appended to
    "converted_input_file" and "unconv_input_file" for each page of input
    image data.  We recommend that you delete "converted_input_file" and
    "unconv_input_file" between each run, and/or view them with an application
    designed to handle multipage TIFF images.


* OCR Processing Speed

  The main variables that affect how fast OCR Shop XTR will process an image
  are:

  - Filesize of the input image

    Large input files require more memory and can result in a longer
    processing times.

    PDF and PS input files may require a larger amount of memory than
    anticipated at first glance, because PDF and PS input files are rendered
    into image data before being loaded into the OCR engine.  By default, PDF
    and PS input files are rendered into 1-bit image data, which is small.  If
    the user specifies an "out_depth" of 8 or 24 bits, however, the input PDF
    and PS input files will be rendered at 8 or 24 bits per pixel, and the
    amount of image data may be large.  This typically only presents
    complications if the input PDF or PS file has a large number of pages; see
    the section "PDF and PS Input:  Bit-depth, Memory Usage, and Processing
    Speed".
    
    Be aware that setting the "out_depth" to a value lower than the input
    image's bit-depth by definition reduces the colorspace in PDF, HTML, or
    graphics output.

  - Quality of the input image

    Lower quality input images always take longer to process.  In the
    preprocessing step of OCR, cleaning up and deskewing lower quality images
    can be time consuming.  In the recognition step, the engine simply takes
    longer to recognize less clear text.  Similarly, extraneous marks on the
    input images, such as handwriting, stamps, or scribbles, will cause the
    engine to take longer; note that distinct image regions are much easier
    for the engine to understand than amorphous marks.

    If you can ensure your input images will be high quality, with clear text,
    no image skew, and no extraneous marks, OCR Shop XTR will run fastest.

  - Command-line options

    During the preprocessing step of OCR, certain options detect image
    properties such as orientation, fax images, or skew automatically.  If you
    know for example, that your images will always be oriented correctly, that
    they aren't faxes, and that they aren't skewed, you can turn these options
    off and improve processing time.

  - Your machine (CPU speed, RAM)

    A faster processor will result in faster OCR processing.  Sufficient RAM
    will help you achieve the fastest results, and is especially important for
    complicated images, large input files, and large combined output
    documents.  If you notice your machine thrashing, where it spends
    excessive time reading and writing to swap, then more memory is
    likely to improve performance.
  
    OCR Shop XTR is not multithreaded, so multiple CPUs will not significantly
    improve performance unless you are running multiple instances of OCR Shop
    XTR concurrently.  Multiple instances of OCR Shop XTR may run concurrently
    if you purchase multiple OCR Shop XTR licenses for the same machine.
    Trial users may request a demo license key that permits multiple
    instances.


* Understanding OCR Processing Using the PDF "normal" Output Format

  When you first use OCR Shop XTR, the PDF output format "normal" can be
  helpful in understandng how OCR Shop XTR recognizes the text and
  reconstructs the formatting, as well as in providing visual feedback for how
  different command-line options you chose affect the processing of your
  image:

    ocrxtr -out_text_format=pdf -pdf_format=normal image.tif

  While this format is typically not as useful for archiving images as
  "-pdf_format=img_txt", it makes a good experimentation and debugging tool.

  The PDF "normal" format contains the recognized text, reconstructions of
  tables and lineart, plus small images that correspond to the image regions
  identified by the OCR engine.  This information is laid out in the output
  PDF document in an attempt to approximate the original image's formatting as
  closely as possible.

  See how changing different command-line parameters affects the appearance of
  an output PDF "normal" document.  Consider using this format to find the
  optimal settings for a particular set of your input images, then create
  final output in format you wish.


* Output of Non-Latin1 Character Sets 

  If you generate output using a character set other than Latin1, be careful
  which output format you choose because not all output formats support
  non-Latin1 characters.  For example, Russian cannot be represented by ASCII
  text ("iso"), but can be represented by Unicode ("unicode").

  Also be aware that the viewer with which you open output files must support
  the character set and format generated.