OCR Shop XTR is a powerful, optical character recognition application for UNIX and Linux systems. Using an efficient command-line interface (CLI), OCR Shop XTR quickly and accurately converts printed documents into readable text in a wide variety of formats using ScanSoft recognition technology.
Numerous command-line options give the user control over the input, preprocessing, recognition, and output steps of the conversion. OCR Shop XTR supports forms processing through region description files, and additionally provides output in the detailed XDOC format when word coordinates and confidence values are needed.
Current Version: OCR Shop XTR 5.5
Operating System Support:
| Sun Solaris™ SPARC® (Solaris™ 2.7+) |
| Linux™ x86 (Kernel 2.0 and higher) |
For questions about support of other Linux™ and UNIX® operating systems, including Mac OS® X, please Contact Sales.
Input Formats Supported:
- Graphics Interchange Format (GIF)
- Joint Photographics Experts Group File Interchange Format (JPEG)
- Portable BitMap (PBM)
- Portable document format (pdf) - available with add-on module
- PostScript® - available with add-on module
- Portable network Graphics Format (PNG)
- Portable PixMap (PPM)
- Rasterfile
- Silicon Graphics image file format (SGI®-RGB)
- Tagged image file format (TIFF)
- XWD
- X11
Languages Supported:
OCR Shop XTR comes with one language pack and is available with add-on language packs for other languages.
| Afrikaans | Albanian | Aymara |
| Basque | Breton | Bulgarian |
| Byelorussian | Catalan | Croatian |
| Czech | Danish | Dutch |
| English | Estonian | Faroese |
| Finnish | Flemish | French |
| Frisian - West | Friulian | Gaelic |
| Galician | German | Greek |
| Greenlandic | Hawaiian | Hungarian |
| Icelandic | Indonesian | Italian |
| Kurdish (Latin) | Latin | Latvian |
| Lithuanian | Macedonian (Cyrillic) | Malaysian |
| Norwegian | Pigin English | Polish |
| Portugese | Romanian | Russian |
| Serbian | Serbo-Croatian | Slovak |
| Slovenian | Sorbian - Lower | Sorbian - Upper |
| Spanish | Swahili | Swedish |
| Tahitian | Turkish | Ukranian |
| Welsh | Zulu |
Output Formats Supported:
| iso text | Standard ASCII text |
| 8bit text | ASCII characters as 8 bit values |
| Unicode | Full 2-byte Unicode |
| HTML* | HTML formatted output; plain, with styles, or with frames |
| PDF* | Portable Document Format |
| XDOC | The XDOC format is a ScanSoft® text output format which provides detailed information about the text, images, and formatting in a recognized document. See Overview of the XDOC Output Format below. |
* Available through an optional add-on.
Synopsis of Supported Command-line Options and Usage
ocrxtr [-<parameter=value]* [<filename>]*
| Help and Informational Parameters | |
| Parameter | Description |
| help | Print out a list of command-line options |
| version | Print current version number of OCR Shop XTR™ |
| Input Functionality | ||
| Parameter | Value | Description |
| black_threshold | <0-102> | Threshold to binarize input images |
| image_list | <filename> | File containing list of files to process |
| image_rdiff_list | <filename> | File containing list of input image files and rdiff files |
| in_res | <dpi> OR <dpi>x<dpi> |
Input image resolution |
| ignore_tiff_fillorder | <Y|N> | Ignore TIFF FillOrder flag in input TIFF image (default Y) |
| Basic Pre-processing Options | ||
| Parameter | Value | Description |
| auto_process | <full|preprocess_only|N> | Automate processing |
| auto_filter | <Y|N> | Automate preprocessing filters |
| rotate | <0|90|180|270> | Explicitely rotate image clockwise |
| auto_orient | <detect|correct|N> | Detect/correct image orientation automatically |
| fax_filter | <Y|N|auto> | Fax filter |
| newspaper_filter | <Y|N> | Newspaper filter |
| dotmatrix_filter | <Y|N|auto> | Dotmatrix filter |
| deskew | <correct|N|manual> | Correct image skew |
| deskew_confidence | <0-100> | Manual deskew confidence level |
| deskew_upper_angular _thresh | <float> | Manual deskew upper angular threshold |
| deskew_lower_angular _thresh | <float> | Manual deskew lower angular threshold |
| invert | <Y|N> | Invert the input image |
| analyze_layout | <Y|N> | Analyze page layout |
| Advanced Pre-processing Options | ||
| Parameter | Value | Description |
| double_dimension | <Y|N> | Double lines or columns to make image more square |
| auto_segment | <Y|N> | Segment into text and image regions |
| segment_lineart | <Y|N> | Distinguish between lineart and halftone image regions |
| single_col_autoseg | <Y|N> | Improve layout analysis for single column pages |
| one_column | <Y|N> | Force single column interpretation |
| two_page_mode | <Y|N> | Input image consists of two facing pages |
| photometric_interp | <detect|correct|n> | Automatically invert a reverse video image |
| reverse_video | <Y|N> | Automatically invert reverse video regions |
| Recognition Options | ||
| Parameter | Value | Description |
| language | <language name> | Language pack to load (see above) |
| english_chars | <Y|N> | Include the english character set (for use with character sets other than Latin1) |
| char_set | <string> | Constrain recognition to specified characters |
| min_point | <5-72> | Minimum point size recognized |
| max_point | <5-72> | Maximum point size recognized |
| format_analysis | <Y|N> | Run format analysis for pdf output |
| recognize_region | <region id> | Recognize only the region specified |
| user_lexicon | <filename> | File with user-lexicon words in the format: word <whitespace> lexclass <newline> |
| timeout | <int> | Set a timeout in seconds for recognition |
| Output Functionality | ||
| Parameter | Value | Description |
| out_text_name | <out_filename> OR info_log |
Template of output filename |
| start_filenum | <0-999> | Starting file number used with out_text_name |
| combine_docs | <Y|N> | Create one multipage output document file |
| out_text_format | <format> | Output format for recognized text (see above) |
| out_graphics_name | <out_filename> | Template of output filename for image region output |
| out_graphics_format | <format> | Format of output graphics data |
| overwrite | <Y|N> | Indicate whether to overwrite existing files |
| Advanced Output Options | ||
| Parameter | Value | Description |
| pdf_format | <normal|img_text| img_only> | PDF output format |
| out_image_scale | <1-500> | Scale the output images |
| out_res | <dpi> OR <dpi>x<dpi> | Output image resolution |
| reject_char | <char> | Character to represent unrecognized characters |
| out_rdiff | <out_filename> | Output region description information |
| out_prerec_rdiff | <out_filename> | Output region description information before recognition |
| out_regions_as_graphics | <Y|N> | Output text regions as images |
| output_text_by_region | <Y|N> | Recognize and output text by region |
| remove_halftone | <Y|N> | Remove halftone image regions from output |
| photometric_invert | <Y|N> | Invert photometric interpretation regions |
| xdoc_word_confidence | <Y|N> | Output word confidences in XDOC |
| xdoc_char_confidence | <Y|N> | Output character confidences in XDOC |
| xdoc_word_coords | <Y|N> | Output word bounding boxes in XDOC |
| xdoc_char_coords | <Y|N> | Output character bounding boxes in XDOC |
| out_depth | <1|8|24|input> | Bit depth of output image data |
| Resource and Settings Files | ||
| Parameter | Value | Description |
| read_params | <filename> | Read parameters from specified file |
| write_params | <filename> | Write parameters to specified file |
| reset_resource_file | <Y|N> | Reset user resource file to defaults |
| write_resource_file | <Y|N> | Write the current settings to the user resource file |
| Debug and Log Options | ||
| Parameter | Value | Description |
| info_log | <filename> or stdout | Control where diagnostic, status, and debugging information is directed |
| error_log | <filename> or stderr | Control where error information is directed |
| error_level | <0-5> | Level to filter error messages (5 most detailed) |
| info_level | <0-3> | Level to filter informational messages (3 most detailed) |
Overview of the XDOC Output Format
The XDOC output format is a ScanSoft text format which provides detailed information about the text, fonts, images, and formatting in a recognized document, as well as coordinate and confidence values for both characters and words.
OCR Shop XTR offers three types of XDOC output, as specified in the "out_text_format" parameter:
| xdoc | Enhanced XDOC format |
| xdoclite | XDOC format with no format analysis |
| xdocplus | XDOC format with style sheet data |
To use the XDOC output format, set "out_text_format" to "xdoc", "xdoclite",
or "xdocplus":
ocrxtr -out_text_format=xdoc image.tif
Use the following parameters to include confidence and
bounding box information in the XDOC output:
| xdoc_word_confidence | Output word confidences in XDOC |
| xdoc_char_confidence | Output character confidences in XDOC |
| xdoc_word_coords | Output word bounding boxes in XDOC |
| xdoc_char_coords | Output character bounding boxes in XDOC |
Additional parameters related to XDOC output include:
| xdoc_word_pixels | Use pixel coordinates for word bounding boxes in XDOC |
| no_header_footer | Do not label headers and footers |
| accept_thresh | Acceptibility threshold (number corresponds to the confidence values seen in XDOC output) |
| quest_thresh | Questionability threshold |
Please contact Vividata Support for
detailed information about the XDOC format and documentation
needed for XDOC parsing.

