Vividata: OCR Shop XTR

OCR Shop XTR is a powerful, optical character recognition application for UNIX and Linux systems. Using an efficient command-line interface (CLI), OCR Shop XTR quickly and accurately converts printed documents into readable text in a wide variety of formats using ScanSoft recognition technology.

Numerous command-line options give the user control over the input, preprocessing, recognition, and output steps of the conversion. OCR Shop XTR supports forms processing through region description files, and additionally provides output in the detailed XDOC format when word coordinates and confidence values are needed.

Current Version: OCR Shop XTR 5.5

Operating System Support:

Sun Solaris™ SPARC® (Solaris™ 2.7+)

Linux™ x86 (Kernel 2.0 and higher)

For questions about support of other Linux™ and UNIX® operating systems, including Mac OS® X, please Contact Sales.

Input Formats Supported:

Graphics Interchange Format (GIF)
Joint Photographics Experts Group File Interchange Format (JPEG)
Portable BitMap (PBM)
Portable document format (pdf) - available with add-on module
PostScript® - available with add-on module
Portable network Graphics Format (PNG)
Portable PixMap (PPM)
Rasterfile
Silicon Graphics image file format (SGI®-RGB)
Tagged image file format (TIFF)
XWD
X11

Languages Supported:

OCR Shop XTR comes with one language pack and is available with add-on language packs for other languages.

Afrikaans	Albanian	Aymara
Basque	Breton	Bulgarian
Byelorussian	Catalan	Croatian
Czech	Danish	Dutch
English	Estonian	Faroese
Finnish	Flemish	French
Frisian - West	Friulian	Gaelic
Galician	German	Greek
Greenlandic	Hawaiian	Hungarian
Icelandic	Indonesian	Italian
Kurdish (Latin)	Latin	Latvian
Lithuanian	Macedonian (Cyrillic)	Malaysian
Norwegian	Pigin English	Polish
Portugese	Romanian	Russian
Serbian	Serbo-Croatian	Slovak
Slovenian	Sorbian - Lower	Sorbian - Upper
Spanish	Swahili	Swedish
Tahitian	Turkish	Ukranian
Welsh	Zulu

Output Formats Supported:

iso text	Standard ASCII text
8bit text	ASCII characters as 8 bit values
Unicode	Full 2-byte Unicode
HTML*	HTML formatted output; plain, with styles, or with frames
PDF*	Portable Document Format
XDOC	The XDOC format is a ScanSoft® text output format which provides detailed information about the text, images, and formatting in a recognized document. See Overview of the XDOC Output Format below.

* Available through an optional add-on.

Synopsis of Supported Command-line Options and Usage

ocrxtr [-<parameter=value]* [<filename>]*

Help and Informational Parameters
Parameter	Description
help	Print out a list of command-line options
version	Print current version number of OCR Shop XTR™

Input Functionality
Parameter	Value	Description
black_threshold	<0-102>	Threshold to binarize input images
image_list	<filename>	File containing list of files to process
image_rdiff_list	<filename>	File containing list of input image files and rdiff files
in_res	<dpi> OR <dpi>x<dpi>	Input image resolution
ignore_tiff_fillorder	<Y\|N>	Ignore TIFF FillOrder flag in input TIFF image (default Y)

Basic Pre-processing Options
Parameter	Value	Description
auto_process	<full\|preprocess_only\|N>	Automate processing
auto_filter	<Y\|N>	Automate preprocessing filters
rotate	<0\|90\|180\|270>	Explicitely rotate image clockwise
auto_orient	<detect\|correct\|N>	Detect/correct image orientation automatically
fax_filter	<Y\|N\|auto>	Fax filter
newspaper_filter	<Y\|N>	Newspaper filter
dotmatrix_filter	<Y\|N\|auto>	Dotmatrix filter
deskew	<correct\|N\|manual>	Correct image skew
deskew_confidence	<0-100>	Manual deskew confidence level
deskew_upper_angular _thresh	<float>	Manual deskew upper angular threshold
deskew_lower_angular _thresh	<float>	Manual deskew lower angular threshold
invert	<Y\|N>	Invert the input image
analyze_layout	<Y\|N>	Analyze page layout

Advanced Pre-processing Options
Parameter	Value	Description
double_dimension	<Y\|N>	Double lines or columns to make image more square
auto_segment	<Y\|N>	Segment into text and image regions
segment_lineart	<Y\|N>	Distinguish between lineart and halftone image regions
single_col_autoseg	<Y\|N>	Improve layout analysis for single column pages
one_column	<Y\|N>	Force single column interpretation
two_page_mode	<Y\|N>	Input image consists of two facing pages
photometric_interp	<detect\|correct\|n>	Automatically invert a reverse video image
reverse_video	<Y\|N>	Automatically invert reverse video regions

Recognition Options
Parameter	Value	Description
language	<language name>	Language pack to load (see above)
english_chars	<Y\|N>	Include the english character set (for use with character sets other than Latin1)
char_set	<string>	Constrain recognition to specified characters
min_point	<5-72>	Minimum point size recognized
max_point	<5-72>	Maximum point size recognized
format_analysis	<Y\|N>	Run format analysis for pdf output
recognize_region	<region id>	Recognize only the region specified
user_lexicon	<filename>	File with user-lexicon words in the format: word <whitespace> lexclass <newline>
timeout	<int>	Set a timeout in seconds for recognition

Output Functionality
Parameter	Value	Description
out_text_name	<out_filename> OR info_log	Template of output filename
start_filenum	<0-999>	Starting file number used with out_text_name
combine_docs	<Y\|N>	Create one multipage output document file
out_text_format	<format>	Output format for recognized text (see above)
out_graphics_name	<out_filename>	Template of output filename for image region output
out_graphics_format	<format>	Format of output graphics data
overwrite	<Y\|N>	Indicate whether to overwrite existing files

Advanced Output Options
Parameter	Value	Description
pdf_format	<normal\|img_text\| img_only>	PDF output format
out_image_scale	<1-500>	Scale the output images
out_res	<dpi> OR <dpi>x<dpi>	Output image resolution
reject_char	<char>	Character to represent unrecognized characters
out_rdiff	<out_filename>	Output region description information
out_prerec_rdiff	<out_filename>	Output region description information before recognition
out_regions_as_graphics	<Y\|N>	Output text regions as images
output_text_by_region	<Y\|N>	Recognize and output text by region
remove_halftone	<Y\|N>	Remove halftone image regions from output
photometric_invert	<Y\|N>	Invert photometric interpretation regions
xdoc_word_confidence	<Y\|N>	Output word confidences in XDOC
xdoc_char_confidence	<Y\|N>	Output character confidences in XDOC
xdoc_word_coords	<Y\|N>	Output word bounding boxes in XDOC
xdoc_char_coords	<Y\|N>	Output character bounding boxes in XDOC
out_depth	<1\|8\|24\|input>	Bit depth of output image data

Resource and Settings Files
Parameter	Value	Description
read_params	<filename>	Read parameters from specified file
write_params	<filename>	Write parameters to specified file
reset_resource_file	<Y\|N>	Reset user resource file to defaults
write_resource_file	<Y\|N>	Write the current settings to the user resource file

Debug and Log Options
Parameter	Value	Description
info_log	<filename> or stdout	Control where diagnostic, status, and debugging information is directed
error_log	<filename> or stderr	Control where error information is directed
error_level	<0-5>	Level to filter error messages (5 most detailed)
info_level	<0-3>	Level to filter informational messages (3 most detailed)

Overview of the XDOC Output Format

The XDOC output format is a ScanSoft text format which provides detailed information about the text, fonts, images, and formatting in a recognized document, as well as coordinate and confidence values for both characters and words.

OCR Shop XTR offers three types of XDOC output, as specified in the "out_text_format" parameter:

xdoc	Enhanced XDOC format
xdoclite	XDOC format with no format analysis
xdocplus	XDOC format with style sheet data

To use the XDOC output format, set "out_text_format" to "xdoc", "xdoclite", or "xdocplus":

ocrxtr -out_text_format=xdoc image.tif

Use the following parameters to include confidence and bounding box information in the XDOC output:

xdoc_word_confidence	Output word confidences in XDOC
xdoc_char_confidence	Output character confidences in XDOC
xdoc_word_coords	Output word bounding boxes in XDOC
xdoc_char_coords	Output character bounding boxes in XDOC

Additional parameters related to XDOC output include:

xdoc_word_pixels	Use pixel coordinates for word bounding boxes in XDOC
no_header_footer	Do not label headers and footers
accept_thresh	Acceptibility threshold (number corresponds to the confidence values seen in XDOC output)
quest_thresh	Questionability threshold

Please contact Vividata Support for detailed information about the XDOC format and documentation needed for XDOC parsing.