2text is a search engine plugin which calls external programs to extract text content from files. It launches an executable module for each document so there are some advantages and disadvantages:
Advantages:
1. External parsers are executed as separate processes so fatal bugs does not affect search engine. It means that you can use unstable programs as text extractor.
2. One can use Java, VisualBasic, Perl or even script languages like Python to implement text extractor.
3. Debugging is easy.
Disadvantages:
1. It is slow.
2. Whole text content is extracted at the time whereas common content plugins can extract text page by page (e.g. DjVu).
3. Text extractors can not access grammar and search engine services via IGrammarEngine and ISearchEngine interfaces.
This plugin typically resides in \plugins\formats in search system installation directory. Subplugins can be placed in any other place.
Source code is included in SDK installation package. Look at \demo\ai\solarix\search_engine\Filetype_plugin\2text.
Source code for 2text subplugin - DjVu text extractor is also included in SDK.
Rules for file type recognition are contained in XML configuration file 2text.xml. Each external extractor is described as XML entry <filter>...</filter>:
| XML node | description | obligatory |
|---|---|---|
| type | extractor type, "external" for external executables, "internal" for built-in general text extractor. | no |
| format | first part of MIME | yes |
| subformat | second part of MIME | yes |
| maxsize | max size of files to handle to prevent hang up (10 Mb is default value) | no |
| XML node | description | obligatory |
|---|---|---|
| ext | file name extension(s), delimited by ';' | yes |
| exe | filepath to extractor executable | yes |
| args | startup command line, {1} stands for input (source) document filepath, {2} stands for extraction result file | yes |
| format | first part of MIME | yes |
| encoding | result text encoding (for out_format=text), utf8 is allowed as well as many other codepage names; current session codepage is used by default. | no |
| timeout | maximum elapsed time, millisec; external program is aborted if specified value is exceeded; default is 10 minutes (600000 msec) | no |
Example entry:
| XML node | description | obligatory |
|---|---|---|
| startpos_type | "begin" (default) when start position is from beginning of file, "end" when start position is relative to file ending, "signature" to search position by bytes sequence | no |
| start_pos | start position in bytes (see startpos_type), by default is set to 0 | no |
| start_signature | signature bytes sequence, decimals or hexadecimals (e.g. 0xab), delimited by spaces or commas | no |
| block_len | length (in bytes) of text block | no |
| extract_encoding | text encoding, may be "utf8", "utf16le", "utf16be", or ASCII codepage name (used by default); "acp" means current session ASCII codepage | no |
Example entry:
© Mental Computing 2009
|
|
|
|