Content plugins API

Engine plugin is a module which extracts plain text from documents.

There are 2 main types of content plugins.

First, text extractors for 2text metaplugin. These extractors are common executables not DLLs. Read more.

Second, plugin DLLs. This type is described below.

Placement

DLLs are usually stored in c:\program files\integra\plugins\formats. Search engine initialization code looks up this folder for *.dll and tries to load each of them as a plugin.

API

1. Plugin instance construction and initialization

 void* Constructor(void)

This procedure is called once per search engine session. Its main purpose is to load all necessary DLL, read configuration files and so on.

It returns the pointer to plugin object which is used in all subsequent calls as This argument.

2.  Plugin instance destruction

void Destructor( void *This )

It frees resources allocated by plugin instance during Constructor call.

This procedure is called on search engine termination.

3. Plugin features and options retrieval

const wchar_t* GetSolarixPluginProperty( void *This, int iProp, int iSub )

Search engine determines the features and characteristics of the plugin using this procedure. iProp and iSub are required property id and sub-id. String value of property is returned if possible. If property is not supported just returns NULL.

There is some minimal set of required properties:

iProp iSub Meaning
0 Must be "filetype_plugin" for content plugins
1 Plugin human readable name, e.q. "OpenOffice reader"
2 Copyright string
3 List of file detected extensions. Extensions are separated by ; for example "aaa;bbb;ccc"

Future vesions of search engine may acquire another prorerties. Return NULL if you don't know the meaning of the acquired property.

4. Document opening and extraction initialization

void* StartExtraction( void *This, const wchar_t *Filename, const wchar_t **Block, unsigned int *Count, const PluginOptions *Options, IGrammarEngine *IGrammarEnginePtr, ISearchEngine *ISearchEnginePtr )

It opens the document Filename and creates extraction context object. Pointer to this context is returned for using in subsequent calls as Ctx argument. Plugin implementation is free for defining the details of context object because search engine operates only the pointer to it.

IGrammarEnginePtr is an interface to grammar engine services (morphology analyzer etc.).

ISearchEngine is an interface to search engine services.

Options is a set of file processing flags.

If unsuccessfull, it returns NULL.

It is very convenient to return the first available text portion throught Block and Count. Memory block pointed by Block must stay valid till ExtractNextChunk or ExtractionComplete call. You can store the internal pointers in extraction context structure.

Plugin can return the whole document contents during StartExtraction call if document is relatevely small. In case of big document it is prefferable to return small portion of document for each ExtractNextChunk call.

5. Return the format of open document

void GetMime( void *This, void *Ctx, const char **Format, const char **Subformat )

It passes MIME format parts through *Format and *SubFormat pointers.

Ctx argument is a pointer returned by StartExtraction call.

6. Extract next portion of text from document

bool ExtractNextChunk( void *This, void *Ctx, const wchar_t **Block, unsigned int *Count, const PluginOptions *Options, IGrammarEngine *IGrammarEnginePtr, ISearchEngine *ISearchEnginePtr )

Ctx argument is a pointer returned by StartExtraction call.

IGrammarEnginePtr is an interface to grammar engine services (morphology analyzer etc.).

ISearchEngine is an interface to search engine services.

Options is a set of file processing flags.

This procedure fills *Block buffer with next chunk of text content and returns true. Number of characters is stored through Count.

If no more text is available, then it returns false.

7. Rewind the document read stream and set cursor to the beginning

void Rewind( void *This, void *Ctx )

Search engine calls this procedure when is needs to restart extraction for document. Plugin must reset all file cursors, cache buffers and so on.

8. Finish document extraction

void ExtractionComplete( void *This, void *Ctx, wchar_t *Block )

Closes all file handles and memory buffers associated with document extraction context. Ctx pointer becomes invalid after this call.


© Mental Computing 2009  rss  email  icq free counters Πειςθνγ@Mail.ru