Search domain


Introduction
Folders
Archives and compressed files
File lists
File masks
Selecting files by size
Selecting files by modification time
File categories
URL (web sites)
Meta options - 'my documents', 'my computer', 'network neighborhood'
Searching indexes

Introduction

The search engine is designed to work with files, although it can search in text string by API functions. The files to be processed can locate in different places:

 1. local file system and connected network disks (local spider or crawler);

 2. MS Windows local area network (taking into account authorization);

 2. web sites (including hyper-referenced - so called www crawler);

 3. archives and compressed files (most of them are processed without an external tool);

 4. named pipes and sockets.

Search domain (or search area) also includes file filters - masks for file names, conditions on file size, modification and creation dates. Folders are scanned (or crawled) recursively. Archives and compressed files are processed on-the-fly.

File masks are of two types: with wildcards (e.g. *.txt or *.htm?) and regular expressions.

Web sites are located by URL. Additionally URL mask can be used to filter documents to download.

The contents of files can be in a wide variety of formats, including plain TXT ASCII,  TXT utf-8 and utf-16 for national characters, HTML, XML, Acrobat PDF, Microsoft Word and some other formats (full list of supporting formats can be viewed here).

One of the most useful features of the search engine is searching indexes. It allows to search in indexed sets of files on removable media (CD/DVD) without the physical access to the files.

Folders

The folder to be scanned must be the first argument in command line:

faind c:\ ...

There is another way to declare the folder name to be processed:

faind -dir c:\

-dir followed by folder(s) name add those directories with subdirectories to search process. The location of this command in the query string is not restricted to first position.

By default the subfolders are also crawled recursively. Use option -recurse:

-recurse=on - subfolders are scanned recursively

-recurse=off - subfolders ARE NOT scanned

-r - equals to -recurse=on

It may be necessary to define several folders to process. In this case you can list all these folders in single -dir option:

-dir "folder1;folder2;..."

Folder names are separated by semicolon ";".

List of folders can also be stored in text file and then used in command line (pay attention to @ character):

-dir @list_file

 CD/DVD drives can be pointed by the command -cdrom

Files

The name of file to process can be declared as the very first argument in the command line:

faind e:\docs\cats.txt ...

You can issue the -file command in any place in command line:

faind ... -file e:\docs\cat.txt ...

It is possible to put several -file commands in one query. In this case you can either put separate -file commands or issue one composite -file command followed by the list of files:

faind ... -file "aaa;bbb;ccc" ...

The file names are separated by colon. The length of the list is limited only by OS shell.

The filenames can be stored in a text file and this text file is used as argument of -file command:

faind ... -file @eee ...

Archives and compressed files

Compression formats supported by search engine are: ARJ, GZIP, TAR, BZIP, RAR, 7ZIP, etc.. Full list of supported formats can be obtained by command:

faind -help=5

When search engine starts processing archive with supported format it automatically unpacks.

-unpack=on - to scan archive content. In order to not scan archives do use -unpack=off. Default value for this option is defined in ini-file.

Archives and compressed files from web sites are also automatically downloaded, unpacked and scanned. 

File lists

If first argument of command line is the name of file then this file will be processed.

For example, the command:

faind CAT.RAR -sample "dog"

searches for pattern dog in file CAT.RAR (it is compressed file).

There are situations when files are listed in text file. For example, list of files is a result of another program execution:

dir /b *.txt > list

for MS Windows or

ls *.txt > list

for Linux.

This file with files list can be processed by use of option:

-file @list

There is a possibility to enumerate files right in the command line:

-file "filename1;filename2;..."

Option -flist is used to load list of files from an XML file. This XML file has simple format (described here). It can be generated by previous execution of search tool FAIND (see option -listfiles:xml). For example, at first round

-faind c:\ -name *.txt -listfiles my_files ...

search engine accumulates files matching the wide conditions and writes them to XML file. At second round the command

faind -flist my_files ...

starts scanning the files selected at first step.

File masks

File masks help select the files by their name (by file name extension, usually).

-name xxx - ordinary file mask with wildcards - symbols * and ? (or list of masks divided by semicolon ';').

-name:rx xxx - regular expression (or list of expressions divided by semicolon ';').

-iname -xxx - differs from -name by case insensitive behavior.

Examples:

faind c:\ -name *.txt

faind \home -iname "*.txt;*.htm*"

faind e:\docs -name:rx "cat(\w*).(.*)"

MS Windows ignores case in file names, so -iname and -name are equivalent, whereas GNU/Linux takes the case into acount.

List of masks can be stored in text file and referred later this way:

faind c:\ -name @text_files

Use regular expression masks only in the case if you really understand the syntax of regular expressions, which is not easy and obvious. For example, options

-iname:rx "(.+)/a(.+)\.txt"

filters the text files (extension *.txt) with the name beginning from 'a' letter. Head part of regular expression (.+)/ is used to skip absolute path to file which is passed to the filter. 

Selecting files by size

To select files of a certain size, use the -size options, following it with the condition and the file size to match.

General syntax is:

-size "CCCSSS"

where CCC is a condition, SSS is a size.

Condition is a character (or two):

+ or >

greater than the given size

>=

greater than or equal to the given size

- or <=

less than given size

= or ==

equal to the size

!=

not equal to the given size

Size may be given in three scales:

1. as bytes - by default

1. as kilobytes - when the size is followed by K

2. as megabytes - when the size is followed by M

Examples:

-size "<=100K"    search the files whose size is less than or equal to 100 Kb

-size "+1M"          search the files whose size is greater than 1 Mb

-size "!=10000"   search the files whose size is not equal to 10000 bytes
 

The search engine allows two filter options to be used to limit the range:

-size ">1K" -size "<100K"  search the files whose size is between 1 and 100 Kb 

-empty selects the empty files (it is implemented for compatibility with GNU find). 

Selecting files by modification time 

The following command filters files by their last modification time:

-modif "CCCdate time" "date_format time_format"

-modif "CCCdate" "date_format"

CCC is a condition sign:

+ or >

greater than

>=

greater or equal

- or <

less than

<=

less or equal

= or ==

equal

!=

not equal

Date and time format string contains the floowing control characters:

DD - day number

MM - month number for date or minutes for time

YYYY - 4-digit year number

MMM - 3-letter month name (JAN-FEB-...DEC)

HH - hour

SS - seconds

Examples:

-modif ">=12-01-2003" "dd-mm-yyyy"

-modif "==12.01.2003 15:00:00" "dd.mm.yyyy hh:mm:ss"

Another syntax of the command is possible:

-modif 0

stands for files modified today,

-modif 1

stands for files modifed yesterday, and so on.

File categories

The search engine processes only text documents by default. Use -store_all_files=true command in order to process all files of search domain. This command is used by Integra to store the whole list of files on CD/DVD, for example.

-allow_raw=true activates an heuristic algorithm which extracts the text from binary files with unknown format. The command must be used in combination with -raw_ext "aaa;bbb;ccc", which sets the file extensions to be processed by the text extraction algorithm.

-allow_audio=true enables the extraction of tags from some audio files (mp3, for example).

allow_gfx=true text commentaries must be extracted from picture files (JPEG, for example).

allow_video=true text commentaries must be extracted from video files.

allow_exec=true enables the extraction of version number/developer name from executables.  

URL (web sites)

Use option

-url address

to scan web site. Command argument is URL - address of web site or address of web page. For example:

-url http://www.somedomain.ru

or

-url http://127.0.0.1:8080/default.shtml

Configuration for proxy server (if access to internet requires proxy) is done in ini file - variable proxy in section internet:

[internet]
proxy = "http://172.168.1.222:3120"

Search engine would follow hyper references if option -href=true is used. Hyper references are ignored by default, because they can cause surprising effect - search engine would start scanning more and more web sites. There are three ways to limit uncontrolled serfing.

First, search engine can be forbidden to leave the original site:

-same_domain=true

This option tells the search algorithm to follow those hyper references only which jump to the same site.

Second, it is possible to limit the depth of web search, that is the number of jumps from one reference to another:

-maxdepth=NN

Let us consider the case, when -maxdepth=2. Search engine starts from -uri=http://www.solarix.ru. Scanner will access title page of the site (it is index.shtml file). After that the scanner finds hyper reference, which points to www.solarix.ru/for_users/dowsload_them/faind/faind.shtml. This is first jump. Search engine loads that page and analyses it. This page also has hyper references. Any one of them causes the scanner to make second jump. All of hyper references on .../faind.shtml are processed, and any one causes second jump in depth. That is all - no deeper jumps will be done.

Third, it is possible to use masks to web address:

-urlmask "(.+)\.gov"

URL mask is regular expression. Each hyper reference checks by masks. If any of masks makes success then hyper reference is used. The set of masks can be declared in one -urimask.

-urinotmask can be used to prevent crawler from following the hyper references:

-urlnotmask "(.+)banner(.+);(.+)\.xxx"
 

All options listed above can be used in arbitrary combination.

Additional features are as follows.

List of masks can be stored in text file and later used like this:

-name @urls_masks

List of web addresses can be stored in text file (as usual for FAIND!) and then used at any moment::

-uri @urls_file

The command:

-maxtraffic=XXX

makes it possible to limit the internet traffic when scanning web sites. The limit value can be bytes, kilobytes (suffix K) or megabytes (suffix M), e.g.:

-maxtraffic=500K

Downloaded documents can be stored in the files on local host. It allows to browse them offline without the need to download them again. Download mode can be switched on by the option:

-store_download=true

Default value for this parameter is defined in ini file.

Downloaded documents are saved in special folder defined in ini file - variable download_dir in section internet.

It should be said that there is no simple correspondence between URL of original documents and the name of saved file. There are two ways to solve the problem.

First, every downloaded files has the pair file with the same name and extension 'uri'. This file contains description of document source.

Second, XML file with search results stores both original name of document (+ its source) and name of file in download directory.

HTML result file (it is generated by -listfiles:html option) does all work - it contains clickable references to the downloaded documents.
 

Metaoptions - 'my documents', 'my computer', 'network neighborhood'

These options simplify the search in the user's home folder (documents folder), on all disks and in the local area network. We call them 'metaoptions' because the search engine resolves these options into absolute folder names before searching. Note that searching in network neighbourhood can take significant time to prepare the list of available network resources.

Option

-mydocs

searches the files in 'MyDocuments' folder (each user has got personal folder for documents on the most modern OSes, including MS Windows and Linux).

Option

-mycomp

search the files in every directory of all hard disks - you don't have to enumerate all drives by hands. Please pay attention to the file access questions: when the search engine encounters the file access denial problem, it prints the diagnostic warning on the screen (for console version of search tools) and continues the searching.

Option

-lan

starts crawling the local area network. Every host (shared resource to be precise) is opened (if possible) and scanned for files. 

Using indexes

Another type of search domain definition is index:

-index domain "CD science fiction" -sample "Stanislaw lem"

More information about the zone is in the "Indexer" chapter.

Accent stripping

Some languages (French is one of them) use special signs to modify Latin alphabet letters:

L'HÔTEL

Such signs (diactrics) make it more complex to directly compare the words, because they change the character code.

Command

-strip_accents=true

strips the accents, so é becomes e and so on. It is recommended to issue this command when indexing the files in order to decrease the number of keywords in index database.

HTML tags stripping

HTML and XML formats store some information inside tags <...>. Usually this information is out of interest, so the tag internals are eliminated when processing files. You can change this behavior by the command:

-stripdecor=false

Document character encodings

Command

-cp NNN

sets the only legal coding for documents. If a document declares another coding (HTML, XML) then it is ignored.

Command

-prefer_cp NNN

sets the document coding if document does not have information about its coding.

Extended syntax

-prefer_cp "MMM;NNN;KKK"

sets the list of codepages to be used by codepage guesser.

External search engines

External search engines are implemented as plugins. Usage syntax:

-engine "plugin_name"

-engine "plugin_name?param1=value1&param2=value2..."

Additional parameters are appended after ? letter.


SourceForge.net Logo BerliOS Developer Logo