Usenet FAQ Archive Search Help

[ By Archive-name | By Author | By Category | By Newsgroup ]
[ Home | Latest Updates | Archive Stats | Search | Help ]


The FAQ Archive Search facility provides you with the ability to preform different kinds of text searches arcoss the entire *.answers FAQ database (over 4000 FAQs). The Glimpse Search Engine is used here. Additional information is available on glimpse and glimpseindex.

Search Methods

FAQ Search Options

Formulating Queries



Search FAQs - Show References

Search FAQs - Show References is the default search available on many of the pages. It searches the FAQs for the string that you entered. It displays what it finds allowing you to quickly bring up the documents. This search option displays the references to the search string you entered. This allows you to review the encountered references without having to take the time to read each document. Below is a sample output of a Search FAQs - Show References search for the string "rkive".

----------

File name (modification date), and list of matched lines

1. alt-sources-intro, ( Dec 30 1996)

2. usenet/software/part1, ( Dec 29 1996)

3. ftp-list/sitelist/part14, ( Dec 14 1996)
...


Search Subject/Archive Names

Search Subject/Archive Names allows you to search for an FAQ by its Archive-name or any information contained in the Subject: line. Below is sample output of a Search Search Subject/Archive Names search.

----------

File name and article Subject:

  1. ftp-list/faq- Anonymous FTP: Frequently Asked Questions (FAQ) List
  2. ftp-list/sitelist/part1- Anonymous FTP: Sitelist Part 1 of 23 [01/23]
  3. ftp-list/sitelist/part2- Anonymous FTP: Sitelist Part 2 of 23 [02/23]
  4. ...


Search Article Headers

Search Article Headers allows you to search for information contained in the RFC 1036 news headers of the posted document. For example, if you entered the Archive-name of an FAQ, you will get back the headers of the FAQ (minus a couple headers dropped for administrative purposes). Below is sample output of a Search Article Headers search.

------------

File name and header references


1. computer-security/anonymous-ftp-faq

2. ftp-list/faq

3. ftp-list/sitelist/part1

....

FAQ Search Options

These options allow some control of the query specification and allow some control of presentation of the query return. These options only pertain to searching the FAQ database (Search FAQs - ...) and not to the "Archive-name" or "Article Header" database searches.

Search Case Sensitivity

This allows you to select whether or not you want the 'case' of the search to matter. By default, case does not matter so "Joe Smith" is the same as "joe smith". If case is important then select "Case Sensitive Search".

Whole Word Matching

By default, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will matching "networking", "sensitive" will match "insensitive", and "Arizona desert" will match "Arizona desertness". The default is to match keywords on word boundaries.

Misspellings Allowed

The search is allowed to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The default is 0 (zero) errors. You can choose from 0 to 4 errors. Allowing errors in the match requires more time and can slow down the match by a factor of 2-4. Be very careful when specifying more than one error, as the number of matches tend to grow very quickly.

Show Number of References

This allows you to specify the number of references shown per file. If the limit is exceeded a message is displayed indicating that the limit was reached and there are more in that specific file. You can select to have 10, 30, 50 100 or all references found in a file displayed to you. References are simply the lines found in the document that match your search criteria. 30 references is the default returned.

Total Number of Files Returned

This allows you to specify the number of files you wish returned. If the limit is exceeded a message is displayed indicating that the limit was reached and there are may be more files that match your search criteria. You can select to have 10, 50 100 or 1000 files be returned. 50 documents returned is the default.

Meta Character in Search String

This option either enables or disables Meta Character interpretation in the search string you entered. This allows you to search for strings that have Meta characters in them.


Formulating Queries

Overview

The simplest query is a single keyword, such as:

	edi

Searching for common words (like "computer" or "html") may take a lot of time.

It is often helpful to use more powerful queries. The following types of queries are supported:

The different types of queries (and how to use them) are discussed below.

The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus on the search form.

Keyword searches can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g.,

	"internet resource discovery"
or
	"discov.*"

Examples

Simple keyword search query:

	Internet
This query will return all indexed objects containing the word Internet.

Boolean query:

	Internet AND EDI
This query will return all indexed objects that contain both words anywhere in the object in any order.

Phrase query:

	"Internet Security"
This query will return all indexed objects that contain Internet Security as a phrase. Notice that you need to put double quotes around the phrase.

Boolean queries with phrases:

	"Internet Security" AND Firewall

Patterns

glimpse supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wild cards, and regular expressions. (See Limitations.)

Strings

Strings are any sequence of characters, including the special symbols `^' for beginning of line and `$' for end of line. The following special characters ( `$', `^', `*', `[', `^', `|', `(', `)', `!', and `\' ) as well as the following meta characters special to glimpse (and agrep): `;', `,', `#', `<', `>', `-', and `.', should be preceded by `\' if they are to be matched as regular characters. For example, \^abc\ corresponds to the string ^abc\, whereas ^abc corresponds to the string abc at the beginning of a line.

Classes of characters

a list of characters inside [] (in order) corresponds to any character from the list. For example, [a-ho-z] is any character between a and h or between o and z. The symbol `^' inside [] complements the list. For example, [^i-n] denote any character in the character set except character `i' to `n'. The symbol `^' thus has two meanings, but this is consistent with egrep. The symbol `.' (don't care) stands for any symbol (except for the newline symbol).

Boolean operations

Glimpse supports an `AND' operation denoted by the symbol `;' an `OR' operation denoted by the symbol `,', or any combination. For example, glimpse `pizza;cheeseburger' will output all lines containing both patterns. glimpse -F `gnu;\.c$' `define;DEFAULT' will output all lines containing both `define' and `DEFAULT' (anywhere in the line, not necessarily in order) in files whose name contains `gnu' and ends with .c. glimpse `{political,computer};science' will match `political science' or `science of computers'.

Wild cards

The symbol `#' is used to denote a sequence of any number (including 0) of arbitrary characters see Limitations). The symbol # is equivalent to .* in egrep. In fact, .* will work too, because it is a valid regular expression (see below), but unless this is part of an actual regular expression, # will work faster. (Currently glimpse is experiencing some problems with #.)

Combination of exact and approximate matching Any pattern inside angle brackets <> must match the text exactly even if the match is with errors. For example, <mathemat>ics matches mathematical with one error (replacing the last s with an a), but mathe<matics> does not match mathematical no matter how many errors are allowed. (This option is buggy at the moment.)


Regular Expressions

Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the Glimpse manual pages.)

Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, "lin.*ing" will find "linking" or "flinching," but not "linear programming."

Since the index is word based, a regular expression must match words that appear in the index for it to be found. Glimpse first strips the regular expression from all non-alphabetic characters, and searches the index for all remaining words. It then applies the regular expression matching algorithm to the files found in the index. For example, glimpse `abc.*xyz' will search the index for all files that contain both `abc' and `xyz', and then search directly for `abc.*xyz' in those files. (If you use glimpse -w `abc.*xyz', then `abcxyz' will not be found, because glimpse will think that abc and xyz need to be matches to whole words.) The syntax of regular expressions in glimpse is in general the same as that for agrep. The union operation `|', Kleene closure `*', and parentheses () are all supported. Currently `+' is not supported. Regular expressions are currently limited to approximately 30 characters (generally excluding meta characters). Some options (-d, -w, -t, -x, -D, -I, -S) do not currently work with regular expressions. The maximal number of errors for regular expressions that use `*' or `|' is 4. (See Limitations.)


Limitations

The index of glimpse is word based. A pattern that contains more than one word cannot be found in the index. The way glimpse overcomes this weakness is by splitting any multi-word pattern into its set of words and looking for all of them in the index. For example, glimpse `linear programming' will first consult the index to find all files containing both linear and programming, and then apply agrep to find the combined pattern. This is usually an effective solution, but it can be slow for cases where both words are very common, but their combination is not.

As was mentioned in the section on PATTERNS above, some characters serve as meta characters for glimpse and need to be preceded by `\' to search for them. The most common examples are the characters `.' (which stands for a wild card), and `*' (the Kleene closure). So, "glimpse ab.de" will match abcde, but "glimpse ab\.de" will not, and "glimpse ab*de" will not match ab*de, but "glimpse ab\*de" will. The meta character - is translated automatically to a hypen unless it appears between [] (in which case it denotes a range of characters).

The index of glimpse stores all patterns in lower case. When glimpse searches the index it first converts all patterns to lower case, finds the appropriate files, and then searches the actual files using the original patterns. So, for example, glimpse ABCXYZ will first find all files containing abcxyz in any combination of lower and upper cases, and then searches these files directly, so only the right cases will be found. One problem with this approach is discovering misspellings that are caused by wrong cases. For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because the pattern is converted to lower case); it will find that there are matches with no errors, and will go to those files to search them directly, this time with the original upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error. Another problem is speed. If you search for "ATT", it will look at the index for "att". Unless you use -w to match the whole word, glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.

There is no size limit for simple patterns and simple patterns within Boolean expressions. More complicated patterns, such as regular expressions, are currently limited to approximately 30 characters. Lines are limited to 1024 characters.

Words greater than 64 characters are not indexed.


[ By Archive-name | By Author | By Newsgroups | By Category ]
[ Home | Latest Updates | Stats | Help | Search | Glimpse Home Page ]

© Copyright The Internet FAQ Consortium, 1996
All rights reserved