[ Usenet FAQs | Search | Web FAQs | Documents | RFC Index ]
See question 'How can I index HTML files?' first.
Lets assume, your servers pages reside in directory
'/home/robots/www/pages'. Your servers URL might be
'http://myserver/'. The database will be named 'www-pages'.
An easy format file (www-pages.fmt) would be:
record-sep: /\n\n/ # never matches
layout:
headline: /<[Tt][Ii][Tt][Ll][Ee]>/ /<\/[Tt][Ii][Tt][Ll][Ee]>/ 80
/<[Tt][Ii][Tt][Ll][Ee]> *./
end:
region: /<[Hh][Tt][Mm][Ll]>/
stemming TEXT GLOBAL
end: /<.[Bb][Oo][Dd][Yy]>/
Then call
waisindex -t URL /home/robots/www/pages http://myserver \
-d www-pages -t fields \
`find /home/robots/www/pages -type f -name "*.html" -print`
If you do not have the modified URL handling compiled in, the
headline always contains the URL. With the modified handling,
headlines contain the title string of the HTML document, if there
is any.
An example database is running at
http://ls6-www.informatik.uni-dortmund.de/SFgate/www-pages rsp.
wais://ls6-www.informatik.uni-dortmund.de/www-pages.
Send corrections/additions to the FAQ Maintainer:
Last Update May 13 2007 @ 00:24 AM