Single Page

Top Document: freeWAIS-sf Frequently Asked Questions [FAQ] with answers
Previous Document: 5.4) How can I index HTML files?
Next Document: 5.6) How can I index my ftp server?


[ Usenet FAQs | Search | Web FAQs | Documents | RFC Index ]

5.5) How can I index my http server?



    See question 'How can I index HTML files?' first. 

    Lets assume, your servers pages reside in directory 
    '/home/robots/www/pages'. Your servers URL might be 
    'http://myserver/'. The database will be named 'www-pages'. 

    An easy format file (www-pages.fmt) would be: 

        record-sep: /\n\n/ # never matches

        
        layout:
        headline: /<[Tt][Ii][Tt][Ll][Ee]>/ /<\/[Tt][Ii][Tt][Ll][Ee]>/ 80 
           /<[Tt][Ii][Tt][Ll][Ee]> *./
        end:

        
        region: /<[Hh][Tt][Mm][Ll]>/
        stemming TEXT GLOBAL
        end: /<.[Bb][Oo][Dd][Yy]>/

    Then call 

        waisindex -t URL /home/robots/www/pages http://myserver \
                -d www-pages -t fields \
                `find /home/robots/www/pages -type f -name "*.html" -print`

    If you do not have the modified URL handling compiled in, the 
    headline always contains the URL. With the modified handling, 
    headlines contain the title string of the HTML document, if there 
    is any. 

    An example database is running at 
    http://ls6-www.informatik.uni-dortmund.de/SFgate/www-pages rsp. 
    wais://ls6-www.informatik.uni-dortmund.de/www-pages. 



Top Document: freeWAIS-sf Frequently Asked Questions [FAQ] with answers
Previous Document: 5.4) How can I index HTML files?
Next Document: 5.6) How can I index my ftp server?

Single Page


[ Usenet FAQs | Search | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
pfeifer@ls6.informatik.uni-dortmund.de (Ulrich Pfeifer)

Last Update May 13 2007 @ 00:24 AM