HOW-TO: Mnogosearch for MidCOM powered sites

Fri 10th September 2004 00:29 EEST

Mnogosearch is an Open-Source Web search engine that uses indexer to build a database of all words in a given Web site and thus providing full-text search. Mnogosearch ships with a bunch of search front-ends: Perl and PHP being the ones that I've worked with.

I must say that the PHP front-end used to lack substring search altogether and some of the other advanced searching capabilities were quite limited. I haven't checked the status of the PHP front-end for over a year, so this might have changed. You decide which front-end to use, but for the rest of the article we will concentrate on setting up the Mnogosearch with Perl front-end to a MidCOM powered site.

So, first of all, install the Mnogosearch. I strongly recommend using the 3.2.x series, and this article concentrates on those versions. The previous versions lack many of the advanced searching capabilities like language filters. After the installation you'll have two moving parts: the indexer and the Perl front-end.

The indexer is pretty simple to configure if you follow the instructions provided in the default indexer.config file - just fill in the blanks. Some extra points for MidCOM sites:

#You don't want to index AIS, now do you?
Disallow */midcom-admin/*

Also, check the Disallow lists through! By default, they will refuse indexing Word documents in addition to PDFs and such non-HTML files.

The Perl front-end requires a bit more work as the default template file (search.htm-dist) is quite crowded. By default, the template provides "simple" and "advanced" searching interfaces - usually your clients want something in between. Edit the template file as you see fit and save it with the same name that you will use in your CGI file (i.e. you have search.cgi, save template as search.htm). Some noteworthy points for MidCOM sites in the template are:

  • Remove all <html> starting and ending tags, the template will be included in the middle of a page.
  • Remove all $(self) calls in all <a> tags, especially the ones in <form action=""> and search page navigation. Otherwise the links will point to the CGI file and not your fancy search page.
  • Remove all target="_blank" attributes in the result listing, you are indexing your own site only, right?

See the full list of search parameters in Mnogosearch documentation or check my default template.

The MidCOM site will generally disallow the old-school way of using Mnogosearch (old-school = using pages to include the template) unless you want to modify your MidCOM template (don't tell me you're not using the MidCOM template). So, we will do this with the help of MidCOM style engine.

  1. Create a child style (i.e. "search").
  2. Create the <(show-article)> style element (we will be using the de.linkm.taviewer component). You should include at least the following in the style where you want the Mnogosearch template to kick in:
    <?php
    include_once("http://".$GLOBALS['_SERVER']['SERVER_NAME']. ":".$GLOBALS['_SERVER']['SERVER_PORT']. "/cgi-bin/search.cgi?".$GLOBALS['_SERVER']['QUERY_STRING']);
    ?>
  3. Create a subtopic called "search". Use de.linkm.taviewer for the component (using any other component makes very little sense).
  4. Edit the topic to use the search style you just created.
  5. Create an index article that you can use for short instructions and heading.
  6. Create a new host record with prefix "/cgi-bin". Otherwise you may not be able to see the CGI-script as the active MidCOM Template root page will take over the cgi-bin address and return 404.

The above example is suitable for most public Web sites as it passes everything from the template to the CGI file. However, if you want restrict the usage, say in hosting environment or an Intranet, you could pass stuff like URL limiting (to specific host) in the include URL. For example:

<?php
include_once("http://".$GLOBALS['_SERVER']['SERVER_NAME'].":". $GLOBALS['_SERVER']['SERVER_PORT']."/cgi-bin/search.cgi?". $GLOBALS['_SERVER']['QUERY_STRING']&ul=www.kaukolaweb.com);
?>

The above would restrict search to kaukolaweb.com domain only.