Technical document
This a short introduction to some of the technical aspects to
IntraSeek If you are not interested in the technical questions of the
inner workings of IntraSeek you can comfortable disregard this
chapter.
Storage of databases
IntraSeek currently uses Yabu as the data base handler. Also note that
no logic is placed in the database, all the boolean search
operations, tree structure calculations and wildcard matching is done
by IntraSeek. The database is only used in a basic way, just to store
data, and to retrieve it.
When collecting new information the crawler uses a separate data
base. The reason is that it should be possible for the web users to
search even while the crawler collects new information. A flag file is
written to tell the search engine to swap the new data base in when it
is finished, overwriting the old one.
Memory usage
The memory used by Pike and IntraSeek varies a great deal, depending
on your operating system.
However, the more pages IntraSeek collects, the more memory it
uses, as it keeps the site structure, its errors and two stacks in
memory, which pages to visit, and which have been visited.
The index of words is also kept in memory, but written to disc at
certain intervals, called safety saves. A safety save dumps the
index to a yabu data base, then clears the memory it used. Also, the
disc data bases are reorganized to keep them down in size. When a
reorganization is running, you can notice files that start and end
with the "#" in the temporary storage directory.
By default, these saves occur every 500 pages. You can lower this
if you run into memory problems. However, you shouldn't increase this,
if you do, you will get a structure data base that grows faster and
faster, and the crawler will consume more and more memory.
If you are not interested in statistics for the site, the log with
broken links and such, you can disable this feature to gain memory. To
do this, go to the profile configuration and change the variable Site
structure logs to no.
To limit the size of the index both in memory and on disc, stop
lists are supported. A stop list contains short, "meaningless" words
that are filtered out. For example, in the English stop list words
like "the", "and" and "it" appear. Use one or several stop lists
covering the language you usually use on the pages you run the crawler
through.
Max download per document defines, in characters, how
much of a document that should be downloaded. This is used to limit
index size. Normally, if you download the 100000 first characters of a
document, it is very likely that there are enough terms to cover the
content of the document. This will of course vary depending on the
type of information present on your site. If all words are important
(even those at the end of large documents) you should increase this
value to higher values, for instance, 999999.
Log files are stored on disc and also take up some memory. Make
sure to delete them every now and then by using the configuration
interface.
Rejects and accepts
Reject and accept patterns are unique for every profile. By default,
the reject pattern comes with a lot of standard avoids, mainly to
reject files ending with .gif, .gz and so on. The
Accept pattern is by default empty.
The reject and accept rules are applied when IntraSeek is about to
schedule a new URL for visit.
- First, IntraSeek matches the reject patterns. If any of them
match, IntraSeek will not visit the URL, and no further match-checks
are made on this URL.
- After that, the accept patterns are matched. If any match is
found, the URL will be accepted, and no further match-checks are made
on this URL.
- Finally, if the URL was neither rejected or accepted, IntraSeek
will reject the URL by default.
|