SWISH::Split - Perl interface to split index variant of Swish-e
use SWISH::Split;
This is alternative interface for indexing data with swish-e. It's designed to split indexes over multiple files (slices) to allow updates of records in index by reindexing just changed parts (slice).
Data is stored in index using intrface which is somewhat similar to the Plucene::Simple manpage. This could make your migration (or supporting two index engines) easier.
In the background, it will fork swish-e binaries (one for each index slice)
and produce UTF-8 encoded XML files for it. So, if your input charset isn't
ISO-8859-1
you will have to specify it.
Create new object for index.
my $i = SWISH::Split->open_index({ index => '/path/to/index', slice_name => \&slice_on_path, slices => 30, merge => 0, codepage => 'ISO-8859-2', swish_config => qq{ PropertyNames from date PropertyNamesDate date }, memoize_to_xml => 0, );
# split index on first component of path sub slice_on_path { return shift split(/\//,$_[0]); }
Options to open_index
are following:
index
slice_name
slices
merge
codepage
ISO-8859-1
.
swish_config
swish-e
configuration file. See swish-config
.
memoize_to_xml
Add document to index.
$i->add($swishpath, { headline => 'foobar result', property => 'data', })
Delete documents from index.
$i->delete(@swishpath);
Finish indexing and close index file(s).
$i->done;
This is most time-consuming operation. When it's called, it will re-index all entries which haven't changed in all slices.
Returns number of slices updated.
This method should really be called close or finish, but both of those are allready used.
This methods return statistics about your index.
Return array of swishpath
s in index.
my @p = $i->swishpaths;
Return array with updated swishpath
s.
my @d = $i->swishpaths_updated;
Return array with deleted swishpath
s.
my $n = $i->swishpaths_deleted;
Return array with all slice names.
my @s = $i->slices;
This methods are used internally, but they might be useful.
Takes path and return slice in which this path belongs.
my $s = $i->in_slice('path/to/document/in/index');
If there are slices
parametar to open_index it will use
MD5 hash to spread documents across slices. That will produce random
distribution of your documents in slices, which might or might not be best
for your data. If you have to re-index large number of slices on each
run, think about creating your own slice
function and distributing
documents manually across slices.
Slice number must always be true value or various sanity checks will fail.
This function is Memoize
ed for performance reasons.
Return array of swishpath
s for given swish-e
query.
my @p = $i->find_paths("headline=test*");
Useful for combining with delete_documents to delete documents which hasn't changed a while (so, expired).
Create swish-e
configuration file for given slice.
my $config_filename = $i->make_config('slice name');
It returns configuration filename. If no swish_config
was defined in
open_index, default swish-e configuration will be used. It will index all data for
searching, but none for properties.
If you want to see what is allready defined for swish-e in configuration
take a look at source code for DEFAULT_SWISH_CONF
.
It uses stdin
as IndexDir
to comunicate with swish-e
.
On first run, starts swish-e
. On subsequent calls just return
it's handles using Memoize
.
my $s = create_slice('/path/to/document');
You shouldn't need to call create_slice
directly because it will be called
from put_slice when needed.
Pass XML data to swish.
my $slice = $i->put_slice('/swish/path', '<xml>data</xml>');
Returns slice in which XML ended up.
Prints to STDERR output and errors from swish-e
.
my $slice = $i->slice_output($s);
Normally, you don't need to call it.
This is dummy placeholder function for very old code that assumes this
module is using IPC::Run
which it isn't any more.
Close slice (terminates swish-e process for that slice).
my $i->close_slice($s);
Returns true if slice is closed, false otherwise.
Convert (binary safe, I hope) your data into XML for swish-e
.
Data will not yet be recoded to UTF-8. put_slice will do that.
my $xml = $i->to_xml({ foo => 'bar' });
This function is extracted from add method so that you can Memoize
it.
If your data set has a lot of repeatable data, and memory is not a problem, you
can add memoize_to_xml
option to open_index.
Searching is still conducted using the SWISH::API manpage, but you have to glob index names.
use SWISH::API;
my $swish = SWISH::API->new( glob('index.swish-e/*') );
You can also alternativly create merged index (using merge
option) and
not change your source code at all.
That would also benefit performance, but it increases indexing time because merged indexes must be re-created on each indexing run.
Nothing by default.
Test script for this module uses all parts of API. It's also nice example
how to use SWISH::Split
.
the SWISH::API manpage, http://www.swish-e.org/
Dobrica Pavlinusic, <dpavlin@rot13.org>
Copyright (C) 2004 by Dobrica Pavlinusic
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.