SWISH::Split perl module with test and essential documentation
WARNING: this is experimental version. It requires swish-e perl bindings compiled with --enable-incremental SWISH-Split-0.03.tar.gz 8.5 Kb | |
Latest source is always available from Subversion repository |
SWISH::Split - Perl interface to split index variant of Swish-e
use SWISH::Split;
This is alternative interface for indexing data with swish-e. It's designed to split indexes over multiple files (slices) to allow updates of records in index by reindexing just changed parts (slice).
Data is stored in index using intrface which is somewhat similar to the Plucene::Simple manpage. This could make your migration (or supporting two index engines) easier.
In the background, it will fork swish-e binaries (one for each index slice)
and produce UTF-8 encoded XML files for it. So, if your input charset isn't
ISO-8859-1
you will have to specify it.
Create new object for index.
my $i = SWISH::Split->open_index({ index => '/path/to/index', slice_name => \&slice_on_path, slices => 30, merge => 0, codepage => 'ISO-8859-2', swish_config => qq{ PropertyNames from date PropertyNamesDate date }, memoize_to_xml => 0, );
sub slice_on_path { return shift split(/\//,$_[0]); }
Options to open_index
are following:
index
path to (existing) directory in which index slices will be created.
slice_name
coderef to function which provide slicing from path.
slices
maximum number of index slices. See in_slice for more explanation.
merge
(planned) option to merge indexes into one at end.
codepage
data codepage (needed for conversion to UTF-8).
By default, it's ISO-8859-1
.
swish_config
additional parametars which will be inserted into
swish-e
configuration file. See swish-config
.
memoize_to_xml
speed up repeatable data, see to_xml.
Add document to index.
$i->add($swishpath, { headline => 'foobar result', property => 'data', })
Delete documents from index.
$i->delete(@swishpath);
This function is not implemented.
Finish indexing and close index file(s).
$i->done;
This is most time-consuming operation. When it's called, it will re-index all entries which haven't changed in all slices.
Returns number of slices updated.
This method should really be called close or finish, but both of those are allready used.
This methods return statistics about your index.
Return array of swishpath
s in index.
my @p = $i->swishpaths;
Return array with updated swishpath
s.
my @d = $i->swishpaths_updated;
Return array with deleted swishpath
s.
my $n = $i->swishpaths_deleted;
Return array with all slice names.
my @s = $i->slices;
This methods are used internally, but they might be useful.
Takes path and return slice in which this path belongs.
my $s = $i->in_slice('path/to/document/in/index');
If there are slices
parametar to open_index it will use
MD5 hash to spread documents across slices. That will produce random
distribution of your documents in slices, which might or might not be best
for your data. If you have to re-index large number of slices on each
run, think about creating your own slice
function and distributing
documents manually across slices.
Slice number must always be true value or various sanity checks will fail.
This function is Memoize
ed for performance reasons.
Return array of swishpath
s for given swish-e
query.
my @p = $i->find_paths("headline=test*");
Useful for combining with delete_documents to delete documents which hasn't changed a while (so, expired).
Create swish-e
configuration file for given slice.
my $config_filename = $i->make_config('slice name');
It returns configuration filename. If no swish_config
was defined in
open_index, default swish-e configuration will be used. It will index all data for
searching, but none for properties.
If you want to see what is allready defined for swish-e in configuration
take a look at source code for DEFAULT_SWISH_CONF
.
It uses stdin
as IndexDir
to comunicate with swish-e
.
On first run, starts swish-e
. On subsequent calls just return
it's handles using Memoize
.
my $s = create_slice('/path/to/document');
You shouldn't need to call create_slice
directly because it will be called
from put_slice when needed.
Pass XML data to swish.
my $slice = $i->put_slice('/swish/path', '<xml>data</xml>');
Returns slice in which XML ended up.
Prints to STDERR output and errors from swish-e
.
my $slice = $i->slice_output($s);
Normally, you don't need to call it.
This is dummy placeholder function for very old code that assumes this
module is using IPC::Run
which it isn't any more.
Close slice (terminates swish-e process for that slice).
my $i->close_slice($s);
Returns true if slice is closed, false otherwise.
Convert (binary safe, I hope) your data into XML for swish-e
.
Data will not yet be recoded to UTF-8. put_slice will do that.
my $xml = $i->to_xml({ foo => 'bar' });
This function is extracted from add method so that you can Memoize
it.
If your data set has a lot of repeatable data, and memory is not a problem, you
can add memoize_to_xml
option to open_index.
Searching is still conducted using the SWISH::API manpage, but you have to glob index names.
use SWISH::API;
my $swish = SWISH::API->new( glob('index.swish-e/*') );
You can also alternativly create merged index (using merge
option) and
not change your source code at all.
That would also benefit performance, but it increases indexing time because merged indexes must be re-created on each indexing run.
Nothing by default.
Test script for this module uses all parts of API. It's also nice example
how to use SWISH::Split
.
the SWISH::API manpage, http://www.swish-e.org/
Dobrica Pavlinusic, <dpavlin@rot13.org>
Copyright (C) 2004 by Dobrica Pavlinusic
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
2005-04-29 23:25:02 dpavlin r13
/trunk/Split.pm: added warning about unimplemented delete
2005-04-29 22:51:58 dpavlin r12
/trunk: added some files to ignore
2005-04-29 22:50:16 dpavlin r11
/trunk/MANIFEST, /trunk/Split.pm: some cleanups
2005-04-29 22:38:00 dpavlin r10
/trunk/Changes: created from Subversion log
2005-04-29 22:35:21 dpavlin r9
/trunk/Split.pm: API 0.03: works with current swish-e version from CVS (with --enable-incremental!), but still doesn't support incremental indexing
2004-12-19 03:06:01 dpavlin r8
/trunk/Split.pm, /trunk/t/01api.t: new api:
- renamed open to open_index
- removed dependency on IPC::Run
- tests which all pass
2004-12-17 18:32:34 dpavlin r7
/trunk/Split.pm, /trunk/Makefile.PL, /trunk/t/01api.t: a lot of changes:
- better testing framework
- changed put_slice API (to actually confirm with documentation)
- use swish-e stdin instead of external cat utility
- added tags target
2004-12-08 20:35:49 dpavlin r6
/trunk/MANIFEST, /trunk/Split.pm, /trunk/Makefile.PL, /trunk/t/99pod.t, /trunk/MANIFEST.SKIP,
/trunk/t/SWISH-Split.t, /trunk/t/01api.t: better distribution packaging and html target
2004-08-11 14:28:40 dpavlin r5
/trunk/Split.pm, /trunk/t/SWISH-Split.t: smaller improvements
2004-08-08 19:22:56 dpavlin r4
/trunk/Split.pm, /trunk/t/SWISH-Split.t: first version which passes 51 test. It still doesn't update documents, just insert.
2004-08-08 10:53:04 dpavlin r3
/trunk/Split.pm: one more planned call: find_paths
2004-08-08 10:27:27 dpavlin r2
/trunk/t/SWISH-Split.t: better tests
2004-08-08 10:09:55 dpavlin r1
/trunk/t/SWISH-Split.t, /trunk/README, /trunk, /trunk/t, /trunk/MANIFEST, /trunk/Split.pm, /trunk/Makefile.PL, /trunk/Changes: initial import of SWISH::Split. Lot of documentation, less code.