This package is essentially composed of two different parts:
PostgreSQL function to search Hyper Estraier full-text index, using full-text queries and attribute filtering to return user-specified table of results.
This function can mimic SQL LIMIT
, OFFSET
and ORDER BY
functionality much faster than using those SQL constructs on search
results.
PostgreSQL trigger function to keep Hyper Estraier in sync with PostgreSQL. It triggers after insert, update or delete and update full-text index accordingly.
Both functions are written in C, while test framework and supporting utilities are written in perl.
You can use just one of those functions. If you want just to search existing Hyper Estraier index or generate it off-line (after nightly batch jobs, for example), just use search function.
On the other hand, if you want just to keep your Hyper Estraier index in sync with PostgreSQL data, you can use just trigger function to achieve that.
Aside from providing single query language (SQL) to RDBMS and full text index (using any language that has PostgreSQL client libraries), real power is hidden in ability to join results from full text index and structured data in RDBMS.
For simple real-life example which address problem
WHERE name LIKE '%foo%' OR surname LIKE '%foo%'
is slow see Tutorial and pgest-index documentation.
Installation should be simple. However, you will have to have following software already installed before you try this functions:
If you want to use helper script to create consistency triggers to keep Hyper Estraier in sync with PostgreSQL database, you will also need:
Search::Estraier
, DBI
and DBD::Pg
To run tests you will also need:
Test::More
trivia.list.gz
from Internet Movie Database in data/
directory.
You can download it from http://www.imdb.com/interfaces test
with permissions for current user estmaster
running with permissions for admin
user
to create trivia
node.If you have all that, you should be able to type
make
and see sample results. You will be asked your password once (via sudo) to
install pgest.so
shared library in system-wide location so that PostgreSQL
could access it.
Perl bindings for Hyper Estraier are available at CPAN:
http://search.cpan.org/~dpavlin/Search-Estraier/
After installing Search::Estraier
you can create index using following commands:
cd data make index cd ..
To run tests (which require that you have estcmd in your $PATH) issue
make test
See also included file test.sql
for more examples of usage.
pgest
PostgreSQL function tries to mimic usage of normal database tables (with support for attribute filtering, limit and offset) in following way:
SELECT -- columns to return (defined later) id,title,size FROM pgest( -- node URI, login, password and depth of search 'http://localhost:1978/node/trivia', 'admin', 'admin', 42, -- query 'blade runner', -- additional attributes, use NULL or '' to disable -- multiple attributes conditions can be separated by {{!}} '@title ISTRINC blade', -- order results by '@title STRA', -- limit, use NULL or 0 to disable null, -- offset, use NULL or 0 to disable null, -- attributes to return as columns ARRAY['@id','@title','@size'] ) AS ( -- specify names and types of returned attributes id text, title text, size text );
You should note that Hyper Estraier uses UTF-8 encoding, while your
PostgreSQL installation might use different encoding. To fix that, use
convert
function in PostgreSQL to convert encodings.
estmaster
server processThis is default and recommended way to use pgest
functionality. In this
case, pgest
will use node API and access index through estmaster
process which should be running on (local or remote) machine.
This will remove database opening overhead, at a cost of (small) additional network
traffic. However, you can have Hyper Estraier estmaster
process running on
different machine or update index while doing searches, so benefits of this
approach are obvious.
Please note that direct access to database is depreciated. As such, it's
not stated in example, and it's kept just for backward compatibility, but it
will probably be removed in future versions of pgest
.
If you want to access database directly (without running estmaster
process), you
have to replace node URI, login, password and depth with full path to database file.
Have in mind that postgres
user under which PostgreSQL is running must
have read permission on Hyper Estraier database files.
This will work a bit faster on really small indexes. However, when your index grows bigger, you might consider using node API to remove overhead of database opening on each query.
Let's first say that I really suggest that you use dbi-index.pl
helper script to
create triggers because it already supports following steps automatically:
Transaction is needed to catch updates which might happen while creation of full-text index is in progress (and on huge collections this can take a while, just like normal index creation in PostgreSQL).
This will be done directly from PostgreSQL database to Hyper Estraier index. This is somewhat faster than waiting for trigger to fire for each existing row.
Which will keep data in sync later
If you still want to do that manually, you will need to know format of
pgest_trigger
function:
CREATE TRIGGER pgest_trigger_insert AFTER INSERT ON table FOR EACH ROW EXECUTE PROCEDURE pgest_trigger( -- node URI, login and password 'http://localhost:1978/node/trivia', 'admin', 'admin', -- name of primary key column 'id', -- names of all other columns to index (one or more) 'column', 'another_one', 'and_another' )
You have to create triggers for UPDATE
and DELETE
in similar way.
Hyper Estraier is written by Mikio Hirabayashi.
PostgreSQL is written by hackers calling themselves PostgreSQL Global Development Group.
This small C functions are written by Dobrica Pavlinusic, dpavlin@rot13.org.
pgestraier
is using noraml queries (with
AND
, OR
etc.) and not simplified queryies (with |
).This tutorial will try to show how easy it is to replace SQL
constructs that use where column like '%something%'
with external full-text index using
Hyper Estraier.
In less then 10 minutes from installation to fill-text search-able table (speed may vary somewhat, but it's really fast).
$ sudo apt-get install postgresql-8.1 postgresql-server-dev-8.1 \ libdbd-pg-perl $ sudo cpan Search::Estraier
$ cd pgestraier $ sudo make install
This will just install library at right place without running tests or examples.
Let's assume you have database called content
in which you have table
new
which has fields title, lead, content, keywords and source which
should be search-able.
You will have to do:
$ ./bin/pgest-index.pl content --create news > search.sql
That's it. Really. You will even get example SQL query to shoehorn into your application.
You can now connect to database using psql, edit search string in
search.sql
and try your new full-text index.
However, you will notice that we indexed every field in table news, which is wasteful (because other fields are ids of various things, booleans and fields which aren't interesting for full-text search).
So, we can do better:
$ ./bin/pgest-index.pl content --create news \ --sql="select id,title,lead,content,keywords from news"
We added custom SQL query which will be used to produce full-text index and triggers so that we touch only columns in which we are really interested.
This time around that it! Enjoy.
For all available options see pgest-index documentation.
pgest-index.pl - create full-text index of some columns in your database
pgest-index.pl --create movies --sql "select id,title,year from movies" pgsql-index.pl --drop movies
Options:
Create index name
and create triggers on table with same name
Remove triggers from table name
and node with same name
Full URI to node. If it's not specified, it's assumed that you are using
Hyper Estraier on http://localhost:1978/
.
SQL query which will return names of columns which are included in full-text index. Have in mind that you can't use aliases (as something) in this SQL query (or triggers will be created with wrong fields).
If SQL query isn't specified, default one select * from movies
will
be created. That will be serious performance hit if all columns are
not needed for search.
Specify name of primary key column in SQL query. If you allready have primary key on table or unique index and it consists of simgle column (compaund keys are not supported) it will be picked up automatically.
If you specify value which is not unique, you will get just last occurence
of that item in index (which might be what you want). That's because specified
pk
column will be used for @uri
in Hyper Estraier.
If name of primary key begins with _
it will not be added into text
indexing (so you won't be able to find prmary key value, but it will still
be available as attribute value).
Username and password to use when connecting to Hyper Estraier. If not specified,
admin
and admin
will be used.
Dump debugging output. It may be specified multiple times for more verbose debugging.
At end of each run, this script will output example search SQL query on STDOUT.
You can use it to quickly construct queries for your application.
Dobrica Pavlinusic <dpavlin@rot13.org>
This product is licensed under GNU Public License (GPL) v2 or later.
2008-03-16 20:55:56 dpavlin r88
/trunk/bin/pgest-index.pl: better error message when run without arguments
2007-07-13 10:12:54 dpavlin r87
/trunk/README.pod: added IMDB download link for test database (trivia)
2007-04-18 15:55:47 dpavlin r86
/trunk/data/indexer.pl: support for multiple quotes
2007-02-03 17:13:07 dpavlin r85
/trunk/pgest.c: add PG_MODULE_MAGIC for PostgreSQL 8.2
2007-01-18 11:02:57 dpavlin r84
/trunk/data/trivia.sql: added indexes
2007-01-17 22:30:14 dpavlin r83
/trunk/data/convert.pl, /trunk/data/trivia.sql, /trunk/data/Makefile, /trunk/data/trivia, /trunk/data/db.pl, /trunk/data/parse_trivia.pm, /trunk/data/indexer.pl: restructure IMDB trivia parser, added db target to create trivia database in PostgreSQL
2006-11-26 21:21:46 dpavlin r82
/trunk/bin/pgest-index.pl: don't add to full-text index primary keys with names beginning with _
2006-08-09 17:25:31 dpavlin r81
/trunk/README.pod: finished see also, with note about queryies
2006-08-08 12:15:55 dpavlin r80
/trunk/pgest.c: bugfix: offset implementation didn't return all records
2006-08-08 11:42:01 dpavlin r79
/trunk/pgest.c: update trigger now uses est_node_edit_doc (as it should)
2006-08-08 11:40:09 dpavlin r78
/trunk/bin/pgest-index.pl: delete just existing triggers
2006-08-08 11:01:00 dpavlin r77
/trunk/bin/pgest-index.pl: try unique index if primary key doesn't exits
2006-08-07 23:24:54 dpavlin r76
/trunk/Makefile: re-generate documentation on cvs target
2006-08-07 23:22:29 dpavlin r75
/trunk/Tutorial.pod, /trunk/Makefile, /trunk/README.pod: minor modificaton to cross-link documentation and actually transfer all of it to pgfoundry.org
2006-08-07 22:53:46 dpavlin r74
/trunk/bin/pgest-index.pl: turn debug off
2006-08-07 22:47:41 dpavlin r73
/trunk/pgest.c: hush size_t warning
2006-08-07 22:44:21 dpavlin r72
/trunk/pgest.c: hush compiler warning
2006-08-07 22:07:49 dpavlin r71
/trunk/Tutorial.pod: tutorial is very short, thanks to helper script :-)
2006-08-07 21:08:41 dpavlin r70
/trunk/bin/pgest-index.pl: finish (hopefully) documentation. perldoc pgsql-index.pl is now really useful :-)
2006-08-07 17:30:56 dpavlin r69
/trunk/bin/pgest-index.pl: single-line progress bar (if it fits), only thing going to STDOUT is example SQL query (so it can be redirected to file)
2006-08-07 17:05:07 dpavlin r68
/trunk/bin/pgest-index.pl: auto-discover primary key (if it's single column and sutable), auto-generate SQL select * from table which is *NOT OPTIMAL* if you are not going to search all columns in your table
2006-08-07 16:38:24 dpavlin r67
/trunk/bin/pgest-index.pl: added chunk of documentation and ability to --create or --drop indexes
2006-08-07 14:56:08 dpavlin r66
/trunk/bin/pgest-index.pl: don't change encoding when inserting in HyperEstraier (since pgest_trigger doesn't do that). This might mean that Hyper Estraier web UI won't have correct encoding if your database isn't in UTF-8 encoding (which might get fixed, but that would require all functions to have encoding parametar which is, IMHO, overkill)
2006-08-07 14:47:00 dpavlin r65
/trunk/bin/pgest-index.pl: print example SQL search query at end
2006-08-07 14:32:21 dpavlin r64
/trunk/README.pod: improved documentation, now this project is full-text index for PostgreSQL and not just search function :-)
2006-08-07 14:30:55 dpavlin r63
/trunk/bin/pgest-index.pl: drop triggers at beginning, work inside transaction, create pgest and pgest_trigger functions
2006-08-07 13:24:49 dpavlin r62
/trunk/bin, /trunk/bin/pgest-index.pl: script to create pgest index on PostgreSQL database (modification of dbi-indexer.pl from Search::Estraier distribution)
2006-08-07 11:14:52 dpavlin r61
/trunk/pgest.sql, /trunk/pgest.c: added pgest_trigger function which allows updateable indexes
2006-07-11 14:11:42 dpavlin r60
/trunk/README.pod: make it in sync with relity
2006-05-25 18:46:49 dpavlin r59
/trunk/Makefile, /trunk/README.pod, /trunk/data/indexer.pl, /trunk/Tutorial.pod: begin tutorial
2006-05-25 17:50:13 dpavlin r58
/trunk/test.sql: example how to return hints (which require de-normalized schema)
2006-05-11 16:19:38 dpavlin r57
/trunk/test.sql, /trunk/pgest.c: initial implementation of hints using (fake) atribute names in form HINTS.something (e.g. HITS.HIT or HINTS.WORDNUM...)
2006-05-11 15:52:50 dpavlin r56
/trunk/Makefile, /trunk/test.sql, /trunk/pgest.c: fix multiple attributes delimited with {{!}}, fix warnings
2006-05-11 11:43:36 dpavlin r55
/trunk/Makefile: use pkg-config hyperestraier if estconfig is not found (this should help on Debian installations with Hyper Estraier installed from package... you will also need libestraier-dev)
2006-05-11 10:19:46 dpavlin r54
/trunk/test.sql, /trunk/README.pod: updated documentation and example for convert
2006-05-11 10:18:32 dpavlin r53
/trunk/data/Makefile: script now creates index on it's own
2006-05-11 10:18:08 dpavlin r52
/trunk/data/indexer.pl: create node if needed
2006-05-09 22:55:42 dpavlin r51
/trunk/t/pgest.t, /trunk/test.sql, /trunk/README.pod, /trunk/pgest.c: depriciated usage of direct access to index. use node API and estmaster instead.
2006-04-14 16:17:19 dpavlin r50
/trunk/data/Makefile, /trunk/data/indexer.pl: remove direct database creation and use node API always via Search::Estraier
2005-10-29 18:54:40 dpavlin r49
/trunk/test.sql, /trunk/README.pod, /trunk/pgest.sql, /trunk/pgest.c, /trunk/t/pgest.t: added depth to node API version of pgest, note that you have to use modified perl wrapper with node API
2005-10-20 16:24:26 dpavlin r48
/trunk/pgest.c: moved argument offsets into #define(s)
2005-09-11 21:44:56 dpavlin r47
/trunk/pgest.c: few more check to keep PostgreSQL from segfaulting in case of errors
2005-09-11 21:44:11 dpavlin r46
/trunk/data/Makefile: drop and re-create trivia node
2005-09-10 23:08:47 dpavlin r45
/trunk/t/pgest.t: test node API version of pgest
2005-09-10 22:51:03 dpavlin r44
/trunk/pgest.c: check if search failed and report status
2005-09-10 22:33:36 dpavlin r43
/trunk/README.pod: few more changes to documentation
2005-09-10 21:48:46 dpavlin r42
/trunk/README.pod: updated documentation to reflect two different ways to call pgest with example
2005-09-10 20:35:09 dpavlin r41
/trunk/data/indexer.pl, /trunk/pgest.c, /trunk/test.sql, /trunk/pgest.sql: added node API
2005-09-10 18:51:13 dpavlin r40
/trunk/test.sql, /trunk/pgest.c: add support for multiple attributes delimited by {{!}}
2005-07-08 15:26:04 dpavlin r39
/trunk/pgest.c: removed misleading INFO message (hint: it's not a bug)
2005-07-08 12:47:49 dpavlin r38
/trunk/pgest.c: added est_cond_set_max to return just minumum needed resutls from Hyper Estraier
2005-07-07 17:24:22 dpavlin r37
/trunk/Makefile, /trunk/test.sql: misc cleanups
2005-07-07 17:21:50 dpavlin r36
/trunk/Makefile: nicer dependency, doesn't recompile pgest.so if not needed
2005-07-07 14:18:49 dpavlin r35
/trunk/data/indexer.pl: added sync()
2005-07-06 16:44:46 dpavlin r34
/trunk/Makefile: update project home page with documentation
2005-07-06 16:35:51 dpavlin r33
/trunk/README, /trunk/Makefile, /trunk/README.pod: converted documentation to pod
2005-07-06 15:15:04 dpavlin r32
/trunk/t/pgest.t: test ordering
2005-07-06 14:47:56 dpavlin r31
/trunk/t/pgest.t, /trunk/test.sql, /trunk/pgest.sql, /trunk/pgest.c: added results order (see test.sql for example) to function arguments, fixed limit (it really returns limit rows, and not limit - offset :-\), added est_cond_delete(cond) so that we don't leek memory
2005-07-06 14:34:10 dpavlin r30
/trunk/data/indexer.pl: small cleanup
2005-07-06 11:47:56 dpavlin r29
/trunk/data/Makefile, /trunk/README, /trunk/data/indexer.pl: added indexer.pl written using perl Hyper Estraier bindings
2005-07-01 12:50:53 dpavlin r28
/trunk/Makefile: cvs target for pgfoundry
2005-06-30 20:01:36 dpavlin r27
/trunk/pgest.c: fix for compilation on PostgreSQL 8.0
2005-05-29 19:32:28 dpavlin r26
/trunk/misc/bench.pl, /trunk/misc/bench.txt: rewrite benchmark to use just new function, add bits of documentation
2005-05-27 21:06:01 dpavlin r25
/trunk/pgest.sql, /trunk/pgest.c, /trunk/misc/bench.pl, /trunk/t/pgest.t, /trunk/misc/bench.txt, /trunk/estseek/estseek.conf, /trunk/misc, /trunk/test.sql: removed obsolete implementeation without attribute specification which was also slower according to benchmarks.
2005-05-27 12:13:28 dpavlin r24
/trunk/test.sql: fix path to index
2005-05-26 20:22:44 dpavlin r23
/trunk/README, /trunk/Makefile, /trunk/test.sql: added small snippet of documentation
2005-05-26 20:16:34 dpavlin r22
/trunk/t/pgest.t: test new function instead of old one
2005-05-26 19:43:56 dpavlin r21
/trunk/data/convert.pl: Generate dumps of used years, titles and quotes. Some parsing fixes
2005-05-26 19:42:36 dpavlin r20
/trunk/pgest.c: moved all debug output to DEBUG1 instead of INFO, fixed compilation warning
2005-05-26 17:56:53 dpavlin r19
/trunk/test.sql, /trunk/pgest.sql, /trunk/pgest.c: Created new function which allows users to specify attributes which should be returned like this:
select * from pgest('/index/path', 'query', '@title ISTRINC foo', limit, offset, {'@id','@title','@size'}) as (id text, title text, size text);
2005-05-26 14:07:04 dpavlin r18
/trunk/estseek/estseek.tmpl, /trunk/estseek, /trunk/estseek/estseek.cgi, /trunk/estseek/estseek.top, /trunk/estseek/estseek.conf: simple estseek configuration for testing searches
2005-05-26 13:57:32 dpavlin r17
/trunk/test.sql, /trunk/data/convert.pl: added year to index data
2005-05-26 00:16:48 dpavlin r16
/trunk/data/convert.pl, /trunk/pgest.c: added size of each trivia to index (and back to output)
2005-05-26 00:15:57 dpavlin r15
/trunk/Makefile: add html target to re-generate ChangeLog.html
2005-05-26 00:06:10 dpavlin r14
/trunk/pgest.c, /trunk/t/pgest.t: tests and fix for limit and offset bug introduced with null handling
2005-05-26 00:05:33 dpavlin r13
/trunk/test.sql: example how to find references to Blade Runner in other movies :-)
2005-05-25 23:38:37 dpavlin r12
/trunk/test.sql, /trunk/pgest.sql, /trunk/pgest.c, /trunk/t/pgest.t: added NULL handling for optional arguments
2005-05-25 23:28:15 dpavlin r11
/trunk/data/Makefile: decrease memory usage to 64Mb (useful when running from tmpfs)
2005-05-25 23:27:35 dpavlin r10
/trunk/data/convert.pl: add title rank to 3, small fix in index generation
2005-05-22 21:18:11 dpavlin r9
/trunk/t/pgest.t, /trunk/pgest.c: support query in which only attribute is specified
2005-05-20 22:09:16 dpavlin r8
/trunk/t/pgest.t, /trunk/data: tests, some of which fail (which is whole point of having tests, I guess)
2005-05-20 22:01:19 dpavlin r7
/trunk/t/limit_offset.sh, /trunk/data/convert.pl, /trunk/pgest.c, /trunk/data/Makefile, /trunk/Makefile, /trunk/test.sql: more work on testing framework
2005-05-20 21:29:39 dpavlin r6
/trunk/data/convert.pl, /trunk/data/Makefile, /trunk/data, /trunk/Makefile: make test database from IMDB's trivia
2005-05-20 19:44:09 dpavlin r5
/trunk/pgest.sql, /trunk/pgest.c, /trunk/test.sql: added attribute to function, fixed warnings, fixed limit and offset
2005-05-20 18:45:01 dpavlin r4
/trunk/pgest.c: bugfix: return null for 0 sized attributes
2005-05-20 16:11:38 dpavlin r3
/trunk/t/limit_offset.sh, /trunk/pgest.sql, /trunk/t, /trunk/pgest.c, /trunk/test.sql: added limit and offset
2005-05-20 13:00:46 dpavlin r2
/trunk, /trunk/pgest.c: changed debug level, ignore some files
2005-05-20 12:19:05 dpavlin r1
/trunk/Makefile, /trunk/test.sql, /trunk, /trunk/pgest.sql, /trunk/pgest.c: initial import of API between Estraier and PostgreSQL