Installing Sphinx with Lithuanian stemmer

August 21, 2013

This page was initially written in Lithuanian. The examples contain Lithuanian phrases.

Sphinx

Sphinx is an open source text search engine.

Although one of the main features of Sphinx is speed, other valuable feature is ability to create text index with word endings being cut (stemmed).

It allows to search for the word without knowing exact word ending.

For instance, Lithuanian word stalas would be transformed to word stal and it would be possible to find it using other forms of the word: stalą, stalo, stalų etc.

Normally Sphinx can be installed from repository, but since stemmer is not standart Sphinx feature, and Snowball Libstemmer library is used instead, it needs to be compiled separately.

Let’s begin :)

We will need these programs (can be installed via repository):

Download the Sphinx source from http://sphinxsearch.com/downloads/ (in this case - version 2.0.8)

Download Libstemmer C version from my repo https://github.com/plutzilla/sphinx-libstemmer (git clone [email protected]:plutzilla/sphinx-libstemmer.git .) and put it to libstemmer_c directory.

It is not necessary to compiler Libstemmer separately (it will be compiled together with Sphinx), but if we do so, the program stemwords will be created. It can ve used to check how stemmer works, i.e.:

./stemwords -l lt
stalas
stal
stalą
stal

Of course there are words that are stemmed inadequately:

./stemwords -l lt
šienas
š

Configure Sphinx:

./configure --with-mysql-includes=/usr/include/mysql --with-mysql-libs=/usr/lib/mysql --with-libstemmer

Normally Sphinx is installed to /usr/local. If you want to change installation path, you can provide --prefix=/path/to/installation

As always, compile with command make, install using command make install.

To use Sphinx from other libraries, we need to compile Sphinx Client library:

cd api/libsphinxclient
./configure
make install

To use Sphinx from PHP, we need to isntall Sphinx PECL library:

pecl install sphinx

To be able to use PECL (install packaged from PECL repository), the followings packages must be installed previously: php-pecl and php5-dev.

After we install Sphinx PECL library, we need to add sphinx extension to php.ini file (paste extension=sphinx.so) and reload PHP - if PHP runs as Apache module, restart Apache, if it run as FastCGI, restart FastCGI or FPM service.

To start Sphinx on system boot, we need to create an Init script - create file /etc/init.d/search.d with content:

#!/bin/bash

case "${1:-''}" in
'start')
/usr/local/bin/searchd
;;
'stop')
/usr/local/bin/searchd --stop
;;
'restart')
/usr/local/bin/searchd --stop && /usr/local/bin/searchd
;;
*)
echo "Usage: $SELF start|stop|restart"
exit 1
;;
esac

If we want to keep Sphinx configuration file in non-default (/usr/local/sphinx/etc/sphinx.conf) location, we need to pass the parameter --config /path/to/sphinx.conf.

Also, if we want to run Sphinx with non-root user, it is possible to run searchd using the following command (put it to init script):

su - <unix-vartotojas> -c "/usr/local/bin/searchd --config /path/to/sphinx.conf"

After creating init script, we need to give ti execution permission and update rc.d configuration:

sudo chmod +x /etc/init.d/searchd
sudo update-rc.d searchd defaults

To use lithuanian stemmer, we need to provide this information to Sphinx configuration file:

morphology = libstemmer_lt

It is also useful to convert lithuanian characters to latin ones (transliterate). To use it, provide this information to the index config:

charset_table     = 0..9, A..Z->a..z, _, a..z, \
    U+104->a, U+105->a, \
    U+10C->c, U+10D->c, \
    U+116->e, U+117->e, \
    U+119->e, U+11A->e, \
    U+12E->i, U+12F->i, \
    U+160->s, U+161->s, \
    U+16A->u, U+16B->u, \
    U+172->u, U+173->u, \
    U+17D->z, U+17E->z

I am not writing about how to use Sphinx, how to create indices and index text - you can find this information in Sphinx documentation or in manuals: man search, man searchd, man indexer.

Huge thanks for lt stemmer initiative and for Linas Valiukas.