16 June 2011

How to create a Sphinx wordform dictionary

To build a wordform file for Sphinx you may use some kind of spellchecker dictionary like myspell, ispell, pspell, aspell. Let's make a wordform file for Russian language from myspell-russian package in openSUSE.

To install myspell-russian:
$ sudo zypper in myspell-russian
$ rpm -ql myspell-russian
/usr/share/doc/packages/myspell-russian
/usr/share/doc/packages/myspell-russian/descr_en.txt
/usr/share/doc/packages/myspell-russian/descr_ru.txt
/usr/share/doc/packages/myspell-russian/description.xml
/usr/share/doc/packages/myspell-russian/dictionaries.xcu
/usr/share/doc/packages/myspell-russian/icon.png
/usr/share/doc/packages/myspell-russian/licence.txt
/usr/share/myspell
/usr/share/myspell/ru_RU.aff
/usr/share/myspell/ru_RU.dic

Build wordforms:
$ spelldump /usr/share/myspell/ru_RU.dic /usr/share/myspell/ru_RU.aff wordforms_myspell_ru_RU.txt
spelldump, an ispell dictionary dumper

Loading dictionary...
Loading affix file...
Using MySpell affix file format
Dictionary words processed: 146265

Detect encoding:
$ cat wordforms_myspell_ru_RU.txt | enca -L ru 
KOI8-R Cyrillic
  LF line terminators

Конвертируем в UTF-8:
$ iconv -f KOI8-R -t UTF-8 -o wordforms_myspell_ru_RU_UTF8.txt wordforms_myspell_ru_RU.txt

Finally, in sphinx.conf set:
wordforms     = /path/to/wordforms_ru_RU_UTF8.txt

See also