29 July 2011

Mapping special characters in Sphinx configuration

Sphinx sometimes assumes some characters as word separators. For instance, letters 'e' and 'ё' are kinda similar in Russian. Some print issues even replace the latter with the former. However, Sphinx assumes 'ё' a word separator. To prevent it, one should search for Unicode code points and append mapping to charset_table index config:
charset_type  = utf-8
charset_table  = 0..9, A..Z->a..z, _, a..z, U+A8->U+E5, U+B8->U+E5, U+410..U+42F->U+430..U+44F, U+430..U+44F, \
U+0451->U+0435
Here U+0451->U+0435 maps 'ё'(U+0451) to 'e'(U+0435). The Unicode code points could be found here.

References

1 comment :