Sphinx
wordforms feature has a drawback: the stemmer skips "destination" words.
In this post I'll show a quick and dirty way to fix a large wordforms file.
Let's assume we have a wordforms file with the following contents:
noisy > noise
noisyyy > noise
Any document having word
noisy will be found by searching for
noisyyy. And vice versa. Bad news is that the word
noise is
out of stemming. It means that if we try to search for
noise, we'll
find only those documents that contain exactly
"noise".
Sphinx 2.1.1-beta introduced indextool --morph INDEXNAME
option,
which applies morphology to the characters given on the standard input, e.g.:
echo 'confidence
> presence' | ~/bin/indextool --morph s3_all
Sphinx 2.1.1-beta (rel21-r3701)
Copyright (c) 2001-2013, Andrew Aksyonoff
Copyright (c) 2008-2013, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
dumping stemmed results...
confid presenc
Yes, the output might be more friendly(parsable). It even doesn't support
--quiet
option at the time. Now we are going to replace the destination words
with their stemmed variants. Here is a simple one-shot Perl script for this.
#!/usr/bin/perl
# File: stem-wordforms.pl
use strict;
use warnings;
use Fcntl; # sysopen
use POSIX 'tmpnam';
use IO::Seekable; # seek
sub usage () {
print "Usage: $0 <filename> [<out_filename>]
<filename> Path to SphinxSearch wordforms file
<out_filename> Path to output filename
";
}
if ($#ARGV < 0) {
usage;
exit;
}
my $in_filename = $ARGV[0];
my $out_filename = $#ARGV > 0 ? $ARGV[1] : $in_filename .".out";
my $tmp_filename = tmpnam();
my $line;
my $n_in_lines = 0;
my $n_out_lines = 0;
sysopen FH_IN, $in_filename, O_RDONLY
or die("Failed to open `$in_filename`");
sysopen FH_RES, $out_filename, O_CREAT | O_TRUNC | O_WRONLY, 0640
or die("Failed to open `$out_filename`");
# Open pipe for indextool. We'll write destination wordforms here line by line.
# By means of `tr` command we place a word per line.
open my $fh_indextool, "|-",
"~/bin/indextool -c sphinx.conf --morph s3_all | tr ' ' '\n' > $tmp_filename"
or (print "Failed\n" and die());
print ">> Stemming to `$tmp_filename`...";
while (<FH_IN>) {
++$n_in_lines;
# Get destination word
s/^[^\>]+\> // ;
# Patch: for some reason Sphinx 2.1.1-beta strips `ё` character even if it
# is declared in charset_table
s/ё/e/g;
$line = $_;
print $fh_indextool $line;
}
close $fh_indextool;
print " OK\n";
print ">> Joining temporary results with the source wordforms... ";
my $got_results = 0;
my $line2;
sysopen(FH_TMP, $tmp_filename, O_RDWR, 0640)
or die("Failed to open `$tmp_filename`");
seek FH_IN, 0, SEEK_SET;
while (<FH_TMP>) {
$line = $_;
if ($got_results == 0 && m/^results\.\.\./) {
print "Got results... ";
$got_results = 1;
next;
}
$got_results or next;
last if !($line2 = <FH_IN>);
$line2 =~ s/[^\>]+$//;
print FH_RES $line2 , " ", $line;
++$n_out_lines;
}
unlink $tmp_filename or print "Failed to unlink $tmp_filename\n";
$n_in_lines != $n_out_lines and die("* Failed! Number of lines of `$in_filename` and `$out_filename` doesn't match!\n");
print "OK\nDone\n";
close FH_IN;
close FH_RES;
close FH_TMP;
Not very nice script, but hopefully will help someone. You may want to
adapt it for your configuration. Usage:
$ ./stem-wordforms.pl wordforms_myspell_ru_RU_UTF-8.txt out
>> Stemming to `/tmp/file7AawtG`... OK
>> Joining temporary results with the source wordforms... Got results... OK
Done