26 May 2013

Fixing Sphinx wordforms

Sphinx wordforms feature has a drawback: the stemmer skips "destination" words. In this post I'll show a quick and dirty way to fix a large wordforms file.

Let's assume we have a wordforms file with the following contents:

noisy > noise
noisyyy > noise
Any document having word noisy will be found by searching for noisyyy. And vice versa. Bad news is that the word noise is out of stemming. It means that if we try to search for noise, we'll find only those documents that contain exactly "noise".

Sphinx 2.1.1-beta introduced indextool --morph INDEXNAME option, which applies morphology to the characters given on the standard input, e.g.:

echo 'confidence
> presence' | ~/bin/indextool --morph s3_all
Sphinx 2.1.1-beta (rel21-r3701)
Copyright (c) 2001-2013, Andrew Aksyonoff
Copyright (c) 2008-2013, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file './sphinx.conf'...
dumping stemmed results...
confid presenc
Yes, the output might be more friendly(parsable). It even doesn't support --quiet option at the time. Now we are going to replace the destination words with their stemmed variants. Here is a simple one-shot Perl script for this.
#!/usr/bin/perl
# File: stem-wordforms.pl
use strict;
use warnings;
use Fcntl; # sysopen
use POSIX 'tmpnam';
use IO::Seekable; # seek

sub usage () {
    print "Usage: $0 <filename> [<out_filename>]
<filename>         Path to SphinxSearch wordforms file
<out_filename>   Path to output filename 
";
}

if ($#ARGV < 0) {
    usage;
    exit;
}
my $in_filename  = $ARGV[0];
my $out_filename = $#ARGV > 0 ? $ARGV[1] : $in_filename .".out";
my $tmp_filename = tmpnam();
my $line;
my $n_in_lines = 0;
my $n_out_lines = 0;

sysopen FH_IN, $in_filename, O_RDONLY
    or die("Failed to open `$in_filename`");
sysopen FH_RES, $out_filename, O_CREAT | O_TRUNC | O_WRONLY, 0640
    or die("Failed to open `$out_filename`");
# Open pipe for indextool. We'll write destination wordforms here line by line.
# By means of `tr` command we place a word per line.
open my $fh_indextool, "|-",
"~/bin/indextool -c sphinx.conf --morph s3_all | tr ' ' '\n' > $tmp_filename"
    or (print "Failed\n" and die());

print ">> Stemming to `$tmp_filename`...";
while (<FH_IN>) {
    ++$n_in_lines;

    # Get destination word
    s/^[^\>]+\> // ;
    # Patch: for some reason Sphinx 2.1.1-beta strips `ё` character even if it
    # is declared in charset_table
    s/ё/e/g;
    $line = $_;

    print $fh_indextool $line;
}
close $fh_indextool;
print " OK\n";

print ">> Joining temporary results with the source wordforms... ";
my $got_results = 0;
my $line2;
sysopen(FH_TMP, $tmp_filename, O_RDWR, 0640)
    or die("Failed to open `$tmp_filename`");
seek FH_IN, 0, SEEK_SET;
while (<FH_TMP>) {
    $line = $_;

    if ($got_results == 0 && m/^results\.\.\./) {
        print "Got results... ";
        $got_results = 1;
        next;
    }
    $got_results or next;

    last if !($line2 = <FH_IN>);
    $line2 =~ s/[^\>]+$//;
    print FH_RES $line2 , " ", $line;
    ++$n_out_lines;
}
unlink $tmp_filename or print "Failed to unlink $tmp_filename\n";

$n_in_lines != $n_out_lines and die("* Failed! Number of lines of `$in_filename` and `$out_filename` doesn't match!\n");

print "OK\nDone\n";

close FH_IN;
close FH_RES;
close FH_TMP;
Not very nice script, but hopefully will help someone. You may want to adapt it for your configuration. Usage:
$ ./stem-wordforms.pl wordforms_myspell_ru_RU_UTF-8.txt out
>> Stemming to `/tmp/file7AawtG`... OK
>> Joining temporary results with the source wordforms... Got results... OK
Done

No comments :

Post a Comment