02 November 2015

PostgreSQL function for parsing JSON into Sphinx keywords

This post describes how to write a basic PostgreSQL function for parsing a JSON into Sphinx keywords(space-separated values of the JSON object).

Requirements

  • Perl core
  • Perl JSON module
  • PostgreSQL built with JSON support

Sample Problem

We have workers table:

CREATE TABLE workers(
  id integer,
  props text
);
where props keeps arbitrary worker properties in JSON format.

Let's insert a worker:

INSERT INTO workers VALUES(1, '{"x":"x test", "d":104, "o":{"o1":"o1 name"}, "a":["a1", "a2"]}');

We want to make the worker properties searchable through Sphinx. So we add corresponding PostgreSQL source and an index in our sphinx.conf:

source src_workers: pgsql
{
  sql_query_pre = set work_mem = '100MB'
  sql_query = select id, props from workers
  sql_field_string = props
}

index idx_workers
{
  source = src_testing
  path = /home/ruslan/sphinxsearch/idx_workers
  min_prefix_len = 3
  docinfo = extern
}

With the current sql_query the props field will be cluttered with JSON keys and specific characters. We'll write a PostgreSQL function which will give us a single line of space-separated values:

$ createlang plperlu t # < "t" is a database name
$ psql -d t
create or replace function json2sphkw(text,text) returns text as $$
use JSON qw( decode_json );

my ($in) = @_;
$_SHARED{read_sub} = sub {
  my ($in) = @_[0];

  if (ref($in) eq "HASH") {
    my @res, $read_sub = $_SHARED{read_sub};
    foreach $v (values %$in) {
      push @res, &$read_sub($v);
    }
    return join(' ', @res);
  } elsif (ref($in) eq "ARRAY") {
    my @res, $read_sub = $_SHARED{read_sub};
    foreach $v (values @$in) { push @res, &$read_sub($v); }
    return join(' ', @res);
  }
  return $in;
};

my $h = decode_json $in;
my $read_sub = $_SHARED{read_sub};

return &$read_sub($h);
$$ LANGUAGE plperlu;

Now we can use it:

SELECT json2sphkw(props, ' ') FROM workers WHERE id = 1;
        json2sphkw
--------------------------
 o1 name a1 a2 x test 104
(1 row)
And the sphinx.conf can be changed as follows:
@@ -1,7 +1,7 @@
 source src_workers: pgsql
 {
   sql_query_pre = set work_mem = '100MB'
-  sql_query = select id, props from workers
+  sql_query = select id, json2sphkw(props, ' ') from workers
   sql_field_string = props
 }

References