Software Engineer's Notes: February 2014

21 February 2014

Extracting unescaped part from Nginx location in UTF-8

The Nginx location matches unescaped URI. However, to capture a Unicode part of URI is somewhat tricky. First of all, we have to prepend our regular expression with (*UTF8), e.g:

location ~* "(*UTF8)^/img/(.+)\.(jpg|jpeg|gif|png))$" {
  # ...
}

Now we can use ordinary $1, $2, ..., $N variables for corresponding regular expression groups. In most cases they work well. For instance, we can successfully pass them to a proxy URI. The parts are getting url-encoded, and everything works fine..., except some cases when we use them as file/directory names, e.g.:

proxy_store $my_variable;

The problem is that a UTF-8 character can be converted to 3..12 characters! A 3-character sequence in Chinese like '艾弗吉' gets converted to 27-character long string '%E8%89%BE%E5%BC%97%E5%90%89'. Obviously, we can reach Ext4's maximum filename length limit of 255 characters with just 28 Chinese symbols. The following captures basename as unescaped UTF-8 string:

location ~* "(*UTF8)^/img/(?<basename>(.+)\.(jpg|jpeg|gif|png)))$" {
  # $basename is unescaped string in Unicode
}

Now we can pass $basename to directives accepting file/directory names. Hope this helps someone.