24 July 2010

How to handle filename encodings in Linux

Recently I upgraded a thumbnail import function in CMS and faced with an annoying problem with differences in filename encodings on different platforms. Filenames played key role as they are compared to store product codes. So I forced to guess source encoding and convert it to UTF-8. Here are some ways to convert or detect filename encodings.

convmv

The simplest way to convert filename encodings is to use convmv package. It supports about 124 encodings:
$ convmv --list | wc -l 
124
It can also detect if filename is encoded in UTF-8. --nosmart option switches off this smartness. For instance, Windows with Russian locale usually encodes filenames in CP866. The following command converts filename encodings to UTF-8 within folder and it's subfolders:
$ convmv --notest -r -f cp866 -t utf8 folder
Your Perl version has fleas #37757 #49830 
mv "folder/�����.jpg" "folder/Пустыня.jpg"
mv "folder/�����_123.jpg" "folder/Коала_123.jpg"
mv "folder/������-1�.jpg" "folder/Пингвины-1А.jpg"
Ready!
To install convmv on Ubuntu:
$ sudo apt-get install convmv

enca

Detects and converts text file encoding. Obviously, can be used to detect ex. Russian filename encoding in folder:
$ ls -1 folder | enca -L ru
IBM/MS code page 866
  LF line terminators
Here -L instructs enca to guess Russian encoding. Of course, encoding should be homogeneuos within the folder. On my Ubuntu box it supports the following languages:
$ enca --list languages
belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U
  bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
      czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
   croatian: CP1250 ISO-8859-2 IBM852 macce CORK
  hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
 lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
     polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
    russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
     slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
    slovene: ISO-8859-2 CP1250 IBM852 macce CORK
  ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
    chinese: GBK BIG5 HZ
    none:
To perform just file list convertion use -x option. E.g.
$ ls -1 folder | enconv -L russian -x UTF-8
Коала_123.jpg
Пингвины-1А.jpg
afolder
cp866pics_with_folder.tar.gz
Пустыня.jpg
A shell script could iterate each file to rename it to appropriate encoding. Ex.
$ cat test_enconv.sh
#!/bin/bash
export DEFAULT_CHARSET='UTF-8'
if [ $# -eq 0 ] ; then dir='./' ; else dir=$1 ; fi
files=(`ls -B $dir`);
i=0

if [ ${#files[*]} -eq 0 ] ; then 
    echo "No files in dir $dir"
    exit 0 
fi

echo "Detected encoding: "`ls $dir | enca -L russian`
cd $dir
while (( i< ${#files[*]} ))
do
    f=${files[$i]}
 if [ -f $f ] ; then 
        mv $f `echo "$f" | enconv -L russian`
 fi
 (( ++i ))
done
exit 0
So the command:
$ ./test_enconv.sh folder
converts filenames within folder and it's subfolders. To install enca and enconv on Ubuntu:
$ sudo apt-get install enca

iconv

Converts files between about
$ iconv --list | tr ',' "\n" | wc -l 
1168
encodings Handy and powerful. The bash script mentioned above can be easily modified to invoke iconv instead of enca. However, source encoding should be previously detected somehow.

$ cat test_iconv.sh
#!/bin/bash
if [ $# -eq 0 ] ; then dir='./' ; else dir=$1 ; fi
files=(`ls -B $dir`);
from_encoding=$2
to_encoding='UTF-8'
i=0

if [ ${#files[*]} -eq 0 ] ; then 
    echo "No files in dir $dir"
    exit 0 
fi

cd $dir
while (( i< ${#files[*]} ))
do
    f=${files[$i]}
 if [ -f $f ] ; then 
           mv $f `echo "$f" | iconv --from-code=$from_encoding \
      --to-code=$to_encoding`
 fi
 (( ++i ))
done
exit 0
The script is called like the following:
$ ./test_iconv.sh folder CP866

Compressed files

Some archivers support pipe streaming which allows to post-process extracted data before storing on filesystem. For instance, Tar has --to-command option telling to extract files and pipe their contents to the standard input of command. See http://www.gnu.org/software/tar/manual/tar.html#SEC84. Command could be a Bourne shell script:

$ cat dispatch_arc_file.sh 
#!/bin/bash
if ( ! test $TAR_REALNAME ) ; then 
exit 1
fi

filename=$TAR_REALNAME      # Filename within archive
default_encoding='UTF-8'    # Target encoding 
filename_encoding=$1        # Source filename encoding

# Ignore files in folders
if [[ $filename =~ "/" ]] ; then 
exit 0
fi

# Convert filename encoding
if [ $# -ne 0 ] ; then
filename=`echo $filename | iconv --from-code=$filename_encoding --to-code=$default_encoding`
fi

# Save file
cat /dev/stdin > $filename
exit 0
Then the following should extract files from archive.tar.gz converting filenames from CP866 encoding to UTF-8.
$ tar -xf archive.tar.gz --to-command='./dispatch_arc_file.sh CP866'
Zip is unfortunately not so flexible, as Tar. Frustrating that Zip and Rar are far more popular than e.g. Tar among Windows users. I wonder, why these archivers with such restricred license prevail, while there are so simple and handy open source tools like 7-zip. Nevertheless, Unzip supports pipe streaming with -p option. But it works just for bulk data. I.e. it doesn't separate stream into files passing all uncompressed content to the program. I'd just quote unzip's help:
unzip -p foo | more => send contents of foo.zip via pipe into program more
Writing a program reading Zip headers etc. is, obviously, not good idea. One option is to previously extract files to a folder, and then convert filenames with one of above-mentioned methods, or with simple script like this:

$ cat conv_filenames.php
<?php
$path = $argv[1];
if ($handle = opendir($path)) {
while ($file = readdir($handle)) {
     rename($file, iconv('CP866', 'UTF-8', $file));
}
closedir($handle);
}
?>
Another option is to use a class like PHP's ZipArchive:

<?php
$zip = new ZipArchive;
if ($zip->open('test.zip') !== TRUE) die 'failed';

for ($i=0; $i<$zip->numFiles; ++$i)
$zip->renameIndex($i, iconv('CP866','UTF-8',$zip->getNameIndex($i)));
$zip->extractTo('/my/directory/');

$zip->close();
?>
ZipArchive is available, when PHP compiled using the --with-zip option. mb_convert_encoding function could be an alternative to iconv in PHP(see http://www.php.net/manual/en/function.mb-convert-encoding.php).