Recently I upgraded a thumbnail import function in CMS and faced with an annoying problem with differences in filename encodings on different platforms. Filenames played key role as they are compared to store product codes. So I forced to guess source encoding and convert it to UTF-8.
Here are some ways to convert or detect filename encodings.
convmv
The simplest way to convert filename encodings is to use convmv package.
It supports about 124 encodings:
$ convmv --list | wc -l
124
It can also detect if filename is encoded in UTF-8.
--nosmart option switches off this smartness.
For instance, Windows with Russian locale usually encodes filenames in CP866.
The following command converts filename encodings to UTF-8 within folder and it's subfolders:
$ convmv --notest -r -f cp866 -t utf8 folder
Your Perl version has fleas #37757 #49830
mv "folder/�����.jpg" "folder/Пустыня.jpg"
mv "folder/�����_123.jpg" "folder/Коала_123.jpg"
mv "folder/������-1�.jpg" "folder/Пингвины-1А.jpg"
Ready!
To install convmv on Ubuntu:
$ sudo apt-get install convmv
enca
Detects and converts text file encoding. Obviously, can be used to detect ex. Russian filename encoding in folder:
$ ls -1 folder | enca -L ru
IBM/MS code page 866
LF line terminators
Here -L instructs enca to guess Russian encoding. Of course, encoding should be homogeneuos within the folder.
On my Ubuntu box it supports the following languages:
$ enca --list languages
belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U
bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
croatian: CP1250 ISO-8859-2 IBM852 macce CORK
hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
slovene: ISO-8859-2 CP1250 IBM852 macce CORK
ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
chinese: GBK BIG5 HZ
none:
To perform just file list convertion use -x option. E.g.
$ ls -1 folder | enconv -L russian -x UTF-8
Коала_123.jpg
Пингвины-1А.jpg
afolder
cp866pics_with_folder.tar.gz
Пустыня.jpg
A shell script could iterate each file to rename it to appropriate encoding. Ex.
$ cat test_enconv.sh
#!/bin/bash
export DEFAULT_CHARSET='UTF-8'
if [ $# -eq 0 ] ; then dir='./' ; else dir=$1 ; fi
files=(`ls -B $dir`);
i=0
if [ ${#files[*]} -eq 0 ] ; then
echo "No files in dir $dir"
exit 0
fi
echo "Detected encoding: "`ls $dir | enca -L russian`
cd $dir
while (( i< ${#files[*]} ))
do
f=${files[$i]}
if [ -f $f ] ; then
mv $f `echo "$f" | enconv -L russian`
fi
(( ++i ))
done
exit 0
So the command:
$ ./test_enconv.sh folder
converts filenames within folder and it's subfolders.
To install enca and enconv on Ubuntu:
$ sudo apt-get install enca
iconv
Converts files between about
$ iconv --list | tr ',' "\n" | wc -l
1168
encodings
Handy and powerful. The bash script mentioned above can be easily modified to invoke iconv instead of enca. However, source encoding should be previously detected somehow.
$ cat test_iconv.sh
#!/bin/bash
if [ $# -eq 0 ] ; then dir='./' ; else dir=$1 ; fi
files=(`ls -B $dir`);
from_encoding=$2
to_encoding='UTF-8'
i=0
if [ ${#files[*]} -eq 0 ] ; then
echo "No files in dir $dir"
exit 0
fi
cd $dir
while (( i< ${#files[*]} ))
do
f=${files[$i]}
if [ -f $f ] ; then
mv $f `echo "$f" | iconv --from-code=$from_encoding \
--to-code=$to_encoding`
fi
(( ++i ))
done
exit 0
The script is called like the following:
$ ./test_iconv.sh folder CP866
Compressed files
Some archivers support pipe streaming which allows to post-process extracted data before storing on filesystem.
For instance,
Tar has
--to-command option telling to extract files and pipe their contents to the standard input of command. See
http://www.gnu.org/software/tar/manual/tar.html#SEC84. Command could be a Bourne shell script:
$ cat dispatch_arc_file.sh
#!/bin/bash
if ( ! test $TAR_REALNAME ) ; then
exit 1
fi
filename=$TAR_REALNAME # Filename within archive
default_encoding='UTF-8' # Target encoding
filename_encoding=$1 # Source filename encoding
# Ignore files in folders
if [[ $filename =~ "/" ]] ; then
exit 0
fi
# Convert filename encoding
if [ $# -ne 0 ] ; then
filename=`echo $filename | iconv --from-code=$filename_encoding --to-code=$default_encoding`
fi
# Save file
cat /dev/stdin > $filename
exit 0
Then the following should extract files from archive.tar.gz converting filenames from CP866 encoding to UTF-8.
$ tar -xf archive.tar.gz --to-command='./dispatch_arc_file.sh CP866'
Zip is unfortunately not so flexible, as Tar. Frustrating that Zip and
Rar are far more popular than e.g. Tar among Windows users.
I wonder, why these archivers with such restricred license prevail, while there are so simple and handy open source tools like
7-zip. Nevertheless, Unzip supports pipe streaming with -p option. But it works just for bulk data. I.e. it doesn't separate stream into files passing all uncompressed content to the program.
I'd just quote unzip's help:
unzip -p foo | more => send contents of foo.zip via pipe into program more
Writing a program reading Zip headers etc. is, obviously, not good idea. One option is to previously extract files to a folder, and then convert filenames with one of above-mentioned methods, or with simple script like this:
$ cat conv_filenames.php
<?php
$path = $argv[1];
if ($handle = opendir($path)) {
while ($file = readdir($handle)) {
rename($file, iconv('CP866', 'UTF-8', $file));
}
closedir($handle);
}
?>
Another option is to use a class like PHP's
ZipArchive:
<?php
$zip = new ZipArchive;
if ($zip->open('test.zip') !== TRUE) die 'failed';
for ($i=0; $i<$zip->numFiles; ++$i)
$zip->renameIndex($i, iconv('CP866','UTF-8',$zip->getNameIndex($i)));
$zip->extractTo('/my/directory/');
$zip->close();
?>
ZipArchive is available, when PHP compiled using the
--with-zip option.
mb_convert_encoding
function could be an alternative to iconv in PHP(see
http://www.php.net/manual/en/function.mb-convert-encoding.php).
Excellent post!!
ReplyDeleteI was facing exactly the same problem but with Japanese encoded filenames.
You just saved me a lot of trouble with such a good post.
Keep it up.
Thanks a lot,
GuruM
Veeeery nice post! Congratulations!
ReplyDelete