It was somewhat troublesome for me to convert ID3 tags for large amount of files, especially in case of file names, with spaces and other "special" characters. I found tools and managed to successfully convert
Windows-1251 MP3 file tags to
UTF-8. I also post a sample bash script here.
mp3info
This tool displays and modifies ID3 tags. Quite simple in use. I would deal with it, if there were no problems with long character sequences(just cuts unpredictably!).
mid3iconv
It's job is exactly what I want. Converts ID3 tag encodings to Unicode. Usage is simplisity itself, e.g.
$ mid3iconv -e "Windows-1251" ~/Music/sample.mp3
Sample bash script
So the task is to iterate each file and apply mid3iconv. But one should care files already having Unicode tags. Therefore, the following script tries to detect encoding by means of enca tool.
#!/bin/bash
# Splits the first argument by delimeter specified by the 2nd parameter
# E.g. my_split_str $string $delim
# If delimeter is not specified, the func defaults to '='
# Requirements:
# enca
# mp3info
# python-mutagen (mid3iconv)
#
# Example: bash mp3id3enc.sh --from-code=Windows-1251 --to-code=UTF-8
function my_split_str()
{
if (( $# <=1 ))
then
IFS='='
else
IFS=$2
fi
set -- $1
echo $*
}
function usage()
{
echo $0" [options]"
echo "Options:
--src-dir Source dir
--auto-detect-enc Whether try to auto detect source encoding. Default: 0
--from-code Source text encoding. Default: Windows-1251
--to-code Target text encoding. Default: UTF-8
--guess-lang Language to use when guessing source text encoding. Default: ru
"
exit 1
}
# Set option defaults
src_dir='./' # source directory
auto_detect_enc=0 # auto detect source encoding
from_code='Windows-1251'
to_code='UTF-8'
guess_lang='ru'
# Get options
for i in $*
do
o=(`my_split_str $i '='`)
case ${o[0]} in
--src-dir)
src_dir=${o[1]}
# Add trailing slash, if not added yet
if [[ ${src_dir#${src_dir%?}} != '/' ]]; then src_dir=$src_dir"/"; fi
;;
--from-code)
from_code=${o[1]}
;;
--to-code)
to_code=${o[1]}
;;
--guess-lang)
guess_lang=${o[1]}
;;
--auto-detect-enc)
auto_detect_enc=1
;;
--help)
usage
exit 1
;;
*)
# unknown option
echo "Unknown option ${o[0]}"
usage
exit 2
;;
esac
done
find "$src_dir" -name "*.mp3" | while read FILENAME
do
echo "$FILENAME..."
# Try to detect encoding
if [[ $auto_detect_enc = 1 ]]; then
e=`mp3info -p "%t" "$FILENAME" | enca -gL $guess_lang`
if [[ "$e" = *1251* ]]; then
from_code="Windows-1251"
elif [[ "$e" = *CP866* || "$e" = *866* ]]; then
from_code="CP866"
elif [[ "$e" = *KOI8-R* ]]; then
from_code="KOI8-R"
else
echo "couldn't detect encoding for "$FILENAME
echo "$FILENAME ($e)" >> not-detected.log
continue
fi
echo "detected encoding: "$from_code
fi
t=`mp3info -p "%t" "$FILENAME" | enca -L ru`
# For unrecognized encoding assume Windows-1251
if [[ $t = *Unrecognized* ]]; then
t="Windows-1251"
fi
echo "t='$t'"
if [[ $t != *866* && $t != *1251* ]]; then
echo "$FILENAME skipped"
continue
fi
mid3iconv -e "$from_code" "$FILENAME"
echo
done
Notice, loop like
for FILENAME in $(find $src_dir -name "*.mp3"); do
...
done
fails with file names containing spaces and other escape sequences.
Zamechatelno! Tolko neponyatno, pochemy enca govorit, chto output 7-bit ASCII -- eto to je, chto i UTF-8?
ReplyDelete> mp3info song.mp3 > text
> enca -L russian text
7bit ASCII characters
> Zamechatelno! Tolko neponyatno, pochemy enca govorit, chto output 7-bit ASCII -- eto to je, chto i UTF-8?
ReplyDelete>> mp3info song.mp3 > text
>> enca -L russian text
>7bit ASCII characters
Видимо, song.mp3 содержит UTF-8-строки, попадающие в диапазон ASCII-символов кодировки UTF-8, т.е. содержит 8-битные октеты. См. http://en.wikipedia.org/wiki/UTF-8