06 November 2010

How to convert MP3 file tag encodings in Ubuntu

It was somewhat troublesome for me to convert ID3 tags for large amount of files, especially in case of file names, with spaces and other "special" characters. I found tools and managed to successfully convert Windows-1251 MP3 file tags to UTF-8. I also post a sample bash script here.

mp3info

This tool displays and modifies ID3 tags. Quite simple in use. I would deal with it, if there were no problems with long character sequences(just cuts unpredictably!).

mid3iconv

It's job is exactly what I want. Converts ID3 tag encodings to Unicode. Usage is simplisity itself, e.g.
$ mid3iconv -e "Windows-1251" ~/Music/sample.mp3

Sample bash script

So the task is to iterate each file and apply mid3iconv. But one should care files already having Unicode tags. Therefore, the following script tries to detect encoding by means of enca tool.

#!/bin/bash 
# Splits the first argument by delimeter specified by the 2nd parameter 
# E.g. my_split_str $string $delim
# If delimeter is not specified, the func defaults to '='
# Requirements: 
# enca
# mp3info
# python-mutagen (mid3iconv)
#
# Example: bash mp3id3enc.sh --from-code=Windows-1251 --to-code=UTF-8
function my_split_str()
{
    if (( $# <=1 )) 
    then 
        IFS='=' 
    else 
        IFS=$2 
    fi 
    set -- $1
    echo $*
}

function usage()
{
    echo $0" [options]"
    echo "Options:
    --src-dir       Source dir
    --auto-detect-enc       Whether try to auto detect source encoding. Default: 0
    --from-code     Source text encoding. Default: Windows-1251
    --to-code       Target text encoding. Default: UTF-8
    --guess-lang        Language to use when guessing source text encoding. Default: ru
    "
    exit 1
}

# Set option defaults
src_dir='./'                # source directory
auto_detect_enc=0           # auto detect source encoding
from_code='Windows-1251'
to_code='UTF-8'
guess_lang='ru'

# Get options
for i in $*
do
  o=(`my_split_str $i '='`)
  case ${o[0]} in
    --src-dir)
        src_dir=${o[1]}
        # Add trailing slash, if not added yet
        if [[ ${src_dir#${src_dir%?}} != '/' ]]; then src_dir=$src_dir"/"; fi
      ;;
    --from-code)
      from_code=${o[1]}
      ;;
    --to-code)
      to_code=${o[1]}
      ;;
    --guess-lang)
      guess_lang=${o[1]}
      ;;
    --auto-detect-enc)
      auto_detect_enc=1
      ;;
    --help)
        usage
        exit 1
      ;;
    *)
      # unknown option
      echo "Unknown option ${o[0]}"
      usage
      exit 2
      ;;
  esac
done

find "$src_dir" -name "*.mp3" | while read FILENAME
do
    echo "$FILENAME..."

    # Try to detect encoding
    if [[ $auto_detect_enc = 1 ]]; then
        e=`mp3info -p "%t" "$FILENAME" | enca -gL $guess_lang`

        if [[ "$e" = *1251* ]]; then 
            from_code="Windows-1251"
        elif [[ "$e" = *CP866* || "$e" = *866* ]]; then
            from_code="CP866"
        elif [[ "$e" = *KOI8-R* ]]; then
            from_code="KOI8-R"
        else
            echo "couldn't detect encoding for "$FILENAME
            echo "$FILENAME ($e)" >> not-detected.log 
            continue
        fi

        echo "detected encoding: "$from_code
    fi

    t=`mp3info -p "%t" "$FILENAME" | enca -L ru`

    # For unrecognized encoding assume Windows-1251 
    if [[ $t = *Unrecognized* ]]; then 
        t="Windows-1251"
    fi
    echo "t='$t'"
    if [[ $t != *866* && $t != *1251* ]]; then 
        echo "$FILENAME skipped"
        continue
    fi

    mid3iconv -e "$from_code" "$FILENAME"

    echo
done
Notice, loop like

for FILENAME in $(find $src_dir -name "*.mp3"); do
...
done
fails with file names containing spaces and other escape sequences.

2 comments :

  1. Zamechatelno! Tolko neponyatno, pochemy enca govorit, chto output 7-bit ASCII -- eto to je, chto i UTF-8?

    > mp3info song.mp3 > text
    > enca -L russian text
    7bit ASCII characters

    ReplyDelete
  2. > Zamechatelno! Tolko neponyatno, pochemy enca govorit, chto output 7-bit ASCII -- eto to je, chto i UTF-8?

    >> mp3info song.mp3 > text
    >> enca -L russian text
    >7bit ASCII characters

    Видимо, song.mp3 содержит UTF-8-строки, попадающие в диапазон ASCII-символов кодировки UTF-8, т.е. содержит 8-битные октеты. См. http://en.wikipedia.org/wiki/UTF-8

    ReplyDelete