Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.

use Encode qw( encode decode );
 
# From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl
my $utf8_char = qr{
    (?:
        [\x00-\x7f]
        |
        [\xc0-\xdf][\x80-\xbf]
        |
        [\xe0-\xef][\x80-\xbf]{2}
        |
        [\xf0-\xf7][\x80-\xbf]{3}
    )
}x;
 
while (<>) {
    s{($utf8_char)|(.)}{
        if ( defined $1 )    { $1 }
        elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) }
        else                 { "" }
    }ge;
    print $_;
}

Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.

Anyway, this seems to solve the issues I was having.

4 Comments to Mixed Character Encodings

  1. dom says:

    Picky, picky, but yes.

  2. Mark Fowler says:

    When you say “UTF-8 character” you mean “UTF-8 byte sequence”, right?

  3. dom says:

    Thank you, Aristotle. That’s a lovely solution!