Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems ā€” Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.

use Encode qw( encode decode );

# From
my $utf8_char = qr{

while () {
        if ( defined $1 )    { $1 }
        elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) }
        else                 { "" }
    print $_;

Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.

Anyway, this seems to solve the issues I was having.

4 replies on “Mixed Character Encodings”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s