I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.
use Encode qw( encode decode ); # From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl my $utf8_char = qr{ (?: [\x00-\x7f] | [\xc0-\xdf][\x80-\xbf] | [\xe0-\xef][\x80-\xbf]{2} | [\xf0-\xf7][\x80-\xbf]{3} ) }x; while (<>) { s{($utf8_char)|(.)}{ if ( defined $1 ) { $1 } elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) } else { "" } }ge; print $_; }
Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.
Anyway, this seems to solve the issues I was having.
Picky, picky, but yes.
When you say “UTF-8 character” you mean “UTF-8 byte sequence”, right?
Thank you, Aristotle. That’s a lovely solution!
Not lazy enough. Let Encode do the work. Repairing broken documents that mix UTF-8 and ISO-8859-1