Categories
Uncategorized

Java Readability

Re: The Positive Legacy of C++ and Java: “Java got this balance correct – Java might be verbose and lack lots of ‘cool’ features, but it’s really easy to figure out what some random code is doing.” (Via infoq)

I often hear that Java is “easy to read.” Don’t kid yourselves. When everything takes so many lines to implement and there are so few opportunities for abstraction available, reading Java code feels like being lost in the snow — everything starts to look the same.

Don’t get me wrong, you can have readable Java. It’s just not the norm. You have to work really hard at it. This isn’t unique to Java though — you can be incomprehensible in any language. Ultimately code readability is more about you as an author than the language.

Categories
Uncategorized

Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.

use Encode qw( encode decode );

# From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl
my $utf8_char = qr{
    (?:
        [x00-x7f]
        |
        [xc0-xdf][x80-xbf]
        |
        [xe0-xef][x80-xbf]{2}
        |
        [xf0-xf7][x80-xbf]{3}
    )
}x;

while () {
    s{($utf8_char)|(.)}{
        if ( defined $1 )    { $1 }
        elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) }
        else                 { "" }
    }ge;
    print $_;
}

Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.

Anyway, this seems to solve the issues I was having.