Tag: perl

 

JavaScript.pm on OSX

Just a quick note… I was looking at RT#48699 when I noticed that MacPorts didn’t have JavaScript.pm in it’s collection. I needed to install it by hand. Unfortunately, the latest version (1.12) doesn’t install cleanly.

So I’ve forked it and fixed it (along with a couple of other minor nits).

Claes said he’ll apply the patch at some point. So hopefully when 1.13 comes out, this won’t be necessary.

Of course, really I should get to grips with MacPorts and submit a Portfile

Mixed Character Encodings

I’ve been given a MySQL dump file at work. It’s got problems — Windows-1252 and UTF-8 characters are mixed in. Bleargh. How can we clean it up to be all UTF-8? Perl to the rescue.

use Encode qw( encode decode );
 
# From http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl
my $utf8_char = qr{
    (?:
        [\x00-\x7f]
        |
        [\xc0-\xdf][\x80-\xbf]
        |
        [\xe0-\xef][\x80-\xbf]{2}
        |
        [\xf0-\xf7][\x80-\xbf]{3}
    )
}x;
 
while (<>) {
    s{($utf8_char)|(.)}{
        if ( defined $1 )    { $1 }
        elsif ( defined $2 ) { encode( "utf8", decode( "cp1252", $2 ) ) }
        else                 { "" }
    }ge;
    print $_;
}

Yes, that’s a regex for matching UTF-8 characters (courtesy of Markus Kuhn). I hadn’t considered using a regex when I first started down this road. I started examining bytes by hand. And the code was about three times longer.

Anyway, this seems to solve the issues I was having.

mod_perl 1 blows chunks

At $WORK, I’m looking at a web service built on mod_perl 1 / Apache 1. The service takes XML as input and returns XML as output. So far, so good.

Unfortunately, whilst I was testing with curl, I found something odd:

  curl -s -v -T- -H'Content-Type: text/xml;charset=UTF-8' http://localhost/api < input.xml

That -T- says to do a PUT request from stdin. It fails and my code returned “no input”.

But when I did this, things worked:

  curl -s -v -Tinput.xml -H'Content-Type: text/xml;charset=UTF-8' http://localhost/api

That reads directly from the file. The only difference between the two requests is that the latter includes a Content-Length header whilst the former has Transfer-Encoding: chunked instead.

This is the code that was reading the request.

    my $content;
    if ( $r->header_in( 'Content-Type' ) ) {
        $r->read( $content, $r->header_in( 'Content-Length' ) );
    }
    return $content;

So, if there’s no Content-Length, what should we do? My first stop is always the venerable eagle book. There’s a little footnote next to read():

At the time of this writing, HTTP/1.1 requests which do not have a Content-Length header, such as one that uses chunked encoding, are not properly handled by this API.

Marvellous. Now, I had a look around in the source code and noticed a function called new_read(). Unfortunately, that failed to work. It stopped chunked reads, but failed to work for ordinary ones.

I did see a post on the mod_perl mailing list which reckoned you could loop and read all input. But I was unable to get that to work either.

So I just decided to disallow chunked input. That’s fairly easy to do, and HTTP has a special status code for it: 411 Length Required. It’s not ideal, but unless this project gets upgraded to Apache 2 (unlikely, quite frankly), it seems to be the best option.

for() $DEITY's sake, why?

I used to think I had a reasonable grasp of Perl. Yesterday, I realised I didn’t even understand a basic foreach loop.

  my $val;
  my @values = qw( a b c );
  foreach $val (@values) {
    print $val, "\n";
  }
  print "[end] $val\n";

I reckoned that this should print:

  a
  b
  c
  [end] c

Instead, it prints:

  a
  b
  c
  Use of uninitialized value in concatenation (.) or string at foo.pl line 11.
  [end]

This confused me no end. But it’s actually documented behaviour. From perlsyn

The foreach loop iterates over a normal list value and sets the variable VAR to be each element of the list in turn. If the variable is preceded with the keyword my, then it is lexically scoped, and is therefore visible only within the loop. Otherwise, the variable is implicitly local to the loop and regains its former value upon exiting the loop. If the variable was previously declared with my, it uses that variable instead of the global one, but it’s still localized to the loop. This implicit localisation occurs only in a foreach loop.

Wow. You really do learn something new every day. I suspect that this is implementation behaviour that was documented post-fact, rather than designed that way.

London Perl Workshop 2006

Yesterday was the London Perl Workshop, a one-day, two-track conference put together by the London Perl Mongers (in particular, muttley and Greg). I arrived early (mostly due to my lack of faith in train times), so got to put up posters first.

First talk of the day was Jesse Vincent on Jifty, the web application framework originating from within Best Practical. It was a basic introduction—making blog software, of course. Jesse took us through the details quite quickly, but it was immediately clear that very little work needs to be done for a basic CRUD app. It appears to use Mason by default, which I consider a plus point. Overall it felt very railsish, particularly the fact that it has migrations. I love migrations.

After the basic intro, he went over a few advanced features like the continuations support (of which more later) and the developer support. There’s a bunch of stuff in there that I really need to look at nicking for work, like the builtin Mason profiler support and the inline fragment editing, not to mentioned CSS::Squish and JS::Squish.

On the whole, I’m not sure about Jifty. It looks lovely, and quick to develop in. But that amount of concentrated magic scares me. I need to try out a couple of small applications to get a better feel for it.

I stayed on for the next talk, Mike Astle on wigwam (a deployment tool). I’m interested in doing deployment better, but ultimately, wigwam didn’t seem to offer that much more than what I have at the moment: a way of building packages into a compartmentalised space that can be distributed around different servers. Looking around the audience in the question time, I think I wasn’t alone.

Afterwards, I popped down to see Tom Hukins talk on “Just in time testing”. Sadly, Tom was ill. However, abigail stepped up and offered to talk about Benchmark.pm instead, highlighting the many ways in which it can be abused. He had culled a number of uses of Benchmark from perlmonks, and demonstrated various flaws. Such as not benchmarking the same thing, or trying to benchmark volatile data. The best was when he demonstrated how the compiler had completely optimised away one of the branches. Naturally executing a statement which has been optimised away is very quick.

My main beef with all of this was simply that the things being benchmarked were phenomenally simple. Really, if you care about map vs grep performance, go write it in C. Otherwise, profile your app long before you start to think about these things. demerphq pointed out that you can end up dieing the death of a thousand cuts if you don’t care about some of these little things, however.

Funnily enough, demerphq was speaking next on the changes to the regex engine coming up in Perl 5.10. This was a really deep, informative talk, and I’ll admit to glossing over some parts of it, but there were two things that really stuck out for me:

  • Recursive patterns. This makes it really easy to call back into the regex you’re matching. This is very handy for doing things like matching balanced tags correctly.
  • Named parameter groups. Nicked from .Net and Python. This should make large regexes much simpler.

Apart from that, there’s been a whole lot of work to optimize the regex engine, as well as making it properly pluggable. This now leads to the situation of making PCRE truly Perl compatible by embedding it…

After lunch, I listened to Jesse Vincent again, on “Advanced Jifty”. This was basically peeking inside some of the deep magic that’s going on in there. First, Jesse gave an overview of the message bus inside Jifty. The heart of it is IPC::PubSub. Moving on, he peeked inside Template::Declare. This is a bit like markaby. Jesse pointed out a couple of “unusual” implementation details such as local *__ANON__ = $tag; which is an undocumented way of naming auto-generated subroutines so that stack traces make sense. He also presented a quote from Audrey: “we read every bit of perlsub.pod and abused it all”.

Lastly he covered the i18n pragma, which is just filled with scary magic to make ~~'hello world' look up hello world in a message catalog and return a translated version. There’s a great deal of use of overload and overload::constant.

A this point, Jesse started to run out of time, so he rushed through a few other interesting uses of Perl:

  • Using a function called _, which is globally available in all packages.
  • Blessing into a class called 0 in order to return false from ref.
  • Creating an MD5 sum of the call stack. I’m guessing that’s how they implement continuations support.

From one mind boggling talk to another: Abigail was on next, with “Sudoku by regex”. I won’t begin to pretend to understand what it was all about, except to note that a standard 9×9 Sudoku grid took 1.5 hours to solve in a single regex. Apparently, he’s also been trying out other games in a similar fashion, except that one of them he let run for 2 weeks before pressing ^C

Alistair McGlinchy talked about “How to make a grumpy network capacity planner happy”. This was a really nice little piece on what his work as a network admin involves, and why developers chew up lots of bandwith without thinking. He gave a really good overview of HTTP caching / compression, which needs to be more widely known.

Ash Berlin spoke on Angerwhale, which is a blog that doesn’t use a database, but does use Catalyst. Consensus from my bit of the audience: why aren’t you using bloxsom? It’s smaller, simpler and works just as well.

Finally, Jos Boumans gave his superb talk on barely legal XXX Perl. It’s a detailed blow by blow account of making Acme::BadExample run, despite all the deviousness contained therein. As with Jesse’s talk, this is scary stuff, but gives a real insight into how to mold Perl to your will.

All in all, a superb day. Interesting people, interesting talks. It was well organised. I’m extremely grateful that it was put on, particularly for free thanks to the sponsors…

Anyway, after the conference, the only natural thing to do was retreat to another pub. I only spent a short while there before departing for the BBC Backstage Xmas bash

XML::Genx 0.22

I’ve released XML::Genx 0.22. There are no functional changes, just a couple of minor bugfixes in order to ensure that it works on Windows correctly.

For some time now, I’d been trying to get XS modules compiling on Windows correctly under ActiveState Perl, all to no avail. But now, thanks to the wonder of Strawberry Perl, I’ve actually been able to build and test the module all on my own. I am hugely grateful to the authors for putting together Strawberry Perl. It’s a huge boon for developing Perl on Windows (Not that I diss ActiveState; they’ve also done a good job, but they’ve gone in different directions).

subatom 0.11

Yet another new version, subatom 0.11 again prompted by Hans F. Nordhaug. The only change this time is to add a feed_title option to the config file, so you can specify the title for the feed as a whole.

Now, I’m going to sit down and attempt to rework all this as a module+script, along with some tests. That I managed to break things in the 0.09 release was very irritating, and the tests should have caught that.

Update: Please grab subatom 0.12 instead when it shows up if you want a working version. Doh.

subatom 0.10

I’ve made another release of subatom. This contains a number of fixes for bugs that I managed to put into the 0.09 release (as well as a couple of minor features). This has really left me with a very nagging need for some tests for this module.

  • Restore the ability to send output to stdout.
  • Make the command line mode work, as well as the config file.
  • Don’t cover up stderr when executing “svn log”.
  • Force subversion to give us back UTF-8, and cope with it.
  • Add support for using—limit if your svn has it.

That item about UTF-8 has annoyed me, because it’s brought me into contact with a hated topic: locales.

Subversion internally works with UTF-8 everywhere. This is a sensible design. But in order to interface with the outside world, it needs to convert that into whatever character encoding you are using. How does it know what character encoding to use? It guesses from the locale.

The locale is meant to be a specification for how to sort characters, how to format the date and so on and so forth. In recent years, it’s also been taken over to specify the character encoding that’s in use. So, to specify that I want to see the world in UTF-8, I need to say:

  export LC_ALL=en_GB.UTF-8

Except that breaks ls(1). Yes, ls(1). For some insane reason, setting a locale changes the way that things work such sorting now happens in a case-independent manner. So that “README” files no longer appear at the top of a directory listing. I haven’t investigated any further to see what else is broken. I quickly switched back to the “C” locale, which effectively means no locale.

So now, I’m left wondering how to tell the system that I’d like UTF-8, but none of the other inconveniences that locales bring me.

subatom 0.09

I’ve released a new version of my tool subatom. If you haven’t seen it, it produces atom feeds for subversion commit messages. It’s pretty handy for monitoring activity in a subversion repository and it doesn’t need access to the server.

There are only two new features in this release:

  • Add in a option to specify link[@rel="self"], which means that the generated feeds can now pass the feedvalidator with flying colours. Many thanks to Hans F. Nordhaug for the patch.
  • I’ve added in a config file. I broke down and did it because I’d ended up with scripts that just called subatom in a variety of ungainly ways. Using a config file makes things slightly more manageable.

However, as with all releases, there are already a couple of problems:

  • Hans found that it doesn’t really cope with character encodings properly. This is particularly shameful for me. So I’ll take a peek at it tonight to ensure that we tell subversion to give us UTF-8 and process that accordingly.
  • Another point brought up by Hans is that it should invoke svn log as svn log --limit. I’d been avoiding that because I’m still on an older version of svn at work, but there’s no reason to not run svn help log and check the output to see if --limit is available.

gdb ruby $pid

gdb is my “tool of last resort”. When all other online diagnostics have failed, I know enough gdb to pull out a C level stack trace.

  % gdb $SHELL $$
  GNU gdb 6.3.50-20050815 (Apple version gdb-477) (Sun Apr 30 20:06:22 GMT 2006)
  /Users/dom/320: No such file or directory.
  Attaching to program: `/bin/zsh', process 320.
  Reading symbols for shared libraries ........ done
  0x90006108 in syscall ()
  (gdb) where
  #0  0x90006108 in syscall ()
  #1  0x00054558 in signal_suspend ()
  #2  0x0002c9a8 in waitforpid ()
  #3  0x0002cb4c in waitjobs ()
  #4  0x00012584 in execlist ()
  #5  0x00011c24 in execlist ()
  #6  0x00011874 in execode ()
  #7  0x00026890 in loop ()
  #8  0x0002945c in zsh_main ()
  #9  0x00001d14 in start ()

It’s not terribly informative, but in the past, it’s given me just enough of a clue to start looking at the SSL libraries (for example).

Jamis Buck has gone one better—he’s pulled a ruby stacktrace from a running process. Which seems quite magical to me indeed. I also think that you can turn most of what he’s done into a gdb macro. I’ll have to have a look at some examples

In the past, I’ve resorted installing a signal handler “just in case” to pull out this sort of information. All of my Perl apps have this in their startup files.

  $SIG{ USR2 } = sub { Carp::cluck("Caught SIGUSR2 in $$") };

Which is all very well, if you know what you need in advance. Which is not usually the case.

Jamis++

Update: Ok, turning it into a gdb macro is dead easy. Save this lot into ~/ruby.gdb

  # A quick hack to show the environment for a Ruby process.

  define printenv
    set $index = 0
    while environ[$index]
      x/1s environ[$index]
      set $index = $index + 1
    end
  end

  document printenv
    Display the environment for the current process.
  end

  define rb_where
    set $ary = (int)backtrace(-1)
    set $count = *($ary+8)
    set $index = 0
    while $index < $count
      x/1s *((int)rb_ary_entry($ary, $index)+12)
      set $index = $index + 1
    end
  end

  document rb_where
    Show the ruby stacktrace.
  end

To use it do source ~/ruby.gdb from the gdb session, and then you get two new commands: printenv and rb_where.

Oh yes, I do know how super-trivial all this would be if only I had DTrace. Roll on Leopard.