Tag: book

 

Book review: Solr 1.4 Enterprise Search Server

solr-book-thumb.png

I was recently offered a review copy of Solr 1.4 Enterprise Search Server (thanks to Swati Iyer). Whilst this is most fortuitous, I only wish I’d had this a month or two ago, when I was working fairly heavily on a Solr based project at $OLDWORK. Still, I’ll be able to judge whether or not the book would have been useful. 🙂

First, some background. Normally, Solr is documented through its wiki. As wikis go, it’s well maintained and informative. But it suffers from both a lack of narrative structure and by being completely online. The latter point really hit home when I was in ApacheCon 2008, in a Solr training class and couldn’t get at the documentation. So, a book covering Solr has to be a good idea.

Even though this book covers Solr 1.4, most of it is still applicable to earlier versions (my experience is all with 1.3). This is handy, seeing as Solr 1.4 isn’t released yet (and hence not yet in central). Hopefully, it should be any day now, seeing as the version numbers have been bumped in svn (r824893).

The first nice thing about this book is simply that it’s not a massive tome. At only 317pp, it’s really quite approachable. When you open it, the writing is in a friendly, conversational style.

The book starts with a brief introduction to solr and lucene, before moving on to installation. One thing I found unusual were the comparisons to relational database technology. These continue in a few places through the book. Perhaps I’m so used to search that I don’t need this. But given that the focus is on “enterprise,” it’s quite likely that’s the best angle to pull in the target audience. The chapter rounds off with a quick walkthrough of loading and querying data. It’s good to see something practical even at this point.

With that out of the way, the discussion moves to the absolute bedrock of solr: the schema. Defining what data you have and how you want to index and search it is of crucial importance. Particularly useful is the advice to play with Solr’s analysis tool, in order to understand how the fields you define actually work. Whilst the explanations of what the schema is and how design a good one are clear, it’s still likely that this is a chapter you’ll be revisiting as you get to know both Solr and your data more.

This chapter also introduces the data set you’ll work with through the book: the MusicBrainz data. This isn’t an obvious choice for testing out a search engine (gutenberg? shakespeare?), but it is fun. And where it doesn’t fully exercise Solr, this is pointed out.

Next we move on to how to get your data into Solr. This assumes a level of familiarity with the command line, in order to use curl. As well as the “normal” method of POSTing XML documents into Solr, this also covers uploading CSV files and the DataImportHandler. The latter is a contrib module which I hadn’t seen before. This lets you pull your data in to solr (instead of pushing) from any JDBC data source. The only missing thing is something that I spent a while getting right: importing XML data into Solr. There is a confusion which stems from the fact that you can post XML into solr, but not arbitrary XML. If you want to put an arbitrary XML document in a Solr field, you have to escape it and nest it into a solr document. It’s ugly, but can be made to work.

Once you’ve got the data in, what about getting it out again? The chapter on “basic querying” covers the myriad of ways you can alter Solr’s output. But the basic query stuff is handled well. In particular, it has a nice clear explanations of Solr’s variant of “data structure as XML” as well as the full query syntax. There is also detail on the solrconfig.xml which I completely managed to miss in six months of staring at it. Oh well.

At this point, the book has the basics covered. You could stop here and get along very well with Solr. But this is also the bit where the interesting parts start to appear:

  • There’s coverage of function queries, which allow you to manipulate the rankings of results in various ways (e.g. ranking newer content higher). I confess that the function queries looked interesting, but I haven’t used them and the descriptions in the book swiftly go past my limited maths knowledge.
  • The dismax handler is introduced, which gives a far simpler query interface to your users. This is something I wish I’d payed closer attention to in my last project.
  • Faceting is covered in detail. This is one of Solr’s hidden gems, providing information about the complete set of results without performing a second query. There’s also a nice demonstration of using faceting to back up a “suggestions” mechanism.
  • Highlighting results data. I could have saved a lot of time by reading this.
  • Spellchecking (“did you mean”). Again, the coverage highlights several pitfalls you need to be aware of.

Then comes the best surprise of all. A chapter on deployment. So many books forget this crucial step. So, there is coverage of logging, backups, monitoring and security. It might have been nice to also mention integrating it into the system startup sequence.

The remaining chapters cover client integration (with Java, PHP, JavaScript and Rails) and how to scale Solr. Though I never needed the scaling for my project, the advice given is still useful. For example, do you need to make every field stored? (doing so can increase disk usage) The coverage of running Solr on EC² also looked rather useful.

Perhaps the one thing that I’m not entirely happy with is the index (though I acknowledge a good index is hard to achieve). Some common terms I looked up weren’t present.

Overall, I’m really pleased by this book. Given my own experiences figuring out solr through the school of hard debugging sessions, I can say that this would have made my life a great deal easier. If you want to use Solr, you’ll save yourself time with this book.

Java Concurrency in Practice

I’ve recently purchased a copy of Java Concurrency in Practice. I was always a little bit scared of threaded code, but we have a small amount at work, so I figured I’d better get my head around it. Wow. The big takeaway from the book was that I was nowhere near scared enough. Concurrency is hard.

But it’s not all bad news. Java 5 introduced java.util.concurrent—a set of abstractions for dealing with concurrency at a higher level. Chapters 5 and 6 are an overview of these facilities. There’s extensive coverage of the new data structures like ConcurrentHashMap as well as a very detailed exploration of what the new Executor and ExecutorService can do for you. That’s as well as looking at things like latches, queues and so on.

Chapter 7 was a detailed exploration of the issues surrounding Thread.interrupt() and ensuring that your threads can shut down properly. Java’s threading system is cooperative—you can’t stop a thread, you have to ask it to stop. This is harder to get right than it looks. Incidentally, much of the same discussion is on the web as Java theory and practice: Dealing with InterruptedException (by the same author, Brian Goetz).

These chapters were probably the most immediately useful parts of the book for me. I put it down and rewrite the threaded portions of my application using ConcurrentHashMap, ExecutorService and using interrupt() for cancellation. And I saw improvements. Where the threads had previously not shut correctly 100% of the time, they now worked just fine. In general, I feel a lot happier using the higher level abstractions. Or maybe that should be “less uneasy about having cocked it up”.

Stepping back a moment, the rest of the book is still very useful. The first part is a grounding of both concurrency and concurrency in Java. It’s all illustrated with very good code examples1. This really taught me about visibility as a threading issue, which I hadn’t previously been aware of at all (and it pointed me straight at a bug in my own code). i.e.

  class Foo extends Thread {
    private boolean done = false;
    public synchronized void setDone(boolean done) {
      this.done = done;
    }
    public void run() {
      while (!done) {
        // …
      }
    }
  }

Spot the problem here? Because read access to done isn’t synchronized, an update from another thread might not be noticed. Ooops. (and yes, it’s better to implements Runnable rather than extends Thread [and yes, I should be using Thread.interrupt() rather than my own flag]).

Later chapters continue to give excellent, clear advice. I’ve never done (or needed to do) Swing programming, but there appears to be a good discussion of the issues involved in ensuring that you only do GUI actions within the confines of a single thread.

One chapter which really blew me away was the testing chapter. I had absolutely no idea how to go about testing thread safety. The clues flew in abundance at me from the pages however. I confess I haven’t written any tests yet, but I at least have an idea of where to start now…

The later chapters are “advanced topics”. They seemed like cogent explanations, but as an absolute beginner, they weren’t too relevant just yet. I can see I’ll be coming back to the book soon enough, however.

Overall, I reckon this is one of those “must own” books, if you go anywhere near Java and threads.

1 Nothing is more off-putting than bad code examples. I mean, you’re meant to be an expert, right? Brian outlined how to avoid errors in code on his blog.

Unread Books

I’ve been buying more books than I should have recently, with the result that a number are piling up behind me. Plus there are a few that I’ve not finished for one reason or another.

Now that I’m going to have minimize my monthly outgoings, I should revisit these instead of purchasing new fare…

65 years of debugging

I’ve recently been plowing through a lot of old Asimov books I had lying around. One that stands out in particular is anthology of his robot stories. Imaging what 2005 will be like is so much easier with hindsight!

But one group of stories is particularly enthralling to me: Powell & Donovan. Why? Because they’re debuggers. They get faced with a weird situation, have to figure out not only what’s gone wrong, but why so that they can fix it. They know about the three laws, and are exploring their unexpected implications and interactions in the real world. This feels exactly like what I do with computers.

Thankfully, the robots they deal with don’t come with a reset switch, which appears to be the limit of many people’s debugging these days. That would have made for a very short story indeed.

O'Reilly Short Cuts

I’ve now purchased a couple of short cuts from O’Reilly: RJS Templates for Rails and Schematron. At $9.99 a pop, they seem to be pretty good value for money (especially with the current GBP to USD rate). Each one is around 50 pages, which is a concise enough to be a good introduction to a particular topic without going overboard.

They have a couple of (minor) faults. The biggest is that the PDFs don’t look that good in Preview.app, forcing me to install Adobe Reader. I haven’t looked into why it’s so bad yet. The other reason is the lack of chapter breaks. The books feel like one continuous block of text (aka a Terry Pratchett novel).

I did notice one or two typos in the Schematron book, but I’ve reported them and hopefully, the PDF can just be regenerated. Yay ebooks!

As to the topics covered, both RJS and Schematron are subjects which are too small to be covered by a full book (200+ pages) yet deserve more attention than a single web article might give it. They’re ideal for this format.

Overall, I’m pretty pleased with these. I’ll likely be going back for more.

Pro JavaScript Techniques

I’ve just finished Pro JavaScript Techniques by John Resig (author of jQuery). It lives up to the name. If you’re a beginner with JavaScript, then look elsewhere. But if you’ve done a little bit before, then this is the book for taking you to the next level.

From the very start, it pulls no punches. It dives right in to talking about Object Oriented JavaScript, a major source of confusion for people arriving from conventional, class-based OO languages. But even in the first few pages, there is already mention of things like firebug, JSAN and Unobtrusive JavaScript. It’s not just the language, the environment and development tools are described. And it’s all bang up to date.

I really appreciate the way that (throughout the book), John will show you how to do something, but then a library which does the same thing (but without the bugs you will inevitably introduce). It’s the best of both worlds: giving you the understanding of what’s going on, but also the ability to use that knowledge quickly.

Further on in part 3, the DOM is described in exacting detail. Yet it’s still succint. Walking through the usual suspect (document.getElementById(), document.getElementsByTagName()), he also goes on to talk about using CSS Selectors and XPath (using external libraries). This is a good example of taking you a step further than most of the other documentation I’ve seen.

Whilst all this is going on, a small library of utillity functions is building up from the examples. Many of these functions are used in later examples. This is something to be aware of if you open the book at random. But, the routines are very likely something you will want to reuse in your own work, so I don’t begrudge this at all.

The chapter on events was a real shocker to me. I thought I had a handle on how events worked in JavaScript, until I read the “Event Phases” section. In particular, I didn’t realise the sequence of events when moving a mouse over nested tags. It’s a small matter, but another example of the accuracy of this book.

Further chapters go on to discuss CSS integration, enhancing forms, and Ajax. There are also practical chapters where you build an image gallery (similiar to lightbox), enhance a blog, build an autocomplete for a text field. These culminate in an Ajax wiki. I have to admit, I haven’t tried the code for that yet. But the design decisions seem clearly explained. I’m not 100% sure on the Ruby code I saw, but that’s a minor matter, as it’s not the books primary focus.

Personally, the chapter on enhancing forms was of the most practical value to the work I do. It was neat, self-contained and immediately usable.

The final chapter is an exploration of the future of JavaScript. John explores JavaScript 1.6 and 1.7 (available now in Firefox), Web Applications 1.0, Comet) and what neat things they will bring.

There are three useful appendices: A DOM reference, an events reference and a brief browser overview. I can easily see myself going back to the first two a great deal.

What negative points are there? Not many. The index seemed a little sparse. In particular, the functions that are developed along the way aren’t in there.

I have no complaints at all about the quality of the writing. It’s all clear and eminently readable. Unsurprisingly, this applies to the code as well as the prose.

Overall, this books gets a big thumbs up from me. I’ll be coming back to it regularly. And I will be nudging work to get a copy as well.

TextMate book

Last week, a copy of TextMate: Power Editing for the Mac landed on my desk. TextMate has been a part of my life for about a year now. I can’t be doing without it (“Edit in TextMate” is worth the price of entry alone). I consider myself a reasonably advanced user.

Boy, was I wrong. From the very beginning of the book, I was getting something new off of each page. Not necessarily earth shattering, but a drip, drip, drip of new knowledge about textmate and how it works. By the end of the chapters on Editing, I felt hugely more confident about using the abilities of TextMate to the full.

The book is split into three main sections: Editing, Automations and Languages. Editing is well, editing. Do it hard, do it fast. The Automations section takes you to the next level, and lets you customize TextMate to your own unique workflow. From the simple snippet to the do-anything command, it really shows you how to take control.

What I particularly love about TextMate is that unlike vim or Emacs, it doesn’t have an extension language. Instead, you get to use Ruby, or Perl, or /bin/sh. You could use PHP if you prefer. In many ways it feels far more “Unixy” than either vi or Emacs. In fact, it’s much more like sam.

The final section, Languages, should rarely be needed. TextMate comes with builtin support plenty of languages (and many more are available). But when you need it, you really need it. James Gray manages to walk through the process of adding a simple language (JSON) in a very effective manner, showing you what’s possible in a succinct manner.

In fact, succinctness is present throughout this book. The text is clear and informative, never taking too long to explain, but not leaving you confused either. And at 200 pages, it’s nice slim tome. You could get through it in a couple of hours, but don’t worry—you’ll be coming back to it.

If you edit text on the Mac, do yourself a favour and get this book.

LibraryThing

A few days ago, somebody at work showed me LibraryThing. It’s the application I’d always wanted to build myself for managing my books. And now somebody’s done it for me. I’ve now chucked all my computer books in and they’re available at: librarything.com/catalog/happygiraffe. By which you can safely conclude that I have spent too much money on books over the years.

The main downside to LibraryThing that I experienced was incredible slowness at times. However today, the message is up that it’s moving on to new servers. So hopefully a speedier experience is ahead!

Revenge of LoveLock

It’s Brighton Festival time. On Tuesday, I saw James Lovelock speak about his new book “Revenge of Gaia”. I’m new to the whole gaia thing, but the notion of treating the earth as a holistic system makes a lot of sense to me.

He talked a little bit about Gaia, but much of the talk seemed to be given over to his dislike of wind farms and his pro-nuclear energy stance. Myself, I’m neither anti-nuke nor pro-nuke. He did question much of the evidence regarding nuclear energy that we had been led to believe, claiming that the safety issue of spent nuclear fuel is overrated, and the cost issues of nuclear are mostly because of onerous health & safety legislation.

He quipped that he’d be more than happy to have spent nuclear fuel in his backyard because it might keep the damned wind farm developers off it…

The other main theme was about global warming. Like most scientists, he cheerfully accepts that it’s happening. But his view of what will happen and when seems to be markedly more doom-and-gloom. But he’s still happy, because he just thinks we should prepare more. Oh, and that Britain will probably emerge relatively unscathed because of it’s location.

Overall, I certainly wasn’t convinced he was correct, but he did give me food for thought. Thanks to the wonders of this internet thingy, I can also read his critics at the same time.

Oh—I didn’t buy the book. I’m not that convinced…

Greasemonkey Hacks

I’ve just gotten a copy of Greasemonkey Hacks and I’m working my way through it. The book itself is great. It’s a really good introduction to greasemonkey and what you can do with it. Eventually, I’m hoping to may my bank usable in firefox.

The only quibble that I have is the code samples. Many of them (and they appears to be the ones written by Mark Pilgrim, the author) are really difficult to read, because they idioms are wrong. For example, nearly every for loop I have seen looks like this:

  for (var i = arTableRows.length - 1; i > = 0; i--) {
    ...
  }

Which is walking through the rows of a table, backwards. Why backwards? I have no idea. I would expect it to look more like this:

  for (var i = 0; i < arTableRows.length; i++) {
    ...
  }

Aside from a marginal efficiency gain (which smacks of micro-optimisation), I don’t see what the benefit is.

And “arTableRows” is another quibble with the code. It’s filled with hungarian notation. Great if you work at Microsoft, but unreadable to the rest of the world.