Jabbering Giraffe

Go Strings

I’ve been looking at Go recently. It’s a pleasant language, with few surprises. However, I wondered (as always) what the encoding of a string is supposed to be. For example:

So how does Go deal with characters in strings? Given that the authors of Go also invented UTF-8, we can hope it’s been thought about.

There are three types to think about.

byte[]
A slice of bytes.
string
A (possibly empty) sequence of bytes. Strings are immutable.
rune
A single unicode code point. Produced by characters in single quotes.

There’s no explicit encoding in the above. Nonetheless, there’s an implicit preference for UTF-8:

But this doesn’t help the common case:

package main

import "fmt"

func main() {
  s := "café"
  fmt.Printf("%q has length %d\n", s, len(s))
}

// "café" has length 5

The unicode/utf8 package can do what’s needed though. This provides functions for, amongst other things, picking runes out of strings.

package main

import (
  "fmt"
  "unicode/utf8"
)

func main() {
  s := "café"
  fmt.Printf("%q has length %d\n", s, utf8.RuneCountInString((s)))
}

// "café" has length 4

This is very Go-like. The default is somewhat low-level, but the types and libraries build on top of it. For example, text/scanner provides a nice way of iterating over runes in a UTF-8 input stream.

On a whim, I took a look at the internals of utf8.RuneCountInString(). It’s deceptively simple.

func RuneCountInString(s string) (n int) {
  for _ = range s {
    n++
  }
  return
}

This relies on the spec defining how a string interacts with a for loop: it’s defined as iterating over the UTF-8 codepoints (or runes).