I’ve been looking at Go recently. It’s a pleasant language, with few surprises. However, I wondered (as always) what the encoding of a string is supposed to be. For example:
- Python 2 has two types: str, and unicode. Python 3 has sensibly renamed these to bytes and str, respectively.
- Perl has a magic bit which gets set to state that the string contains characters as opposed to bytes (it’s called the UTF-8 bit, but it means characters).
So how does Go deal with characters in strings? Given that the authors of Go also invented UTF-8, we can hope it’s been thought about.
There are three types to think about.
byte[]
- A slice of bytes.
string
- A (possibly empty) sequence of bytes. Strings are immutable.
rune
- A single unicode code point. Produced by characters in single quotes.
There’s no explicit encoding in the above. Nonetheless, there’s an implicit preference for UTF-8:
- Source code is assumed to be UTF-8
- The builtin string conversions (from int and to rune) assume UTF-8.
But this doesn’t help the common case:
package main import "fmt" func main() { s := "café" fmt.Printf("%q has length %dn", s, len(s)) } // "café" has length 5
The unicode/utf8 package can do what’s needed though. This provides functions for, amongst other things, picking runes out of strings.
package main import ( "fmt" "unicode/utf8" ) func main() { s := "café" fmt.Printf("%q has length %dn", s, utf8.RuneCountInString((s))) } // "café" has length 4
This is very Go-like. The default is somewhat low-level, but the types and libraries build on top of it. For example, text/scanner provides a nice way of iterating over runes in a UTF-8 input stream.
On a whim, I took a look at the internals of utf8.RuneCountInString()
. It’s deceptively simple.
func RuneCountInString(s string) (n int) { for _ = range s { n++ } return }
This relies on the spec defining how a string interacts with a for loop: it’s defined as iterating over the UTF-8 codepoints (or runes).