Encoding, Strings and Runes in Go

Character sets and encoding

ASCII is an encoding where a character occupies 1 byte or 8 bits(actually needs only 7 bits - 0-127)

Unicode maps a character to a code point, like $ to U+0024, which is a number in hex. Those numbers are stored differently in memory depending on encoding.

UTF-8 is an encoding where code points from 0-127 occupy 1 byte(so thay are compatible with ASCII) and code points above 127 occupy up to 4 bytes.

Strings in Go

Strings in Go are read-only slices of bytes.

  • byte is alias for uint8(0 to 255)
str := "abc"[0]   // yields the byte value 'a'
str := "abc"[0:3] // yields the string "ab"

Go source code is UTF-8, so the string literals contain UTF-8 text - str := "abc", unless there are byte-level escapes - str := "\xbd\xb2\x3d\xbc". In other words strings can contain arbitrary bytes, but when constructed from string literals, those bytes are UTF-8.

Runes in Go

Go comes with a type rune which represents a code point. As we discussed above a code point can occupy from 1 up to 4 bytes, thus

  • rune is alias for int32 which is 4 bytes
chr := '❗' // rune type, int32, holds a unicode code point.

A for loop over a string decodes one UTF-8-encoded rune on each iteration, in other word it iterates over runes and not bytes:

str := "❗hello"
fmt.Println("Length of the string is:", len(str))
for index, char := range str {
    fmt.Printf("Index: %d\tCharacter: %c\tCode Point: %U\n", index, char, char)
}

Outputs:
Length of the string is: 8 // 8 bytes, but 6 runes
Index: 0 Character:  Code Point: U+2757 // occupies multiple bytes
Index: 3 Character: h  Code Point: U+0068
Index: 4 Character: e  Code Point: U+0065
Index: 5 Character: l  Code Point: U+006C
Index: 6 Character: l  Code Point: U+006C
Index: 7 Character: o  Code Point: U+006F

Reverse string exercise:

func reverse(s string) string { // strings are immutable
    runes := []rune(s)          // convert string to runes
    for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
        runes[i], runes[j] = runes[j], runes[i] // swap runes
    }
    return string(runes)        // convert runes back to string
}

Summary:

  • Unicode describes a list of code points
  • UTF-8 is an encoding which where a code point can occupy 1-4 bytes
  • Golang's string is a slice of bytes(each element is 1 byte)
  • Golang's rune is 4 bytes(to fit any code point)

References: