Encoding, Strings and Runes in Go
Character sets and encoding
ASCII is an encoding where a character occupies 1 byte or 8 bits(actually needs only 7 bits - 0-127)
Unicode maps a character to a code point, like $
to U+0024
, which is a number in hex. Those numbers are stored differently in memory depending on encoding.
UTF-8 is an encoding where code points from 0-127 occupy 1 byte(so thay are compatible with ASCII) and code points above 127 occupy up to 4 bytes.
Strings in Go
Strings in Go are read-only slices of bytes.
byte
is alias foruint8
(0 to 255)
str := "abc"[0] // yields the byte value 'a'
str := "abc"[0:3] // yields the string "ab"
Go source code is UTF-8, so the string literals contain UTF-8 text - str := "abc"
, unless there are byte-level escapes - str := "\xbd\xb2\x3d\xbc"
. In other words strings can contain arbitrary bytes, but when constructed from string literals, those bytes are UTF-8.
Runes in Go
Go comes with a type rune which represents a code point. As we discussed above a code point can occupy from 1 up to 4 bytes, thus
rune
is alias forint32
which is 4 bytes
chr := '❗' // rune type, int32, holds a unicode code point.
A for
loop over a string decodes one UTF-8-encoded rune on each iteration, in other word it iterates over runes and not bytes:
str := "❗hello"
fmt.Println("Length of the string is:", len(str))
for index, char := range str {
fmt.Printf("Index: %d\tCharacter: %c\tCode Point: %U\n", index, char, char)
}
Outputs:
Length of the string is: 8 // 8 bytes, but 6 runes
Index: 0 Character: ❗ Code Point: U+2757 // occupies multiple bytes
Index: 3 Character: h Code Point: U+0068
Index: 4 Character: e Code Point: U+0065
Index: 5 Character: l Code Point: U+006C
Index: 6 Character: l Code Point: U+006C
Index: 7 Character: o Code Point: U+006F
Reverse string exercise:
func reverse(s string) string { // strings are immutable
runes := []rune(s) // convert string to runes
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i] // swap runes
}
return string(runes) // convert runes back to string
}
Summary:
- Unicode describes a list of code points
- UTF-8 is an encoding which where a code point can occupy 1-4 bytes
- Golang's string is a slice of bytes(each element is 1 byte)
- Golang's rune is 4 bytes(to fit any code point)