Go Strings, Runes & Bytes

Links: - 103 Golang Index
- Go - DataTypes
- Go - Basics

Strings¶

Strings have to be enclosed in double quotes.
- If we want to use double quotes inside double quotes we must use \ to escape it
Strings in Go are UTF-8 encoded by default.
We can raw string using backticks. Backslashes and other special characters have no special meaning.
- Useful for defining directories
```
fmt.Println(`hello "there"`) // hello "there"
fmt.Println(`hello \n there`) // hello \n there
```

Strings are immutable

s := "ABC";s[0] = "D" - this will give an error

Each time you use + for concatenation Go returns a new string

s := "ABC"
fmt.Println(s[0]) // 65
fmt.Printf("%c",s[0]) // A
fmt.Println(s[0:2]) // AB
// for multiple characters it prints the string representation

We perform string operations using the strings package.
- Recommended way to compare strings when we don't want to take into account the case => fmt.Println(strings.EqualFold("GO","go"))
- Converting to lower case and then comparing is an expensive operation

Bytes & Runes¶

Strings are implemented quite differently as compared to other programming languages.
Go has two additional integer types called byte and rune that are aliases for uint8 and int32 data types.
In Go, the byte and rune data types are used to distinguish characters from integer values.
Golang doesn't have a char data type. It uses byte and rune to represent character values.
- byte data type represents ASCII characters
- rune data type represents more broader UNICODE characters that are encode in UTF-8 by default.

UTF-8 is a variable length encoding format.

UNICODE characters occupy between 1 and 4 bytes.
ASCII characters occupy 1 byte.

Characters or rune literals are expressed in Go by enclosing them in single quotes, as in 'x' or '\n'.

Rune can be thought of being synonymous to a letter. Although this is true string is a slice of bytes and not runes.

Rune literals such as 'a', 'b', 'c', 'x' or \n' are represented using Unicode Code Points. A code point is a numeric value that represents a rune literal.
The character encoding scheme ASCII which is a Unicode subset, comprises 128 code points.
A string is a sequence of bytes not runes or characters. A string is a slice of bytes and any byte slice can be encoded in a string value.
The Go terminology for code points is runes.
- A rune represent a single unicode character.
- Rune 0x61 in hexadecimal represents the rune literal 'a'.

a := 'a' // this is a rune
fmt.Printf("%d, %T\n", a, a) // 97, int32
str := "¥"
fmt.Printf("%d, %T, %d\n", len(str), str[1], str[1]) // 2, uint8, 165
// we have a length of 2 since UTF-8 is a variable length encoding format and the character occupies 2 bytes
// We get the byte at position 1 and not rune
fmt.Println(len("hello")) // 5

Decoding a string rune by rune¶

By using indexes we get the byte at that position, not rune.

// wrong decoding
str := "¥"
for i := 0; i < len(str); i++ {
    fmt.Printf("%c\n", str[i])
} // Â¥

// using a package
for i:= 0; i < len(str); {
    r, size := utf8.DecodeRuneInString(str[i:])
    fmt.Printf("%c", r)
    i = i + size
}

str := "¥þð"
for i, r := range str {
    fmt.Printf("%d -> %c\n", i, r) 
}
// 0 -> ¥
// 2 -> þ
// 4 -> ð

Understanding `len` of strings¶

len returns the number of bytes in a string
- It will be equal to the number of characters in ASCII since in ASCII each letter is 1 byte.
- But this isn't the case with UNICODE characters which can be of multiple bytes. The rune (letter) count can't be determined using len function
```
str := "golang"
fmt.Println(len(str)) // 6
str = "¥þ"
fmt.Println(len(str)) // 2 rune in the string but the length is 4
// this UNICODE string occupies 4 bytes
```
If we want the rune count and not the number of bytes then we have to use a function from the utf8 package
```
n := utf8.RuneCountInString(str)
fmt.Println(n) // 1
```

Slicing Strings¶

Slicing a string returns bytes and not runes

s := "golang"
fmt.Println(s[1:3]) // ol

In the above example since it was an ASCII string where 1 character = 1 byte we got characters from index 1 to 2.

Slicing non ASCII characters is not that simple. We need a slice of runes and not a slice of bytes.

str := "¥þð"
fmt.Println(str[1:3]) // �� -> unicode representation of the returned bytes

We first convert slice of runes to slice of bytes and then after slicing convert the rune back to string.
- The above method of converting a slice of bytes to a slice of runes can also be used to find the length of the runes.
```
str := "¥þð"
rs := []rune(str)
fmt.Println(len(rs)) // 3
fmt.Println(rs) // [165 254 240]
fmt.Println(string(rs[1:3])) // þð
```

Miscellaneous Examples¶

a := "hello"
b := []byte(a)
for _, value := range b {
    fmt.Printf("%#v, %d, %c\n", value, value, value)
}
// 0x68, 104, h
// 0x65, 101, e
// 0x6c, 108, l
// 0x6c, 108, l
// 0x6f, 111, o

- Converting string to byte and rune slice

s := "hello"
rs := []rune(s)
bs := []byte(s)
fmt.Println(rs) // [104 101 108 108 111]
fmt.Println(bs) // [104 101 108 108 111]

- Converting to string

bs := []byte{102, 97, 65, 34}
rs := []rune{102, 97, 65, 34}
is := []uint8{102, 97, 65, 34} // alias to byte
bis := []int32{102, 97, 65, 34} // alias to rune
fmt.Println(string(bs)) // faA"
fmt.Println(string(rs)) // faA"
fmt.Println(string(is)) // faA"
fmt.Println(string(bis)) // faA"
// is := []int{102,97,65,34} // this would give an error since we cannot convert int to string
// we cannot convert any other data type to string

Last updated: 2022-05-29