optimizing my version of uniq
Problem
While browsing coding challenges, I came across a task on codingchallenges.fyi that caught my attention. The challenge was to implement a uniq tool, which filters out adjacent duplicate lines in a file, similar to the Unix command-line utility. However, I found it a bit odd that the original uniq requires the input to be sorted to be useful. I decided to take this challenge further by building my own version that supports both sorted and unsorted inputs.
Uniq V2
For my implementation, I decided to keep things simple:
Support for sorted inputs: The tool should work with sorted files, just like the original uniq. No additional features: The focus is purely on performance and memory optimization.
Solution
To ensure my implementation is efficient, I’ll rely heavily on Go’s benchmarking tools. The following command will help me profile the CPU and memory usage of my solution:
go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
Here’s an example of a benchmark result:
BenchmarkUniq-8 14 72927262 ns/op 38636948 B/op 2279 allocs/op
-
1st Column: Test name
-
2nd Column: Number of times the benchmark ran
-
3rd Column: Number of nanoseconds each benchmark iteration took
-
4th Column: Number of memory allocations per iteration
I also plan to use pprof to dive into memory usage, aiming to reduce it as much as possible, as this challenge leans more on memory efficiency than CPU optimization.
Test File
For testing, I created a 6.2 MB file containing every word from The Adventures of Sherlock Holmes on a new line. Here are some stats:
-
Total lines: 1,164,980 via
wc -l uniq/big.txt
-
Unique lines: 81,398
sort uniq/big.txt | uniq | wc -l
Attempts
Here is all my code if you would like to look at it.
First Attempt - Base Implementation
func Uniq(content []byte) string {
cache := make(map[string]bool)
var output []string
for _, l := range strings.Split(string(content), "\n") {
if _, ok := cache[l]; !ok {
output = append(output, l)
cache[l] = true
}
}
return strings.Join(output, "\n")
}
var sink string
func BenchmarkUniq(t *testing.B) {
var result string
for i := 0; i < t.N; i++ {
result = Uniq(bigContent)
}
sink = result
}
My initial implementation was just a function that takes in a byte slice and returns a string.
It naively splits the byte slice by a newline string character and then puts the newline into a map[string]bool
to see if it’s unique, if it doesn’t exist, shove it in a string slice. Finally, join the string slice with a \n
and return that output.
➜ benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8 14 72927262 ns/op 38636948 B/op 2279 allocs/op
PASS
ok github.com/griggsca91/gobenchmarkexample/uniq 1.462
Second Attempt - Operate on the byte slice directly
func UniqV2(content []byte) string {
cache := make(map[string]bool)
var output []string
var word []byte
for _, b := range content {
if b != '\n' {
word = append(word, b)
continue
}
l := string(word)
if _, ok := cache[l]; !ok {
output = append(output, l)
cache[l] = true
}
word = word[0:0]
}
return strings.Join(output, "\n")
}
Instead of allocating a new slice of strings to iterate on, operate directly on the contents byte slice. Omit allocating a new string slice entirely and every new line, we cast the byte slice to a string and then check the cache. If it’s there do the same logic. Finally we reuse the byte slice that we’re storing our bytes in with word = word[0:0]
where the length gets reset to 0 but the capacity is maintained. This reduces the number of allocations during appends.
➜ benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8 16 70174036 ns/op 38639600 B/op 2308 allocs/op
BenchmarkUniqV2-8 15 73347111 ns/op 19953586 B/op 1070476 allocs/op
PASS
ok github.com/griggsca91/gobenchmarkexample/uniq 3.784
Overall memory usage is down, but the number of allocations went up.
Under the hood strings.Split
counts how many strings are going to be made before it does an allocation for the return. It uses stringslite.Index
which does some very over my head algorithms. While interesting, I don’t know if I want to spend the time to look at it right now to understand. But maybe later.
Another source of allocations is that I’m creating a new string
on each []byte
, which is another source of allocation. In Go, strings are immutable, so when you create a string
from a []byte
, it’s going to copy the contents of the []byte
to prevent any risk of mutating the underlying []byte
for the string
. And since I’m doing this in a loop instead of all at once, that increases the number of allocations as well.
Third Attempt - Allocate output slice at one time
func UniqV3(content []byte) string {
cache := make(map[string]int)
var word []byte
var position int
for _, b := range content {
if b != '\n' {
word = append(word, b)
continue
}
l := string(word)
if _, ok := cache[l]; !ok {
cache[l] = position
position++
}
word = word[0:0]
}
output := make([]string, len(cache))
for key, value := range cache {
output[value] = key
}
return strings.Join(output, "\n")
}
The next thing I attempted was instead of adding to the output immediately, allocate a slice at the end and put the strings there in thier appropriate position and join the strings with that.
➜ benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8 15 73716039 ns/op 38641794 B/op 2310 allocs/op
BenchmarkUniqV2-8 15 73822325 ns/op 19955764 B/op 1070493 allocs/op
BenchmarkUniqV3-8 15 72249708 ns/op 16281427 B/op 1070569 allocs/op
PASS
ok github.com/griggsca91/gobenchmarkexample/uniq 3.716
Saved a bit of memory with this, but nothing significant.
Fourth Attempt - Use unsafe
func UniqV4(content []byte) string {
cache := make(map[string]int)
var position int
var startPosition int
for i, b := range content {
if b != '\n' {
continue
}
word := content[startPosition:i]
l := *(*string)(unsafe.Pointer(&word))
if _, ok := cache[l]; !ok {
cache[l] = position
position++
}
startPosition = i + 1
}
output := make([]string, len(cache))
for key, value := range cache {
output[value] = key
}
return strings.Join(output, "\n")
}
I mentioned previously that when you create a string
from a []byte
it allocates a new []byte
for the underlying data of the string. If we use unsafe.Pointer
we can skip that and just directly cast the pointer to a string
. Of course, this is risky and now we risk the string
being changed. This isn’t a very good solution, but an interesting one to see how much memory we saved by doing this.
➜ benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8 15 68814044 ns/op 38637430 B/op 2284 allocs/op
BenchmarkUniqV2-8 15 71712869 ns/op 19952593 B/op 1070474 allocs/op
BenchmarkUniqV3-8 15 70332697 ns/op 16283322 B/op 1070576 allocs/op
BenchmarkUniqV4-8 21 51110863 ns/op 9825371 B/op 2390 allocs/op
PASS
ok github.com/griggsca91/gobenchmarkexample/uniq 4.901s
Fifth Attempt - Use my own hash
func UniqV5(content []byte) string {
// average size of a word according to google is 4.7 characters long, and assuming all ascii, 4.7 * 8 = 37.6 which is ~= 40
// 25 is faster, but uses more memory
// 40 uses less memory but small trade off for perf, which is ok
// 100 uses less memory even more, but speed gains start to go down noticeable, maybe worth making this and env variable if we care that much
cache := make(map[uint32]struct{}, len(content)/40)
var startPosition int
str := make([]byte, 0, len(content)/10)
for i, b := range content {
if b != '\n' {
continue
}
lineBytes := content[startPosition:i]
var hash uint32 = 2166136261
for _, c := range lineBytes {
hash ^= uint32(c)
hash *= 16777619
}
if _, ok := cache[hash]; !ok {
cache[hash] = struct{}{}
str = append(str, lineBytes...)
str = append(str, '\n')
}
startPosition = i + 1
}
return unsafe.String(unsafe.SliceData(str), len(str))
}
This was probably the biggest change I made, and it had the most significant effect. And for transparency, I used ChatGPT for some help with this :)
First, I used the FNV Hash Function to actually generate my own hash value from a []byte
instead of casting it to a string and then hashing that. I don’t need the actual string
just need to know if the []byte
has been encountered. This made the map
much smaller in memory and the hashing much faster.
Second, I copied the internals of strings.Builder
and just decided to inline it myself. I would append the []byte
slice whenever it was seen for the first time immediately, and it’s followed by a \n
. Then at the return, I would do the unsafe.String
function, but this is actually alright since I have complete control over the []byte
and it’s not going to change in the life of the program.
Lastly, I did some assumptions in pre-allocating the cache and returning str slice. This helped in reducing the allocations per operation.
➜ benchmarkinguniq git:(main) go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8 16 68985776 ns/op 38638754 B/op 2303 allocs/op
BenchmarkUniqV2-8 16 70510958 ns/op 19953413 B/op 1070472 allocs/op
BenchmarkUniqV3-8 14 71854795 ns/op 16277012 B/op 1070550 allocs/op
BenchmarkUniqV4-8 22 50859449 ns/op 9824981 B/op 2386 allocs/op
BenchmarkUniqV5-8 33 34923020 ns/op 3146931 B/op 11 allocs/op
PASS
ok github.com/griggsca91/gobenchmarkexample/uniq 7.804s
The V5 version is 2x faster than the initial implementation and 10x more memory efficient on a file that’s 6.9 MB.
Thoughts
Go is fast and efficient out of the box. This was a fun exercise and I learned a bit about how strings work and how to use benchmarks to find improvements. But I probably wouldn’t suggest doing these optimization’s unless I absolutely needed it.