optimizing my version of uniq

Posted on 2024-09-02

Table of Contents

Problem
Uniq V2
Solution
Test File
Attempts
Thoughts
References I used for this post

Problem

While browsing coding challenges, I came across a task on codingchallenges.fyi that caught my attention. The challenge was to implement a uniq tool, which filters out adjacent duplicate lines in a file, similar to the Unix command-line utility. However, I found it a bit odd that the original uniq requires the input to be sorted to be useful. I decided to take this challenge further by building my own version that supports both sorted and unsorted inputs.

Uniq V2

For my implementation, I decided to keep things simple:

Support for sorted inputs: The tool should work with sorted files, just like the original uniq. No additional features: The focus is purely on performance and memory optimization.

Solution

To ensure my implementation is efficient, I’ll rely heavily on Go’s benchmarking tools. The following command will help me profile the CPU and memory usage of my solution:

go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out

Here’s an example of a benchmark result:

BenchmarkUniq-8               14          72927262 ns/op        38636948 B/op       2279 allocs/op

1st Column: Test name
2nd Column: Number of times the benchmark ran
3rd Column: Number of nanoseconds each benchmark iteration took
4th Column: Number of memory allocations per iteration

I also plan to use pprof to dive into memory usage, aiming to reduce it as much as possible, as this challenge leans more on memory efficiency than CPU optimization.

Test File

For testing, I created a 6.2 MB file containing every word from The Adventures of Sherlock Holmes on a new line. Here are some stats:

Total lines: 1,164,980 via wc -l uniq/big.txt
Unique lines: 81,398 sort uniq/big.txt | uniq | wc -l

Attempts

Here is all my code if you would like to look at it.

First Attempt - Base Implementation

func Uniq(content []byte) string {
	cache := make(map[string]bool)

	var output []string

	for _, l := range strings.Split(string(content), "\n") {
		if _, ok := cache[l]; !ok {
			output = append(output, l)
			cache[l] = true
		}
	}

	return strings.Join(output, "\n")
}

var sink string

func BenchmarkUniq(t *testing.B) {
	var result string
	for i := 0; i < t.N; i++ {
		result = Uniq(bigContent)
	}
	sink = result
}

My initial implementation was just a function that takes in a byte slice and returns a string.

It naively splits the byte slice by a newline string character and then puts the newline into a map[string]bool to see if it’s unique, if it doesn’t exist, shove it in a string slice. Finally, join the string slice with a \n and return that output.

➜  benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out

goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8               14          72927262 ns/op        38636948 B/op       2279 allocs/op
PASS
ok      github.com/griggsca91/gobenchmarkexample/uniq   1.462

Second Attempt - Operate on the byte slice directly

func UniqV2(content []byte) string {
	cache := make(map[string]bool)

	var output []string

	var word []byte
	for _, b := range content {
		if b != '\n' {
			word = append(word, b)
			continue
		}
		l := string(word)
		if _, ok := cache[l]; !ok {
			output = append(output, l)
			cache[l] = true
		}
		word = word[0:0]
	}

	return strings.Join(output, "\n")
}

Instead of allocating a new slice of strings to iterate on, operate directly on the contents byte slice. Omit allocating a new string slice entirely and every new line, we cast the byte slice to a string and then check the cache. If it’s there do the same logic. Finally we reuse the byte slice that we’re storing our bytes in with word = word[0:0] where the length gets reset to 0 but the capacity is maintained. This reduces the number of allocations during appends.

➜  benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out

goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8               16          70174036 ns/op        38639600 B/op       2308 allocs/op
BenchmarkUniqV2-8             15          73347111 ns/op        19953586 B/op    1070476 allocs/op
PASS
ok      github.com/griggsca91/gobenchmarkexample/uniq   3.784

Overall memory usage is down, but the number of allocations went up.

Under the hood strings.Split counts how many strings are going to be made before it does an allocation for the return. It uses stringslite.Index which does some very over my head algorithms. While interesting, I don’t know if I want to spend the time to look at it right now to understand. But maybe later.

Another source of allocations is that I’m creating a new string on each []byte, which is another source of allocation. In Go, strings are immutable, so when you create a string from a []byte, it’s going to copy the contents of the []byte to prevent any risk of mutating the underlying []byte for the string. And since I’m doing this in a loop instead of all at once, that increases the number of allocations as well.

Third Attempt - Allocate output slice at one time

func UniqV3(content []byte) string {
	cache := make(map[string]int)

	var word []byte
	var position int
	for _, b := range content {
		if b != '\n' {
			word = append(word, b)
			continue
		}
		l := string(word)
		if _, ok := cache[l]; !ok {
			cache[l] = position
			position++
		}
		word = word[0:0]
	}

	output := make([]string, len(cache))
	for key, value := range cache {
		output[value] = key
	}

	return strings.Join(output, "\n")
}

The next thing I attempted was instead of adding to the output immediately, allocate a slice at the end and put the strings there in thier appropriate position and join the strings with that.

➜  benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out

goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8               15          73716039 ns/op        38641794 B/op       2310 allocs/op
BenchmarkUniqV2-8             15          73822325 ns/op        19955764 B/op    1070493 allocs/op
BenchmarkUniqV3-8             15          72249708 ns/op        16281427 B/op    1070569 allocs/op
PASS
ok      github.com/griggsca91/gobenchmarkexample/uniq   3.716

Saved a bit of memory with this, but nothing significant.

Fourth Attempt - Use `unsafe`

func UniqV4(content []byte) string {
	cache := make(map[string]int)

	var position int
	var startPosition int
	for i, b := range content {
		if b != '\n' {
			continue
		}
		word := content[startPosition:i]
		l := *(*string)(unsafe.Pointer(&word))
		if _, ok := cache[l]; !ok {
			cache[l] = position
			position++
		}
		startPosition = i + 1
	}

	output := make([]string, len(cache))
	for key, value := range cache {
		output[value] = key
	}

	return strings.Join(output, "\n")
}

I mentioned previously that when you create a string from a []byte it allocates a new []byte for the underlying data of the string. If we use unsafe.Pointer we can skip that and just directly cast the pointer to a string. Of course, this is risky and now we risk the string being changed. This isn’t a very good solution, but an interesting one to see how much memory we saved by doing this.

➜  benchmarkinguniq git:(main) ✗ go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8               15          68814044 ns/op        38637430 B/op       2284 allocs/op
BenchmarkUniqV2-8             15          71712869 ns/op        19952593 B/op    1070474 allocs/op
BenchmarkUniqV3-8             15          70332697 ns/op        16283322 B/op    1070576 allocs/op
BenchmarkUniqV4-8             21          51110863 ns/op         9825371 B/op       2390 allocs/op
PASS
ok      github.com/griggsca91/gobenchmarkexample/uniq   4.901s

Fifth Attempt - Use my own hash

func UniqV5(content []byte) string {
	// average size of a word according to google is 4.7 characters long, and assuming all ascii, 4.7 * 8 = 37.6 which is ~= 40
	// 25 is faster, but uses more memory
	// 40 uses less memory but small trade off for perf, which is ok
	// 100 uses less memory even more, but speed gains start to go down noticeable, maybe worth making this and env variable if we care that much
	cache := make(map[uint32]struct{}, len(content)/40)

	var startPosition int

	str := make([]byte, 0, len(content)/10)

	for i, b := range content {
		if b != '\n' {
			continue
		}
		lineBytes := content[startPosition:i]
		var hash uint32 = 2166136261
		for _, c := range lineBytes {
			hash ^= uint32(c)
			hash *= 16777619
		}

		if _, ok := cache[hash]; !ok {
			cache[hash] = struct{}{}
			str = append(str, lineBytes...)
			str = append(str, '\n')
		}
		startPosition = i + 1
	}

	return unsafe.String(unsafe.SliceData(str), len(str))
}

This was probably the biggest change I made, and it had the most significant effect. And for transparency, I used ChatGPT for some help with this :)

First, I used the FNV Hash Function to actually generate my own hash value from a []byte instead of casting it to a string and then hashing that. I don’t need the actual string just need to know if the []byte has been encountered. This made the map much smaller in memory and the hashing much faster.

Second, I copied the internals of strings.Builder and just decided to inline it myself. I would append the []byte slice whenever it was seen for the first time immediately, and it’s followed by a \n. Then at the return, I would do the unsafe.String function, but this is actually alright since I have complete control over the []byte and it’s not going to change in the life of the program.

Lastly, I did some assumptions in pre-allocating the cache and returning str slice. This helped in reducing the allocations per operation.

➜  benchmarkinguniq git:(main) go test -bench=. -benchmem ./uniq -cpuprofile=cpuprofile.out -memprofile=memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/griggsca91/gobenchmarkexample/uniq
cpu: Apple M1 Pro
BenchmarkUniq-8               16          68985776 ns/op        38638754 B/op       2303 allocs/op
BenchmarkUniqV2-8             16          70510958 ns/op        19953413 B/op    1070472 allocs/op
BenchmarkUniqV3-8             14          71854795 ns/op        16277012 B/op    1070550 allocs/op
BenchmarkUniqV4-8             22          50859449 ns/op         9824981 B/op       2386 allocs/op
BenchmarkUniqV5-8             33          34923020 ns/op         3146931 B/op         11 allocs/op
PASS
ok      github.com/griggsca91/gobenchmarkexample/uniq   7.804s

The V5 version is 2x faster than the initial implementation and 10x more memory efficient on a file that’s 6.9 MB.

Thoughts

Go is fast and efficient out of the box. This was a fun exercise and I learned a bit about how strings work and how to use benchmarks to find improvements. But I probably wouldn’t suggest doing these optimization’s unless I absolutely needed it.

Chris Griggs

Full-Stack Developer

optimizing my version of uniq

Problem

Uniq V2

Solution

Test File

Attempts

First Attempt - Base Implementation

Second Attempt - Operate on the byte slice directly

Third Attempt - Allocate output slice at one time

Fourth Attempt - Use `unsafe`

Fifth Attempt - Use my own hash

Thoughts

References I used for this post

optimizing my version of uniq

Problem

Uniq V2

Solution

Test File

Attempts

First Attempt - Base Implementation

Second Attempt - Operate on the byte slice directly

Third Attempt - Allocate output slice at one time

Fourth Attempt - Use unsafe

Fifth Attempt - Use my own hash

Thoughts

References I used for this post

Fourth Attempt - Use `unsafe`