perf.slide

Go Performance Tutorial

Josh Bleecher Snyder
Braintree/PayPal
josharian@gmail.com
@offbymany

# introduce myself, contributions to Go, etc.

# TODO: Move code into files, make it executable, write benchmarks
# TODO: say it all out loud, tweak it, run the code
# TODO: figure out how to print out a guide in big font


* Plan

- Introduction and philosophy
- Tools: Benchmarks, profiles
- Habits and techniques: string/[]byte, memory, concurrency
- Advanced tools and techniques
- Other kinds of optimization
- Wrap-up and stump the chump

.link http://10.10.32.101


* Introduction and philosophy

# lots to say, so there will be interludes throughout
# but here are the key points


* Write simple, clear code

- Usually the fastest anyway

.link https://codereview.appspot.com/131840043

- Easy to see optimization opportunities
- Compiler and runtime optimized for normal code
- Take it easy on abstraction (reflection, interfaces)

"All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections." - David Wheeler


* Write good tests and use version control

Enables experimentation.

"If you're not going to get the right answer, I don't see the point. I can make things very fast if they don't have to be correct." - Russ Cox


* Develop good habits

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth

# we'll discuss good habits as we go
# get the basics right: caching, lazy initialization, better algorithms


* Know thy tools, at all levels

- Can you cheat? Does it matter?
- Algorithms
- Language
- Benchmarking and profiling
- Machine and OS: Disk vs network vs memory

"People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird." - Donald Knuth


* The easiest wins around

- Use the most recent release of Go!
- Use the standard library.
- Use more hardware.


* Benchmarking


* Hello, benchmarks

Demo: package fib

# how to write, run, and interpret a benchmark
# adaptive benchtime
# -bench=regexp, -run=NONE, -benchtime
# full power of language at disposal
# careful about which benchmarks you write
# beware microbenchmarks


* Tour of testing.B and 'go test' flags

Demo: word length count

# b.SetBytes
# -benchmem, b.ReportAllocs
# b.Errorf, b.Logf, -v
# danger: these skew benchmarking! demo with ReportAllocs
# and wow, reflect.DeepEqual is expensive!
# b.ResetTimer, b.StopTimer, b.StartTimer
# show Go 1.4 vs Go tip (1.5)


* Comparing benchmarks

- benchcmp

	go get -u golang.org/x/tools/cmd/benchcmp

- benchviz

	go get -u github.com/ajstarks/svgo/benchviz

- benchstat

	go get -u rsc.io/benchstat

# hope for more in 1.6


* Benchmarking concurrent code

Demo: ngram

# -cpu, b.PB, b.RunParallel, b.SetParallelism
# how to set up global state and goroutine-local state
# what it does under the hood
# tip: Use rand.Zipf to simulate real load


* Profiling

* Hello, profiling

- Where have all the cycles gone?
- Support built into the runtime
- `go`tool`pprof`, graphviz
- OS X sadness


# helps you understand you program's performance via instrumentation
# different kinds of profiling: cpu, memory, block profiling, tracing
# cpu works by sampling, with support from the OS; OS X needs a patch
# other kinds of profiling work using instrumentation in the runtime
# cpu is efficient, can run on live production server, memory less so
# different kinds of profiling interfere with each other

# NetBSD also has (had?) broken profiling


* CPU profiling

Demo: fib

# show basic usage: -cpuprofile, -outputdir
# saves binary
# don't run any tests, target individual benchmarks
# -lines -pdf -nodecount=10 -focus=fib
# be careful about hiding things!
# mention:
#  can set CPU profiling rate


* Memory profiling

Demo: ascii

# start with load
# discover syscalls
# remove syscalls

# discover malloc
# add ReportAllocs
# do mem profiling
# need -l
# need -alloc_objects, discuss alternatives
# discuss -memprofilerate

# move on to encode
# discover malloc
# add ReportAllocs
# mem profiling
# need -l? nope.
# need -runtime flag to pprof
# aha! string concatenation. Use a bytes.Buffer. Will discuss more later. common problem.

# syscall: look at syscalls. malloc: look at allocation. mutex/futex/channel-y things? look at blocking. We'll get to that.


* Profiling gotchas

- Don't run multiple profilers at once.
- Don't run tests when profiling.
- If the output doesn't make sense, poke around or ask for help.

# pprof has a crappy UI. Live with it. :(


* Block profiling

Demo: ngram

# easy demo: -blockprofile=
# discuss -blockprofilerate, -memprofilerate
# there is a way to change cpu profile rate in runtime package but too fast can't happen (OS support, expense of walking the stack) and the default is pretty good
# discuss ok/expected blocking: time.Ticker, sync.WaitGroup


* Other kinds of profiling

In package `runtime/pprof`:

- `goroutine`: helpful for finding sources of leaking goroutines
- `threadcreate`: helpful for debugging runaway thread creation (usually syscalls or cgo)

# TODO: Expand?

Basic memory stats available in package `runtime`: `ReadMemStats`

.link https://golang.org/pkg/runtime/#MemStats


* Whole program profiling

Set up first thing in `func`main`.

Use runtime and runtime/pprof packages...but it is a pain.

# Mention gotchas like flushing and closing the files,
# calling runtime.GC before exit, etc.

Dave Cheney made a nice helper package:

	go get -u github.com/pkg/profile

.link https://godoc.org/github.com/pkg/profile


* Monitoring live servers

Cheap enough to do in production!

And easy, using `net/http/pprof`.

	import _ "net/http/pprof"

Use pprof to view CPU:

	go tool pprof -pdf http://localhost:3999/debug/pprof/profile > o.pdf && open o.pdf

Heap:

	go tool pprof http://localhost:3999/debug/pprof/heap

Goroutines:

	go tool pprof http://localhost:3999/debug/pprof/goroutine

See `net/http/pprof` docs.


* Monitoring live servers

Demo: present

.link http://localhost:3999/debug/pprof

	go tool pprof -pdf http://localhost:3999/debug/pprof/goroutine > o.pdf && open o.pdf

Oh goodness!

.link https://github.com/golang/go/issues/11507


* Protecting the net/http/pprof endpoints

`net/http/pprof` registers endpoints with `http.DefaultServeMux`.

So don't use `http.DefaultServeMux`.

	serveMux := http.NewServeMux()
	// use serveMux to serve your regular website

	pprofMux := http.NewServeMux()
	pprofMux.HandleFunc("/debug/pprof/", pprof.Index)
	pprofMux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
	pprofMux.HandleFunc("/debug/pprof/profile", pprof.Profile)
	pprofMux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
	// use pprofMux to serve the pprof handles

Or use a single non-default `ServeMux` but insert http handler middleware.


* Execution tracing

- New as of Go 1.5! Google for "Go execution tracer" to see the design doc.
- A few rough edges still.
- Incredibly detailed and powerful, with all the good and bad that that brings.


* Execution tracing

Demo: ngram

# go test -bench=. -trace=trace.out -benchtime=50ms
# go tool trace ngram.test trace.out

# before and after
# observe blocked goroutines before, interleaved goroutines after

# play with it and explore. I still am.


# switch to present14!


* Techniques and habits


* string and []byte


* string and []byte

Common source of performance problems.
Easy to learn good habits.
Helps to know what's happening under the hood.


* Under the hood

*string*

- basic type
- interpreted as UTF-8
- _immutable_

*[]byte*

- just another slice type
- no particular interpretation
- _mutable_

.play stringbytes/mutable.go /func set/,/^}/


* Correct conversions are expensive

Above all, the compiler and runtime must be correct.
Speed is a bonus.

In the general case, converting between string and []byte requires an alloc and a copy.

.play stringbytes/convert.go /func Benchmark/,/^}/


* Shrink and grow

*string*

- slicing is very cheap and safe
- concatenation is expensive (alloc + copy x 2)

*[]byte*

- slicing is very cheap but not obviously safe
- append is sometimes expensive (sometimes alloc, always copy x 1, sometimes copy x 2)


* Good habits

- Live in just one world (modulo code clarity and correctness).
- Convert as late as possible.
- Pay attention to concatenation, particularly in loops.

# that's why we have parallel strings and bytes packages
# now specific techniques/habits
# these are not rules. clarity trumps, but these are generally equally clear.
# these might only matter in loops. but get in the habit of writing performant code.


* bytes.Buffer

Use a bytes.Buffer to build strings.

.play stringbytes/buf.go /func Benchmark/,/^$/

# important change for c2go compiler performance!


* APIs

Use dedicated APIs:

- bytes and strings packages
- "io.Writer".Write vs io.WriteString
- "bufio.Scanner".Bytes vs "bufio.Scanner".Text
- "bytes.Buffer".Bytes vs "bytes.Buffer".String

Related: Implement WriteString for your io.Writers:

	func WriteString(s string) (n int, err error)


* Avoid building strings

If the set of choices is small, pick a string rather than building it.
(Or use stringer: golang.org/x/tools/cmd/stringer.)

.play stringbytes/pick.go /BenchmarkConstruct/,/func _/


* Order of operations

Convert last (usually).

For example, slice after converting.

.play stringbytes/slice.go /Repeat/,/func _/

If you're slicing multiple times, there are trade-offs: Multiple small alloc+copy vs one large monolithic chunk of memory.


* Easy on the Sprintf

Use concatenation and strconv instead of fmt.Sprintf for simple things.

.play stringbytes/strconv.go /func Benchmark/,/^$/


* API design

Design your APIs to allow reduced garbage.

- Provide []byte and string variants.
- Use io.Reader and io.Writer instead of buffers.
- BYO buffer.

Good:

	Read(p []byte) (n int, err error)

Bad:

	Read(n int) (p []byte, err error)


* Techniques

- Reuse buffers.
- Take advantage of compiler optimizations.
- Intern strings.

# distinguish habits from techniques:
# habits are things you should usually do;
# techniques are things to use when profiling says you need to optimize.

# compiler optimization overlaps with "delay conversion"


* Reuse buffers

.play stringbytes/reuse.go /pool/,/func _/

# can also reuse with a local free list. do whatever is appropriate.
# note that sync.Pool's efficiency is implementation-dependent.


* Convert last

Pop quiz: How many allocs/op in this benchmark?

.play stringbytes/convertnoescape.go /Benchmark/,/^}/

# answer: it varies!
# Go 1.4: 1
# Go 1.5: 0
# Go 1.5, with a longer byte slice: 1!
# explain escape analysis, compiler optimizations


* Compiler magic

Conversion optimizations in Go 1.5 include:

- map keys
- range expressions
- concatenation
- comparisons

Convert as late as possible to enable them to work.
(Future work may change that.)

More are possible. Those that work well on normal code may eventually be implemented.


* Map keys

The map key optimization is particularly interesting.

.play stringbytes/mapkey.go /Repeat/,/func _/


* Interning strings

.play stringbytes/intern.go /interned/,/func _/


* Caution

Be careful with interning!

- Advanced technique. Use with caution and only when necessary.
- Depends on compiler version.
- *Manual*memory*management!* Ewwwwww.
- Not thread safe. (But see github.com/josharian/intern for a hack.)


* Optimizing memory usage


* What is allocation?

Making a place to put stuff.

- You can't avoid all allocation. That's ok!
- Why it matters: allocation, zeroing/copying, GC, limited resource, impact on caches.
- Number of allocations vs size of allocation.

# Talking about allocs because they have are usually
# a significant contributor to program performance.


* What allocates?

Lots of things, but it varies by compiler.
In practice, there are no strict rules.

Common sources of allocations are:

- Data growth (append, concatenation, map assignment, stacks)
- new, make, and &
- string/[]byte conversion
- Interface conversions
- Closures

Develop good habits, profile, and benchmark.


* Escape analysis

- Heap vs stack
- Subtle, interacts with growable stacks and GC
- Stack pressure vs heap
- `-gcflags=-m`

Mostly, just know that it exists and what it is.

# open to improvements--file issues!


* Good habits

- Avoid unnecessary data growth.
- Avoid unnecessary string/[]byte conversions.
- Design APIs that allow re-use.
- Use values where you can.
- Avoid gratuitous boxing, reflection, and indirection.

# Stream instead of buffering.
# Example of values: 0, 1, 2 instead of *bool.
# Example of when not to use values: large arrays, particularly ranging over them.


* Unnecessary data growth

Buffer:

	buf, err := ioutil.ReadAll(r)
	// check err
	var x T
	err = json.Unmarshal(buf, &x)
	// check err

Stream:

	dec := json.NewDecoder(r)
	var x T
	err := dec.Decode(&x)
	// check err


* Unnecessary/deep recursion

Stacks take memory too. Stack growth is an alloc+copy+process.

(Most data growth is alloc+copy.)


* API design

Use io.Reader and io.Writer.

Also, io.Reader is a fine example itself!

	type Reader interface {
	    Read(p []byte) (n int, err error)
	}

It is hard to anticipate your users' needs. Give them the tools to be efficient if they need it.


* Use values

Value:

	type OptBool uint8

	const (
		Unset = OptBool(iota)
		SetFalse
		SetTrue
	)

Pointer:

	type OptBool *bool


But take care with large values.

	var a [10000]int{}
	for _, i := range a {
	}
	fmt.Println(a)

# not just copy cost, also impact on cache usage, etc.


* Go easy on the abstraction

- Reflect allocates heavily.
- Most interface conversions allocate.
- Creating closures usually allocates.

# abstraction is ok, just not needless abstraction


* Techniques

- Provide initial capacity estimates for data structures.
- Trade off allocation size and number of allocations.
- Reuse objects. Maintain a free list or use sync.Pool.
- Steal ideas from the standard library.


* Initial capacity estimates


Delayed/multiple allocs vs exactly one alloc

.play alloc/initialcap.go /size/,/func _/

# Disagreement over whether this is habit vs technique.


* Pre-allocate backing array

    type Buffer struct {
    	buf       []byte
    	off       int
    	runeBytes [utf8.UTFMax]byte
    	bootstrap [64]byte
    	lastRead  readOp
    }

runeBytes avoids allocation during WriteRune:

	utf8.EncodeRune(b.runeBytes[0:], r)

bootstrap avoids allocation for small buffers:

	b.buf = b.bootstrap[0:]

# trading off alloc size vs number of allocs
# this about impact on cache


* Reuse objects

Local re-use is better:

	buf := make([]byte, 1024)
	for {
		n, err := r.Read(buf)
		// use err, n, buf
		// look out: buf's contents will be overwritten in the next Read call
	}


Sometimes there's no context (type or scope) to allow re-use. Enter sync.Pool:

	for {
		buf := pool.Get().([]byte)
		// use buf
		// optional: clear buf for safety
		for i := range buf {
			buf[i] = 0
		}
		pool.Put(buf)
	}

# TODO: benchmark


* Struct layout

Go guarantees struct field alignment.

	type Efficient struct {
		a interface{}
		b *int
		c []int
		d uint16
		e bool
		f uint8
	}

	type Inefficient struct {
		e bool
		a interface{}
		f uint8
		b *int
		d uint16
		c []int
	}

	var e Efficient
	var i Inefficient
	fmt.Println(unsafe.Sizeof(e), unsafe.Sizeof(i)) // 28 36


* Struct layout

cmd/wasted:

.link https://golang.org/cl/2179

Unlikely to happen automatically:

.link https://golang.org/issue/10014

"No Go compiler should probably ever reorder struct fields. That seems like it is trying to solve a 1970s problem, namely packing structs to use as little space as possible. The 2010s problem is to put related fields near each other to reduce cache misses, and (unlike the 1970s problem) there is no obvious way for the compiler to pick an optimal solution. A compiler that takes that control away from the programmer is going to be that much less useful, and people will find better compilers." - Russ Cox

# Discuss cache impact
# Discuss GC impact of having pointers first
# Mention gc compiler and Nodes
# Mention runtime alloc size classes


* Optimizing concurrent programs


* Optimizing concurrent programs

Concurrency correctness is hard, even with Go.


* Habits

- Use mutexes instead of channels for simple shared state.
- Minimize critical sections.
- Don't leak goroutines.
- Gate access to shared resources, particularly the file system.

# not much else to say about mutexes vs channels
# chann


* Mutexes and channels

Mutexes are good for mutual exclusion, like simple shared state. They are fast and simple in such cases.

Channels are for everything else: Flow control, communication, coordination, select.

# Honest truth: This is actually about readability and code clarity.
# It just happens to also coincide with good performance advice.


* Minimize critical sections

Separate work that requires shared state from work that does not.
Only hold the lock when you really need it. Refactor as needed.

Before:

	func (t *T) Update() {
		t.Lock()
		defer t.Unlock()
		// expensive work that can be done independently
		// update shared state
	}

After

	func (t *T) Update() {
		// expensive work that can be done independently
		t.Lock()
		defer t.Unlock()
		// update shared state
	}


* Don't leak goroutines

When you start a new goroutine, pause to ask when it will/how it will complete.

	func doh(c chan int) {
		go func() {
			for i := range c {
				// use i
			}
		}()
		// who closes c? who calls doh?
	}

Goroutines are so cheap you might not notice leaks quickly.

Profile or manually inspect the result of a SIGQUIT.


* Gate access to shared resources

It's easy to thrash the filesystem, make lots of threads, and create churn in the scheduler. It's also easy to prevent.

	type gate chan bool

	func (g gate) enter() { g <- true }
	func (g gate) leave() { <-g }

	type gatefs struct {
		fs vfs.FileSystem
		gate
	}

	func (fs gatefs) Open(p string) (vfs.ReadSeekCloser, error) {
		fs.enter()
		defer fs.leave()
		// ...
		return gatef{file, fs.gate}, nil
	}

	var fsgate = make(gate, 8)

# not needed for the network
# needed for cgo


* Use buffered I/O

Every read or write to a file corresponds to a system call. These are relatively expensive, particularly in high numbers.

The `bufio` package makes buffered I/O easy. Use it.

	f, err := os.Open("abc.txt")
	// handle err
	r := bufio.NewReader(f)

# not really a concurrency thing, but we were talking about the filesystem
# expensive because of concurrency support--not just a syscall, but also stack switches, scheduler interactions


* Techniques

- sync.RWMutex is only sometimes better than sync.Mutex.
- Use buffered channels.
- Provide backpressure or dropping.
- Partition shared data structures.
- Batch work to amortize cost of lock acquisition.
- Use sync/atomic.
- Cooperate with the scheduler
- Avoid false sharing by padding data structures.


* sync.RWMutex vs sync.Mutex

sync.RWMutex does strictly more work than sync.Mutex and has more complicated semantics.

# Discuss writer starvation

sync.RWMutext can help a lot, but it can also hurt. Profile and/or benchmark.

# don't just automatically use RWMutex


* Use buffered channels

Buffered and unbuffered channels have different semantics and synchronization guarantees.

Buffered channels are much cheaper, if both semantics work for you.

# discuss queueing theory: buffer size only provides a buffer
# buffer sizes come at a memory cost


* Provide backpressure or dropping

Critical for operational stability of distributed services, but also useful for concurrency.


* Partition shared data structures

Before:

	type Counter struct {
		mu sync.Mutex
		m  map[string]int
	}

After:

	const shards = 16

	type Counter struct {
		mu [shards]sync.Mutex
		m  [shards]map[string]int
	}

Can reduce contention.

Adds cost of hashing, increases data structure size, and depends on distribution of data. Measure with real world data. `rand.Zipf` can be helpful for benchmarks.


* Batch work

	// consumer
	var sum int
	for i := range c {
		sum += i
	}

	// producer before
	for !done {
		sum := count(stuff)
		c <- sum
	}

	// producer after
	for !done {
		sum := 0
		for i := 0; i < 16; i++ {
			sum += count(stuff)
		}
		c <- sum
	}

Can dramatically reduce contention, but not always applicable. Can introduce delays due to batching.

# example: goroutine goids per-thread
# example: testing.PB
# example: hand out worklist items a slice at a time


* atomic.Value

	var (
		configmu    sync.Mutex    // protects configvalue
		configvalue *atomic.Value // value of map[string]string
	)

	func config() map[string]string {
		return configvalue.Load().(map[string]string)
	}

	func set(key, val string) {
		configmu.Lock()
		defer configmu.Unlock()
		old := config()
		m := make(map[string]string, len(old)+1)
		for k, v := old {
			m[oldk] = oldv
		}
		m[k] = v
		configvalue.Store(m)
	}

For frequently read but infrequently written data structures. Requires copy-on-write and writer synchronization (or a single writer). Danger of logical races.


* atomic int and pointer operations

	var count uint32

	func inc() {
		atomic.AddUint32(&count, 1)
	}

	func get() uint32 {
		return atomic.LoadUint32(&count)
	}

Cheapest form of concurrency-safety available in Go. Great caution required; very easy to misuse in subtle ways!

Mostly helpful for cheap, scalable counters.

If you use atomic.* with a value anywhere, you must use it everywhere!

Extra special care required when using 64 bit integer sizes on 32 bit platforms due to alignment requirements.


* Cooperative scheduling

Go scheduling is currently cooperative. It mostly just works, except for tight loops with no function calls.

	var x int
	for i := 0; i < 1<<30; i++ {
		x = x ^ i
	}

Solution: Use runtime.Gosched()

	var x int
	for i := 0; i < 1<<30; i++ {
		x = x ^ i
		if i & 0xFFFF == 0xFFFF {
			runtime.Gosched()
		}
	}

# may be done automatically by future compilers


* Avoid false sharing

Usually solvable by rearrangement or padding.

	type T [1024]Padded

	type Padded struct {
		mu sync.Mutex
		x  *X
		_  [128]byte
	}

Diagnose first; the medicine is bitter.


* Advanced tools and techniques


* Advanced tools and techniques

- Compiler flags
- Runtime flags and calls
- Assembly and cgo
- Code generation
- Micro-optimizations


* Compiler flags

Use:

	go build -gcflags=-S pkg

Or:

	go tool compile -S a.go b.go c.go

# call out cmd/compile vs cmd/6g

Important flags:

	-h	help
	-S	print assembly listing
	-m	print optimization decisions such as escape analysis
	-l	turn off inlining, repeat to make inlining more aggressive
	-N	disable optimizations
	-B	disable bounds checking

# walk through these, demo them

Demo

# mainly you should know that these exist
# if you find bad codegen: (a) maybe it is needed for correctness, (b) file a bug, (c) change your code


* GODEBUG and GOGC

Sample GODEBUG use:

	GODEBUG=scheddetail=1,schedtrace=1000 go run x.go

# but only one at a time, probably

Useful GODEBUG variables for performance investigation:

	allocfreetrace=1: print all allocs and frees (it's a lot!)
	gctrace=1, gctrace=2: print GC activity
	schedtrace=X: print scheduler state every X ms
	scheddetail=1: print detailed scheduler state

# allocfreetrace is voluminous but useful
# gctrace I haven't needed
# schedtrace eclipsed by trace viewer except Go 1.4 or if you can't edit the code or...

Sample GOGC use:

	GOGC=off go run x.go

Or:

	runtime.SetGCPercent(-1) // -1 for off, 50 for aggressive GC, 100 for default, 200 for lazy GC


* cgo

Plenty of rope.

- overhead: stack switch and calling convention change
- takes up a thread
- medium chunks of work
- cross-compilation is not trivial

# need external linking, cross-compiler for C, etc.
# useful (performance-wise) for access to optimized C libraries
# but see also gonum, gccgo


* Assembly

Upgrade from rope to gun.

- overhead: function call
- dangerous
- basically undocumented
- small chunks of work (but not individual instructions)
- not subject to Go 1 guarantee
- go vet is helpful

Useful when there is no other way.

# extreme control, lots of opportunity for mistakes, loss of safety, not portable


* Code generation

# maybe not advanced topic

Helpful for:

- Avoiding the need for an abstraction layer (yes yes, generics)
- Unrolling loops or calculations
- Generating pre-calculated tables
- Generating efficient code that is hard to read or maintain

	const _Num_name = "OneTwo"

	var _Num_index = [...]uint8{0, 3, 6}

	func (i Num) String() string {
		if i < 0 || i+1 >= Num(len(_Num_index)) {
			return fmt.Sprintf("Num(%d)", i)
		}
		return _Num_name[_Num_index[i]:_Num_index[i+1]]
	}

* Micro-optimizations

Knuth alert!

- arrays instead of maps for lookups with small integer keys

	var a = [10]string{2: "even prime", 9: "maybe prime"}
	var m = map[int]string{2: "even prime", 9: "maybe prime"}

- slice instead of map for very small quantities of data
- manually unwind instead of using defer
- index into slice instead of pointer in giant data structures

# arrays: lookup is the same
# slices: O(n) < O(1) when n is tiny and 1 is small
# only ok to manually unwind if no one catches panics
# helps reduce GC time

Compiler dependent:

- optimized memclear
- rotate instruction: `i<<13`|`i>>(64-13)`


* Other kinds of optimization

- Binary size
- Build time


* Binary size

# probably doesn't matter, but good to have options for when it does


* go tool nm

Helpful for finding large static data:

	package main

	var a [100000]int

	func main() {
		_ = a[0] // prevent the linker from dropping a
	}

Result:

	$ go build bigarray.go && go tool nm -size -sort=size bigarray | head -n 4
	   d47c0     800000 B main.a
	   86140     152299 R runtime.pclntab
	   4d220     124312 T runtime.etext
	   4d220     124312 R type.*

# go tool compile -w can help find the meaning of an autotmp, or ask on golang-nuts


* Millions of strings

The only way to get millions of strings is to generate them. If you're generating that many, generate them as a single string and slice as needed.

	var nums = [...]string{"0", "1", "2", "3", "4", ..., "99999"}

Binary size: 3471840 bytes

	var nums = "0123...99999"

Binary size: 1558960 bytes

# note that this is what stringer does
# note that first binary used to be 5mb+ in Go 1.4!
# we should fix the toolchain. not as easy as it sounds.


* ldflags

Normal:

	$ go build helloworld.go && stat -f "%N: %z bytes" helloworld
	helloworld: 2344944 bytes

Without DWARF:

	$ go build -ldflags=-w helloworld.go && stat -f "%N: %z bytes" helloworld
	helloworld: 1746928 bytes

But you lose debug information.


* Build time

# worse with Go 1.5, will get better again


* go install

`go`build` builds and discards.
`go`install` build and keeps.

Use `go`install`.

# biggest, easiest, most important fix
# lots of staleness bugs fixed in Go 1.5


* Enable caching of stable code

The unit of compilation is the package.

If you have a large, stable chunk of code (frequently a generated file containing assets), put it in a different package than high churn code. Use internal packages if you're worried about API visibility.


* Split up giant functions

This oughtn't matter. It does.

Giant static data and tables can generate giant functions. (See previous slide.)

One hack: Use multiple init functions.

# this should get better in Go 1.6, I hope.
# please file issues for egregious cases.


* Wrap-up


* Want more?

- Read the standard library
- Blog posts by Dmitry Vyukov and Russ Cox
- Lurk on golang-codereviews@googlegroups.com (or even better, don't just lurk!)
- Ask questions on golang-nuts
- Experiment!


* Experiment where?

- Profile your own code. (But know when to stop.)
- Pick an open source project. Find and fix a significant performance problem. (For all but the largest projects, there's usually at least one.)
- Futz around with the standard library. (But remember that clarity and maintainability trumps speed.)


* Reminders

- Write simple, clear code
- Write tests and use version control
- Cheat (solve an easier problem instead)
- Develop good habits
- Know your tools and use them


* Stump the chump