How do computers work?

A tale of two programs

Feb 22, 2021

It’s time to tackle the most important computer topic of them all. What’s better, an array or a linked list? Can we solve this mystery?

Pretend you’re a computer for a sec

Sombody has given you a program to execute. It is a list of instruction that somebody smart (a programmer) using some smart tools (a programming language and a compiler) translated into something a dummy like you (a computer) can do. Lets start with a super simple model of how computers work and see if we can answer our question.

Computers have a few places they can keep numbers called registers. They can do some math on the numbers in these registers (like ADD the values in two registers together and put the result in another register). They also have a much larger place to store data called memory. Memory isn’t super close to the computer so they can’t do any operations on data stored there, all they can do is LOAD a pice of data from memory to a register and STORE a piece of data from a register back to Memory.

Lets assume our list is in memory and we’re going to execute a program that adds up all the elements. An array is a contiguous block of numbers in memory. All you need to know to handle an array is the address of the first element and how many elements there are.

An array in memory that starts at address 5 might look like this.

...
5: 1
6: 2
7: 3
8: 4
9: 5
...

A program to add the elements in an array could look like this.

p_array = 5 // address the array starts at
length = 5  // length of the array
i = 0       // index of the element we're on
result = 0  // sum of all the elements
// Loop over the array
loop:
  p_element = ADD(p_array, i)      // address of element
  element   = LOAD(p_element)      // load the element
  result    = ADD(result, element) // add element to result
  i         = ADD(i, 1)            // increment index
  if i < length GOTO loop          // loop

In a linked list the elements don’t need to be in order in memory. Each element also contains a pointer to the next element which is the address the next element is at. So each node of the list takes two slots in memory, one for the number and one for the pointer. A liked list in memory that starts at address 5 might look like this.

...
5:   1
6:   9
7:   5
8:   0
9:   2
10:  11
11:  3
12:  950
...
950: 4
951: 7
...

A program to add the element in a linked list could look like this.

p_element = 5
loop:
  element   = LOAD(p_element)      // load the element
  result    = ADD(element, result) // add element to result
  p_next    = ADD(p_element, 1)    // address of next pointer
  p_element = LOAD(p_next)         // load next element address
  if p_element != 0 GOTO loop      // loop

Can we tell which one is better yet?

By instruction counts alone they are the same. Each loop has 51 instructions per iteration. Does that mean they are equivalent?

The array program has 3 ADD instructions and 1 LOAD, the linked list program has 2 ADD instructions and 2 LOAD instructions. So if ADDs are faster than LOADs it seems like the array version would be faster.

This is true, adding two numbers together is faster than loading a number from memory. It’s actually WAY WAY WAY faster. Memory is really far away in computer terms.

Modern computers have a way to deal with this problem. Since loading from memory is so slow they don’t want to just wait for data all day. They do this in two ways. They have caches for data from memory that are closer to the cpu. There’s small fast ones close to the processor that tie into larger and slower ones chaining out to main memory. When you LOAD an address it first looks in the cache, if it’s not there it checks the next cache, etc...

If you LOAD address 7 it’s slow because you go all the way out to main memory but if you LOAD it again later it’s fast because it’s in the cache.2

That seems like it would help but in our program we’re only ever LOADing a single address once. This is where the other trick comes in. The caches don’t LOAD elements from memory one at a time but they LOAD them in big blocks called cache lines. If you ask to LOAD address 0 it’ll actually LOAD addresses 0-64 into the cache so any further LOADs of those addresses will be fast. This means that for our array example we’ll LOAD the whole array into the cache at once and all the LOADs will be fast. The linked list will work similarly but OH NO in our example we have one of the nodes out all the way at 950 which won’t be in the cache. That means we’ll have at least one more slow LOAD in that program than the array version so clearly arrays are faster than linked lists!

Hold on a sec though, what if all the nodes in the linked list were in the same cache line? Both programs would only be doing fast cache loads and adds so they would be equally fast. Is this the answer? That the data structure doesn’t matter, all that matters is the data locality? Sort of. This is one of the biggest factors in performance. Make sure all the data you’re using is packed together.

What if the lists were very long? In our model if a cache line is 64 addresses, that would store an array with 64 elements, or a linked list with 32 element. Since each slot in a linked list also has the address of the next element you can fit twice as much data in the array than the linked list. This can be a huge factor when the lists are large. Another point for the array!

There’s a big piece of this that our model doesn’t factor in yet. In our model each instruction is done one at a time. Modern computers don’t actually work this way.3 They try to execute many instructions at the same time. You don’t tell them which ones to do in parallel4, They figure it out based on the data dependencies of the individual instructions. If we have two instructions.

a = ADD(b,c)
d = ADD(f,g)

Those two adds could be done at the same time because they don’t depend on each other. If instead you have.

a = ADD(b,c), d = ADD(a, e)

They can be done at the same time.

Lets look at our examples. If they instructions had to execute one at a time a trace of the full loop would look like this in the array case.

00: p=ADD(a,i),
01:            e=LOAD(p),
02:                      r=ADD(r,e),
03:                                 i=ADD(i,1)
04: p=ADD(a,i),
05:            e=LOAD(p),
06:                      r=ADD(r,e),
07:                                 i=ADD(i,1)
08: p=ADD(a,i),
09:            e=LOAD(p),
10:                      r=ADD(r,e),
11:                                 i=ADD(i,1)
12: p=ADD(a,i),
13:            e=LOAD(p),
14:                      r=ADD(r,e),
15:                                 i=ADD(i,1)
16: p=ADD(a,i),
17:            e=LOAD(p),
18:                      r=ADD(r,e),
19:                                 i=ADD(i,1)

And this for the linked list.

00: e=LOAD(p),
01:           r=ADD(e,r),
02:                      n=ADD(p,1),
03:                                 p=LOAD(n)
04: e=LOAD(p),
05:           r=ADD(e,r),
06:                      n=ADD(p,1),
07:                                 p=LOAD(n)
08: e=LOAD(p),
09:           r=ADD(e,r),
10:                      n=ADD(p,1),
11:                                 p=LOAD(n)
12: e=LOAD(p),
13:           r=ADD(e,r),
14:                      n=ADD(p,1),
15:                                 p=LOAD(n)
16: e=LOAD(p),
17:           r=ADD(e,r),
18:                      n=ADD(p,1),
19:                                 p=LOAD(n)

Pretty much the same.

But in the array we can actually do all the steps at the same time if we look across loops. We can do the math and load the next element while we’re adding up the last one. A full trace would look like this.

00: p1=ADD(a,i0),                        i1=ADD(i0,1)
01: p2=ADD(a,i1),e1=LOAD(p1),            i2=ADD(i1,1)
02: p3=ADD(a,i2),e2=LOAD(p2),r=ADD(r,e1),i3=ADD(i2,1)
03: p4=ADD(a,i3),e3=LOAD(p3),r=ADD(r,e2),i4=ADD(i3,1)
04: p5=ADD(a,i4),e4=LOAD(p4),r=ADD(r,e3),i5=ADD(i4,1)
05:              e5=LOAD(p5),r=ADD(r,e4),
06:                          r=ADD(r,e5),

For the linked list we can’t do quite as well. We can’t load the next element until we’ve loaded the pointer so we can’t get multiple steps ahead like we can with arrays. A full trace would look like this.

00: e1=LOAD(p0),
01:             r=ADD(e1,r),n1=ADD(p0,1),
02:                                      p1=LOAD(n1)
03: e2=LOAD(p1),
04:             r=ADD(e2,r),n2=ADD(p1,1),
05:                                      p2=LOAD(n2)
06: e3=LOAD(p2),
07:             r=ADD(e3,r),n3=ADD(p2,1),
08:                                      p3=LOAD(n3)
09: e4=LOAD(p3),
10:             r=ADD(e4,r),n4=ADD(p3,1),
11:                                      p4=LOAD(n4)
12: e5=LOAD(p4),
13:             r=ADD(e5,r),n5=ADD(p4,1),
15:                                      p5=LOAD(n5)

Seems like arrays are the clear winner!5

I’ve only recently started to understand a computer this way. Some of the places I’ve learned from include.

This Mike Acton talk.
This blog by Fabian “ryg” Giesen. especially the post in the footnotes.
Handmade Hero, especially this episode.

I really want to see these things in action. Does anybody know if there are ways to see which instructions are executed each cycle or when loads miss the cache or anything like this on any mainstream computers / operating systems?

About 5, we’re sorta ignoring the magic loop part which probably isn’t just one instruction but we can assume that it’s the same amount of work in each of the examples.

Unless of course another thread has changed it! Across cores caches have to synchronize so unless each core is working on different cache lines of data things are gonna get slow.

This whole section is a really bad version of this better blog post you should read. A whirlwind introduction to dataflow graphs

There is another thing called SIMD where you do tell the computer to do multiple things at the same time but it’s more like telling them to do the same thing on multiple pieces of data. ADD these 8 sets of numbers all at once and put the result in these 8 places.

It’s not really clear, there are many more concerns than we’ve covered here.

Steve Industeves

Discussion about this post