Author Topic: How L1 and L2 caches work (Read 3936 times)

ElPolloDiabl · « **on:** September 11, 2014, 09:38:06 PM »

Here is an article on how caches work. Really interesting the difference between 1980 and now.
Link:
http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

biggun · « **Reply #1 on:** September 12, 2014, 11:05:27 AM »

Quote from: ElPolloDiabl;772804

Here is an article on how caches work. Really interesting the difference between 1980 and now.
Link:
http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

A few things a badly explained or even misleading explained there.
For example its not explained why people not simply grow the size of the 1st level cache.
The concept of cache/ways is badly explained as a CPU does not iterative search through its cache ways but all the compares are done in parallel. In reality a full assosiative cache work good up to a certain size. The two problem that come then are the huger number of needed comperators = e.g size and the size of the needed mux.

psxphill · « **Reply #2 on:** September 12, 2014, 11:50:36 AM »

Quote from: biggun;772834

The concept of cache/ways is badly explained as a CPU does not iterative search through its cache ways but all the compares are done in parallel.

I agree they would be done in parallel, but the latency is likely to be higher due to the extra complexity. Cost is the major reason for not doing fully associate cache, it doesn't offer that much advantage for the complexity it adds.

Quote from: biggun;772834

For example its not explained why people not simply grow the size of the 1st level cache.

Does that need explaining though? I would have thought it was obvious that you can't just add 16GB of L1 cache.
L1 cache is super fast, much faster than your standard ram. If you could get really large fast ram for really cheap then you wouldn't need any cache at all.

ElPolloDiabl · « **Reply #3 on:** September 12, 2014, 12:08:02 PM »

The article is a a bit over simplified, I like the new rule "add a cache level every 10 years" lol

psxphill · « **Reply #4 on:** September 12, 2014, 02:11:54 PM »

Quote from: ElPolloDiabl;772836

The article is a a bit over simplified, I like the new rule "add a cache level every 10 years" lol

I suspect that won't hold up as well as moore's law.

Although improvements in memory latency are long overdue.

biggun · « **Reply #5 on:** September 12, 2014, 02:34:21 PM »

Quote from: psxphill;772835

I agree they would be done in parallel, but the latency is likely to be higher due to the extra complexity. Cost is the major reason for not doing fully associate cache, it doesn't offer that much advantage for the complexity it adds.

Does that need explaining though? I would have thought it was obvious that you can't just add 16GB of L1 cache.
L1 cache is super fast, much faster than your standard ram. If you could get really large fast ram for really cheap then you wouldn't need any cache at all.

This is not the point.

The point is :
* If you have 32 KB 1st level Cache
* Why do they add another 128 KB 2nd level cache - instead growing the 1st level Cache simply to 128KB ?

What they say about the 2nd level cache being "cheaper" is bull%&$#?@!%&$#?@!%&$#?@!%&$#?@!.

The true reason is latency. The latency of a cache is propotional to its size.

A small 1st cache today with a size of 16-32KB has a latency of 2-4 clocks

A 2nd level cache with 256KB might have a latency of 20 clocks

They today use a pyramide system to have both:
1) a total big cache,
2) and at least for a small part a very low latency.

TeamBlackFox · « **Reply #6 on:** September 12, 2014, 07:01:51 PM »

PA-RISC proved that large L1 cache doesn't have to be slow, but it did have heat and power consumption issues, especially the last few models in 2006-`08. From what I know, the PA-RISC architecture had basically two L1 caches, vs L1 and L2. However, compared to Alpha, PA-RISC was a terrible performer in general calculations, but it did kick x86's arse well into the NetBurst era. It would have been interesting to see what the computing industry would be today if Itanium had never been developed.

psxphill · « **Reply #7 on:** September 12, 2014, 10:11:46 PM »

Quote from: biggun;772840

The true reason is latency. The latency of a cache is propotional to its size.

Latency is an issue, however I believe that if they threw a large amount of money at it then they could work round that and increase the L1 cache by a modest amount. They just can't justify the cost (size and heat are effectively costs).

biggun · « **Reply #8 on:** September 13, 2014, 01:35:56 AM »

Quote from: psxphill;772873

Latency is an issue, however I believe that if they threw a large amount of money at it then they could work round that and increase the L1 cache by a modest amount. They just can't justify the cost (size and heat are effectively costs).

Just inreasing the size of the L1 is technical a lot easier
than building a complex combination of L1 and L2 which need to communicate to function.

Even money an not change the law of physiscs.
And physics simply dictates that increased size needs more space.
More space means longer wires, Longer wires means longer lantency.
Its very simply physics - which everybody will understand.

Of course latency is relative to the clockrate of the whole CPU.
A L1 of 32 KB might result in a latency of 4 cycles for a 4 GHZ CPU.
If you only aim at a CPU clockrate of 500 MHz then you can savely inrease the L1 to 1 MB ....

ElPolloDiabl · « **Reply #9 on:** September 13, 2014, 01:53:33 AM »

Is it actually far enough away to cause latency? Everything is on the chip now.

I've also read that adding heaps of level 2 cache is cheaper than increasing the complexity of the cpu. In manufacturing if any of the level 2 cache comes out defective you can route around it instead of tossing the whole chip.

biggun · « **Reply #10 on:** September 13, 2014, 01:59:18 AM »

Quote from: TeamBlackFox;772855

PA-RISC proved that large L1 cache doesn't have to be slow, but it did have heat and power consumption issues, especially the last few models in 2006-`08. From what I know, the PA-RISC architecture had basically two L1 caches, vs L1 and L2. However, compared to Alpha, PA-RISC was a terrible performer in general calculations, but it did kick x86's arse well into the NetBurst era. It would have been interesting to see what the computing industry would be today if Itanium had never been developed.

To be precise:
The early PA RISC chips had a tiny 2K internal L1 caches - which they did not call L1 but L0
And a large external cache - which technically was an L2 which they called L1.
So this large L1 is a naming thing.

The latest PA RISC had a relativ big L1.
But you have to mind that all PA-RISC chip were very slow clocked compared to other chips.
This means this L1 did had a huge latency - but the CPU did not feel it as much - being clocked relative slow.

Latency is always in relation to your clockrate.
If you chip is only clocked at 30% or 25% of the clockrate other CPUs do
then you can increase your L1 cache size larger without seeing a penalty.

For our current 68K CPU development its the same story.
As we base our CPU in FPGAs are clockrate is by design limited.
But we can without penalty have relativ huge L1 caches, as our clockrate is limited anyway.
Assuming you use a decent sized FPGA - I can instantiate you
an 68K CPU with 1 MB L1 cache with no technical problem at all.

biggun · « **Reply #11 on:** September 13, 2014, 02:18:29 AM »

Quote from: ElPolloDiabl;772888

Is it actually far enough away to cause latency
Everything is on the chip now.

Yes we talk here of latency inside 1 single chip.
Their is a significant latency to run from 1 side of the chip to the other side of the chip.

Quote from: ElPolloDiabl;772888

I've also read that adding heaps of level 2 cache is cheaper than increasing the complexity of the cpu.

Making a cache bigger does not increase its complexity.
A cache is first of all just an array.

For example : I can design you a cache with the same attributes e.g the same number of ways but with different sizes.

You can have a cache with 4 ways and a total size of 1 KB
Or you can have a cache with 4 ways and a total size of 4 KB
Or you can have a cache with 4 ways and a total size of 16 KB
Or you can have a cache with 4 ways and a total size of 64 KB
Or you can have a cache with 4 ways and a total size of 128 KB

All these 5 caches would have the same "number of lines" to code them.
All would have the same complexity to design them.
But they all would have diffrent chip size and therefore different latencies.

Quote from: ElPolloDiabl;772888

In manufacturing if any of the level 2 cache comes out defective you can route around it instead of tossing the whole chip.

You can do the same with L1.
Typically ways of doing this is designing the cache with more banks or ways than you will actually use in the operation. Like working with a 4 way cache - but physically putting 5 ways in the chip. And during manufacturing you selftest the design and if 1 way is bad - it uses the 5th spare way instead.
But this is the common way of doing this.

Today cache of L1, L2 and L3 can be on the very same chip.
When the L1 is very small it can therefore be placed very close to CPUs units with a low latency.
The L2 being bigger will be further away on the chip, and will therefore have a longer latency.
The L3 being again much bigger would again have a bigger latency.

To conserve energy typically you also use lower clockrates for the biggest cache level.
E.g run the L3 at halve the CPU speed or so.

A6000 · « **Reply #12 on:** September 13, 2014, 03:32:35 PM »

@Biggun, sorry, I don't understand this,

An L1 cache is fast because it is small, make it bigger and it slows down, so they add an L2 cache, but an L2 cache is slower than an L1.
I do not understand why an L1+L2 is faster than a larger L1 cache.

biggun · « **Reply #13 on:** September 13, 2014, 03:54:17 PM »

Quote from: A6000;772923

@Biggun, sorry, I don't understand this,

An L1 cache is fast because it is small, make it bigger and it slows down, so they add an L2 cache, but an L2 cache is slower than an L1.
I do not understand why an L1+L2 is faster than a larger L1 cache.

Ok lets use some real world example.

Lets say your CPU runs at 4 GHz
Lets say your CPU is Super Scalar and can excute 2 instructions per clock.
So you have a theoretical peak performane of
8000 Mips - if you work only with registers.
So far all clear?

Now lets say your memory has a latency of 200 cycles.
This means if you work not with register but mith memory your performance degrades to
4000/200 = 20 MIPS

So we want to improve this.

The best that we could create is a 32KB 1st level cache with a latency of 4 cycles.
If you our work variables fit in the cache = 100% hitrate we will reach
4000/4 = 1000 MIPS

If our CPU design can work around the lateny -
E.g either using vertical pipelining (060/Phoenix like)
or with compiler code restructuring and OurOfOrder support.
- then we could reach up to
4000/1 = 4000 MIPS

As you see the advantage of the 1st level cache is very clear.

Now we might have problems needing more than 32KB of variables.
These will have a lower hit rate and run sower.
As every time we miss the 1st level cache we go out to main memory needing 200 cycles.

We have two options now
a) Increase 1st Level cache size e.g to 256 KB
Such a cache would have about 20 cycles latency.
One thing is clear : Out of order can with luck work around 2-3 cycles of lateny but not around 20.
This means even with Out of Order our CPU would get a big performace hit.
Realistically we would reach
4000 /10 = 400 MIPS
As you see this big 1st level cache is very bad for performance now.
We have MUCH less performance than we had with the smaller cache.

The best solution is implementing 2 levels of cache
32 KB with low latency
256 KB with 20 cycle latency

Lets say out testprogram has 50% hit rate with 32 KB and 100% Hitrate with 256 KB.

Only using a small and fast 1st level Cache
4000 * 0.5 = 2000 MIPS for the cases that hit level 1
4000 * 0.5 / 200 = 10 MIps for the cases that need to go to main memory
Total Speed = 2010 MIPS

Only using a big and slow 1st level Cache
4000 * /10 = 400 MIPS for the cases that hit level 1
Total Speed = 400 MIPS

Only using a small and fast 1st level Cache and a big and slower level 2 cache
4000 * 0.5 = 2000 MIPS for the cases that hit level 1
4000 * 0.5 / 10 = 200 MIps for the cases that need to go to main memory
Total Speed = 2200 MIPS
As you see this combination does give the best performance.

The numbers are in "realistic" range for todays CPU cores.

Does this answer your question?

biggun · « **Reply #14 on:** September 13, 2014, 03:59:33 PM »

Edit: in the above examples I mentioned 20 cyles latency for 2nd Level cache
and in the performance calculations we used 10 cycle.

This is no error - but this was under the realistic assumption that the compiler can work around the latency sometimes a little, That a pipeline design can sometimes help and that Prefething or Out Of order can soften the penalty sometimes.

Author Topic: How L1 and L2 caches work (Read 3936 times)

ElPolloDiabl

How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

psxphill

Re: How L1 and L2 caches work

ElPolloDiabl

Re: How L1 and L2 caches work

psxphill

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

TeamBlackFox

Re: How L1 and L2 caches work

psxphill

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

ElPolloDiabl

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

A6000

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work

biggun

Re: How L1 and L2 caches work