@Biggun, sorry, I don't understand this,
An L1 cache is fast because it is small, make it bigger and it slows down, so they add an L2 cache, but an L2 cache is slower than an L1.
I do not understand why an L1+L2 is faster than a larger L1 cache.
Ok lets use some real world example.
Lets say your CPU runs at 4 GHz
Lets say your CPU is Super Scalar and can excute 2 instructions per clock.
So you have a theoretical peak performane of
8000 Mips - if you work only with registers.
So far all clear?
Now lets say your memory has a latency of 200 cycles.
This means if you work not with register but mith memory your performance degrades to
4000/200 =
20 MIPS So we want to improve this.
The best that we could create is a 32KB 1st level cache with a latency of 4 cycles.
If you our work variables fit in the cache = 100% hitrate we will reach
4000/4 = 1000 MIPS
If our CPU design can work around the lateny -
E.g either using vertical pipelining (060/Phoenix like)
or with compiler code restructuring and OurOfOrder support.
- then we could reach up to
4000/1 =
4000 MIPSAs you see the advantage of the 1st level cache is very clear.
Now we might have problems needing more than 32KB of variables.
These will have a lower hit rate and run sower.
As every time we miss the 1st level cache we go out to main memory needing 200 cycles.
We have two options now
a) Increase 1st Level cache size e.g to 256 KB
Such a cache would have about 20 cycles latency.
One thing is clear : Out of order can with luck work around 2-3 cycles of lateny but not around 20.
This means even with Out of Order our CPU would get a big performace hit.
Realistically we would reach
4000 /10 =
400 MIPSAs you see this big 1st level cache is very bad for performance now.
We have MUCH less performance than we had with the smaller cache.
The best solution is implementing 2 levels of cache
32 KB with low latency
256 KB with 20 cycle latency
Lets say out testprogram has 50% hit rate with 32 KB and 100% Hitrate with 256 KB.
Only using a small and fast 1st level Cache
4000 * 0.5 = 2000 MIPS for the cases that hit level 1
4000 * 0.5 / 200 = 10 MIps for the cases that need to go to main memory
Total Speed = 2010 MIPS
Only using a big and slow 1st level Cache
4000 * /10 = 400 MIPS for the cases that hit level 1
Total Speed = 400 MIPS
Only using a small and fast 1st level Cache and a big and slower level 2 cache
4000 * 0.5 = 2000 MIPS for the cases that hit level 1
4000 * 0.5 / 10 = 200 MIps for the cases that need to go to main memory
Total Speed = 2200 MIPS
As you see this combination does give the best performance.
The numbers are in "realistic" range for todays CPU cores.
Does this answer your question?