Description
1. Media applications that play audio or video files are part of a class of workloads called “streaming”
workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider an audio streaming
workload that accesses a 512 KiB working set sequentially with the following byte address stream:
0, 2, 4, 6, 8, 10, 12, 14, 16, …
a. Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address
stream above? How is this miss rate sensitive to the size of the cache or the working set? How
would you categorize the misses this workload is experiencing, based on the 3C model (Compulsory,
Conflict, and Capacity)?
b. Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind
of locality is this workload exploiting?
c. “Prefetching” is a technique that leverages predictable address patterns to speculatively bring in
additional cache blocks when a particular cache block is accessed. One example of prefetching is a
stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a
particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a
hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream
buffer and assume that the cache latency is such that a cache block can be loaded before the
computation on the previous cache block is completed. What is the miss rate for the address stream
above?
2. Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of
1.35 references (both instruction and data) per instruction, help find the optimal block size given the
following miss rates for various block sizes.
8: 4% 16: 3% 32: 2% 64: 1.5% 128: 1%
a. What is the optimal block size for a miss latency of 20 x B cycles?
b. What is the optimal block size for a miss latency of 24 + B cycles?
c. For constant miss latency, what is the optimal block size?
3. Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions. The
following table shows data for L1 caches attached to each of two processors, P1 and P2.
L1 Size L1 Miss Rate L1 Hit Time
P1 2 KB 8% 0.66 ns
P2 2 KB 6% 0.9ns
a. Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective
clock rates?
b. What is the Average Memory Access Time for P1 and P2?
c. Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 and P2? Which
processor is faster?
Use the L1 cache capacities and hit times from the table in Problem 3 when solving the following problems.
The L2 miss rate indicated is its local miss rate.
L2 Size L2 Miss Rate L2 Hit Time
1 MB 95% 5.62 ns
d. What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2
cache?
e. Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition
of an L2 cache?
f. Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need
in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1
cache to match P2’s performance?
4. Given the following word addresses: 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
a. Show the final cache contents for a three-way set associative cache with two-word blocks and a
total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits,
the block offset bits, and if it is a hit or a miss.
b. Show the final cache contents for a fully associative cache with one-word blocks and a total size of 8
words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit
or a miss.
c. What is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words,
using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally
what is the best possible miss rate for this cache, given any replacement policy?
5. Consider a processor with the following parameters: Base CPI, no memory Stall Processor speed Main memory access time L1 cache miss rate per instruction
L2 cache, direct- mapped speed
Global miss rate
with L2 cache,
direct-mapped
L2 cache, eight- way set
associative speed
Global miss rate
with L2 cache,
eight-way set
associative
1.5 2 GHz 100 ns 7% 12 cycles 3.5% 28 cycles 1.5%
a. Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level
direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers
change if main memory access time is doubled? If it is cut in half?
b. It is possible to have an even greater cache hierarchy than two levels. Given the processor above
with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50
cycles to access and will reduce the global miss rate to 1.3%. Would this provide better
performance? In general, what are the advantages and disadvantages of adding a third level cache?
c. In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external
(located on a different chip) from the main processor and the first level cache. While this allowed for
large second level caches, the latency to access the cache was much higher, and the bandwidth was
typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off-chip
second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global
miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have
to be to match the performance of the second level direct-mapped cache listed above? Of the eightway set associative cache?