Description
1. Given the following MIPS code snippet (note that instruction #6 could be anything):
loop:
1 addi $t0, $t0, 4
2 lw $v0, 0($t0)
3 sw $v0, 20($t0)
4 lw $s0, 60($t0)
5 bne $s0, $0, loop
6 ## The following instruction could be anything!
a) Detect hazards and insert no-opsto insure correct operation. Assume no delayed branch, no forwarding
units, registers cannot read and write at the same time, and no interlocked pipeline stages. Your answer
on the right should take the form of pair(s) of numbers: num@location – indicating num no-ops should
be placed at location. E.g., if you wanted to place 6 noops between lines 2 and 3 (i.e., location=2.5) and 8
noops between lines 5 and 6 (i.e., location=5.5), you would write: “6@2.5, 8@5.5”.
b) Now, reorder/rewrite the program to maximize performance. Assume delayed branch and forwarding
units, but no interlocked pipeline stages. For unknown reasons, the first instruction after the loop label
must be the addi. Feel free to insert no-ops where needed. You should be able to do it using 6
instructions per loop (10 pts) or only 5 instructions (hard, 15 pts).
_____________________________ ##Extra instruction before loop (if any)
_____________________________ ##Extra instruction before loop (if any)
loop:
1 addi $t0, $t0, 4
2 ______________________________
3 ______________________________
4 ______________________________
5 ______________________________
6 ______________________________
7 ## The following instruction could be anything!
c) Name three reasons why we use caches in a pipelined processor
d) Circle whether each change will make something decrease (-), increase (+) or remain the same (=). The
effects you will consider are the number of instructions in the program, the amount of cycles for each
instruction (in a pipelined processor) and the amount of time for each cycle (seconds/cycle).
Instructions/Progr
am
Cycles/Instruction Seconds/Cycle
Reducing the
number of registers
in the ISA
– = + – = + – = +
Adding a branch
delay slot – = + – = + – = +
Merging the Execute
and Mem stages
(loads and stores
use an additional
adder to calculate
base+offset)
– = + – = + – = +
Changing the
implementation
from a microcoded
CISC machine to a
RISC pipeline
– = + – = + – = +
2. Consider the following loop.
loop: lw r1, 0(r1)
and r1, r1, r2
lw r1, 0(r1)
lw r1, 0(r1)
beq r1, r0, loop
Assume that perfect branch prediction is used (no stalls due to control hazards), that there are no delay slots,
and that the pipeline has full forwarding support. Also assume that many iterations of this loop are executed
before the loop exits.
a) Show a pipeline execution diagram for the third iteration of this loop, from the cycle in which we fetch
the first instruction of that iteration up to (but not including) the cycle in which we can fetch the first
instruction of the next iteration. Show all instructions that are in the pipeline during these cycles (not
just those from the third iteration).
b) How often (as a percentage of all cycles) do we have a cycle in which all five pipeline stages are doing
useful work?
3. Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-
bit memory address references, given as word addresses (addresses are assigned according to words, not
bytes): 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
a. For each of these references, identify the binary address (in words, not bytes), the tag, and the index
given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss,
assuming the cache is initially empty and the references are accessed according to the order as
listed. The first address has been done as an example.
Word
Address
Binary
Address
Tag Index Hit (H) /
Miss (M)
3 0000 0011 0 3 M
180
43
2
191
88
190
14
181
44
186
253
b. For each of these references, identify the binary address (in words), the tag, and the index given a
direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a
hit or a miss, assuming the cache is initially empty.
Word
Address
Binary
Address
Tag Index Hit (H) /
Miss (M)
3 0000 0011 0 1 M
180
43
2
191
88
190
14
181
44
186
253
c. You are asked to optimize a cache design for the above references. There are three direct-mapped
cache designs possible, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word
blocks, and C3 has 4-word blocks. If the miss stall time is 25 cycles, and C1 has an access time of 2
cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design? (Use a table, similar
to part a and b, to help you evaluate the hit/miss rate of C1, C2, and C3.)
d. Below are listed parameters for different direct-mapped cache designs:
Cache Data Size: 32 KiB
Cache Block Size: 2 words
Cache Access Time: 1 cycle
Generate a series of read requests that have a lower miss rate on a 2 KiB 2-way set associative cache
than the cache listed above. Identify one possible solution that would make the cache listed have an
equal or lower miss rate than the 2 KiB cache. Discuss the advantages and disadvantages of such a
solution.
4. For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access
the cache:
Tag Index Offset
31-10 9-5 4-0
a. What is the cache block size (in words)?
b. How many entries does the cache have?
c. What is the ratio between total bits required for such a cache implementation over the data storage
bits?
Starting from power on, the following byte-addressed cache references are recorded:
0, 4, 16, 132, 232, 160, 1024, 30, 140, 3100, 180, 2180
d. How many blocks are replaced?
e. What is the hit ratio?