DELFT UNIVERSITY OF TECHNOLOGY Faculty of EEMCS Computer Engineering MSc Programme



# **ET4074** Modern Computer Architectures **ET4165** Embedded Computer Architecture

November 5, 2009

- The duration of the test is 3 hours.
- The test is open book, Hennessy and Patterson, Computer Architecture: A Quantitative Approach. You may use the book and a copy of the slides, but no other material.
- If you are not sure about what is described in a question, clearly state your assumptions.
- The test is worth 60% of your final grade.
- Put your name, student number, and whether you are enrolled in ET4074 or ET4165 on each answer page.
- This exam consists of 6 assignments and the total number of points is 103, so there is a small bonus.
- Show your work, i.e., show how you obtained your answer. Answers without work will not be given credit.
- Good luck!

and the second of the second o

## Assignment 1 - Branch prediction (12 points)

A branch has the following history (T stands for taken, NT for not taken):

- a) 6 points How often would a branch history table with 1-bit entries correctly predict this branch? Assume the initial prediction is taken. Show how you obtained your answer.
- b) 6 points How often would a branch history table with 2-bit entries correctly predict this branch? Assume the initial prediction is "weakly taken" (10). Show how you obtained your answer.

# Assignment 2 - Loop scheduling (24 points)

Consider the following loop and the corresponding compiler-generated code:

The pipeline has the following characteristics:

| Instruction producing result | Instruction consuming result | Delay in clock cycles |  |
|------------------------------|------------------------------|-----------------------|--|
| INT ALU op                   | INT ALU op                   | 0                     |  |
| FP ALU op                    | FP ALU op                    | 2                     |  |
| Load FP                      | FP ALU op                    | 2                     |  |
| FP ALU op                    | Store FP                     | 2                     |  |
| Branch delay slots           |                              | 2 cycles              |  |

- a) 6 points Give one-sentence comments for each line of code.
- b) 6 points Show how the loop will be executed, including any stall cycles, similar to the example in Section 2.2 of the textbook (page 76 and further). Also calculate the number of cycles per iteration.
- c) 6 points Show how the loop can be rescheduled to improve performance and calculate the number of cycles per iteration.
- d) 6 points Show that unrolling the loop twice, eliminating redundant loads and rescheduling is sufficient to remove all stall cycles. Also calculate the number of cycles per iteration.

#### Assignment 3 - Random questions (24 points)

- a) 6 points Suppose we want to achieve a speedup of 25 on a 32-core machine. What fraction of the total execution time can be sequential?
- **b)** 6 points IA64 as well as some other architectures use a technique to avoid branches for small if-then-else statements. What is it called and briefly describes how it works?
- c) 6 points Briefly describe the two architectural approaches superscalar and VLIW, and clearly state the differences between the two approaches.
- d) 6 points Assume a Simultaneous Multi-Threaded processor with a "classical" 5 stage pipeline (IF/ID/EX/MEM/WB). What is the number of concurrent threads that this machine should be able to support in order to need no branch prediction? Motivate your answer.

### Assignment 4 - Cache organization (15 points)

Consider the following drawing of a physical cache:



a) 3 points What is the size (capacity) of this cache?

Assume a physical cache with the following parameters: cache size: 2Mbyte, cache line/block size: 32 bytes, organization: 2-way set-associative, CPU word size: 8 byte (this is the size of the data unit delivered from the cache to the CPU), physical address size: 32 bits (i.e., 4Gbyte of address space can be addressed).

- **b)** 6 points Show how the cache access is done by drawing a diagram similar to the one above, and specifying the correct range bits ranges and number. Describe the functionality of all logic needed to determine of a read access hits the cache.
- c) 3 points For this cache, which bit ranges would change if the cache size was increased to 4Mbyte while all the other parameters stayed the same?

State the new bit ranges (you do not need to make completely new drawing)

d) 3 points How can you easily extend the 2Mbyte cache to a 3MByte cache? Which bit ranges would change in this case?

## Assignment 5 - Virtual memory (12 points)

The virtual memory system can be viewed as a "cache system":

- a) 3 points what part of it corresponds to a cache line?
- b) 3 points what is its associativity?
- c) 3 points what is its write strategy?
- d) 3 points what piece of hardware is used to speed-up the virtual to physical address translation? (At most two sentences per subquestion)

## Assignment 6 - Cache coherence (16 points)

The diagram below describes the states transitions of a MESI cache coherence snooping protocol.



Cache line at access initiating processor

- Read-with-intent to modify
- (1) Cache line fill
- ⊕ Invalidate transaction
- Dirty line copyback



Line in snooping cache

RH - Read Hit

RMS - Read Miss, Shared

RME - Read Miss, Exclusive

WH - Write Hit

WM - Write Miss

SHR - Snoop Hit on Read

SHW - Snoop Hit on Write

(or read-with-intent-to-modify)

Every cache line is in one of the following states:

*Modified*: The cache line is present only in the current cache, and it has been modified from the value in main memory (it is dirty). The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.

Exclusive: The cache line is present only in the current cache, but is clean (it matches the main memory); it matches main memory.

*Shared*: Indicates that this cache line may be stored in other caches of the machine & is clean. *Invalid*: Indicates that this cache line is invalid.

Assume the same address A is in a pair of caches of a multi-processor, and the state of the cache block where A maps is  $X_1$  in the first cache and  $X_2$  in the second. According to the MESI protocol not all  $(X_1, X_2)$  combinations are permitted. Fill in the table below specifying which for each combination of states  $(X_1, X_2)$  if it is permitted or not by the MESI protocol (an " $\mathbf{o}$ " means "permitted", an " $\mathbf{x}$ " means "not permitted").

|   | М | E | S | I |
|---|---|---|---|---|
| М |   |   |   |   |
| E |   |   |   |   |
| S |   |   |   |   |
| I |   |   |   |   |