m (Updated formatting of headers to make them stand out more.)
m (Added my initial solutions for Question II.)
Line 20: Line 20:
  
 
:Assuming each stage can be completed in one cycle: If 80% of instructions incur a one cycle/stage loop-back, this implies that stage would have to be performed again. Thus, a one cycle penalty for 80% instructions would result in a 1 * 0.8 = 0.8 CPI increase.
 
:Assuming each stage can be completed in one cycle: If 80% of instructions incur a one cycle/stage loop-back, this implies that stage would have to be performed again. Thus, a one cycle penalty for 80% instructions would result in a 1 * 0.8 = 0.8 CPI increase.
 +
 +
:An example of such a loop may be found either in incrementers (i++) or decrementers (i--).
  
 
'''2. For a loop from the 15th stage back to the 1st stage, the pipeline employs prediction with accuracy 90% (you should figure out how prediction lessens the loops' impact on the CPI). What is the increase in the CPI if the loop is relevant to 20% of all instructions?'''
 
'''2. For a loop from the 15th stage back to the 1st stage, the pipeline employs prediction with accuracy 90% (you should figure out how prediction lessens the loops' impact on the CPI). What is the increase in the CPI if the loop is relevant to 20% of all instructions?'''
Line 26: Line 28:
 
* For 90% of such instructions, the pipeline can predict this behavior.
 
* For 90% of such instructions, the pipeline can predict this behavior.
  
:So 18% (0.2 * 0.9) of instructions loop from the 15th stage to the 1st stage in a ''predicted'' manner. This means that 2% (0.2 * 0.1) of instructions loop from the 15th stage to the 1st stage without prediction and must pay a penalty. This penalty is where the increase in CPI will originate.  When a misprediction occurs, the penalty here is 15 stages/cycles which 2% of instructions must pay.
+
:So 18% (0.2 * 0.9) of instructions loop from the 15th stage to the 1st stage in a ''predicted'' manner. This means that 2% (0.2 * 0.1) of instructions loop from the 15th stage to the 1st stage without prediction and must pay a penalty. This penalty is where the increase in CPI will originate.  When a missprediction occurs, the penalty here is 15 stages/cycles which 2% of instructions must pay.
  
 
:Therefore, 0.02 * 15 cycles/stages = 0.3 increase in CPI. If the correctly predicted loop back from the 15th to the 1st stage does not have to pay the penalty (I assume this here due to the nature of the correct prediction), then 0.3 is the total increase to the CPI.
 
:Therefore, 0.02 * 15 cycles/stages = 0.3 increase in CPI. If the correctly predicted loop back from the 15th to the 1st stage does not have to pay the penalty (I assume this here due to the nature of the correct prediction), then 0.3 is the total increase to the CPI.
Line 48: Line 50:
 
==<big>II. Memory systems</big>==
 
==<big>II. Memory systems</big>==
  
1. Assume 25% of all instructions are loads and stores, L1 miss rate is 20%, L2 miss rate is 20% (out of L1 misses), and L1, L2, main memory latencies are 3, 32, and 300 cycles, respectively, and their bandwidths are, respectively, 1 access (hit or miss) every 1, 8, and 30 cycles. What is the achievable IPC (instructions per cycle), if we assume abundant parallelism due to simultaneous multithreading?
+
'''1. Assume 25% of all instructions are loads and stores, L1 miss rate is 20%, L2 miss rate is 20% (out of L1 misses), and L1, L2, main memory latencies are 3, 32, and 300 cycles, respectively, and their bandwidths are, respectively, 1 access (hit or miss) every 1, 8, and 30 cycles. What is the achievable IPC (instructions per cycle), if we assume abundant parallelism due to simultaneous multithreading?'''
 +
 
 +
:The question says we can assume abundant parallelism due to simultaneous multithreading. For my solution, I take this to mean that when a memory component (L1, L2, or main memory) misses, other instructions can still use this memory unit (up until the memory unit's bandwidth cap has been reached).
 +
 
 +
* So since L1's latency is 3 cycles and it can do one access every 1 cycle, this means that it can handle 3 (3 / 1 ) overlapping operations at one time.
 +
* L2's latency is 32 cycles and it can do one access every 8 cycles. This means that it can handle 4 (32 / 8 ) overlapping operations at one time.
 +
* Main memory's (MM's) latency is 300 cycles and it can do one access every 30 cycles. This means that it can handle 10 (300 / 30) overlapping operations at one time.
 +
 
 +
:We know average memory access time (AMAT) can be calculated: hit time + (miss rate * miss penalty). We can use this to help calculate the achievable IPC such as,
 +
 
 +
:AMAT (in CPI) = (L1<sub>hitTime</sub> + L1<sub>missRate</sub> * L1<sub>missPenalty</sub>) / L1<sub>nOverlap</sub>,
 +
 
 +
::where
 +
 
 +
:L1<sub>missPenalty</sub> = (L2<sub>hitTime</sub> + L2<sub>missRate</sub> * L2<sub>missPenalty</sub>) / L2<sub>nOverlap</sub>
 +
 
 +
::and
 +
 
 +
:L2<sub>missPenalty</sub> = (MM<sub>hitTime) / MM<sub>nOverlap</sub>.
 +
 
 +
:Using the problem's parameters:
 +
 
 +
::L2<sub>missPenalty</sub> = 300 / 10 = 30.
 +
::L1<sub>missPenalty</sub> = (32 + 0.2 * 30) / 4 = 9.5.
 +
::AMAT = (3 + 0.2 * 9.5) / 3  = 1.633.
 +
 
 +
:Also, this average memory access time in CPI is only applicable to 25% of the instructions so AMAT * 0.25 = 1.633 * 0.25 = 0.408333 CPI.
 +
 
 +
:Then in terms of IPC = 1.0 / CPI = 1.0 / 0.408333 = 2.449 IPC.
 +
 
 +
:Thus the achievable IPC of this system with abundant multithreading is 2.449.
 +
 
 +
'''2. Inverted page tables reduce page table space overhead via hashing. However, how are hashing collisions handled? Explain any policy the hardware/operating system may have to use.'''
 +
 
 +
:One method to handle hashing collisions is to use '''chaining'''. To implement chaining, linked lists are often used. When storing a new value in the hash table at an index that's already occupied (i.e., there is a collision), the value can be appended to the end of the linked list. When either of the values is needed later, the linked list will then be searched to find the correct value to be returned.
 +
 
 +
:If the hashing algorithm is good and the hash table has a decent size, the number of collisions should be minimal and an amortized store / access time of O(1) can be expected. In the worst case, however, everything will hash to the same index within the hash table and gets stored within a linked list. In this scenario, access times are that of a linear search, O(n), as the entire linked list may have to be explored.
  
2. Inverted page tables reduce page table space overhead via hashing. However, how are hashing collisions handled? Explain any policy the hardware/operating system may have to use.
+
:In summary, chaining is one policy a hash table can use to handle hashing collisions.
  
 
<hr>
 
<hr>

Revision as of 21:09, 15 April 2017

ECE Ph.D. Qualifying Exam
Computer Engineering, Question 04 (CE-4): Architecture
August 2015


Please note: this page is still being currently updated and modified as of April 15, 2017.


I. Processor

In modern out-of-order processors pipelines, there are many "pipeline loops" due to data/information generated in later pipeline stages and used in earlier stages (e.g., register bypass, branch resolution, and rename table update). Modern pipelines have many such loops each of which is relevant only to a subset of instructions.

Assume single-issue pipeline for the following questions.

1. What is the increase in the cycles per instruction (CPI), if 80% of instructions incur a loop from the 8th stage to the 8th stage? Also, give a real example of such a loop.

Assuming each stage can be completed in one cycle: If 80% of instructions incur a one cycle/stage loop-back, this implies that stage would have to be performed again. Thus, a one cycle penalty for 80% instructions would result in a 1 * 0.8 = 0.8 CPI increase.
An example of such a loop may be found either in incrementers (i++) or decrementers (i--).

2. For a loop from the 15th stage back to the 1st stage, the pipeline employs prediction with accuracy 90% (you should figure out how prediction lessens the loops' impact on the CPI). What is the increase in the CPI if the loop is relevant to 20% of all instructions?

  • For 20% of instructions, there is a pipeline loop from the 15th to the 1st stage.
  • For 90% of such instructions, the pipeline can predict this behavior.
So 18% (0.2 * 0.9) of instructions loop from the 15th stage to the 1st stage in a predicted manner. This means that 2% (0.2 * 0.1) of instructions loop from the 15th stage to the 1st stage without prediction and must pay a penalty. This penalty is where the increase in CPI will originate. When a missprediction occurs, the penalty here is 15 stages/cycles which 2% of instructions must pay.
Therefore, 0.02 * 15 cycles/stages = 0.3 increase in CPI. If the correctly predicted loop back from the 15th to the 1st stage does not have to pay the penalty (I assume this here due to the nature of the correct prediction), then 0.3 is the total increase to the CPI.

3. For a loop from the 15th stage back to the 1st stage and another from the 10th stage back to the 5th stage, if 1% of instructions are affected by the former loop and 3% (including the previous 1%) are affected by the latter loop, what is the increase in CPI?

From the question we know that a loop back from the 15th to the 1st stage equates to a 15 cycle/stage penalty. A loop then from the 10th stage back to the 5th stage is a cycle/stage penalty (as stages 5, 6, 7, 8, 9, and 10 all have to be visited again).
Then we know that 1% of instructions pay the 15 stage/cycle penalty and 3% of instructions pay the 6 stage/cycle penalty.
The problem states that this 3% "includes the previous 1%." Given this, to calculate the affect of this loop separately, I assume the 3% is instead 2% (3%-1%).
Then:
  • 1% pay the 15 stage penalty: 0.01 * 15 = 0.15
  • 2% pay the 6 stage penalty: 0.02 * 6 = 0.12
Therefore, the total increase to the system CPI imposed by these two loop-backs is 0.27 CPI.

II. Memory systems

1. Assume 25% of all instructions are loads and stores, L1 miss rate is 20%, L2 miss rate is 20% (out of L1 misses), and L1, L2, main memory latencies are 3, 32, and 300 cycles, respectively, and their bandwidths are, respectively, 1 access (hit or miss) every 1, 8, and 30 cycles. What is the achievable IPC (instructions per cycle), if we assume abundant parallelism due to simultaneous multithreading?

The question says we can assume abundant parallelism due to simultaneous multithreading. For my solution, I take this to mean that when a memory component (L1, L2, or main memory) misses, other instructions can still use this memory unit (up until the memory unit's bandwidth cap has been reached).
  • So since L1's latency is 3 cycles and it can do one access every 1 cycle, this means that it can handle 3 (3 / 1 ) overlapping operations at one time.
  • L2's latency is 32 cycles and it can do one access every 8 cycles. This means that it can handle 4 (32 / 8 ) overlapping operations at one time.
  • Main memory's (MM's) latency is 300 cycles and it can do one access every 30 cycles. This means that it can handle 10 (300 / 30) overlapping operations at one time.
We know average memory access time (AMAT) can be calculated: hit time + (miss rate * miss penalty). We can use this to help calculate the achievable IPC such as,
AMAT (in CPI) = (L1hitTime + L1missRate * L1missPenalty) / L1nOverlap,
where
L1missPenalty = (L2hitTime + L2missRate * L2missPenalty) / L2nOverlap
and
L2missPenalty = (MMhitTime) / MMnOverlap.
Using the problem's parameters:
L2missPenalty = 300 / 10 = 30.
L1missPenalty = (32 + 0.2 * 30) / 4 = 9.5.
AMAT = (3 + 0.2 * 9.5) / 3 = 1.633.
Also, this average memory access time in CPI is only applicable to 25% of the instructions so AMAT * 0.25 = 1.633 * 0.25 = 0.408333 CPI.
Then in terms of IPC = 1.0 / CPI = 1.0 / 0.408333 = 2.449 IPC.
Thus the achievable IPC of this system with abundant multithreading is 2.449.

2. Inverted page tables reduce page table space overhead via hashing. However, how are hashing collisions handled? Explain any policy the hardware/operating system may have to use.

One method to handle hashing collisions is to use chaining. To implement chaining, linked lists are often used. When storing a new value in the hash table at an index that's already occupied (i.e., there is a collision), the value can be appended to the end of the linked list. When either of the values is needed later, the linked list will then be searched to find the correct value to be returned.
If the hashing algorithm is good and the hash table has a decent size, the number of collisions should be minimal and an amortized store / access time of O(1) can be expected. In the worst case, however, everything will hash to the same index within the hash table and gets stored within a linked list. In this scenario, access times are that of a linear search, O(n), as the entire linked list may have to be explored.
In summary, chaining is one policy a hash table can use to handle hashing collisions.

III. Multicore

1. Assuming non-atomic writes, add statements to the following code to show sequential consistency (SC) violation. You may add statements to one or both threads, and either before or after the statements shown below.

Assume that before the following code segment runs X=Y=0.

Thread 1: write X = 10;
Thread 2: write Y = 20;

Show your new code and explain how SC may be violated.

2. Assume the following multithreaded code where mypid is each thread's process id. The thread runs on n cores with coherent private caches.

repeat {many times}
for i = 1 to n
A[mypid] = A[mypid] % mypid;

(a) Explain why this code is likely to perform poorly.
(b) Describe how to address this problem and show your pseudocode.

IV. Fundamentals

A high-performance architecture uses ultra-wide simultaneous multi-threaded (SMT) processors (e.g., 128-way SMT processors) to target "embarrassingly" parallel, highly regular and loop-based codes with heavy branching (e.g., highly-conditional dense matrix applications). Due to the heavy branching, vector compiler scheduling is not effective. Instead, the compiler generates simple, parallel threads from the loop iterations each of wich runs on an SMT context (e.g., 128 iterations run in parallel). While the branching disallows lock-step execution across the threads, the iterations are naturally load-balanced and access mostly sequentially-contiguous data (e.g., thread i accesses ith array elements). A key challenge is designing memory system to support such processors which employ a cache hierarchy (memory system includes cache and main memory). Assume the code has enough reuse to justify a conventional cache hierarchy.

1. Which memory-system parameters poses a key challenge for data access? Which parameters are much less relevant?

2. Based on the code characteristics listed above, what would be your strategy to address the above challenge? Describe your strategy for both caches and main memory.

3. What difficulty does the lack of lock-stepping cause your strategy? How would you tackle the difficulty?

Alumni Liaison

Ph.D. on Applied Mathematics in Aug 2007. Involved on applications of image super-resolution to electron microscopy

Francisco Blanco-Silva