## Description

Part I. Computational Complexity and Flop/s Performance

In this part, you will perform small hands-on exercises to get a feel for computational

complexity and flop/s (floating-point operations per second) performance of a computer. Submit

your answers to the following two questions.

I-1. Measuring Computational Complexity

Download the data file, MDtime.out, from the course home page for assignment 1. In the

two-column file, the left column is the number of atoms, N, simulated by the md.c program,

whereas the right column is the corresponding running time, T, of the program in seconds. Make

a log-log plot of T vs. N. Perform linear fit of logT vs. logN, i.e., logT = alogN +b, where a and

b are fitting parameters. Note that the coefficient a signifies the power with which the runtime

scales as a function of problem size N: T µ Na . Submit the plot and your fit of a.

I-2. Theoretical Flop/s Performance

Suppose you have a quadcore processor (a processor equipped with 4 processing cores)

operating at a clock speed of 3.0 GHz (i.e., clock ticks 3´109 times per second), in which each

core can operate 1 multiplication and 1 addition operations per clock cycle (e.g., using a special

fused multiply-add (FMA) circuit). Assume that each multiply or add operation is performed on

vector registers, each holding 4 double-precision operands (Fig. 1).

* What is the theoretical peak

performance of your computer in Gflop/s (gigaflop/s or 109 flops)?

Fig. 1: Vector multiplier (VMUL) loads data from two vector registers, R1 and R3, each holding 4 double-precision

numbers, concurrently performs 4 multiplications, and stores the results on vector register R3.

Part II. Implementing Your Own Global Summation with Message Passing Interface

In this part, you will write your own global summation program (equivalent to

MPI_Allreduce) using MPI_Send and MPI_Recv. Your program should run on P = 2l processors

(l = 0, 1,…). Each process contributes a partial value, and at the end, all the processes will have

the globally summed value of these partial contributions.

Your program will use a communication structure called butterfly, which is structured as a

series of pairwise exchanges (see the figure below where messages are denoted by arrows). This

structure allows a global reduction among P processes to be performed in log2P steps.

* See FLOPS in Wikipedia [https://en.wikipedia.org/wiki/FLOPS].

2

a000 + a001 + a010 + a011 + a100 + a101 + a110 + a111

= ((a000 + a001) + (a010 + a011)) + ((a100 + a101) + (a110 + a111))

At each level l, a process exchanges messages with a partner whose rank differs only at the lth bit position in the binary representation (Fig. 2).

Fig. 2: Butterfly network used in hypercube algorithms.

HYPERCUBE TEMPLATE

We can use the following template to perform a global reduction using any associative

operator OP (such as multiplication or maximum), (a OP b) OP c = a OP (b OP c).

procedure hypercube(myid, input, logP, output)

begin

mydone := input;

for l := 0 to logP-1 do

begin

partner := myid XOR 2l;

send mydone to partner;

receive hisdone from partner;

mydone = mydone OP hisdone

end

output := mydone

end

USE OF BITWISE LOGICAL XOR

Note that

0 XOR 0 = 1 XOR 1 = 0;

0 XOR 1 = 1 XOR 0 = 1.

so that a XOR 1 flips the bit a, i.e.,

a XOR 1 = `a

a XOR 0 = a

where `a is the complement of a (`a = 1|0 for a = 0|1). In particular, myid XOR 2l reverses the

l-th bit of the rank of this process, myid:

abcdefg XOR 0000100 = abcd`efg

Note that the XOR operator is ^ (caret symbol) in the C programming language.

ASSIGNMENT

Complete the following program by implementing the function, global_sum, using

MPI_Send and MPI_Recv functions and the hypercube template given above.

3

Submit the source code as well as the printout from a test run on 4 processors and that on 8

processors.

#include “mpi.h”

#include <stdio.h>

int nprocs; /* Number of processors */

int myid; /* My rank */

double global_sum(double partial) {

/* Implement your own global summation here */

}

int main(int argc, char *argv[]) {

double partial, sum, avg;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myid);

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

partial = (double) myid;

printf(“Node %d has %le\n”, myid, partial);

sum = global_sum(partial);

if (myid == 0) {

avg = sum/nprocs;

printf(“Global average = %le\n”, avg);

}

MPI_Finalize();

return 0;

}