Multi core programming in C++/OpenMP on Raspberry Pi

1. Introduction: CPU, Core, and Thread

A lot of development in engineering is a race to make faster machines. One way to increase the speed is to make programs run on computers with faster central processing units (CPUs). The problem is that the speed of CPUs cannot be pushed much above 2GHz. One way to increase a speed of a program is to use multiple CPUs running at the same time.

Currently the CPUs are made with several cores. We will think of cores as separate CPUs. In reality they are not because they use the same power supply and cooling fan, but that is something we will not care about in our introductory course.

It is worth pointing out that some tasks that CPUs perform are faster than the others. Arithmetic and logic operations are much faster than communications with main memory (RAM memory). Here are time frames for some typical operations:

  • 0.5 nanoseconds to multiply two integers if they are already inside the CPU registers;
  • 100 nanoseconds to bring something from the main memory;
  • 16000 nanoseconds to bring something from the solid state drive;

Since reading from RAM is a very common behavior of a program, the ability of the CPU to multiply numbers really fast does not improve performance too much. The addition and multiplication of numbers is performed by the brain of the CPU called arithmetic logic unit (ALU). Most of the time CPU is waiting for things to go back and forth from the RAM memory. The CPU cores often have smart algorithms that makes them very efficient in executing several programs at the same time: While one program is waiting for data to arrive from the RAM, the other program is using ALU. These individual programs are called threads.

Advanced CPUs have multiple cores, and each core can support several threads. When we write our programs we have no control of how the scheduling is done among threads. So, as the high-level programmers we only care about the total number of threads that our computer would allow us to use.

Our first program determines the number of threads available. The listing of the code will be put first and the explanation will be placed below.

// compile with
// c++ countThreads.cpp -o countTh -std=c++11 -fopenmp
// execute with
// ./countTh

#include<iostream>
#include<omp.h>

int main(){
  long numTh;
  #pragma omp parallel
  {
    numTh=omp_get_num_threads();
  }
  std::cout << numTh << "\n";
  return 0;
}

In order to make parallel programs we need to do the following: 1) include the header omp.h; 2) use the flag -fopenmp when invoking the compiler.

The function omp_get_num_threads() returns the most optimal number of threads that should be run in parallel.

The function is placed in the block of the code that is run in parallel.

Important: The parallel code starts with the block #pragma omp parallel that must be in separate line of the code. Unlike regular c++ code where the line breaks are treated as non-existent, the compiler directives starting with # must occupy their own line. In particular the bracket opener { cannot be in the same line as the compiler directive starting with #.

2. Problematic example: Threads that print

One of the best way to learn parallel programming is to start with writing bad programs. We will now learn why printing should not be done in parallel.

We will create a function that activates \(5\) threads and requests each thread to print a message. OpenMP assigns the unique id number to each thread. The assigned id is from the set \(\{0,1,2,3,4\}\). Each thread prints its own id number. The thread 0 will print I am thread 0; the thread 1 will print I am thread 1, and so on. The last thread will print I am thread 4.

We are allowed to ask for 5 threads even if the function omp_get_num_threads() told us that 4 (or 8) was optimal number. We could ask for 100 threads as well - each core would be assigned more threads than manufacturer suggests and probably the program would run slower than if we followed the recommendation.

Here is the code for the program Printing threads.

// compile with
// c++ printingThreads.cpp -o pTh -std=c++11 -fopenmp
// execute with
// ./pTh
#include<iostream>
#include<omp.h>

void printingThreads(){
  #pragma omp parallel num_threads(5)
  {
    long id=omp_get_thread_num();
    std::cout<<"I am thread " << id << "\n";
  }
}

int main(){
  printingThreads();
  return 0;
}

The above code invoked parallel code with the directive #pragma omp parallel num_threads(5) that requests 5 threads specifically. We also see a new function omp_get_thread_num() that returns the id number of a thread.

Please compile and execute the code several times. You will get a different messy output every time. This is one of the outputs that the code could produce

I am thread I am thread 0
I am thread 1
I am thread I am thread 2
4
3

To explain this mess let us first start with the conclusion: The command std::cout that prints on the screen should not be run in parallel. Your computer screen is one and only one. The 5 threads are in race for this single resource. In the above output you could see that one thread got there first, grabbed the wire to the screen and printed "I am thread" and just as it was about to print its number, another thread grabbed the wire to the screen and started printing its message. These races for resources must be avoided. The consequences cannot be predicted.

3. Memory race

Shared resources such as screen, keyboard, and printers should not be inside the parallel section of the code. And this is kind of easy to avoid and it rarely ends up being a source of a bug. However, there is another more dangerous bug that is harder to detect: memory race. This is when several threads try to print in the same location in main memory. If a variable is declared before the parallel block and then all threads try to write to the variable, the variable can very well end up with a lot of garbage inside of it.

Let us look at the following code.

// compile with
// c++ memoryRace.cpp -o mRace -std=c++11 -fopenmp
// execute with
// ./mRace
#include<iostream>
#include<omp.h>

long memoryRace(){
  long result;
  #pragma omp parallel num_threads(5)
  {
    result = 0;
    for(long i=0;i < 100000;++i){
      result=result+1;
    }
  }
  return result;
}
int main(){
  std::cout << memoryRace() << std::endl;
  return 0;
}

If this code was run on a single core, then the variable result inside the function would just end up with the content \(100000\). But when run on \(5\) cores, then almost every execution of the program will result in a different outcome. All threads are trying to write to the same variable result in RAM. Each of them sets that variable to 0 exactly once. It may happen that just as one thread finishes the loop and writes the big sum inside result another thread executes its first command and places 0 in result

Also, the program can become slower when the threads try to write to the same location. The shared resource may not often be able to accommodate simultaneous requests and forces the threads to wait in line for the shared resource.

4. Modifying a sequence

After writing several bad programs, we are ready for a good one. Here is the problem that will illustrate the increase in speed that parallel programming can offer.

We want to create a function whose argument consists of a positive prime number \(p\), positive integer \(n\), and three sequences alpha, beta, and gamma. The sequences alpha and beta contain positive integers. The function should modify the sequence gamma so that each of its terms satisfies \[\mbox{gamma}[i]=\left(\mbox{alpha}[i]^{\mbox{beta}[i]}\right)\mbox{\%} p.\]

We will write both the traditional (non-parallel) function and parallel function. Then we will compare their speeds.

5. Timer

The timer is located at simpleTimerCPP

We first need a function long powerFunction(const long & a, const long & b, const long & p) that calculates \(\left(a^b\right)\mbox{\%}p\).

Both non-parallel and parallel version of the program will use powerFunction. The non-parallel implementation is given below.

//nonParallelFunction.cpp
// compile with
// c++ nonParallelFunction.cpp -o npf -std=c++11 -fopenmp
// execute with
// ./npf
#include<iostream>
#include<omp.h>
#include"SimpleTimer.cpp"
long powerFunction(long a, long b, long p){
  long res=1;
  long i=0;
  while(i < b){
    res*=a;
    res%=p;
    ++i;
  }
  return res;
}
void nonParallelFunction(long p,long n, long * alpha, long *beta, long* gamma){
  long i=0;
  while(i < n){
    gamma[i]=powerFunction(alpha[i],beta[i],p);
     ++i;
   }
 }
 int main(){
   long * alpha, * beta, * gamma;
   long p=17;
   long N=100000;
   alpha=new long[N]; beta=new long[N]; gamma=new long[N];
   alpha[0]=8;beta[0]=5183;alpha[1]=9;beta[1]=4351;
   alpha[2]=7;beta[2]=4711;
   long i=3;
   while(i < N){
     alpha[i]=alpha[i-3];beta[i]=beta[i-3];++i;
   }
   SimpleTimer tm;
   tm.start();
   nonParallelFunction(p,N,alpha,beta,gamma);
   tm.end();
   std::cout << "The non-parallel function running time is: ";
   std::cout<< tm.getTime() << "\n";

   delete[] alpha;delete[] beta;delete[] gamma;
 return 0;
}

Please be patient when you execute the code. It can take as long as 30 seconds on some computers.

The parallel program will only change the function nonParallelFunction to parallelFunction which will have the same arguments. The difference is that we will run the code on \(m\) threads - where \(m\) is the suggested number of threads. We will obtain it by using the function omp_get_num_threads(). Then thread 0 will calculate the terms gamma[0], gamma[m], gamma[2m], \(\dots\). The thread \(1\) will calculate the terms gamma[1], gamma[m+1], gamma[2m+1], etc.

The complete program is given below.

//parallelFunction.cpp
// compile with
// c++ parallelFunction.cpp -o pf -std=c++11 -fopenmp
// execute with
// ./pf
#include<iostream>
#include<omp.h>
#include"SimpleTimer.cpp"
long powerFunction(long a, long b, long p){
  long res=1;
  long i=0;
  while(i < b){
    res*=a;
    res%=p;
    ++i;
  }
  return res;
}
void parallelFunction(long p,long n, long * alpha, long *beta, long* gamma){
  long m;
  #pragma omp parallel 
  {
     if(omp_get_thread_num()==0){
        m=omp_get_num_threads();
     }
  }
  #pragma omp parallel num_threads(m)
  {
     long myId=omp_get_thread_num();
     long i=myId;
     while(i < n){
        gamma[i]=powerFunction(alpha[i],beta[i],p);
        i+=m;
     }
  }
}
int main(){
   long * alpha, * beta, * gamma;
   long p=17;
   long N=100000;
   alpha=new long[N]; beta=new long[N]; gamma=new long[N];
   alpha[0]=8;beta[0]=5183;alpha[1]=9;beta[1]=4351;
   alpha[2]=7;beta[2]=4711;
   long i=3;
   while(i < N){
     alpha[i]=alpha[i-3];beta[i]=beta[i-3];++i;
   }
   SimpleTimer tm;
   tm.start();
   parallelFunction(p,N,alpha,beta,gamma);
   tm.end();
   std::cout << "The parallel function running time is: " << tm.getTime() << "\n";

   delete[] alpha;delete[] beta;delete[] gamma;
 return 0;
}

6. Practice

Problem 1.

The following archive

https://imomath.com/bmath/pcpdf/programming/hw6_input.tar.gz

contains the file hw6_input.txt.

Use the function readSequenceFromFile that you created in homework 2 to read the sequence \(x\) of non-negative integers from the file hw6_input.txt. The sequence \(x\) is written to the file according to the convention we created in homework 2.

Each of the elements of the sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) corresponds to one of the citizens in a country. The number \(n\) is read from the file as explained in the homework \(2\). The number \(x[i]\) is a politician for whom the citizen \(i\) voted in the election. As you can see from the file, the politicians are represented by the numbers from the set \(\{0,1,2,3,\dots, 9\}\).

Determine which politician won the elections.

Create two functions: a non-parallel version, and a parallel one using openmp.

Remark 1: Be careful when designing the parallel algorithm so that no two cores attempt to write in the same location in the RAM.

Remark 2: You may notice that the parallel version is slower than the non-parallel one on dual-core computers. The reason is that non-parallel version has very few computations, and parallelization adds few more instructions to the code. The performance is improved only if there are more than two cores.

Problem 2

Use the function readSequenceFromFile that you created in homework 2 to read the sequence \(x\) of non-negative integers from the file hw6_input.txt. (You should have downloaded the file while solving the previous problem.) The sequence \(x\) is written to the file according to the convention we created in homework 2.

Read the numbers \(k\) and \(m\) from the standard input. Each of the elements of the sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) corresponds to one of the citizens in a country. The number \(n\) is read from the file as explained in the homework \(2\). The citizens live in a circle. Hence, the first left neighbor of the citizen \(0\) is \(n-1\); the second from the left neighbor is \(n-2\). The first from the right neighbor of the citizen \(0\) is the citizen \(1\), and the second from the right is the citizen \(2\).

The number \(x[i]\) is a politician for whom the citizen \(i\) is planning to vote in the upcoming election. As you can see from the file, the politicians are represented by the numbers from the set \(\{0,1,2,3,\dots, 9\}\).

The citizens are talking about the politicians and they often change their opinions on how to vote. During each of the next \(m\) days, each citizen looks at \(k\) of his(/her) nearest neighbors that are located to the left and \(k\) of the nearest neighbors that are located to the right. Then during the night, each citizen decides to change the favorite politician to the one that the majority of his/her neighbors preferred the day before (if there is a tie, a politician with the smaller id is chosen). This crazy behavior continues for \(m\) days and nights.

Find out who is going to vote for whom after \(m\) days and \(m\) nights. Create two functions: a non-parallel version, and a parallel one using openmp and compare their performance.

Remark: Be careful when designing the algorithms: When changing the opinion of the candidate \(8\), you will look at the neighbor \(7\). However you have to keep in mind that you need to consider the value \(x[7]\) that was during the day, which could be tricky if you already changed \(x[7]\).