Multi Core Programming and Threads

1. Introduction: CPU, core, and thread

A lot of development in engineering is a race to make faster machines. One way to increase the speed is to make programs run on computers with faster central processing units (CPUs). The problem is that the speed of CPUs cannot be pushed much above 2GHz. One way to increase a speed of a program is to use multiple CPUs running at the same time.

Currently the CPUs are made with several cores. We will think of cores as separate CPUs. In reality they are not because they use the same power supply and cooling fan, but that is something we will not care about in our introductory course.

It is worth pointing out that some tasks that CPUs perform are faster than the others. Arithmetic and logic operations are much faster than communications with main memory (RAM memory). Here are time frames for some typical operations:

  • 0.5 nanoseconds to multiply two integers if they are already inside the CPU registers;
  • 100 nanoseconds to bring something from the main memory;
  • 16000 nanoseconds to bring something from the solid state drive;

Since reading from RAM is a very common behavior of a program, the ability of the CPU to multiply numbers really fast does not improve performance too much. The addition and multiplication of numbers is performed by the brain of the CPU called arithmetic logic unit (ALU). Most of the time CPU is waiting for things to go back and forth from the RAM memory. The CPU cores often have smart algorithms that makes them very efficient in executing several programs at the same time: While one program is waiting for data to arrive from the RAM, the other program is using ALU. These individual programs are called threads.

Advanced CPUs have multiple cores, and each core can support several threads. When we write our programs we have no control of how the scheduling is done among threads. So, as the high-level programmers we only care about the total number of threads that our computer would allow us to use.

Our first program determines the number of threads available. The listing of the code will be put first and the explanation will be placed below.

// compile with
// c++ countThreads.cpp -o countTh -std=c++11 -pthread
// execute with
// ./countTh

#include<iostream>
#include<thread>
int main(){
  std::cout <<std::thread::hardware_concurrency()<< "\n";
  return 0;
}

In order to make parallel programs we need to do the following: 1) include the header thread; 2) use the flag -pthread when invoking the compiler.

The function std::thread::hardware_concurrency() returns the most optimal number of threads to use in a program. Often, the most optimal number of threads is some multiple of a number of cores. If the CPU is quad core, then it has 4 cores. The optimal number of threads that computer will suggest would be a number such as 4, 8, 12, or 16 (depending on whether there is a hardware support for multi-threading). The programmers should try to use the optimal number of threads, but they are not required to. Occasionally, for certain programs it is worth doing simulations and experiments with different numbers of threads. The suggested number does not always result in the fastest program.

2. First example

We will now make a program that uses three threads to execute three different functions at the same time. The first function adds the number 7 to its input value x. The second functions adds two of its arguments. The third function adds the three arguments.

The implementations of our three functions are given below:

void addSeven(long x, long* result){
  *result=x+7;
}
void addTwoNumbers(long x, long y, long* sum){
  *sum=x+y;
}
void addThreeNumbers(long x, long y, long z, long* sum){
  *sum=x+y+z;
}

We will ask our program to execute the functions concurrently. We will be able to start the first execution and ask the program not to wait for this execution to finish before proceeding to the next instruction. The name of the first function is addSeven. Our request must tell the program to not wait for the function to finish. Such request is given with the command

std::thread th1(addSeven,a,&result1);

The command creates and object th1 of type std::thread. The constructor will start a thread that executes addSeven with arguments a and &result1. The arguments should be of the same type as the function expects. The first one, a must be of type long and the second one must be a pointer to long.

After we create the object th1, we proceed in an analogous way. We create two objects th2 and th3. Each of th2 and th3 will be created using the constructor that starts a new thread.

After the threads are created, they continue executing their functions. In our example, the functions are extremely simple, and th1 may be done before th2 and th3 are created. However, in practice, the different threads can take substantially different amounts of time to finish the execution. The main program will not wait for individual threads, unless a specific command is given. The name of the command is join. The command th1.join(); tells the program to wait until the thread th1 completes the execution. We need to give a join command for each of the threads. The complete code is given below.

// compile with
// c++ threeThreads.cpp -o threeTh -std=c++11 -pthread
// execute with
// ./threeTh

#include<iostream>
#include<thread>
void addSeven(long x, long* result){
  *result=x+7;
}
void addTwoNumbers(long x, long y, long* sum){
  *sum=x+y;
}
void addThreeNumbers(long x, long y, long z, long* sum){
  *sum=x+y+z;
}
int main(){
  long a=10;
  long b=15; long c=20;
  long x=2; long y=5; long z=30;
  long result1,result2,result3;
  std::thread th1(addSeven,a,&result1);
  std::thread th2(addTwoNumbers,b,c,&result2);
  std::thread th3(addThreeNumbers,x,y,z,&result3);
  th1.join();
  th2.join();
  th3.join();
  std::cout<<result1<<" "<<result2<<" "<<result3<<"\n";
  return 0;
}

3. Problematic example: Threads that print

One of the best way to learn parallel programming is to start with writing bad programs. We will now learn why printing should not be done in parallel.

We will create a function that activates \(5\) threads and requests each thread to call the function that prints one message. The message will include one integer i. The threads will send different integers to the function. The thread 0 will print I am thread 0; the thread 1 will print I am thread 1, and so on. The last thread will print I am thread 4.

We are allowed to ask for 5 threads even if the function std::thread::hardware_concurrency() told us that 4 (or 8, or 12) was optimal number. We could ask for 100 threads as well - each core would be assigned more threads than manufacturer suggests and probably the program would run slower than if we followed the recommendation.

Here is the code for the program Printing threads.

// compile with
// c++ printingThreads.cpp -o printingTh -std=c++11 -pthread
// execute with
// ./printingTh

#include<iostream>
#include<thread>
void printMessage(long i){
  std::cout<<"I am thread "<<i<<"\n";
}
int main(){
  long numThreads=5;
  std::thread allTh[numThreads];
  for(long i=0;i<numThreads;++i){
    allTh[i]=std::thread(printMessage,i);
  }
  for(long i=0;i<numThreads;++i){
    allTh[i].join();
  }
  return 0;
}

The first for loop in the above code caused the function printMessage to be executed in parallel. Each element of the sequence allTh was of type std::thread. Once thy are assigned the value std::thread(printMessage,i) a CPU core is called and asked to perform an action. The action consists of a call to the function printMessage and passing the parameter i to the function.

Please compile and execute the code several times. You will get a different messy output every time. This is one of the outputs that the code could produce

I am thread I am thread 0
I am thread 1
I am thread I am thread 2
4
3

To explain this mess let us first start with the conclusion: The command std::cout that prints on the screen should not be run in parallel. Your computer screen is one and only one. The 5 threads are in race for this single resource. In the above output you could see that one thread got there first, grabbed the wire to the screen and printed "I am thread" and just as it was about to print its number, another thread grabbed the wire to the screen and started printing its message. These races for resources must be avoided. The consequences cannot be predicted.

4. Memory race

Shared resources such as screen, keyboard, and printers should not be inside the parallel section of the code. And this is kind of easy to avoid and it rarely ends up being a source of a bug. However, there is another more dangerous bug that is harder to detect: memory race. This is when several threads try to print in the same location in the main memory. If a variable is declared before the parallel block and then all threads try to write to the variable, the variable can very well end up with a lot of garbage inside of it.

Let us look at the following code.

// compile with
// c++ memoryRace.cpp -o mRace -std=c++11 -pthread
// execute with
// ./mRace

#include<iostream>
#include<thread>
void memoryRace(long* aR){
  *aR=0;
  for(long i=0;i < 100000;++i){
    *aR=*aR+1;
  }
}
int main(){
  std::thread thArray[5];
  long result=0;
  for(long i=0;i<5;++i){
    thArray[i]=std::thread(memoryRace,&result);
  }
  for(long i=0;i<5;++i){
    thArray[i].join();
  }
  std::cout << result << std::endl;
  return 0;
}

If this code was run on a single core, then the variable result inside the function would just end up with the content \(100000\). But when run on \(5\) cores, then almost every execution of the program will result in a different outcome. All threads are trying to write to the same variable result in RAM. Each of them sets that variable to 0 exactly once. It may happen that just as one thread finishes the loop and writes the big sum inside result another thread executes its first command and places 0 in result.

Also, the program can become slower when the threads try to write to the same location. The shared resource may not often be able to accommodate simultaneous requests and forces the threads to wait in line for the shared resource.

5. Modifying a sequence

After writing several bad programs, we are ready for a good one. Here is the problem that will illustrate the increase in speed that parallel programming can offer.

We want to create a function whose argument consists of a positive prime number \(p\), positive integer \(n\), and three sequences alpha, beta, and gamma. The sequences alpha and beta contain positive integers. The function should modify the sequence gamma so that each of its terms satisfies \[\mbox{gamma}[i]=\left(\mbox{alpha}[i]^{\mbox{beta}[i]}\right) \% p.\]

We will write both the traditional (non-parallel) function and parallel function. Then we will compare their speeds.

6. Timer

We will compare the performance of non-parallel and parallel program. We will need a class SimpleTimer to do the comparison. The timer is located at simpleTimerCPP

7. Non-parallel program

We first need a function long powerFunction(const long & a, const long & b, const long & p) that calculates \(\left(a^b\right) \% p\).

Both non-parallel and parallel version of the program will use powerFunction. The non-parallel implementation is given below.

// compile with
// c++ nonParallel.cpp -o npTest -std=c++11 -pthread
// execute with
// ./npTest
#include<iostream>
#include <thread>
#include"SimpleTimer.cpp"

long powerFunction(long a, long b, long p){
  long res=1;
  long i=0;
  while(i < b){
    res*=a;
    res%=p;
    ++i;
  }
  return res;
}
void nonParallelFunction(long* alpha, long* beta, long* gamma, long p, long n){
  long i=0;
  while(i < n){
    gamma[i]=powerFunction(alpha[i],beta[i],p);
     ++i;
   }
 }
int main(){
   std::cout<<"Starting execution...\n";
   long * alpha, * beta, * gamma;
   long p=17;
   long N=100000;
   alpha=new long[N]; beta=new long[N]; gamma=new long[N];
   alpha[0]=8;beta[0]=5183;alpha[1]=9;beta[1]=4351;
   alpha[2]=7;beta[2]=4711;
   long i=3;
   while(i < N){
     alpha[i]=alpha[i-3];beta[i]=beta[i-3];++i;
   }
   SimpleTimer tm;
   tm.start();
   nonParallelFunction(alpha,beta,gamma,p,N);
   tm.end();
   std::cout<<"The non-parallel function running time is: ";
   std::cout<< tm.getTime() << "\n";
   std::cout<<"First few terms of the sequence are:\n";
   for(long i=0;i<100;++i){
     std::cout<<gamma[i]<<"\t";
     if(i%10==9){std::cout<<"\n";}
   }
   std::cout<<"\n";
   delete[] alpha;delete[] beta;delete[] gamma;
   return 0;
}

Please be patient when you execute the code. It can take as long as 30 seconds on some computers.

8. Parallel program

The parallel program will only change the function nonParallelFunction to parallelFunction which will have the same arguments. The difference is that we will run the code on \(m\) threads - where \(m\) is the suggested number of threads. We will obtain it by using the function std::thread::hardware_concurrency(). Then thread 0 will calculate the terms gamma[0], gamma[m], gamma[2m], \(\dots\). The thread \(1\) will calculate the terms gamma[1], gamma[m+1], gamma[2m+1], etc.

The complete program is given below.

// compile with
// c++ parallel.cpp -o pTest -std=c++11 -pthread
// execute with
// ./pTest
#include<iostream>
#include <thread>
#include"SimpleTimer.cpp"

long powerFunction(long a, long b, long p){
  long res=1;
  long i=0;
  while(i < b){
    res*=a;
    res%=p;
    ++i;
  }
  return res;
}
void executeOnOneThread(long* alpha, long* beta, long* gamma, long p, long m, long start, long end){
  while(start<end){
    gamma[start]=powerFunction(alpha[start],beta[start],p);
    start+=m;
  }
}
void parallelFunction(long* alpha, long* beta, long* gamma, long p, long n, long m){
  if(m<1){m=1;}
  std::thread* allTh;
  allTh=new std::thread[m];
  for(long i=0;i<m;++i){
    allTh[i]=std::thread(executeOnOneThread,alpha,beta,gamma,p,m,i,n);
  }
  for(long i=0;i<m;++i){
    allTh[i].join();
  }
  delete[] allTh;
}

int main(){
   long m=std::thread::hardware_concurrency();
   std::cout<<"Starting execution...\n";
   long * alpha, * beta, * gamma;
   long p=17;
   long N=100000;
   alpha=new long[N]; beta=new long[N]; gamma=new long[N];
   alpha[0]=8;beta[0]=5183;alpha[1]=9;beta[1]=4351;
   alpha[2]=7;beta[2]=4711;
   long i=3;
   while(i < N){
     alpha[i]=alpha[i-3];beta[i]=beta[i-3];++i;
   }
   SimpleTimer tm;
   tm.start();
   parallelFunction(alpha,beta,gamma,p,N,m);
   tm.end();
   std::cout << "The parallel function running time is: " << tm.getTime() << "\n";
   std::cout<<"First few terms of the sequence are:\n";
   for(long i=0;i<100;++i){
     std::cout<<gamma[i]<<"\t";
     if(i%10==9){std::cout<<"\n";}
   }
   std::cout<<"\n";
   delete[] alpha;delete[] beta;delete[] gamma;
   return 0;
}