Parallel Programming Models: MPI and OpenMP

We will use widely used two parallel programming models. They solve similar problems using different methods, and in this chapter we will introduce these methods with sample codes to work with.

Message Passing Interface (MPI)

MPI is a standardized and portable message passing library that is designed to distribute tasks of a job to be shared with cluster of computers. The method that is used to share these tasks usually require intranet communication through standard Ethernet, Fiber Channel connection or specialized hardware such as InfiniBand, Elastic Fabric Adapter.

Let’s start with the terminologies to understand each basis of component in an MPI application.

  • Processes: Each MPI program contains multiple processes and each of these processes are responsible to manage their own lifecycle.
  • Communicators: Processes are organized into groups and they are used to define the scope and context of communication operations.
  • Point-to-point: MPI provides functions for sending and receiving messages between pairs of processes (e.g., MPI_Send, MPI_Recv).
  • Collective Communication: MPI also supports collective operations involving groups of processes (e.g., MPI_Bcast, MPI_Reduce).

Let’s start with a simple C code example that uses MPI to find a sum of one million. Even though this project is fairly simple and easy to guess (and also there is no real reason to run it), I believe these examples are great way to write “Hello, World!” applications in MPI.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int
compute_local_sum(int rank, int n)
{
    long int local_sum = 0;
    int start = rank * n + 1;  // Assuming elements start from 1
    int end = start + n;

    printf("Process %d: Start = %d, End = %d\n", rank, start, end);

    for (int i = start; i < end; ++i)
    {
        local_sum += i;        // Compute the local sum for this process
    }

    return local_sum;
}

int
main(int argc, char *argv[])
{
    int rank, nprocs, n;
    long int local_sum, global_sum;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    // Distribute data and compute local sum
    n = 1000 / nprocs;
    local_sum = compute_local_sum(rank, n);

    printf("Process %d: Local sum = %ld\n", rank, local_sum);

    // Perform parallel sum reduction
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0)
    {
        printf("Global sum: %ld\n", global_sum);
    }

    MPI_Finalize();

    return EXIT_SUCCESS;
}

In this given example, we are diving the problem into number of processes, which is nprocs, and then calling each process ID (rank) to execute the responsible portion of the code. Later in the lines, MPI_Reduce is responsible to perform parallel sum reduction and then combine all local_sums from processes into global_sum.

OpenMP

Open Multi-Processing is an application programming interface that support shared-memory parallel programming in a portable and scalable manner. It allows developers to create multi-threaded applications that can leverage the computational power of modern multi-core processors and also multithreading.

Let’s look into components of OpenMP to understand its need.

  • Parallel Regions: OpenMP uses compiler directives (pragmas) to specify parallel regions, where multiple threads can execute concurrently.
  • Work-Sharing Construct: OpenMP provides constructs like parallel for and parallel sections for distributing work among threads.
  • Synchronization: Mechanisms like barriers, critical sections, and locks are available for synchronizing thread execution and protecting shared data.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void
matrix_multiply(int n, double a[n][n], double b[n][n], double c[n][n])
{
    #pragma omp parallel for
    for (int i = 0; i < n; i++)
    {
        for (int j = 0; j < n; j++)
        {
            c[i][j] = 0.0;

            for (int k = 0; k < n; k++)
            {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }
}

void
print_matrix(int n, double matrix[n][n])
{
    const int max_rows = 10;
    const int max_cols = 10;

    for (int i = 0; i < n && i < max_rows; i++)
    {
        for (int j = 0; j < n && j < max_cols; j++)
        {
            printf("%.1f\t", matrix[i][j]);
        }
        printf("\n");
    }

    /*
     * Items after max_rows and max_cols will not be printed and therefore ...
     */
    if (n > max_rows || n > max_cols)
    {
        printf("...\n");
    }
}


int
main(int argc, char *argv[])
{
    int n = 500;
    double a[n][n], b[n][n], c[n][n];
    srand(time(NULL));

    #pragma omp parallel for
    for (int i = 0; i < n; i++)
    {
        for (int j = 0; j < n; j++)
        {
            a[i][j] = rand() % 10;
            b[i][j] = rand() % 10;
        }
    }

    matrix_multiply(n, a, b, c);
    print_matrix(n, c);

    return EXIT_SUCCESS;
}

In this example, the #pragma omp parallel for directive instructs the compiler to distribute the iterations of the outer loop across multiple threads, enabling parallel execution of the matrix multiplication algorithm.

Just to give you an idea, on a virtual machine the above code performed 0.500 seconds while without OpenMP, we were getting 0.900 seconds.

Building C software

In order to run any of the code below, we need to compile. In order to do that, I will introduce CMake that is already installed in your virtual machine. In the following CmakeLists.txt file, we are enabling MPI support to compile our “Hello, World!”.

project(Chapter4)
cmake_minimum_required(VERSION 3.1)

find_package(MPI REQUIRED)

add_executable(chapter_4 main.c)

target_compile_options(chapter_4 PRIVATE ${MPI_C_COMPILE_FLAGS})
target_include_directories(chapter_4 PRIVATE ${MPI_C_INCLUDE_PATH})
target_link_libraries(chapter_4 ${MPI_C_LIBRARIES} ${MPI_C_LINK_FLAGS})

For instance, for the OpenMP example, we will required to use slitly different CMakeLists.txt since we don’t need the MPI check but OpenMP check and also library linking.

Technically, all can be done in my CMakeLists.txt file, but for the sake of demonstration, we would like to keep them separate.

cmake_minimum_required(VERSION 3.1)
project(Chapter4)

find_package(OpenMP REQUIRED)

add_executable(chapter_4 main.c)

if (OpenMP_C_FOUND)
    target_link_libraries(chapter_4 PRIVATE OpenMP::OpenMP_C)
endif()

Both of these files will require a directory which for the simplicity we will call it build, and then you can follow cmake commands to prepare GNU Makefiles to eventually compile the program.

mkdir build
cd build
cmake ..
cmake --build .

If you are curious about generated Makefiles, then definitely run --verbose within the build stage. This will show all the commands that are executed in the system in order to get a running version of the program.

The compiled program will be named as chapter_4. Let’s run the MPI program.

mpiexec -np 2 chapter_4

Now, it is time to run OpenMP program

./chapter_4