OpenMP
 
 
 

OpenMP overview

OpenMP is a higher level threading abstraction primarily designed for data parallelism applied to for loops. It is supported by VS2005 and by the Intel compiler on all platforms. The version of gcc required by Maya on OSX and Linux does not support OpenMP.

OpenMP is implemented via compiler pragmas. Compared with native threading, it presents a very simple interface to the user. All the thread creation and management and most data partitioning is hidden from the user. OpenMP uses thread pools, which are created the first time a parallel region is encountered (for this reason it is important to ignore the very first threaded loop traversal in an application when profiling OpenMP code.)

Here is a simple example of a threaded loop:

#pragma omp parallel for
for(int i=0; i<imax; i++) {
    doThreadsafeWork(i);
}

This code will be converted by the compiler into a function that is run in parallel by multiple threads, by default one thread per logical processor. The first processor takes the first (imax/numThreads) elements of the loop, the second processor takes the next equally sized chunk and so on. This ensures good cache affinity, that is, the processor is usually working with adjacent data elements and so minimizes cache misses.

Here is a more complex example:

#pragma omp parallel for schedule(guided) if(imax>1000)
for(int i=0; i<imax; i++) {
    doThreadsafeWork(i);
    #pragma omp critical
    doNonThreadsafeWork(i);
}

This code also breaks the loop into chunks, but the guided scheduling option causes it to use chunks smaller than size (imax/numThreads), and to send a new chunk to each thread as it finished an existing chunk. This provides better load balancing for cases where workload varies between iterations of the loop, since at runtime additional chunks of work are assigned to any threads that finish early. The critical pragma places a lock on the line which follows it.

The if conditional on the main pragma causes the loop to be run in parallel only if the trip count (number of loop iterations) exceeds the specified value. There is overhead to invoking a parallel region, so it does not make sense to parallelize the evaluation if the trip count is too low. A good rule of thumb is to assume an overhead of 10k clock cycles to start and end a parallel region, so the cutoff trip count should try to exceed this work. This is particularly important for an application like Maya, where the same algorithm may be invoked on a single very dense object or a large number of very simple objects. In the latter case, the extra startup overhead of thousands of very short threaded evaluations could easily overwhelm any threading benefit, and you might actually get significant slowdowns with threaded code that does not have a cutoff. Note that even if the threading is not cut off by the conditional, the code inside the loop is still extracted by the compiler into a separate function and will take the hit of a function call.

Pros and cons of OpenMP

The benefits of OpenMP are cross-platform support, simple implementation, and ease of removal. Simply disabling the pragma causes the code to revert to its serial form, and can be used as a quick way to check that the behavior is the same. A surprisingly large amount of code can be threaded using very simple OpenMP pragmas such as those in the examples above. It is very useful for quickly prototyping and evaluating possible threading benefits, even if the final implementation will be done using a different threading library.

The downsides are limited algorithm applicability, and incompatibility between implementations. The VS2005 and Intel OpenMP libraries can be used together, but they do not recognize each other's implementations. So a threaded loop compiled with VC2005 that calls a function compiled with the Intel compiler will ignore any OpenMP locks defined by the Intel compiler.

OpenMP is also problematical with nested threading, where higher level threads are spawned that call into code that is itself threaded at a lower level. This causes more threads to be activated than there are cores on the system. This is known as oversubscription, and leads to poor performance. Oversubscription becomes increasingly problematical as more threading is added to an application, since a developer may not even realize that a function being called in parallel is itself threaded.

OpenMP and Maya

A lot of algorithms in Maya are threaded using OpenMP, including the fluids solver, hair collisions and many deformers. Maya uses the Intel compiler for OpenMP on all platforms.

Vendor-specific OpenMP Issues

There are some performance and correctness issues with some OpenMP implementations. See Vendor-specific OpenMP Issues for details.