The parallel loop templates parallel_for and parallel_reduce take an optional partitioner argument, which specifies a strategy for executing the loop. The following table summarizes the three partitioners and their effect when used in conjunction with blocked_range.
Partitioner |
Description |
When Used with blocked_range(i,j,g) |
---|---|---|
simple_partitioner |
Chunksize bounded by grain size. |
g/2 ≤ chunksize ≤ g |
auto_partitioner (default)[4] |
Automatic chunk size. |
g/2 ≤ chunksize |
affinity_partitioner |
Automatic chunk size and cache affinity. |
An auto_partitioner is used when no partitioner is specified. In general, the auto_partitioner or affinity_partitioner should be used, because these tailor the number of chunks based on available execution resources. However, simple_partitioner can be useful in the following situations:
The subrange size for operator() must not exceed a limit. That might be advantageous, for example, if your operator() needs a temporary array proportional to the size of the range. With a limited subrange size, you can use an automatic variable for the array instead of having to use dynamic memory allocation.
A large subrange might use cache inefficiently. For example, suppose the processing of a subrange involves repeated sweeps over the same memory locations. Keeping the subrange below a limit might enable the repeatedly referenced memory locations to fit in cache. See the use of parallel_reduce in examples/parallel_reduce/primes/primes.cpp for an example of this scenario.
You want to tune to a specific machine.
[4] >Prior to Intel® Threading Building Blocks (Intel® TBB) 2.2, the default was simple_partitioner. Compile with TBB_DEPRECATED=1 to get the old default.