Abstract The multidimensional positive definite advection transport algorithm MPDATA belongs to the group of nonoscillatory forward-in-time algorithms and performs a sequence of stencil computations.

Hence, each piece is decomposed into some MPDATA blocks, where subsequent blocks are processed one by one, and each computational block is processed in parallel by the corresponding work team.

In the basic, unoptimized implementation of the MPDATA algorithm Algorithm 1every stage reads a required set of matrices from the main memory and sditor results to the main memory after computation. Although the block decomposition of MPDATA allows for the reduction of the memory traffic, it still does not guarantee a satisfying utilization of target platforms.

The starting point of the proposed block decomposition is the loop tiling technique for the original version of the MPDATA code. This rule requires us to develop a flexible management of data for all the stages, as well as an adequate mapping of partial results onto the cache space.


In particular, the performance evaluation of sparse-matrix multiplication kernels on the Intel Xeon Phi was presented in [ 4 ].

MPDATA belongs to the group of nonoscillatory forward-in-time algorithms and performs a sequence of stencil computations [ 5, 30 ].

The best configurations, including number of teams, sizes of pieces, size of block, and distribution of computation within teams, are chosen in an empirical way, individually for each platform. The impact of block size on the overall performance is illustrated in Figure 6 c. It also allows us to improve the cache reuse and operational intensity ratio. This mechanism is based on using the OpenMP atomic directive.


The second one provides a better load balance across available resources assigned to a team, but it requires more intracache communications. In particular, it is noticeable for the first stages that they are strongly dependent on the input data.

