

Bringing AI Everywhere in HPC

# MPI@Intel

Maria J. Garzaran

it starts with

Supercomputing 2023

### Intel<sup>®</sup> MPI 2021.11 Update

• What's new:

MPI-3 RMA GPU (host and device initiated)
Device initiated mode: I\_MPI\_OFFLOAD\_ONESIDED\_DEVICE\_INITIATED
MPI 4.0 features: MPI Sessions support
Large count/native ILP64 is available since 2021.10
OFI/cxi support and optimizations (Technical preview)

#### • Enhancements:

GPU buffer aware pt2pt and collective operations optimization MPI\_Alltoall, MPI\_Allreduce CPU/GPU platforms scalability optimizations



#### Intel® MPI 2021.11 MPI-3 RMA GPU (SYCL example)

| Host initiatedDevice initiated*q.submit([&](auto & h) {<br>h.parallel_for(sycl::range(my_subarray.x_size), [=] (auto index)h.parallel_for(sycl::range(my_subarray.x_size), [=] (auto index)m        |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>my_subarray.x_size, MPI_DOUBLE,<br/>my_subarray.rank + 1, 1,<br/>my_subarray.x_size, MPI_DOUBLE, cwin);<br/>/* Recalculate internal points in parallel with communications */<br/>f<br/></pre> |
| <pre>for (int column = my_x_lb; column &lt; my_x_ub; column ++) {    </pre>                                                                                                                         |

 MPI-3 RMA GPU subset of supported functions: *MPI\_Put MPI\_Get MPI\_Win\_lock / MPI\_Win\_lock\_all\*\* MPI\_Win\_unlock / MPI\_Win\_unlock\_all\*\* MPI\_Win\_flush / MPI\_Win\_flush\_all\*\* MPI\_Win\_flush / MPI\_Win\_flush\_all\*\* MPI\_Win\_fence\*\** Supported for scale up and scale out - Supported for SCALE and OpenMP (C/C++ only\*\*\*)

Legend: GPU kernel code HOST code

\* - Host initiated is available out of the box (I\_MPI\_OFFLOAD path) Device initiated requires I\_MPI\_OFFLOAD\_ONESIDED\_DEVICE\_INITIATED=1

- \*\* Synchronization primitives require kernel level serialization
- \*\*\* Fortran support is work in progress

ked

**Examples:** https://github.com/oneapi-src/oneAPI-samples/tree/development/Libraries/MPI/jacobian\_solver

# Intel<sup>®</sup> MPI 2021.11 GPU buffer path latency optimizations

IMB-MPI1-GPU **allreduce**. 4 nodes, 32 ranks total: 2 x Intel<sup>®</sup> Xeon<sup>®</sup> Platinum 8480+ Processor + 4 x Intel<sup>®</sup> Data Center GPU Max 1550. Latency ratio. Higher is better



IMB-MPI1-GPU **pingpong**. 1 node, 2 ranks total: 2 x Intel<sup>®</sup> Xeon<sup>®</sup> Platinum 8480+ Processor + 2 x Intel<sup>®</sup> Data Center GPU Max 1550. Latency ratio. Higher is better



Intel MPI 2021.11 GPU buffer

- New latency and BW optimizations for GPU aware pt2pt and collective operations: allreduce/alltoall

 New highly efficient support for alltoall with XeLink and GPU RDMA\* support (I\_MPI\_OFFLOAD\_RDMA\*)

\* I\_MPI\_OFFLOAD\_RDMA is fully supported with IEFS + OFI/psm3 (available for ConnectX-6+ interconnects family as well): https://www.intel.com/content/www/us/en/support/articles/000088090/ethernet-products/intel-ethernet-software.html

Intel MPI 2021.10 GPU buffer

#### **MPICH for Aurora Update - I**

- Improvements for Intel<sup>®</sup> Data Center GPU Max Series Optimizations for intra-node communication
  - Uses oneAPI Level Zero Inter Process Communication (IPC) primitives
  - For latency and bandwidth, and for a variety of message sizes
  - Optimized collective algorithms for GPU buffers
    - AllReduce, Broadcast, Alltoall, and Allgather
    - They benefit from XeLinks for large messages
  - Support for GPU RDMA (point to point and RMA)
  - Optimizations for RMA for GPU buffers
  - Optimizations leverage the underlying hardware features
  - Support for using a tile as a device or two tiles (in a single GPU) as a device

### MPICH for Aurora Update - II

- Contributions to Collectives
  - Topology (dragonfly) aware collectives for Broadcast, Allreduce, and Reduce
  - Hardware-offload Collectives for Broadcast, Allreduce, and Barrier
    - Triggered-based operations
    - Switched-based
  - Collectives with GPU buffers (in previous slide)
  - **Past Contributions** 
    - High radix algorithms for large scale system
    - Multi-leader algorithms to leverage multiple NICs
    - Non-blocking algorithms
    - Intra-node collectives that leverage shared-memory in the node
- Validation and bug fixes

#### Intel<sup>®</sup> SHMEM

- Device-initiated OpenSHMEM operations on Intel GPUs with SYCL
- Supports OpenSHMEM 1.5 style Remote Memory Access (RMA), Atomics, Collectives, Synchronization and Ordering operations
- Includes work-group and sub-group extensions for co-operative thread execution
- GPU RDMA enabled Sandia OpenSHMEM as host back-end through CPU proxy
- Fast synchronization algorithm between CPU and GPU threads
- >= 90% scale-up performance efficiency with 2D stencil kernel
- First open-source release is expected by next week) <u>https://github.com/oneapi-src/ishmem</u>

#### Legal Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the <u>Performance</u> <u>Index site</u>.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.



## it starts with