On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI)

2017-03-28 | thesis; doctoral thesis. A publication of Göttingen

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

​On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI)​
End, V. ​ (2017)

Documents & Media


GRO License GRO License


End, Vanessa 
In high performance computing (HPC) applications, scientific or engineering problems are solved in a highly parallel and often necessarily distributed manner. The distribution of work leads to the distribution of data and thus also to communication between the participants of the computation. The application programmer has many different communication libraries and application programming interfaces (APIs) to choose from, one of the most recent libraries being the Global Address Space Programming Interface (GASPI). This library takes advantage of the hardware and especially interconnect developments of the past decade, enabling true remote direct memory access (RDMA) between nodes of a cluster. The one-sided, asynchronous semantic of GASPI routines opens multiple research questions with respect to the implementation of collective communication routines, i.e., routines, where a group of processors is involved in the communication. The GASPI specification itself only offers two of these collective operations: the allreduce, computing a global result from the data of all participants, and the barrier, constituting a synchronization point for all members of the group. For these collective routines, appropriate underlying algorithms have to be chosen. In the scope of the one-sided, asynchronous and split-phase semantic of GASPI collective routines, algorithms used in other wide-spread communication libraries like the Message-Passing Interface (MPI) may not be fitting any more. In this thesis, existing algorithms have been reevaluated for their usability in GASPI collective routines in the context of a newly designed library GASPI_COLL, amending the existing GASPI implementation GPI2 with additional algorithms for the allreduce and with further collective routines: reduce and broadcast. For the split-phase allreduce, algorithms with a butterfly-like communication scheme have been extensively tested and found to be very suited due to their low number of communication rounds and involvement of all participants in each communication round. This ensures few repeated calls to the allreduce routine and also very small idling times for all nodes. One of the most wide-spread algorithms for barrier operations, the dissemination algorithm, has been adapted to be usable for the allreduce operation as well. The adapted n-way algorithm shows very good results compared to the native implementation of the GPI2 allreduce and different MPI implementations. To make the one-sided communication semantic of GASPI manageable for the application programmer, the GASPI specification introduces weak-synchronization primitives, notifying the destination side of the arrival of data. This notification mechanism prevents the necessity of global synchronization points or the waiting on multiple communication requests. This notification mechanism has previously only been available for write-based operations but has been extended to the read routine in the scope of this thesis, introducing gaspi_read_notify. With this new routine, the thesis establishes the basis of a completely one-sided, asynchronous graph exploration, implemented with the notified read operation. This enables a broader audience to use data analytical methods on big data. Big data poses a real challenge for graph analytical methods, because the data needs to be distributed on multiple nodes, introducing high communication overhead if two sides are involved in the communication. This issue is eliminated through gaspi_read_notify. Last but not least, the potential usage of gaspi_read_notify for a distributed matrix transpose was investigated. Not only is a matrix transpose a wide-spread communication scheme in HPC applications, it can also be considered as a special case of an alltoall communication. The split-phase, one-sided paradigm of GASPI collective routines, has inspired the idea of a partially evaluable alltoallv and as a first step towards this routine, the applicability of gaspi_read_notify for the implementation of the alltoall can be deduced from the matrix transpose. On the available systems, this kind of implementation can not be encouraged though. Yet, the experiments in this thesis have also shown the high dependence of communication routines and algorithms on the underlying hardware. Thus, extensive tests on different system architectures will have to be done in the future.
Issue Date
Gesellschaft für wissenschaftliche Datenverarbeitung