Enhancing MPI with modern networking mechanisms in cluster interconnects
Advances in CPU and networking technologies make it appealing to aggregate commodity
compute nodes into ultra-scale clusters. But the performance achievable is highly dependent
on how tightly their components are integrated together. The ever-increasing size of
clusters and applications running over them leads to dramatic changes in the requirements.
These include at least scalable resource management, fault tolerance process control, scalable
collective communication, as well as high performance and scalable parallel IO.
Message Passing Interface (MPI) is the de facto standard for the development of parallel
applications. There are many research efforts actively studying how to leverage the best performance
of the underlying systems and present to the end applications. In this dissertation,
we exploit various modern networking mechanisms from the contemporary interconnects and
integrate them into MPI implementations to enhance their performance and scalability. In
particular, we have leveraged the novel features available from InfiniBand, Quadrics and
Myrinet to provide scalable startup, adaptive connection management, scalable collective
operations, as well as high performance parallel IO. We have also designed a parallel Checkpoint/Restart
framework to provide transparent fault tolerance to parallel applications.
Through this dissertation, we have demonstrated that modern networking mechanisms
can be integrated into communication and IO subsystems for enhancing the scalability, performance
and reliability of MPI implementations. Some of the research results have been
incorporated into production MPI software releases such as MVAPICH/MVAPICH2 and
LA-MPI. This dissertation has showcased and shed light on where and how to enhance the
design of parallel communication subsystems to meet the current and upcoming requirements
of large-scale clusters, as well as high end computing environments in general.
School:The Ohio State University
School Location:USA - Ohio
Source Type:Master's Thesis
Keywords:parallel processing electronic computers programming computer science algorithms networks infiniband standard
Date of Publication: