High performance and network fault tolerant MPI with multi-pathing over infiniBand
In the last decade or so, the high performance community is observing a paradigm shift with interconnection methodology for processing elements. Combining commercial off-the-shelf components to build supercomputers has provided users with an excellent price-to-performance ratio. At the same time, scientific applications ranging from molecular dynamics to ocean modeling are being designed with Message Passing Interface (MPI) being the de facto programming model. The insatiable computational requirements of the scientific applications has been continuously pushing the scale of these clusters. Increasing scale of these clusters has aggravated the occurrence of hot-spots in the network and reduced the mean time between failures of difference network components. In order to provide the best performance to the scientific applications, it is imperative that the MPI libraries are capable of avoiding network hot-spots and resilience to faults in the network. At the same time, InfiniBand has emerged as a popular interconnect, providing a plethora of modern features with open standard and high performance. In this dissertation, we focus on designing a communications and network fault tolerance layer with InfiniBand, which leverages the presence of multiple paths in the network for avoidance of hot-spots in the network and network fault tolerance. Much of the dissertation has been integrated with an open source effort, MVAPICH, which is a popular implementation of MPI over InfiniBand and is used by a large number of supercomputers in the world.
School:The Ohio State University
School Location:USA - Ohio
Source Type:Master's Thesis
Keywords:infiniband mpi network fault tolerance hot spot
Date of Publication:01/01/2007