PPoPP 2008 START Conference Manager    

Experience on Optimizing Irregular Computation in a Manycore Architecture--IBM Cyclops64 (poster presentation)

Guangming Tan, Andrew Russo and Guang Gao

The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008)
Salt Lake City, Utah, February 20-23, 2008


This paper represents the mapping of a SSCA2 benchmark - the betweenness centrality algorithm} for scale free sparse graph on an emerging class of manycore architectures -- pioneered IBM Cyclops64 architectures. Owing to little locality, A distinct feature of such a benchmark is the challenges of handling highly irregular memory accesses due to the traversal of dynamic data structures, while the operations performed on such data structures may contain ample opportunities of fine-grain parallelism. One design objective of the SSCA2 algorithm is to capture such features for a large class of irregular computation problems. Main contributions of this paper include. First, the locality challenges are resolved by percolation - where a pointer data structure in off-chip memory is first fetched into on-chip memory and transformed into a linear array during this process. While some helper threads are performing such percolation other threads are kept busy for operating on data already percolated earlier onto the on-chip memory. Implementation of our percolation take advantages of the distinct features of modern manycore chip architectures: there are likely to have ample thread unit resource that can be exploited as {\em helper} threads. Second, we have applied lock optimization methods to reduce synchronization overhead - both in terms of reducing the number of locks used, as well as reduced the cost of each individual lock operations by taking advantage of an architecture feature called synchronization state buffer. Finally, we have implemented our methods on the IBM Cyclops64 platforms. The experimental results on SSCA2 have demonstrated the advantage of our methods: 1). achieved considerable speedups on large number of cores. 2). obtain running time reduction by 2-3 times on small number of cores. Our work shows that creating locality just in time by percolation is efficient for irregular computation in a manycore with explicit memory hierarchy.

START Conference Manager (V2.54.5)