Sponsor:
Advanced Scientific Computing Research (ASCR) under the U.S. Department of Energy Office of Science (Office of Science)
Project Team Members:
Northwestern University
North Carolina State University
Lawrence Berkley National Laboratory
Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems
Introduction
Recent DOE Workshops on Exascale Computing have articulated emerging trends related to data, hardware, and energy issues that necessitate the next-generation algorithms and software libraries for data analysis and mining. They recognized that the increasing gap between the opportunities created by these trends and the current data analytics capabilities will soon become a major bottleneck on our path to exascale. Given the data trend—scientific data grow not only in the size but in the complexity—the demands for more sophisticated analyses increase. The execution of many data analysis algorithms is dominated by a small number of kernels. Therefore, our strategy is to provide a generic and highly optimized set of cores, or kernel, analytics functions, from these a broad constellation of high performance analytical pipelines could be organically consolidated. It is our vision to develop a comprehensive library of such exascale data analysis and mining kernels. In the long term, such an approach will bring the development of analytics algorithms to the next level: the impact maybe akin to that of the ScaLAPACK linear algebra library in scientific computing. Furthermore, to meet the hardware trend—the architectures of emerging HPC systems are becoming inherently heterogeneous—our specific goal is to design algorithms for data analysis kernels accelerated on hybrid multi-node, multi-core HPC architectures comprised of a mix of GPUs, FPGA, and SSDs and develop their scalable implementations. Finally, the energy trend—performance-energy tradeoffs are becoming an essential part of the equation—drives the proposed advances in our performance-energy tradeoff analysis framework that would enable our data analysis kernels algorithms and software to be parameterized so that users can choose the right power-performance optimizations. The apex of this proposal is a library of functions and software to accelerate data analytics, mining, knowledge discovery for large-scale scientific applications, thereby, increasing productivity of both scientists and the systems. The developed software will be released as open source for the benefits of the community in large. Moreover, students and postdoctoral associates on this project would be trained, readily available for DOE labs.
To achieve our overarching goals, the specific objectives of the proposal are as follows:
- Design and develop data mining kernels and algorithms for acceleration on hybrid architectures which include many-core systems, GPUs, and other accelerators.
- Design and develop approximate scalable algorithms for data mining and analysis kernels enabling faster exploration, more efficient resource usage, reduced memory footprint, and more power efficient computations.
- Design and develop scalable and out-of-core algorithms and software for analytics that exploit SSD disks to enable exploration of massive amounts of data and to also enable large-scale in-situ analytics and mining on nodes.
- Design and develop index-based data analysis and mining kernels and algorithms for performance and power optimizations including (a) selective data mining kernels with FastBit and (b) index-based perturbation analysis kernels for noisy and uncertain data.
- Design and develop alternative and parameterized kernels and algorithms that facilitate trade-offs in performance, resource usage, and energy efficiency. For this purpose, we will build upon our Energy-Resource-Efficiency (ERE) Framework.
- Demonstrate the results of our project by enabling analytics at scale for selected applications (some of them described in the next section) on large-scale HPC systems.
- Provide the results, algorithms, and software libraries in the public domain.
Publications
- Prabhat Kumar, Berkin Ozisikyilmaz, Wei-keng Liao, Gokhan Memik, and Alok Choudhary. High Performance Data Mining Using R on Heterogeneous Platforms. In Workshop on Multithreaded Architectures and Applications, in conjunction with the International Parallel and Distributed Processing Symposium, May 2011.(pdf)
- Jerry Chou, Kesheng Wu, Oliver Rubel, Mark Howison, Ji Qiang, Prabhat, Brian Austin, E. Wes Bethel, Rob D. Ryne, and Arie Shoshani. Parallel Index and Query for Large Scale Data Analysis. In Proceedings of SuperComputing Conference 2011.
- Jerry Chou, Kesheng Wu, and Prabhat. FastQuery: A General Indexing and Querying System for Scientific Data. In SSDBM, pp. 573-574, 2011.
- Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross and Nagiza F. Samatova. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data. In Euro-Par 2011. Volume 6852 pages 366-379.
- Sriram Lakshminarasimhan, John Jenkins, Isha Arkatkar, Zhenhuan Gong, Hemanth Kolla, Seung-Hoe Ku, Stephane Ethier, Jackie Chen, C. S. Chang, Scott Klasky, Robert Latham, Robert Ross, and Nagiza F. Samatova. ISABELA-QA: Query-driven Analytics with ISABELA-compressed Extreme-scale Scientific Data. In SC '11.