Project Team Members:
Northwestern University
Parallel Data Clustering Algorithms
Description:
Clustering is a data mining technique that groups data into meaningful subclasses, known as clusters, such that it minimizes the intra-differences and maximizes inter-differences of these subclasses. Well-known algorithms include K-means, K-medoids, BIRCH, DBSCAN, OPTICS, STING, and WaveCluster. These algorithms have been used in various scientific areas such as satellite image segmentation, noise filtering and outlier detection, unsupervised document clustering, and clustering of bioinformatics data. Existing data clustering algorithms have been roughly categorized into four classes: partitioning-based, hierarchy-based, grid-based, and density-based. We provide parallel implementations for three clustering algorithms, OPTICS, DBSCAN, and single-linkage hierarchical agglomerative clustering.
OPTICS
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order.
We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
DBSCAN
DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm. The key idea of the DBSCAN algorithm is that for each data point in a cluster, the neighborhood within a given radius has to contain at least a minimum number of points, i.e. the density of the neighborhood has to exceed some threshold.
Even though DBSCAN is a well-known and popular for its capability of discovering arbitrary shaped clusters and eliminating noise data, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
PINK
Hierarchical clustering, meanwhile, is the problem of discovering the large-scale cluster structure of a dataset by forming a dendrogram that captures a full range of clustering behavior in the dataset, from the most general cluster that encompasses the entire dataset, to the most stringent clusters that only include a single data point each. Hierarchical clustering offers several advantages over other clustering algorithms in that the number of clusters does not need to be specified in advance and the structure of the resulting dendrogram can offer insight into the larger structure of the data, e.g., establishing a phylogenetic tree among a set of species.
Hierarchical clustering is a challenging problem to parallelize, as each level in the resulting dendrogram relies on all of the earlier levels to make sense. Moreover, many existing parallel algorithms for hierarchical clustering store the distance matrix for a dataset explicitly, limiting the size of the problem that can be tackled due to memory constraints. We present two solutions for parallel single linkage hierarchical clustering, PINK (parallel single linkage) and SHRINK (shared memory single linkage). As neither one explicitly stores a distance matrix, they can be applied to much larger problem sizes. The overall strategy for the two algorithms is to divide a large hierarchical clustering problem instance into a set of smaller sub-problems, calculate the hierarchical clustering dendrogram for each of these sub-problems, and reconstruct the solution for the original dataset by combining the solutions to the sub-problems. While SHRINK utilizes an overlapping problem decomposition and relies on OpenMP, PINK uses MPI and has a disjoint problem decomposition, resulting in higher performance. We evaluated PINK empirically on real and synthetic datasets with up to 5.2 million data points and found that it achieves an estimated speedup of up to 6596 on 6050 processes. SHRINK, meanwhile, provides a speedup of 18-20 on 36 cores on both real and synthetic datasets of up to 250,000 points.
Publications:
- Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary, "Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'13), pp. 49:1-49:12, 2013. PDF
- Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary, "A New Scalable Parallel DBSCAN Algorithm Using the Disjoint Set Data Structure", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'12), pp.62:1-62:11, 2012.PDF
- William Hendrix, Md. Mostofa Ali Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. "Parallel Hierarchical Clustering on Shared Memory Platforms." 19th International Conference on High Performance Computing (HiPC). 2012. PDF
- William Hendrix, Diana Palsetia, Md. Mostofa Ali Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. "A Scalable Algorithm for Single-Linkage Hierarchical Clustering on Distributed-Memory Architectures." IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV 2013). 2013. PDF
Software Download:
- Parallel OPTICS (OpenMP and MPI) - OPTICS-v1.0.0
- Parallel DBSCAN (OpenMP and MPI) - DBSCAN-v1.0.0
- Parallel hierarchical clustering (PINK and SHRINK) - PINK-v1.0.0
- Sample dataset - clus50k.bin
Acknowledgements:
This work is supported in part by NSF award numbers CCF- 0621443, OCI-0724599, CCF-0833131, CNS-0830927, IIS- 0905205, OCI-0956311, CCF-0938000, CCF-1043085, CCF- 1029166, and OCI-1144061, and in part by DOE grants DE-FG02-08ER25848, DE-SC0001283, DE-SC0005309, DESC0005340, and DE-SC0007456. This research used Hopper Cray XE6 computer of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.Contact:
Most files in the package are self explanatory, although with limited comments. In case you have questions or if you would like to give suggestions or contribute software, please contact us:Mostofa Patwary (OPTICS)
Mostofa Patwary (DBSCAN)
William Hendrix (PINK)