Scalable algorithms for correlation clustering on large graphs

Date
2023
DOI
Authors
Cordner, Nathan
Version
OA Version
Citation
Abstract
Correlation clustering (CC) is a widely-used clustering paradigm, where objects are represented as graph nodes and clustering is performed based on relationships between objects (positive or negative edges between pairs of nodes). The CC objective is to obtain a graph clustering that minimizes the number of incorrectly assigned edges (negative edges within clusters, and positive edges between clusters). Many of the state-of-the-art algorithms for solving correlation clustering rely on subroutines that cause significant memory and run time bottlenecks when applied to larger graphs. Several algorithms with the best theoretical guarantees for clustering quality need to first solve a relatively large linear program; others perform brute-force searches over sizeable sets, or store large amounts of unnecessary information. Because of these issues, algorithms that run quicker (e.g. in linear time) but have lower quality approximation guarantees have still remained popular. In this thesis we examine three such popular linear time CC algorithms: Pivot, Vote, and LocalSearch. For the general CC problem we show that these algorithms perform well against slower state-of-the-art algorithms; we also develop a lightweight InnerLocalSearch method that runs much faster and delivers nearly the same quality of results as the full LocalSearch. We adapt Pivot, Vote, and LocalSearch for two constrained CC variants (limited cluster sizes, and limited total number of clusters), and show their practicality when compared against slower algorithms with better approximation guarantees. Finally, we give two practical run time improvements for applying CC algorithms to the related consensus clustering problem.
Description
License
Attribution 4.0 International