Skip to main content
Article
Fast Streaming k-Means Clustering with Coreset Caching
IEEE Transactions on Knowledge and Data Engineering
  • Yu Zhang, Iowa State University
  • Kanat Tangwongsan, Mahidol University International College
  • Srikanta Tirthapura, Iowa State University
Document Type
Article
Publication Version
Accepted Manuscript
Publication Date
8-24-2020
DOI
10.1109/TKDE.2020.3018744
Abstract

We present new algorithms for k-means clustering on a data stream with a focus on providing fast responses to clustering queries. Compared to the state-of-the-art, our algorithms provide substantial improvements in the query time for cluster centers while retaining the desirable properties of provably small approximation error and low space usage. Our proposed clustering algorithms systematically reuse the "coresets" (summaries of data) computed for recent queries in answering the current clustering query, a novel technique which we refer to as coreset caching. We also present an algorithm called OnlineCC that integrates the coreset caching idea with a simple sequential streaming k-means algorithm. In practice, OnlineCC algorithm can provide constant query time. We present both theoretical analysis and detailed experiments demonstrating the correctness, accuracy, and efficiency of all our proposed clustering algorithms.

Comments

This is a manuscript of an article published as Zhang, Yu, Kanat Tangwongsan, and Srikanta Tirthapura. "Fast Streaming k-Means Clustering with Coreset Caching." IEEE Transactions on Knowledge and Data Engineering (2020). DOI: 10.1109/TKDE.2020.3018744. Posted with permission.

Rights
© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Copyright Owner
IEEE
Language
en
File Format
application/pdf
Citation Information
Yu Zhang, Kanat Tangwongsan and Srikanta Tirthapura. "Fast Streaming k-Means Clustering with Coreset Caching" IEEE Transactions on Knowledge and Data Engineering (2020)
Available at: http://works.bepress.com/srikanta-tirthapura/54/