The "breathing k-means" algorithm with datasets and example notebooks

Last update: Nov 17, 2022

Overview

The Breathing K-Means Algorithm (with examples)

The Breathing K-Means is an approximation algorithm for the k-means problem that (on average) is better (higher solution quality) and faster (lower CPU time usage) than k-means++.

Techreport: https://arxiv.org/abs/2006.15666 (submitted for publication)

Typical results for the "Birch" data set (100000 points drawn from a mixture of 100 circular Gaussians). k=100

Can you spot the mistakes? :-)

Installation from pypi

pip install bkmeans

Local installation to run the examples

Clone the repository

git clone https://github.com/gittar/breathing-k-means

Enter the top directory.

cd breathing-k-means

Create the conda environment 'bkm' (or any other name) via

conda env create -n bkm -f environment.yml

Activate the created environment via

conda activate bkm

To run a jupyter notebook with examples, type, e.g.:

jupyter lab notebooks/2D.ipynb

Content

The top level folder contains the following subfolders

data/ - data sets used in the notebooks
notebooks/ - jupyter notebooks with all examples from the technical report
- 2D.ipynb 2D problems executed with helper functions for brevity
- 2D_detail.ipynb 2D problems executed with raw API
- 10+D.ipynb high-dimensional problems based on the data sets from the original k-means++ publication
src/
- bkmeans.py - reference implementation of breathing k-means
misc/
- aux.py - auxiliary functions
- dataset.py - general class to administer and plot data sets
- runfunctions.py - wrapper functions used in the notebook

API

The included class BKMeans is subclassed from scikit-learn's KMeans class and has, therefore, the same API. It can be used as a plug-in replacement for scikit-learn's KMeans.

There is one new parameters which can be ignored (left at default) for normal usage:

m (breathing depth), default: 5

The parameter m can also be used, however, to generate faster ( 1 < m < 5) or better (m>5) solutions. For details see the technical report.

Example 1: running on simple random data set

Code:

import numpy as np
from bkmeans import BKMeans

# generate random data set
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)

Output:

1.1775040547902602

Example 2: comparison with k-means++ (multiple runs)

Code:

import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans

# random 2D data set
X=np.random.rand(1000,2)

# number of centroids
k=100

for i in range(5):
    # kmeans++
    km = KMeans(n_clusters=k)
    km.fit(X)

    # breathing k-means
    bkm = BKMeans(n_clusters=k)
    bkm.fit(X)

    # relative SSE improvement of bkm over km++
    imp = 1 - bkm.inertia_/km.inertia_
    print(f"SSE improvement over k-means++: {imp:.2%}")

Output:

SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%

Acknowledgements

Kudos go the scikit-learn team for their excellent sklearn.cluster.KMeans class, also to the developers and maintainers of the other packages used: numpy, scipy, matplotlib, jupyterlab

The "breathing k-means" algorithm with datasets and example notebooks

Related tags

Overview

The Breathing K-Means Algorithm (with examples)

Installation from pypi

Local installation to run the examples

Content

API

Example 1: running on simple random data set

Example 2: comparison with k-means++ (multiple runs)

Acknowledgements

Owner

Bernd Fritzke

Instance-based label smoothing for improving deep neural networks generalization and calibration

Serve TensorFlow ML models with TF-Serving and then create a Streamlit UI to use them

Mixed Neural Likelihood Estimation for models of decision-making

Framework for evaluating ANNS algorithms on billion scale datasets.

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

PECOS - Prediction for Enormous and Correlated Spaces

An implementation for Neural Architecture Search with Random Labels (CVPR 2021 poster) on Pytorch.

A CNN implementation using only numpy. Supports multidimensional images, stride, etc.

Character Controllers using Motion VAEs

Probabilistic Gradient Boosting Machines

Learning based AI for playing multi-round Koi-Koi hanafuda card games. Have fun.

The easiest tool for extracting radiomics features and training ML models on them.

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Face detection using deep learning.

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

This repository contains source code for the Situated Interactive Language Grounding (SILG) benchmark

Raster Vision is an open source Python framework for building computer vision models on satellite, aerial, and other large imagery sets

PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners

Code for the paper BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks

A Model for Natural Language Attack on Text Classification and Inference