Available in: K-Means
This option specifies the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters.
To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.
Minimum-cost flow problems can be efficiently solved in polynomial time (or in the worst case, in exponential time). The performance of this implementation of the Constrained K-means algorithm is slow due to many repeatable calculations that cannot be parallelized and more optimized at the H2O backend. For large datasets, the calculation can last hours. For example, a dataset with 100,000 rows and five features can run for several hours.
Expected time with various sized data (OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB):
10 000 rows, 5 features ~ 0h 9m 21s
20 000 rows, 5 features ~ 0h 39m 27s
30 000 rows, 5 features ~ 1h 26m 43s
40 000 rows, 5 features ~ 2h 13m 31s
50 000 rows, 5 features ~ 4h 4m 18s
The sum of constraints is smaller the time is faster - it uses MCF calculation until all constraints are satisfied then use standard K-means.