HDBSCAN is a clustering algorithm. It has 5 steps:

Build the Minimum spanning tree using the $d_{m re a c h - k} (a, b)$ Distance. This can be done efficiently using Prims algorithm or Dual Tree Boruvka.
Dropping the edges according to their Distances allows us to build a cluster hierarchy
Remove cluster from Tree below a minimum cluster size $m$ .
Calculate the cluster stability for all clusters in the Tree.
Climb the Tree from the node clusters to the root:
1. If a cluster has higher stability than the sum of its direct children, delete all of the children.
2. Otherwise set it’s stability to be the sum of it’s children
Return the list of leaf clusters that survived as our clustering 🎉

Note: If we selected a maximum cluster distance after step 2 and selected all clusters below it, this would be the implementation of DBSCAN.

$cor e_{k} (x)$ is the core distance to the $k$ nearest neighbour from point $x$ . $x$ does not need to be in our data.
$d_{m re a c h - k} (a, b) := ma x {cor e_{k} (a), cor e_{k} (b), d (a, b)}$ is the mutual reachability distance.
$m$ : The minimum cluster size, used for pruning the cluster tree.
$\sum_{p \in C} (λ_{p} - λ_{bi r t h})$ is the stability of a cluster of points $C$ .
$λ_{p}$ is the inverse of the Distance when a point “fell of” a cluster
$λ_{bi r t h}$ is the inverse Distance of a cluster

The description of $λ$ as the inverse of a Distance needs to be flushed out: it is unclear which distance is being used. Help me ❓

Tip

We can think of the steps above as

Transforming the space according to the density/sparsity.

Building the minimum Spanning tree of the Distance weighted graph.

Constructing a cluster hierarchy of connected components.

Condensing the cluster hierarchy based on minimum cluster size.

Extracting the stable clusters from the condensed Tree.

External resources

readthedocs from Python’s hdbscan package

🌊 CollapsedWave

Explorer

HDBSCAN

External resources

Backlinks

Graph View