HDBSCAN is a clustering algorithm. It has 5 steps:
- Build the Minimum spanning tree using the Distance. This can be done efficiently using Prims algorithm or Dual Tree Boruvka.
- Dropping the edges according to their Distances allows us to build a cluster hierarchy
- Remove cluster from Tree below a minimum cluster size .
- Calculate the cluster stability for all clusters in the Tree.
- Climb the Tree from the node clusters to the root:
- If a cluster has higher stability than the sum of its direct children, delete all of the children.
- Otherwise set it’s stability to be the sum of it’s children
- Return the list of leaf clusters that survived as our clustering 🎉
Note: If we selected a maximum cluster distance after step 2 and selected all clusters below it, this would be the implementation of DBSCAN.
- is the core distance to the nearest neighbour from point . does not need to be in our data.
- is the mutual reachability distance.
- : The minimum cluster size, used for pruning the cluster tree.
- is the stability of a cluster of points .
- is the inverse of the Distance when a point “fell of” a cluster
- is the inverse Distance of a cluster
The description of as the inverse of a Distance needs to be flushed out: it is unclear which distance is being used. Help me ❓
Tip
We can think of the steps above as
- Transforming the space according to the density/sparsity.
- Building the minimum Spanning tree of the Distance weighted graph.
- Constructing a cluster hierarchy of connected components.
- Condensing the cluster hierarchy based on minimum cluster size.
- Extracting the stable clusters from the condensed Tree.