HDBSCAN is a clustering algorithm. It has 5 steps:

  1. Build the Minimum spanning tree using the Distance. This can be done efficiently using Prims algorithm or Dual Tree Boruvka.
  2. Dropping the edges according to their Distances allows us to build a cluster hierarchy
  3. Remove cluster from Tree below a minimum cluster size .
  4. Calculate the cluster stability for all clusters in the Tree.
  5. Climb the Tree from the node clusters to the root:
    1. If a cluster has higher stability than the sum of its direct children, delete all of the children.
    2. Otherwise set it’s stability to be the sum of it’s children
  6. Return the list of leaf clusters that survived as our clustering 🎉

Note: If we selected a maximum cluster distance after step 2 and selected all clusters below it, this would be the implementation of DBSCAN.

  • is the core distance to the nearest neighbour from point . does not need to be in our data.
  • is the mutual reachability distance.
  • : The minimum cluster size, used for pruning the cluster tree.
  • is the stability of a cluster of points .
  • is the inverse of the Distance when a point “fell of” a cluster
  • is the inverse Distance of a cluster

The description of as the inverse of a Distance needs to be flushed out: it is unclear which distance is being used. Help me ❓

Tip

We can think of the steps above as

  1. Transforming the space according to the density/sparsity.
  2. Building the minimum Spanning tree of the Distance weighted graph.
  3. Constructing a cluster hierarchy of connected components.
  4. Condensing the cluster hierarchy based on minimum cluster size.
  5. Extracting the stable clusters from the condensed Tree.

External resources