回到主页

the basic of clustering

聚类分析ABC

· Data mining

Goals of clustering is 1. They are pragmatic groupings, a kind of functionalism, rather than real joins in nature. 2. They allow limited interchangeability. 3. They vary by purpose, data, and algorithm.

The algorithm defines what it means to be similar and how to measure distance. There are many choices available. For instance,

1)distance between points:

There are a few limitation for these connectivity models, first, these models can only find convex clusters, i.e, they can calculate the shapes like apples but can’t optimize the calculation in shapes like bananas! Another drawback is distance models are difficult to apply to gigabytes of data (too slow).

2)Distance from centroid; the central point is defined by a mean vector and the most common version of this is called k-means. You can simply define how many centroids you want to have and then it figures out how far each point from the centroids; but they can only find convex clusters(not banana!) and you need to define the value for k.

3)density of data; these models try to draw a border based on the density; therefore, the nonconvex shapes can be recognized; and they can deal with the outliers easily because these unusual cases are in form of density;

4)distribution models. These models try to draw a normal shape that wrap around, such as an ellipse, so we’re getting the clusters and we are modeling them as statistical distribution. By far the most common is a multivariate normal distribution. But it could be overfitting if you give the computer free rein. It will just add more and more dimensions until you end up with models that really only apply to the exact data that you currently have. On the other hand, they can only do convex shapes.

In conclusion, be mindful for the limitations of each models and fits your data with the purposes or pragmatic application.