# Kmeans performance test on D-thinker

k-means cluster algorithm chooses k points as seed, then let n data objects in datacenter to gain each distance with K seed point in order to division n data objects into K cluster, the data object in the same cluster should meet follow conditions:

Data objects' similarity in the same cluster is high while low in different cluster. It means one data object's distance with the seed point of the cluster it belongs to is shortest with other seed points.

Cluster similarity is computed by gaining a 'center object' with computing the mean of all clusters' seed point.

K-means algorithm basic idea Choosing K point as seed points, then make the data object closest to this seed point as a cluster. Through iterative method, the value of the successive update the clustering center, until the best clustering results are obtained.

Assume divide sample set into c clusters, Algorithm describes as follow:

(1) choose c seed points.

(2) In kth iteration, for one sample, computing its distance with each seed points, belong this sample to the cluster where its seed is closest to the sample.

(3) Using like the mean of one cluster to update the cluster's seed point

(4) For all C cluster, if execute (2) and (3) and the seed point keeps the same, then the iteration ends, or continue to iterate.

This algorithm's advance is quick and concise. the key is to choose the initial seed and the distance formula.

Follow experiment's iteration count is 20. tiny means points count is 10 and seed count is 2; small means points count is 512*1024 and seed count is 20; large means points count is 204800000, seed count is 1000.

tiny mode running time constrat:

smal mode running time constrat:

large mode running time constrat: