Define the distortion function as:
$$ J(c,u)=\sum{i=1}^{m}||x^{(i)}-u{c^(i)}||^{2} $$
Function J is the square sum of the distance between centroid and corresponding vectors. The aim of K-means algorithm is to minimize the J. It’s easy to find that there are 2 methods to minimize the J:
- Fix the centroid and arrange the vectors in the cluster.
- Fix the vectors and arrange the centroid.
When J is near to 0, which means the sum of the distance between the centroid and vectors is decreasing. Thus, we can set a tolerance variable to judge the similarity of 2 centroids. When the distance between 2 mean centroids is less or equal to the tolerance, it means the clustering consequence is converged. Depend on the sklearn
library, we set the tolerence = 0.0001
Although the function J is not a convex function, which means we may converge to a local optimal solution. But J shows that the ordinary K-means algorithm can converge to a minimal value.
However, in the assignment 2, we need to modify the K-means algorithm from the original one to choose the instance closest to the mean. After several experiments, I believe that we cannot use tolerance method for instance centroid.
For all vectors, I compute the distance between one vector to other vectors. Then I sort the distances and print the smallest 5 distance:1
2
3
4
5
6
7
8
9
10distance_matrix=np.zeros((408,408))
for i in range(len(reviews)):
for j in range(408):
distance_matrix[i][j]=np.linalg.norm(reviews[i]-reviews[j])
# find the vectors that has the 0 distance
if( distance_matrix[i][j]==0 and i!=j):
print(i,j)
for i in range(408):
distance_matrix[i].sort()
print(i,distance_matrix[i][1:6])
The output is the pair of vectors has the 0 distance and a sorted 408*408 distance matrix, a part of it looks like:
1 | 51 53 |
Because the vector is fixed, and now we choose a vector which is nearest to the centroid, so now both 2 ways of minimising the function J are not work.