# 1. INTRODUCTION

Rapid development in information technology in the last few decades has led to increased digital data which are widely used in many fields such as medical science, military, banking, manufacture, and logistics [1-6]. Consequently, automatically processing a large amount of information is essential. Thus far, many information-processing algorithms have been proposed and considerably benefited people [7-8]. Among these algorithms, clustering methods, which are one of popular techniques, are being widely used in data mining, object classification, pattern recognition, machine learning, image analysis, information retrieval and bioinformatics [9-15]. There are two types of clustering approaches: the supervised algorithm and the unsupervised algorithm [16]. For supervised clustering, the category of each sample should be known in advance, whereas this is not required for the unsupervised clustering scheme.

However, to the best of our knowledge, the number of cluster members are random for all existing clustering methods, i.e, the number of cluster members is dependent on the clustering criteria such as statistical inference, similarity measurement, and density [17-18]. There is no adjustment for the number of cluster elements. Indeed, adjustments for cluster elements are not necessary in most of the cases. Nevertheless, in some fields such as logistical management, it would be necessary to adjust cluster members to ensure that the number of cluster elements satisfies the predefined requirement. For example, when a logistics company intends to schedule its logistics center and demands each logistics center to be responsible for a fixed number of cities, the conventional clustering algorithm cannot achieve its purpose. On the other hand, a clustering method with a fixed number of cluster members would be suitable to satisfy this type of requirement.

In this paper, we propose a k-means based clustering method with fixed number of cluster members. First, the data samples or patterns extracted from objects are grouped into k categories using the k-means clustering algorithm [17-18]. Then, the number of cluster members is adjusted to satisfy the required number using the greedy strategy [19]. The adjustment procedure is the key step for the proposed clustering method and is explained in detail in section 3. In addition, we show improvement of the work present in [27]. The biggest difference between the proposed method and that in [27] is the adjustment order for the original clusters obtained with k-means algorithm. In [27], the cluster with minimum members is firstly selected to adjust its members at each iteration. In this paper, however, we adjust the cluster that its cluster center point is on the convex hull which contain all of the unadjusted cluster center points at each iteration. When there are only two clusters, the two methods would lead to the same results. But, the proposed method will be superior to that in [27] when more than two clusters are existed. We demonstrate that our method can achieve good results when the number of cluster members is pre-defined and can result in better performance than that in[27].

This paper is organized as follows. In Section 2, we describe the conventional k-means clustering algorithm. In Section 3, we describe the procedure for the proposed clustering method with a fixed number of cluster elements. In Section 4, we present the experimental results. The conclusion is presented in Section 5.

# 2. K-MEANS CLUSTERING ALGORITHM

K-means [17-19] is a robust clustering method and has been successfully used in image segmentation, object classification, pattern recognition, data mining and logistics management [20-24]. Hugo Steinhaus first presented the idea of kmeans in 1957 and the standard k-means algorithm was proposed by Stuart Lloyd in 1957. However, the term “k-means” was first used by James MacQueen in 1967 [25-27]. It is one of the simplest yet robust deterministic clustering algorithms, which aims to partition N observations or data samples into k user-defined clusters where each observation belongs to the cluster with the nearest mean that is regarded as a prototype of the corresponding cluster [27-28]. The flowchart of the well known k-means approach achieved by Stuart Lloyd is presented in Fig. 1.

**Fig. 1.**Flowchart of the conventional k-means algorithm.

The flowchart in Fig. 1 can be explained in detail as follows: First, the number of clusters k has to be defined. Then, k points or observations are randomly chosen as the centroid points of the k clusters. Each centroid point is viewed as the prototype of one cluster. The final clustering results correspond to the choice of the initial k centroid points. In general, it is better to select k centroid points that are far away from each other [29]. Further, every sample point is classified into the cluster with the nearest distance, i.e, comparing each point with k centroid points and assigning that point to the centroid point representing one cluster with the minimum distance. The distance can be measured using the Minkowski distance, Euclidean distance or Cityblock distance [18]. When all observations have been assigned to their corresponding clusters, one round of clustering is completed. The next step is to update the centroid point of each cluster. The centroid point is calculated using the cluster members at the current cluster as achieved from previous clustering rounds. Consequently, each point can be assigned to a new centroid point as per the shortest distance criterion. This iteration is continued until the centroid points converge.

In fact, the above-mentioned k-means algorithm attempts to minimize the within-cluster sum of squares. In other words, the k-means technique seeks to minimize the following term [27-29]:

where k is the cluster number, Si is the set including the corresponding cluster member of the ith cluster, μi is the mean value of observations in the ith cluster and is used to denote the prototype of the ith cluster, and d2(xj, μi) is the distance between the sample xi and centroid point μi. Because k-means algorithms are highly dependent on the initial cluster centroids, Eq. (1) will always converge to a local minimum value. In order to achieve a better minimum result, the k-means algorithm can be conducted repeatedly and the minimum within-cluster sum of squares can be chosen. Moreover, the third and fourth step in Fig. 1 can be expressed as Eq. (2) and Eq. (3) respectively [27-28]:

where (t) denotes the tth iteration and represents the number of members in the ith set. Here, each point xp is only allotted to exactly one cluster Si even if the point xp achieves the same distance as that of more than one centroid points in the k clusters.

The computational time complexity of a general k-means algorithm is linearly proportional to the number of samples, the number of clusters, the number of observation dimensions, and the number of iterations resulting in convergence. One of the drawbacks of the traditional k-means algorithm is that clustering results can be affected by noise or outlier observations. However, this problem can be mitigated or avoided by employing outlier analysis methods such as Random Sample Consensus (RANSAC) [30-31] on the sample data before implementing the k-means algorithm. Thus far, many algorithms based on the k-means approach have been proposed and widely used in the area of image processing, computer vision and data mining [20-23].

# 3. PROCEDURES OF THE PROPOSED CLUSTERING METHOD

Based on description of the k-means algorithm in section 2, note that the number of members in each cluster is not fixed but random after implementing k-means. However, in some cases, a fixed number of elements in each category are expected. For example, when k-means is used to group N observations into two clusters and each cluster contains N/2 members, the conventional k-means fails because it cannot adjust the cluster elements as per the user’s requirement. In this section, a k-means based clustering method with a management when the number of group elements needs to be adjusted to satisfy a predefined number. The procedure for the proposed clustering method is described as follows:

step 1: Classify N observations into k clusters using the conventional k-means algorithm. Denote the resulting k centroid points as μ1,μ2,...,μk.

step 2: Calculate the center point of set μ1,μ2,...,μk. The computed center point is expressed as Cμ. Here, μ1,μ2,...,μk are prototypes of k clusters. Cμ and μ1,μ2,...,μk can be used to obtain the corresponding distance between Cμ and μi (0*
*

step 3: Choose cluster j from the k clusters that have the maximum distance to Cμ. i.e., dj=max(di=d(Cμ , μi)) where 0< i ≤ k. Then, adjust the element in the jth cluster. Assume that the required number of elements for the k clusters is a1, a2,⋯,ak where a1+a2+⋯ +ak=N. The number of elements in the k clusters obtained from k-means algorithm in step 1 is assumed to be n1, n2 ,⋯, nk. Once the cluster used to adjust its element is determined, the number of elements can be denoted by nj (here, the jth cluster is selected). However, the desired number of elements for this cluster is unknown, and it should be chosen from the set of (a1, a2,⋯,ak). The greedy strategy is applied to select the desired number of elements of the jth cluster. The value that is most similar to nj in (a1, a2,⋯,ak) is selected as the desired number of members for the jth cluster because this value can make adjustments with the minimum element. Assume that the obtained desired number is aj for the jth cluster.

step 4: The current number of elements in the jth cluster is nj and the desired number of elements is aj. In this step, we have to determine whether this cluster needs to recruit new members or it has to discard redundant elements. This decision can be taken easily by comparing nj with aj. If nj-aj>0, the jth cluster needs to discard nj-aj elements. If nj-aj<0, aj-nj new members should be recruited into the jth cluster. Of course, when nj-aj=0, it is unnecessary to adjust the cluster element. Here, the recruitment and discard process is designed as follows. Both the recruitment and discard process represent the greedy strategy because they attempt to minimize the within-cluster sum of squares while adjusting the cluster element at the same time.

Recruitment process: Recruit aj-nj members from the N-nj observation samples (excluding the nj points that are already in the jth cluster) with the shortest distance criterion. In this paper, the Euclidean distance is employed although other distance measurement methods such as the Minkowski and Cityblock distance can also be adopted. Thus, aj-nj points that are nearest to μj are recruited into the jth cluster from other clusters to satisfy the desired number of cluster members. Here, μj is the centroid point of the jth cluster.

Discard process: The nj-aj elements in the jth cluster will be discarded with the same shortest distance criterion. However, the nj-aj elements that are nearest to the centroid points of other clusters are discarded. It is confusing because one may assume that it is better to discard the nj-aj elements that are farthest to μj which is the centroid point of the current cluster. Nevertheless, this will result in incorrect adjustments and considerably increase the within-cluster sum of squares. This is demonstrated in Fig. 2. When one point needs to be discarded from cluster 2, it is noted that discarding one point (point p in Fig. 2(a)) that is nearest to the other cluster’s (cluster 1 in Fig. 2) centroid points is better than discarding the point (point q) that is farthest to the current cluster’s (cluster 2 in Fig. 2) centroid point. Otherwise, the result will affect the other cluster as showed in Fig. 2(c). It is obvious that the discarded point q in Fig. 2(c) is recruited by cluster 1, which results in poor element distribution in cluster 1 because the elements are considerably dispersive and pass through the elements in cluster 1. On the other hand, discarding point p from cluster 2 as shown in Fig. 2 (b) makes the adjustment reasonable.

**Fig. 2.**Illustration of the discard Process. (a) Clustering result with k-means. (b) Adjustment results by discarding point p. (c) Adjustment result by discarding point q.

step 5: Set k=k-1, N=N-aj, remove aj from set {a1, a2,⋯,ak}, and determine the updated set denoted as {a1, a2,⋯,ak-1}. If k≠1, go to step 1 and continue the adjustment process. Otherwise, terminate the iteration and update all centroid points for new cluster members with a criterion of minimizing the following term [27-28]:

where xj is the element in a new set Si of the ith cluster and μi is the updated centroid point of the ith cluster.

The procedure for the proposed clustering algorithm indicates that this method can be implemented recursively thus making the implementation convenient and efficient. The pseudo code for the k-means based clustering method with fixed cluster members is presented in Fig. 3. In Fig. 3, N=N-aj means the remaining unadjusted data. That is, the recursion does k-means only on the remaining data after taking out the data points in clusters already adjusted.

**Fig. 3.**Pseudo code for the proposed k-means based clustering method.

# 4. EXPERIMENTAL RESULTS

In this section, a computer simulation is conducted to demonstrate the feasibility of the kmeans based clustering method with a fixed number of cluster members. Concurrently, we compare our results with that present in [27] leading to arbitrary results in some cases. The main difference between the proposed k-means based method and the one in [27] is that the manner in which the first cluster is selected where the cluster member should be initially adjusted in each iteration. For clustering with only two clusters, the two approaches will achieve the same clustering results. However, when the number of clusters is more than two, the method in [27] produces problems in some cases. Consequently, the simulation in this paper is executed with two types of clusters: One with two clusters and the other with three clusters. Fig. 4 shows the clustering results with two clusters. The number of cluster members is 20 and 40 which are obtained using the conventional k-means algorithm whereas the total number of data samples is 60, which are randomly generated. Here, the sample data are generated from two groups that follow Gaussian distribution. Fig. 4(c) and 4(d) show the clustering results when the number of members for the two clusters is adjusted to occupy 50% and 50% -of the total number of observations with the method in [27] and the proposed method respectively. Fig. 4(e) and 4(f) are the adjustment results when the number of members in the two clusters are adjusted to possess 20% and 80% with the method in [27] and the proposed method, respectively. It is noted that for the adjustment of cluster members with two clusters, the method in [27] and the proposed approach achieve the same acceptable performance. In order to numerically view the results, the sum of within-cluster sum of square [see Eq. (1)] for the two methods [method in [27] and method in this paper] with two clusters are given in Table 1. It is noted from Table 1 that the sums of within-cluster sum of square for the two methods are the same in condition of two clusters.

**Fig. 4.**Clustering results with two clusters. (a) data samples. (b) clustering result using the conventional k-means algorithm. (c) and (d) clustering results using the method in [27] and the proposed approach when the number of cluster elements is assigned to be 50% and 50% of the total samples, respectively. (e) and (f) clustering results using the method in [27] and the proposed approach when the number of cluster elements is assigned to be 20% and 80% of the total samples, respectively.

**Table 1.**Sum of within-cluster sum of square for method in [27] and this new method

Fig. 5 presents the clustering results with three clusters. 200 samples are randomly generated and classified with conventional k-means, the clustering approach in [27], and the proposed clustering algorithm. Also, these sample data are generated from three groups that follow Gaussian distribution. The number of elements in the three clusters obtained by the conventional k-means algorithm is 86, 48, and 66 as shown in Fig. 5(b). Fig. 5(c) and 5(d) are the clustering results with method in [27] and the proposed method, respectively, whereas the number of cluster members is adjusted to be 33%, 33%, and 34% of the total data samples. Fig. 5(e) and 5(f) are the clustering results with method in [27] and the proposed method, respectively when the number of cluster elements is adjusted to be 20%, 40%, and 40% of the total data samples. It is apparently shown that the cluster members for cluster 3 in Figs. 5(c) and 5(e) are considerably poor when they are adjusted by starting from cluster 1. On the other hand, the proposed method in this paper can make up the deficiency in [27]. The experimental results discover that this proposed algorithm can achieve better adjustment performance than that in [27] when multiple clusters are considered. In other words, this means that the choice of the first cluster for updating the corresponding element is very important and affects the final clustering results. Here, we have to note that the method in [27] can also adjust multiple cluster members well when the data samples are not long and narrow as shown in the experimental part in [27]. Also, the sum of within-cluster sum of square [see Eq. (1)] for the two methods [method in [27] and method in this paper] with multiple clusters are provided in Table. 2. It is proved that the new method proposed in this paper can achieve much smaller sum of within-cluster sum of square when it is compared with method in [27] under the condition of multiple clusters.

**Fig. 5.**Clustering results with three clusters. (a) data samples. (b) clustering result by conventional k-means algorithm. (c) and (d) clustering results with method in [27] and the proposed approach in this paper when the number of cluster elements are assigned to be 33%, 33%, and 34% of the total samples respectively. (e) and (f) clustering results with the method in [27] and the proposed approach when the number of cluster elements is assigned to be 20%, 40%, and 40% of the total samples, respectively.

**Table 2.**Within-cluster sum of square for method in [27] and this new method

In this simulation, each point can be viewed as a city while the centroid point of each cluster can be considered as the logistics center. Sometimes, logistics company need each logistics center to handle business with a fixed number of cities. Thus, the proposed method in this paper can be applied to choose the cities (cluster members) for each cluster. In other words, the number of cities will be random when a simple clustering algorithm such as k-means is applied to all of the cities. Thus, each cluster members need to be adjusted in order to satisfy with the predefined number of cities which are managed by each logistics center. In addition, the proposed method can be helpful to image segmentation, object classification and pattern recognition when some priori knowledges such as the occupation rate of each class of object are known.

# 5. CONCLUSIONS

In this paper, a k-means based clustering method with a fixed number of cluster members was proposed and demonstrated. Experimental results show that the proposed algorithm worked well when the number of cluster elements had to satisfy some predefined values. Further, simulation results revealed that the proposed clustering method is superior to the conventional k-means and previously proposed clustering algorithm for a fixed number of cluster members. This algorithm is suitable for data sets with two or more clusters (more than two clusters). The proposed method is achieved based on the greedy strategy and can be implemented recursively to make the algorithm convenient and efficient. Moreover, the proposed method can be applied to other clustering methods and is not limited to the k-means algorithm. This method would be useful in image segmentation, object classification, pattern recognition and logistics management. Moreover, some outlier analysis approaches can be applied to the data samples before performing the proposed clustering algorithm to make the algorithm robust against noise.