eva = evalclusters(x,clust,criterion) creates
a clustering evaluation object containing data used to evaluate the
optimal number of data clusters.

eva = evalclusters(x,clust,criterion,Name,Value) creates
a clustering evaluation object using additional options specified
by one or more name-value pair arguments.

Use an input matrix of proposed clustering
solutions to evaluate the optimal number of clusters.

Load the sample data.

load fisheriris;

The data contains length and width measurements from the sepals
and petals of three species of iris flowers.

Use kmeans to create an input matrix
of proposed clustering solutions for the sepal length measurements,
using 1, 2, 3, 4, 5, and 6 clusters.

clust = zeros(size(meas,1),6);
for i=1:6
clust(:,i) = kmeans(meas,i,'emptyaction','singleton',...'replicate',5);
end

Each row of clust corresponds to one sepal
length measurement. Each of the six columns corresponds to a clustering
solution containing 1 to 6 clusters.

Evaluate the optimal number of clusters using the Calinski-Harabasz
criterion.

Input data, specified as an N-by-P matrix. N is
the number of observations, and P is the number
of variables.

Data Types: single | double

clust — Clustering algorithm'kmeans' | 'linkage' | 'gmdistribution' | matrix of clustering solutions | function handle

Clustering algorithm, specified as one of the following.

'kmeans'

Cluster the data in x using the kmeans clustering algorithm, with 'EmptyAction' set
to 'singleton' and 'Replicates' set
to 5.

'linkage'

Cluster the data in x using the clusterdata agglomerative clustering algorithm,
with 'Linkage' set to 'ward'.

'gmdistribution'

Cluster the data in x using the gmdistribution Gaussian mixture distribution
algorithm, with 'SharedCov' set to true and 'Replicates' set
to 5.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin',
or 'silhouette', you can specify a clustering algorithm
using the function_handle (@)
operator. The function must be of the form C = clustfun(DATA,K),
where DATA is the data to be clustered, and K is
the number of clusters. The output of clustfun must
be one of the following:

A vector of integers representing the cluster index
for each observation in DATA. There must be K unique
values in this vector.

A numeric n-by-K matrix
of score for n observations and K classes.
In this case, the cluster index for each observation is determined
by taking the largest score value in each row.

If Criterion is 'CalinskHarabasz', 'DaviesBouldin',
or 'silhouette', you can also specify clust as
a n-by-K matrix containing the
proposed clustering solutions. n is the number
of observations in the sample data, and K is the
number of proposed clustering solutions. Column j contains
the cluster indices for each of the N points in
the jth clustering solution.

Clustering evaluation criterion, specified as one of the following.

'CalinskiHarabasz'

Create a CalinskiHarabaszEvaluation clustering
evaluation object containing Calinski-Harabasz index values.

'DaviesBouldin'

Create a DaviesBouldinEvaluation cluster
evaluation object containing Davies-Bouldin index values.

'gap'

Create a GapEvaluation cluster evaluation
object containing gap criterion values.

'silhouette'

Create a SilhouetteEvaluation cluster evaluation
object containing silhouette values.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments.
Name is the argument
name and Value is the corresponding
value. Name must appear
inside single quotes (' ').
You can specify several name and value pair
arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'KList',[1:5],'Distance','cityblock' specifies
to test 1, 2, 3, 4, and 5 clusters using the sum of absolute differences
distance measure.

List of number of clusters to evaluate, specified as the comma-separated
pair consisting of 'KList' and a vector of positive
integer values. You must specify KList when clust is
a clustering algorithm name string or a function handle. When criterion is 'gap', clust must
be a string or a function handle, and you must specify KList.

Distance metric used for computing the criterion values, specified
as the comma-separated pair consisting of 'Distance' and
one of the following.

'sqEuclidean'

Squared Euclidean distance

'Euclidean'

Euclidean distance

'cityblock'

Sum of absolute differences

'cosine'

One minus the cosine of the included angle between points (treated
as vectors)

'correlation'

One minus the sample correlation between points (treated as
sequences of values)

'Hamming'

Percentage of coordinates that differ

'Jaccard'

Percentage of nonzero coordinates that differ

For detailed information about each distance metric, see pdist.

You can also specify a function for the distance metric by using
the function_handle (@)
operator. The distance function must be of the form d2 =
distfun(XI,XJ), where XI is a 1-by-n vector
corresponding to a single row of the input matrix X,
and XJ is an m_{2}-by-n matrix
corresponding to multiple rows of X. distfun must
return an m_{2}-by-1 vector
of distances d2, whose kth element
is the distance between XI and XJ(k,:).

If Criterion is 'silhouette',
you can also specify Distance as the output vector
output created by the function pdist.

When Clust a string representing a built-in
clustering algorithm, evalclusters uses the distance
metric specified for Distance to cluster the data,
except for the following:

If Clust is 'linkage',
and Distance is either 'sqEuclidean' or 'Euclidean',
then the clustering algorithm uses Euclidean distance and Ward linkage.

If Clust is 'linkage' and Distance is
any other metric, then the clustering algorithm uses the specified
distance metric and average linkage.

In all other cases, the distance metric specified for Distance must
match the distance metric used in the clustering algorithm to obtain
meaningful results.

Prior probabilities for each cluster, specified as the comma-separated
pair consisting of 'ClusterPriors' and one of the
following.

'empirical'

Compute the overall silhouette value for the clustering solution
by averaging the silhouette values for all points. Each cluster contributes
to the overall silhouette value proportionally to its size.

'equal'

Compute the overall silhouette value for the clustering solution
by averaging the silhouette values for all points within each cluster,
and then averaging those values across all clusters. Each cluster
contributes equally to the overall silhouette value, regardless of
its size.

Number of reference data sets generated from the reference distribution ReferenceDistribution,
specified as the comma-separated pair consisting of 'B' and
a positive integer value.

Method for selecting the optimal number of clusters, specified
as the comma-separated pair consisting of 'SearchMethod' and
one of the following.

'globalMaxSE'

Evaluate each proposed number of clusters in KList and
select the smallest number of clusters satisfying

where K is the number
of clusters, Gap(K) is the gap value for the clustering
solution with K clusters, GAPMAX is
the largest gap value, and SE(GAPMAX) is the standard
error corresponding to the largest gap value.

'firstMaxSE'

Evaluate each proposed number of clusters in KList and
select the smallest number of clusters satisfying

where K is the number
of clusters, Gap(K) is the gap value for the clustering
solution with K clusters, and SE(K +
1) is the standard error of the clustering solution with K +
1 clusters.