Statistics Toolbox 

Classical multidimensional scaling, also known as Principal Coordinates Analysis, takes a matrix of interpoint distances, and creates a configuration of points. Ideally, those points can be constructed in two or three dimensions, and the Euclidean distances between them approximately reproduce the original distance matrix. Thus, a scatter plot of the those points provides a visual representation of the original distances.
The two examples in this demo illustrate applications of multidimensional scaling to dissimilarity measures other than spatial distance, and show how to construct a configuration of points to visualize those dissimilarities.
This demo describes "classical" multidimensional scaling, using the cmdscale function in the Statistics Toolbox™. The mdscale function performs "nonclassical" MDS, which is sometimes more flexible than the classical method. Nonclassical MDS is described in the NonClassical Multidimensional Scaling demo.
Suppose you have measured the genetic "distance", or dissimilarity, between a number of local subpopulations of a single species of animal. You also know their geographic locations, and would like to know how closely their genetic and spatial distances correspond. If they do, that is evidence that interbreeding between the subpopulations is affected by their geographic locations.
Below are the spatial locations of the subpopulations, and the uppertriangle of the matrix of genetic distances, in the same vector format produced by pdist.
X = [39.1 18.7; 40.7 21.2; 41.5 21.5; 39.2 21.8; 38.7 20.6; 41.7 20.1; 40.1 22.1; 39.2 21.6]; D = [4.69 6.79 3.50 3.11 4.46 5.57 3.00 ... 2.10 2.27 2.65 2.36 1.99 1.74 ... 3.78 4.53 2.83 2.44 3.79 ... 1.98 4.35 2.07 0.53 ... 3.80 3.31 1.47 ... 4.35 3.82 ... 2.57];
Although this vector format for D is spaceefficient, it's often easier to see the distance relationships if you reformat the distances to a square matrix.
squareform(D)
ans = Columns 1 through 7 0 4.6900 6.7900 3.5000 3.1100 4.4600 5.5700 4.6900 0 2.1000 2.2700 2.6500 2.3600 1.9900 6.7900 2.1000 0 3.7800 4.5300 2.8300 2.4400 3.5000 2.2700 3.7800 0 1.9800 4.3500 2.0700 3.1100 2.6500 4.5300 1.9800 0 3.8000 3.3100 4.4600 2.3600 2.8300 4.3500 3.8000 0 4.3500 5.5700 1.9900 2.4400 2.0700 3.3100 4.3500 0 3.0000 1.7400 3.7900 0.5300 1.4700 3.8200 2.5700 Column 8 3.0000 1.7400 3.7900 0.5300 1.4700 3.8200 2.5700 0
cmdscale recognizes either of the two formats.
[Y,eigvals] = cmdscale(D);
cmdscale's first output, Y, is a matrix of points created to have interpoint distances that reproduce the distances in D. With eight species, the points (rows of Y) could have as many as eight dimensions (columns of Y). Visualization of the genetic distances depends on using points in only two or three dimensions. Fortunately, cmdscale's second output, eigvals, is a set of sorted eigenvalues whose relative magnitudes indicate how many dimensions you can safely use. If only the first two or three eigenvalues are large, then only those coordinates of the points in Y are needed to accurately reproduce D. If more than three eigenvalues are large, then it is not possible to find a good lowdimensional configuration of points, and it will not be easy to visualize the distances.
[eigvals eigvals/max(abs(eigvals))]
ans = 29.0371 1.0000 13.5746 0.4675 2.0987 0.0723 0.7418 0.0255 0.3403 0.0117 0.0000 0.0000 0.4542 0.0156 3.1755 0.1094
Notice that there are only two large positive eigenvalues, so the configuration of points created by cmdscale can be plotted in two dimensions. The two negative eigenvalues indicate that the genetic distances are not Euclidean, that is, no configuration of points can reproduce D exactly. Fortunately, the negative eigenvalues are small relative to the largest positive ones, and the reduction to the first two columns of Y should be fairly accurate. You can check this by looking at the error in the distances between the twodimensional configuration and the original distances.
maxrelerr = max(abs(D  pdist(Y(:,1:2)))) / max(D)
maxrelerr = 0.1335
Now you can compare the "genetic locations" created by cmdscale to the actual geographic locations. Because the configuration returned by cmdscale is unique only up to translation, rotation, and reflection, the genetic locations probably won't match the geographic locations. They will also have the wrong scale. But you can use the procrustes command to match up the two sets of points best in the least squares sense.
[D,Z] = procrustes(X,Y(:,1:2)); plot(X(:,1),X(:,2),'bo',Z(:,1),Z(:,2),'rd'); labels = num2str((1:8)'); text(X(:,1)+.05,X(:,2),labels,'Color','b'); text(Z(:,1)+.05,Z(:,2),labels,'Color','r'); xlabel('Distance East of Reference Point (Km)'); ylabel('Distance North of Reference Point (Km)'); legend({'Spatial Locations','Constructed Genetic Locations'},'Location','SE ');
This plot shows the best match of the reconstructed points in the same coordinates as the actual spatial locations. Apparently, the genetic distances do have a close link to the spatial distances between the subpopulations.
Suppose you have computed the following correlation matrix for a set of 10 variables. It's obvious that these variables are all positively correlated, and that there are some very strong pairwise correlations. But with this many variables, it's not easy to get a good feel for the relationships among all 10.
Rho = ...
[1 0.3906 0.3746 0.3318 0.4141 0.4279 0.4216 0.4703 0.4362 0.206
6;
0.3906 1 0.3200 0.3629 0.2211 0.9520 0.9811 0.9052 0.4567 0
;
0.3746 0.3200 1 0.8993 0.7999 0.3589 0.3460 0.3333 0.8639 0.652
7;
0.3318 0.3629 0.8993 1 0.7125 0.3959 0.3663 0.3394 0.8719 0.572
6;
0.4141 0.2211 0.7999 0.7125 1 0.2374 0.2079 0.2335 0.7050 0.746
9;
0.4279 0.9520 0.3589 0.3959 0.2374 1 0.9657 0.9363 0.4791 0.025
4;
0.4216 0.9811 0.3460 0.3663 0.2079 0.9657 1 0.9123 0.4554 0.001
1;
0.4703 0.9052 0.3333 0.3394 0.2335 0.9363 0.9123 1 0.4418 0.009
9;
0.4362 0.4567 0.8639 0.8719 0.7050 0.4791 0.4554 0.4418 1 0.527
2;
0.2066 0 0.6527 0.5726 0.7469 0.0254 0.0011 0.0099 0.5272 1
];
Multidimensional scaling is often thought of as a way to (re)construct points using only pairwise distances. But it can also be used with dissimilarity measures that are more general than distance, to spatially visualize things that are not "points in space" in the usual sense. The variables described by Rho are an example, and you can use cmdscale to plot a visual representation of their interdependencies.
Correlation actually measures similarity, but it is easy to transform it to a measure of dissimilarity. Because all the correlations here are positive, you can simply use
D = 1  Rho;
although other choices might also make sense. If Rho contained negative correlations, you would have to decide whether, for example, a correlation of 1 indicated more or less of a dissimilarity than a correlation of 0, and choose a transformation accordingly.
It's important to decide whether visualization of the information in the correlation matrix is even possible, that is, whether the number of dimensions can be reduced from ten down to two or three. The eigenvalues returned by cmdscale give you a way to decide. In this case, a scree plot of those eigenvalues indicates that two dimensions are enough to represent the variables. (Notice that some of the eigenvalues in the plot below are negative, but small relative to the first two.)
[Y,eigvals] = cmdscale(D); plot(1:length(eigvals),eigvals,'bo'); if feature('HGUsingMATLABClasses') cl = specgraphhelper('createConstantLineUsingMATLABClasses','LineStyle',... ':','Color',[.7 .7 .7],'Parent',gca); cl.Value = 0; else graph2d.constantline(0,'LineStyle',':','Color',[.7 .7 .7]); end axis([1,length(eigvals),min(eigvals),max(eigvals)*1.1]); xlabel('Eigenvalue number'); ylabel('Eigenvalue');
In a more independent set of variables, more dimensions might be needed. If more than three variables are needed, the visualization isn't all that useful.
A 2D plot of the configuration returned by cmdscale indicates that there are two subsets of variables that are most closely correlated among themselves, plus a single variable that is more or less on its own. One of the clusters is tight, while the other is relatively loose.
labels = {' 1',' 2',' 3',' 4',' 5',' 6',' 7',' 8',' 9',' 10'}; plot(Y(:,1),Y(:,2),'bx'); axis(max(max(abs(Y))) * [1.1,1.1,1.1,1.1]); axis('square'); text(Y(:,1),Y(:,2),labels,'HorizontalAlignment','left'); if feature('HGUsingMATLABClasses') hx = specgraphhelper('createConstantLineUsingMATLABClasses',... 'LineStyle','','Color',[.7 .7 .7],'Parent',gca); hx.Value = 0; else hx = graph2d.constantline(0,'LineStyle','','Color',[.7 .7 .7]); end changedependvar(hx,'x'); if feature('HGUsingMATLABClasses') hy = specgraphhelper('createConstantLineUsingMATLABClasses',... 'LineStyle','','Color',[.7 .7 .7],'Parent',gca); hy.Value = 0; else hy = graph2d.constantline(0,'LineStyle','','Color',[.7 .7 .7]); end changedependvar(hy,'y');
On the other hand, the results from cmdscale for the following correlation matrix indicates a much different structure: there are no real groups among the variables. Rather, there is a kind of "circular" dependency, where each variable has a pair of "closest neighbors" but is less well correlated with the remaining variables.
Rho = ...
[1 0.7946 0.1760 0.2560 0.7818 0.4496 0.2732 0.3995 0.5305 0.282
7;
0.7946 1 0.1626 0.4227 0.5674 0.6183 0.4004 0.2283 0.3495 0.277
7;
0.1760 0.1626 1 0.2644 0.1864 0.1859 0.4330 0.4656 0.3947 0.805
7;
0.2560 0.4227 0.2644 1 0.1017 0.7426 0.8340 0 0.0499 0.485
3;
0.7818 0.5674 0.1864 0.1017 1 0.2733 0.1484 0.4890 0.6138 0.202
5;
0.4496 0.6183 0.1859 0.7426 0.2733 1 0.6303 0.0648 0.1035 0.324
2;
0.2732 0.4004 0.4330 0.8340 0.1484 0.6303 1 0.1444 0.1357 0.629
1;
0.3995 0.2283 0.4656 0 0.4890 0.0648 0.1444 1 0.8599 0.394
8;
0.5305 0.3495 0.3947 0.0499 0.6138 0.1035 0.1357 0.8599 1 0.310
0;
0.2827 0.2777 0.8057 0.4853 0.2025 0.3242 0.6291 0.3948 0.3100 1
];
[Y,eigvals] = cmdscale(1Rho);
[eigvals eigvals./max(abs(eigvals))]
ans = 1.1416 1.0000 0.7742 0.6782 0.0335 0.0294 0.0280 0.0245 0.0239 0.0210 0.0075 0.0066 0.0046 0.0040 0.0000 0.0000 0.0151 0.0132 0.0472 0.0413
plot(Y(:,1),Y(:,2),'bx'); axis(max(max(abs(Y))) * [1.1,1.1,1.1,1.1]); axis('square'); text(Y(:,1),Y(:,2),labels,'HorizontalAlignment','left'); if feature('HGUsingMATLABClasses') hx = specgraphhelper('createConstantLineUsingMATLABClasses',... 'LineStyle','','Color',[.7 .7 .7],'Parent',gca); hx.Value = 0; else hx = graph2d.constantline(0, 'LineStyle','','Color',[.7 .7 .7]); end changedependvar(hx,'x'); if feature('HGUsingMATLABClasses') hy = specgraphhelper('createConstantLineUsingMATLABClasses',... 'LineStyle','','Color',[.7 .7 .7],'Parent',gca); hy.Value = 0; else hy = graph2d.constantline(0,'LineStyle','','Color',[.7 .7 .7]); end changedependvar(hy,'y');
Multidimensional scaling is most often used to visualize data when only their distances or dissimilarities are available. However, when the original data are available, multidimensional scaling can also be used as a dimension reduction method, by reducing the data to a distance matrix, creating a new configuration of points using cmdscale, and retaining only the first few dimensions of those points. This application of multidimensional scaling is much like Principal Components Analysis, and in fact, when you call cmdscale using the Euclidean distances between the points, the results are identical to PCA, up to a change in sign.
n = 10; m = 5;
X = randn(n,m);
D = pdist(X,'Euclidean');
[Y,eigvals] = cmdscale(D);
[PC,Score,latent] = princomp(X);
Y
Y = 1.4505 1.6602 0.8106 0.5834 0.5952 2.6140 1.0513 1.1962 0.7221 0.2299 2.2399 1.6699 0.7881 0.6659 0.0398 0.4956 0.2265 1.2682 0.5123 0.5702 0.1004 2.3659 1.2672 0.4837 0.2888 2.5996 1.0635 0.8532 0.1392 0.1216 1.5565 0.4215 0.0931 0.2863 0.0299 0.4656 0.6250 0.7608 0.3233 0.2786 2.3961 2.6933 0.2020 0.2572 0.4374 2.7660 0.3529 0.5474 0.4560 0.7044
Score
Score = 1.4505 1.6602 0.8106 0.5834 0.5952 2.6140 1.0513 1.1962 0.7221 0.2299 2.2399 1.6699 0.7881 0.6659 0.0398 0.4956 0.2265 1.2682 0.5123 0.5702 0.1004 2.3659 1.2672 0.4837 0.2888 2.5996 1.0635 0.8532 0.1392 0.1216 1.5565 0.4215 0.0931 0.2863 0.0299 0.4656 0.6250 0.7608 0.3233 0.2786 2.3961 2.6933 0.2020 0.2572 0.4374 2.7660 0.3529 0.5474 0.4560 0.7044
Even the nonzero eigenvalues are identical up to a scale factor.
[eigvals(1:m) (n1)*latent]
ans = 36.9993 36.9993 21.3766 21.3766 7.5792 7.5792 2.2815 2.2815 1.5981 1.5981