Top Document: comp.ai.neuralnets FAQ, Part 1 of 7: Introduction Previous Document: How many kinds of NNs exist? Next Document: How are layers counted? See reader questions & answers on this topic!  Help others by sharing your knowledge (And what is kmeans?) ====================== Teuvo Kohonen is one of the most famous and prolific researchers in neurocomputing, and he has invented a variety of networks. But many people refer to "Kohonen networks" without specifying which kind of Kohonen network, and this lack of precision can lead to confusion. The phrase "Kohonen network" most often refers to one of the following three types of networks: o VQ: Vector Quantizationcompetitive networks that can be viewed as unsupervised density estimators or autoassociators (Kohonen, 1995/1997; HechtNielsen 1990), closely related to kmeans cluster analysis (MacQueen, 1967; Anderberg, 1973). Each competitive unit corresponds to a cluster, the center of which is called a "codebook vector". Kohonen's learning law is an online algorithm that finds the codebook vector closest to each training case and moves the "winning" codebook vector closer to the training case. The codebook vector is moved a certain proportion of the distance between it and the training case, the proportion being specified by the learning rate, that is: new_codebook = old_codebook * (1learning_rate) + data * learning_rate Numerous similar algorithms have been developed in the neural net and machine learning literature; see HechtNielsen (1990) for a brief historical overview, and Kosko (1992) for a more technical overview of competitive learning. MacQueen's online kmeans algorithm is essentially the same as Kohonen's learning law except that the learning rate is the reciprocal of the number of cases that have been assigned to the winnning cluster. Suppose that when processing a given training case, N cases have been previously assigned to the winning codebook vector. Then the codebook vector is updated as: new_codebook = old_codebook * N/(N+1) + data * 1/(N+1) This reduction of the learning rate makes each codebook vector the mean of all cases assigned to its cluster and guarantees convergence of the algorithm to an optimum value of the error function (the sum of squared Euclidean distances between cases and codebook vectors) as the number of training cases goes to infinity. Kohonen's learning law with a fixed learning rate does not converge. As is well known from stochastic approximation theory, convergence requires the sum of the infinite sequence of learning rates to be infinite, while the sum of squared learning rates must be finite (Kohonen, 1995, p. 34). These requirements are satisfied by MacQueen's kmeans algorithm. Kohonen VQ is often used for offline learning, in which case the training data are stored and Kohonen's learning law is applied to each case in turn, cycling over the data set many times (incremental training). Convergence to a local optimum can be obtained as the training time goes to infinity if the learning rate is reduced in a suitable manner as described above. However, there are offline kmeans algorithms, both batch and incremental, that converge in a finite number of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979). The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the advantage for large data sets, since the incremental methods require you either to store the cluster membership of each case or to do two nearestcluster computations as each case is processed. Forgy's algorithm is a simple alternating leastsquares algorithm consisting of the following steps: 1. Initialize the codebook vectors. 2. Repeat the following two steps until convergence: A. Read the data, assigning each case to the nearest (using Euclidean distance) codebook vector. B. Replace each codebook vector with the mean of the cases that were assigned to it. Fastest training is usually obtained if MacQueen's online algorithm is used for the first pass and offline kmeans algorithms are applied on subsequent passes (Bottou and Bengio, 1995). However, these training methods do not necessarily converge to a global optimum of the error function. The chance of finding a global optimum can be improved by using rational initialization (SAS Institute, 1989, pp. 824825), multiple random initializations, or various timeconsuming training methods intended for global optimization (Ismail and Kamel, 1989; Zeger, Vaisy, and Gersho, 1992). VQ has been a popular topic in the signal processing literature, which has been largely separate from the literature on Kohonen networks and from the cluster analysis literature in statistics and taxonomy. In signal processing, online methods such as Kohonen's and MacQueen's are called "adaptive vector quantization" (AVQ), while offline kmeans methods go by the names of "Lloyd" or "Lloyd I" (Lloyd, 1982) and "LBG" (Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho and Gray (1992) that summarizes these algorithms as information compression methods. Kohonen's work emphasized VQ as density estimation and hence the desirability of equiprobable clusters (Kohonen 1984; HechtNielsen 1990). However, Kohonen's learning law does not produce equiprobable clustersthat is, the proportions of training cases assigned to each cluster are not usually equal. If there are I inputs and the number of clusters is large, the density of the codebook vectors approximates the I/(I+2) power of the density of the training data (Kohonen, 1995, p. 35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately equiprobable only if the data density is uniform or the number of inputs is large. The most popular method for obtaining equiprobability is Desieno's (1988) algorithm which adds a "conscience" value to each distance prior to the competition. The conscience value for each cluster is adjusted during training so that clusters that win more often have larger conscience values and are thus handicapped to even out the probabilities of winning in later iterations. Kohonen's learning law is an approximation to the kmeans model, which is an approximation to normal mixture estimation by maximum likelihood assuming that the mixture components (clusters) all have spherical covariance matrices and equal sampling probabilities. Hence if the population contains clusters that are not equiprobable, kmeans will tend to produce sample clusters that are more nearly equiprobable than the population clusters. Corrections for this bias can be obtained by maximizing the likelihood without the assumption of equal sampling probabilities Symons (1981). Such corrections are similar to conscience but have the opposite effect. In cluster analysis, the purpose is not to compress information but to recover the true cluster memberships. Kmeans differs from mixture models in that, for kmeans, the cluster membership for each case is considered a separate parameter to be estimated, while mixture models estimate a posterior probability for each case based on the means, covariances, and sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and Lewis (1994) found that kmeans algorithms recovered cluster membership more accurately than Kohonen VQ. o SOM: SelfOrganizing Mapcompetitive networks that provide a "topological" mapping from the input space to the clusters (Kohonen, 1995). The SOM was inspired by the way in which various human sensory impressions are neurologically mapped into the brain such that spatial or other relations among stimuli correspond to spatial relations among the neurons. In a SOM, the neurons (clusters) are organized into a gridusually twodimensional, but sometimes onedimensional or (rarely) three or moredimensional. The grid exists in a space that is separate from the input space; any number of inputs may be used as long as the number of inputs is greater than the dimensionality of the grid space. A SOM tries to find clusters such that any two clusters that are close to each other in the grid space have codebook vectors close to each other in the input space. But the converse does not hold: codebook vectors that are close to each other in the input space do not necessarily correspond to clusters that are close to each other in the grid. Another way to look at this is that a SOM tries to embed the grid in the input space such every training case is close to some codebook vector, but the grid is bent or stretched as little as possible. Yet another way to look at it is that a SOM is a (discretely) smooth mapping between regions in the input space and points in the grid space. The best way to undestand this is to look at the pictures in Kohonen (1995) or various other NN textbooks. The Kohonen algorithm for SOMs is very similar to the Kohonen algorithm for AVQ. Let the codebook vectors be indexed by a subscript j, and let the index of the codebook vector nearest to the current training case be n. The Kohonen SOM algorithm requires a kernel function K(j,n), where K(j,j)=1 and K(j,n) is usually a nonincreasing function of the distance (in any metric) between codebook vectors j and n in the grid space (not the input space). Usually, K(j,n) is zero for codebook vectors that are far apart in the grid space. As each training case is processed, all the codebook vectors are updated as: new_codebook = old_codebook * [1  K(j,n) * learning_rate] j j + data * K(j,n) * learning_rate The kernel function does not necessarily remain constant during training. The neighborhood of a given codebook vector is the set of codebook vectors for which K(j,n)>0. To avoid poor results (akin to local minima), it is usually advisable to start with a large neighborhood, and let the neighborhood gradually shrink during training. If K(j,n)=0 for j not equal to n, then the SOM update formula reduces to the formula for Kohonen vector quantization. In other words, if the neighborhood size (for example, the radius of the support of the kernel function K(j,n)) is zero, the SOM algorithm degenerates into simple VQ. Hence it is important not to let the neighborhood size shrink all the way to zero during training. Indeed, the choice of the final neighborhood size is the most important tuning parameter for SOM training, as we will see shortly. A SOM works by smoothing the codebook vectors in a manner similar to kernel estimation methods, but the smoothing is done in neighborhoods in the grid space rather than in the input space (Mulier and Cherkassky 1995). This is easier to see in a batch algorithm for SOMs, which is similar to Forgy's algorithm for batch kmeans, but incorporates an extra smoothing process: 1. Initialize the codebook vectors. 2. Repeat the following two steps until convergence or boredom: A. Read the data, assigning each case to the nearest (using Euclidean distance) codebook vector. While you are doing this, keep track of the mean and the number of cases for each cluster. B. Do a nonparametric regression using K(j,n) as a kernel function, with the grid points as inputs, the cluster means as target values, and the number of cases in each cluster as an case weight. Replace each codebook vector with the output of the nonparametric regression function evaluated at its grid point. If the nonparametric regression method is NadarayaWatson kernel regression (see What is GRNN?), the batch SOM algorithm produces essentially the same results as Kohonen's algorithm, barring local minima. The main difference is that the batch algorithm often converges. Mulier and Cherkassky (1995) note that other nonparametric regression methods can be used to provide superior SOM algorithms. In particular, locallinear smoothing eliminates the notorious "border effect", whereby the codebook vectors near the border of the grid are compressed in the input space. The border effect is especially problematic when you try to use a high degree of smoothing in a Kohonen SOM, since all the codebook vectors will collapse into the center of the input space. The SOM border effect has the same mathematical cause as the "boundary effect" in kernel regression, which causes the estimated regression function to flatten out near the edges of the regression input space. There are various cures for the edge effect in nonparametric regression, of which locallinear smoothing is the simplest (Wand and Jones, 1995). Hence, locallinear smoothing also cures the border effect in SOMs. Furthermore, locallinear smoothing is much more general and reliable than the heuristic weighting rule proposed by Kohonen (1995, p. 129). Since nonparametric regression is used in the batch SOM algorithm, various properties of nonparametric regression extend to SOMs. In particular, it is well known that the shape of the kernel function is not a crucial matter in nonparametric regression, hence it is not crucial in SOMs. On the other hand, the amount of smoothing used for nonparametric regression is crucial, hence the choice of the final neighborhood size in a SOM is crucial. Unfortunately, I am not aware of any systematic studies of methods for choosing the final neighborhood size. The batch SOM algorithm is very similar to the principal curve and surface algorithm proposed by Hastie and Stuetzle (1989), as pointed out by Ritter, Martinetz, and Schulten (1992) and Mulier and Cherkassky (1995). A principal curve is a nonlinear generalization of a principal component. Given the probability distribution of a population, a principal curve is defined by the following selfconsistency condition: 1. If you choose any point on a principal curve, 2. then find the set of all the points in the input space that are closer to the chosen point than any other point on the curve, 3. and compute the expected value (mean) of that set of points with respect to the probability distribution, then 4. you end up with the same point on the curve that you chose originally. See http://www.iro.umontreal.ca/~kegl/research/pcurves/ for more information about principal curves and surfaces. In a multivariate normal distribution, the principal curves are the same as the principal components. A principal surface is the obvious generalization from a curve to a surface. In a multivariate normal distribution, the principal surfaces are the subspaces spanned by any two principal components. A onedimensional locallinear batch SOM can be used to estimate a principal curve by letting the number of codebook vectors approach infinity while choosing a kernel function of appropriate smoothness. A twodimensional locallinear batch SOM can be used to estimate a principal surface by letting the number of both rows and columns in the grid approach infinity. This connection between SOMs and principal curves and surfaces shows that the choice of the number of codebook vectors is not crucial, provided the number is fairly large. If the final neighborhood size in a locallinear batch SOM is large, the SOM approximates a subspace spanned by principal componentsusually the first principal component if the SOM is onedimensional, the first two principal components if the SOM is twodimensional, and so on. This result does not depend on the data having a multivariate normal distribution. Principal component analysis is a method of data compression, not a statistical model. However, there is a related method called "common factor analysis" that is often confused with principal component analysis but is indeed a statistical model. Common factor analysis posits that the relations among the observed variables can be explained by a smaller number of unobserved, "latent" variables. Tibshirani (1992) has proposed a latentvariable variant of principal curves, and latentvariable modifications of SOMs have been introduced by Utsugi (1996, 1997) and Bishop, Svensén, and Williams (1997). The choice of the number of codebook vectors is usually not critical as long as the number is fairly large. But results can be sensitive to the shape of the grid, e.g., square or an elongated rectangle. And the dimensionality of the grid is a crucial choice. It is difficult to guess the appropriate shape and dimensionality before analyzing the data. Determining the shape and dimensionality by trial and error can be quite tedious. Hence, a variety of methods have been tried for growing SOMs and related kinds of NNs during training. For more information on growing SOMs, see Bernd Fritzke's home page at http://pikas.inf.tudresden.de/~fritzke/ Using a 1by2 SOM is pointless. There is no "topological structure" in a 1by2 grid. A 1by2 SOM is essentially the same as VQ with two clusters, except that the SOM clusters will be closer together than the VQ clusters if the final neighborhood size for the SOM is large. In a Kohonen SOM, as in VQ, it is necessary to reduce the learning rate during training to obtain convergence. Greg Heath has commented in this regard: I favor separate learning rates for each winning SOM node (or kmeans cluster) in the form 1/(N_0i + N_i + 1), where N_i is the count of vectors that have caused node i to be a winner and N_0i is an initializing count that indicates the confidence in the initial weight vector assignment. The winning node expression is based on stochastic estimation convergence constraints and pseudoBayesian estimation of mean vectors. Kohonen derived a heuristic recursion relation for the "optimal" rate. To my surprise, when I solved the recursion relation I obtained the same above expression that I've been using for years. In addition, I have had success using the similar form (1/n)/(N_0j + N_j + (1/n)) for the n nodes in the shrinking updatingneighborhood. Before the final "winnersonly" stage when neighbors are no longer updated, the number of updating neighbors eventually shrinks to n = 6 or 8 for hexagonal or rectangular neighborhoods, respectively. Kohonen's neighborupdate formula is more precise replacing my constant fraction (1/n) with a nodepair specific h_ij (h_ij < 1). However, as long as the initial neighborhood is sufficiently large, the shrinking rate is sufficiently slow, and the final winneronly stage is sufficiently long, the results should be relatively insensitive to exact form of h_ij. Another advantage of batch SOMs is that there is no learning rate, so these complications evaporate. Kohonen (1995, p. VII) says that SOMs are not intended for pattern recognition but for clustering, visualization, and abstraction. Kohonen has used a "supervised SOM" (1995, pp. 160161) that is similar to counterpropagation (HechtNielsen 1990), but he seems to prefer LVQ (see below) for supervised classification. Many people continue to use SOMs for classification tasks, sometimes with surprisingly (I am tempted to say "inexplicably") good results (Cho, 1997). o LVQ: Learning Vector Quantizationcompetitive networks for supervised classification (Kohonen, 1988, 1995; Ripley, 1996). Each codebook vector is assigned to one of the target classes. Each class may have one or more codebook vectors. A case is classified by finding the nearest codebook vector and assigning the case to the class corresponding to the codebook vector. Hence LVQ is a kind of nearestneighbor rule. Ordinary VQ methods, such as Kohonen's VQ or kmeans, can easily be used for supervised classification. Simply count the number of training cases from each class assigned to each cluster, and divide by the total number of cases in the cluster to get the posterior probability. For a given case, output the class with the greatest posterior probabilityi.e. the class that forms a majority in the nearest cluster. Such methods can provide universally consistent classifiers (Devroye et al., 1996) even when the codebook vectors are obtained by unsupervised algorithms. LVQ tries to improve on this approach by adapting the codebook vectors in a supervised way. There are several variants of LVQcalled LVQ1, OLVQ1, LVQ2, and LVQ3based on heuristics. However, a smoothed version of LVQ can be trained as a feedforward network using a NRBFEQ architecture (see "How do MLPs compare with RBFs?") and optimizing any of the usual error functions; as the width of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ network. There are several other kinds of Kohonen networks described in Kohonen (1995), including: o DECDynamically Expanding Context o LSMLearning Subspace Method o ASSOMAdaptive Subspace SOM o FASSOMFeedbackcontrolled Adaptive Subspace SOM o Supervised SOM o LVQSOM More information on the error functions (or absence thereof) used by Kohonen VQ and SOM is provided under "What does unsupervised learning learn?" For more online information on Kohonen networks and other varieties of SOMs, see: o The web page of The Neural Networks Research Centre, Helsinki University of Technology, at http://www.cis.hut.fi/research/ o The SOM of articles from comp.ai.neuralnets at http://websom.hut.fi/websom/comp.ai.neuralnetsnew/html/root.html o Akio Utsugi's web page on Bayesian SOMs at the National Institute of Bioscience and HumanTechnology, Agency of Industrial Science and Technology, M.I.T.I., 11, Higashi, Tsukuba, Ibaraki, 305 Japan, at http://www.aist.go.jp/NIBH/~b0616/Lab/indexe.html o The GTM (generative topographic mapping) home page at the Neural Computing Research Group, Aston University, Birmingham, UK, at http://www.ncrg.aston.ac.uk/GTM/ o Nenet SOM software at http://www.mbnet.fi/~phodju/nenet/nenet.html o Bernd Fritzke's home page at http://pikas.inf.tudresden.de/~fritzke/ has information on growing SOMs and other related types of NNs References: Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with kmeans clustering", Psychometrika, 59, 509525. Bishop, C.M., Svensén, M., and Williams, C.K.I (1997), "GTM: A principled alternative to the selforganizing map," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambridge, MA: The MIT Press, pp. 354360. Also see http://www.ncrg.aston.ac.uk/GTM/ Bottou, L., and Bengio, Y. (1995), "Convergence properties of the KMeans algorithms," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) Advances in Neural Information Processing Systems 7, Cambridge, MA: The MIT Press, pp. 585592. Cho, S.B. (1997), "Selforganizing map with dynamical nodesplitting: Application to handwritten digit recognition," Neural Computation, 9, 13451355. Desieno, D. (1988), "Adding a conscience to competitive learning," Proc. Int. Conf. on Neural Networks, I, 117124, IEEE Press. Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer, Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency versus interpretability," Biometric Society Meetings, Riverside, CA. Abstract in Biomatrics, 21, 768. Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers. Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A kmeans clustering algorithm," Applied Statistics, 28100108. Hastie, T., and Stuetzle, W. (1989), "Principal curves," Journal of the American Statistical Association, 84, 502516. HechtNielsen, R. (1990), Neurocomputing, Reading, MA: AddisonWesley. Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering utilizing hybrid search strategies," Pattern Recognition, 22, 7589. Kohonen, T (1984), SelfOrganization and Associative Memory, Berlin: SpringerVerlag. Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl 1), 303. Kohonen, T. (1995/1997), SelfOrganizing Maps, Berlin: SpringerVerlag. First edition was 1995, second edition 1997. See http://www.cis.hut.fi/nnrc/new_book.html for information on the second edition. Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: PrenticeHall. Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector quantizer design," IEEE Transactions on Communications, 28, 8495. Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions on Information Theory, 28, 129137. MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281297. Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on Information Theory, 6, 712. Mulier, F. and Cherkassky, V. (1995), "SelfOrganization as an iterative kernel smoothing process," Neural Computation, 7, 11651177. Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation and SelfOrganizing Maps: An Introduction, Reading, MA: AddisonWesley. SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition, Cary, NC: SAS Institute. Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 3543. Tibshirani, R. (1992), "Principal curves revisited," Statistics and Computing, 2, 183190. Utsugi, A. (1996), "Topology selection for selforganizing maps," Network: Computation in Neural Systems, 7, 727740, available online at http://www.aist.go.jp/NIBH/~b0616/Lab/indexe.html Utsugi, A. (1997), "Hyperparameter selection for selforganizing maps," Neural Computation, 9, 623635, available online at http://www.aist.go.jp/NIBH/~b0616/Lab/indexe.html Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman & Hall. Zador, P.L. (1982), "Asymptotic quantization error of continuous signals and the quantization dimension," IEEE Transactions on Information Theory, 28, 139149. Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector quantizer design by stochastic relaxation," IEEE Transactions on Signal Procesing, 40, 310322. User Contributions:Top Document: comp.ai.neuralnets FAQ, Part 1 of 7: Introduction Previous Document: How many kinds of NNs exist? Next Document: How are layers counted? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

Comment about this article, ask questions, or add new information about this topic: