Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 1 of 7: Introduction
Section - How many kinds of Kohonen networks exist?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]

Top Document: FAQ, Part 1 of 7: Introduction
Previous Document: How many kinds of NNs exist?
Next Document: How are layers counted?
See reader questions & answers on this topic! - Help others by sharing your knowledge
(And what is k-means?)

Teuvo Kohonen is one of the most famous and prolific researchers in
neurocomputing, and he has invented a variety of networks. But many people
refer to "Kohonen networks" without specifying which kind of Kohonen
network, and this lack of precision can lead to confusion. The phrase
"Kohonen network" most often refers to one of the following three types of

 o VQ: Vector Quantization--competitive networks that can be viewed as
   unsupervised density estimators or autoassociators (Kohonen, 1995/1997;
   Hecht-Nielsen 1990), closely related to k-means cluster analysis
   (MacQueen, 1967; Anderberg, 1973). Each competitive unit corresponds to a
   cluster, the center of which is called a "codebook vector". Kohonen's
   learning law is an on-line algorithm that finds the codebook vector
   closest to each training case and moves the "winning" codebook vector
   closer to the training case. The codebook vector is moved a certain
   proportion of the distance between it and the training case, the
   proportion being specified by the learning rate, that is: 

      new_codebook = old_codebook * (1-learning_rate)

                   + data * learning_rate

   Numerous similar algorithms have been developed in the neural net and
   machine learning literature; see Hecht-Nielsen (1990) for a brief
   historical overview, and Kosko (1992) for a more technical overview of
   competitive learning. 

   MacQueen's on-line k-means algorithm is essentially the same as Kohonen's
   learning law except that the learning rate is the reciprocal of the
   number of cases that have been assigned to the winnning cluster. Suppose
   that when processing a given training case, N cases have been previously
   assigned to the winning codebook vector. Then the codebook vector is
   updated as: 

      new_codebook = old_codebook * N/(N+1)

                   + data * 1/(N+1)

   This reduction of the learning rate makes each codebook vector the mean
   of all cases assigned to its cluster and guarantees convergence of the
   algorithm to an optimum value of the error function (the sum of squared
   Euclidean distances between cases and codebook vectors) as the number of
   training cases goes to infinity. Kohonen's learning law with a fixed
   learning rate does not converge. As is well known from stochastic
   approximation theory, convergence requires the sum of the infinite
   sequence of learning rates to be infinite, while the sum of squared
   learning rates must be finite (Kohonen, 1995, p. 34). These requirements
   are satisfied by MacQueen's k-means algorithm. 

   Kohonen VQ is often used for off-line learning, in which case the
   training data are stored and Kohonen's learning law is applied to each
   case in turn, cycling over the data set many times (incremental
   training). Convergence to a local optimum can be obtained as the training
   time goes to infinity if the learning rate is reduced in a suitable
   manner as described above. However, there are off-line k-means
   algorithms, both batch and incremental, that converge in a finite number
   of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979).
   The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the
   advantage for large data sets, since the incremental methods require you
   either to store the cluster membership of each case or to do two
   nearest-cluster computations as each case is processed. Forgy's algorithm
   is a simple alternating least-squares algorithm consisting of the
   following steps:

   1. Initialize the codebook vectors. 
   2. Repeat the following two steps until convergence:
      A. Read the data, assigning each case to the nearest (using Euclidean
      distance) codebook vector.
      B. Replace each codebook vector with the mean of the cases that were
      assigned to it. 

   Fastest training is usually obtained if MacQueen's on-line algorithm is
   used for the first pass and off-line k-means algorithms are applied on
   subsequent passes (Bottou and Bengio, 1995). However, these training
   methods do not necessarily converge to a global optimum of the error
   function. The chance of finding a global optimum can be improved by using
   rational initialization (SAS Institute, 1989, pp. 824-825), multiple
   random initializations, or various time-consuming training methods
   intended for global optimization (Ismail and Kamel, 1989; Zeger, Vaisy,
   and Gersho, 1992). 

   VQ has been a popular topic in the signal processing literature, which
   has been largely separate from the literature on Kohonen networks and
   from the cluster analysis literature in statistics and taxonomy. In
   signal processing, on-line methods such as Kohonen's and MacQueen's are
   called "adaptive vector quantization" (AVQ), while off-line k-means
   methods go by the names of "Lloyd" or "Lloyd I" (Lloyd, 1982) and "LBG"
   (Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho
   and Gray (1992) that summarizes these algorithms as information
   compression methods. 

   Kohonen's work emphasized VQ as density estimation and hence the
   desirability of equiprobable clusters (Kohonen 1984; Hecht-Nielsen 1990).
   However, Kohonen's learning law does not produce equiprobable
   clusters--that is, the proportions of training cases assigned to each
   cluster are not usually equal. If there are I inputs and the number of
   clusters is large, the density of the codebook vectors approximates the 
   I/(I+2) power of the density of the training data (Kohonen, 1995, p.
   35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately
   equiprobable only if the data density is uniform or the number of inputs
   is large. The most popular method for obtaining equiprobability is
   Desieno's (1988) algorithm which adds a "conscience" value to each
   distance prior to the competition. The conscience value for each cluster
   is adjusted during training so that clusters that win more often have
   larger conscience values and are thus handicapped to even out the
   probabilities of winning in later iterations. 

   Kohonen's learning law is an approximation to the k-means model, which is
   an approximation to normal mixture estimation by maximum likelihood
   assuming that the mixture components (clusters) all have spherical
   covariance matrices and equal sampling probabilities. Hence if the
   population contains clusters that are not equiprobable, k-means will tend
   to produce sample clusters that are more nearly equiprobable than the
   population clusters. Corrections for this bias can be obtained by
   maximizing the likelihood without the assumption of equal sampling
   probabilities Symons (1981). Such corrections are similar to conscience
   but have the opposite effect. 

   In cluster analysis, the purpose is not to compress information but to
   recover the true cluster memberships. K-means differs from mixture models
   in that, for k-means, the cluster membership for each case is considered
   a separate parameter to be estimated, while mixture models estimate a
   posterior probability for each case based on the means, covariances, and
   sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and
   Lewis (1994) found that k-means algorithms recovered cluster membership
   more accurately than Kohonen VQ. 

 o SOM: Self-Organizing Map--competitive networks that provide a
   "topological" mapping from the input space to the clusters (Kohonen,
   1995). The SOM was inspired by the way in which various human sensory
   impressions are neurologically mapped into the brain such that spatial or
   other relations among stimuli correspond to spatial relations among the
   neurons. In a SOM, the neurons (clusters) are organized into a
   grid--usually two-dimensional, but sometimes one-dimensional or (rarely)
   three- or more-dimensional. The grid exists in a space that is separate
   from the input space; any number of inputs may be used as long as the
   number of inputs is greater than the dimensionality of the grid space. A
   SOM tries to find clusters such that any two clusters that are close to
   each other in the grid space have codebook vectors close to each other in
   the input space. But the converse does not hold: codebook vectors that
   are close to each other in the input space do not necessarily correspond
   to clusters that are close to each other in the grid. Another way to look
   at this is that a SOM tries to embed the grid in the input space such
   every training case is close to some codebook vector, but the grid is
   bent or stretched as little as possible. Yet another way to look at it is
   that a SOM is a (discretely) smooth mapping between regions in the input
   space and points in the grid space. The best way to undestand this is to
   look at the pictures in Kohonen (1995) or various other NN textbooks. 

   The Kohonen algorithm for SOMs is very similar to the Kohonen algorithm
   for AVQ. Let the codebook vectors be indexed by a subscript j, and let
   the index of the codebook vector nearest to the current training case be 
   n. The Kohonen SOM algorithm requires a kernel function K(j,n), where
   K(j,j)=1 and K(j,n) is usually a non-increasing function of the
   distance (in any metric) between codebook vectors j and n in the grid
   space (not the input space). Usually, K(j,n) is zero for codebook
   vectors that are far apart in the grid space. As each training case is
   processed, all the codebook vectors are updated as: 

      new_codebook  = old_codebook  * [1 - K(j,n) * learning_rate]
                  j               j

                    + data * K(j,n) * learning_rate

   The kernel function does not necessarily remain constant during training.
   The neighborhood of a given codebook vector is the set of codebook
   vectors for which K(j,n)>0. To avoid poor results (akin to local
   minima), it is usually advisable to start with a large neighborhood, and
   let the neighborhood gradually shrink during training. If K(j,n)=0
   for j not equal to n, then the SOM update formula reduces to the formula
   for Kohonen vector quantization. In other words, if the neighborhood size
   (for example, the radius of the support of the kernel function K(j,n))
   is zero, the SOM algorithm degenerates into simple VQ. Hence it is
   important not to let the neighborhood size shrink all the way to zero
   during training. Indeed, the choice of the final neighborhood size is the
   most important tuning parameter for SOM training, as we will see shortly.

   A SOM works by smoothing the codebook vectors in a manner similar to
   kernel estimation methods, but the smoothing is done in neighborhoods in
   the grid space rather than in the input space (Mulier and Cherkassky
   1995). This is easier to see in a batch algorithm for SOMs, which is
   similar to Forgy's algorithm for batch k-means, but incorporates an extra
   smoothing process: 

   1. Initialize the codebook vectors. 
   2. Repeat the following two steps until convergence or boredom:
      A. Read the data, assigning each case to the nearest (using Euclidean
      distance) codebook vector. While you are doing this, keep track of the
      mean and the number of cases for each cluster. 
      B. Do a nonparametric regression using K(j,n) as a kernel function,
      with the grid points as inputs, the cluster means as target values,
      and the number of cases in each cluster as an case weight. Replace
      each codebook vector with the output of the nonparametric regression
      function evaluated at its grid point. 

   If the nonparametric regression method is Nadaraya-Watson kernel
   regression (see What is GRNN?), the batch SOM algorithm produces
   essentially the same results as Kohonen's algorithm, barring local
   minima. The main difference is that the batch algorithm often converges.
   Mulier and Cherkassky (1995) note that other nonparametric regression
   methods can be used to provide superior SOM algorithms. In particular,
   local-linear smoothing eliminates the notorious "border effect", whereby
   the codebook vectors near the border of the grid are compressed in the
   input space. The border effect is especially problematic when you try to
   use a high degree of smoothing in a Kohonen SOM, since all the codebook
   vectors will collapse into the center of the input space. The SOM border
   effect has the same mathematical cause as the "boundary effect" in kernel
   regression, which causes the estimated regression function to flatten out
   near the edges of the regression input space. There are various cures for
   the edge effect in nonparametric regression, of which local-linear
   smoothing is the simplest (Wand and Jones, 1995). Hence, local-linear
   smoothing also cures the border effect in SOMs. Furthermore, local-linear
   smoothing is much more general and reliable than the heuristic weighting
   rule proposed by Kohonen (1995, p. 129). 

   Since nonparametric regression is used in the batch SOM algorithm,
   various properties of nonparametric regression extend to SOMs. In
   particular, it is well known that the shape of the kernel function is not
   a crucial matter in nonparametric regression, hence it is not crucial in
   SOMs. On the other hand, the amount of smoothing used for nonparametric
   regression is crucial, hence the choice of the final neighborhood size in
   a SOM is crucial. Unfortunately, I am not aware of any systematic studies
   of methods for choosing the final neighborhood size. 

   The batch SOM algorithm is very similar to the principal curve and
   surface algorithm proposed by Hastie and Stuetzle (1989), as pointed out
   by Ritter, Martinetz, and Schulten (1992) and Mulier and Cherkassky
   (1995). A principal curve is a nonlinear generalization of a principal
   component. Given the probability distribution of a population, a
   principal curve is defined by the following self-consistency condition: 
   1. If you choose any point on a principal curve, 
   2. then find the set of all the points in the input space that are closer
      to the chosen point than any other point on the curve, 
   3. and compute the expected value (mean) of that set of points with
      respect to the probability distribution, then 
   4. you end up with the same point on the curve that you chose originally.
   See for more
   information about principal curves and surfaces. 

   In a multivariate normal distribution, the principal curves are the same
   as the principal components. A principal surface is the obvious
   generalization from a curve to a surface. In a multivariate normal
   distribution, the principal surfaces are the subspaces spanned by any two
   principal components.

   A one-dimensional local-linear batch SOM can be used to estimate a
   principal curve by letting the number of codebook vectors approach
   infinity while choosing a kernel function of appropriate smoothness. A
   two-dimensional local-linear batch SOM can be used to estimate a
   principal surface by letting the number of both rows and columns in the
   grid approach infinity. This connection between SOMs and principal curves
   and surfaces shows that the choice of the number of codebook vectors is
   not crucial, provided the number is fairly large. 

   If the final neighborhood size in a local-linear batch SOM is large, the
   SOM approximates a subspace spanned by principal components--usually the
   first principal component if the SOM is one-dimensional, the first two
   principal components if the SOM is two-dimensional, and so on. This
   result does not depend on the data having a multivariate normal

   Principal component analysis is a method of data compression, not a
   statistical model. However, there is a related method called "common
   factor analysis" that is often confused with principal component analysis
   but is indeed a statistical model. Common factor analysis posits that the
   relations among the observed variables can be explained by a smaller
   number of unobserved, "latent" variables. Tibshirani (1992) has proposed
   a latent-variable variant of principal curves, and latent-variable
   modifications of SOMs have been introduced by Utsugi (1996, 1997) and
   Bishop, Svensén, and Williams (1997). 

   The choice of the number of codebook vectors is usually not critical as
   long as the number is fairly large. But results can be sensitive to the
   shape of the grid, e.g., square or an elongated rectangle. And the
   dimensionality of the grid is a crucial choice. It is difficult to guess
   the appropriate shape and dimensionality before analyzing the data.
   Determining the shape and dimensionality by trial and error can be quite
   tedious. Hence, a variety of methods have been tried for growing SOMs and
   related kinds of NNs during training. For more information on growing
   SOMs, see Bernd Fritzke's home page at 

   Using a 1-by-2 SOM is pointless. There is no "topological structure" in a
   1-by-2 grid. A 1-by-2 SOM is essentially the same as VQ with two
   clusters, except that the SOM clusters will be closer together than the
   VQ clusters if the final neighborhood size for the SOM is large. 

   In a Kohonen SOM, as in VQ, it is necessary to reduce the learning rate
   during training to obtain convergence. Greg Heath has commented in this

   I favor separate learning rates for each winning SOM node (or k-means
   cluster) in the form 1/(N_0i + N_i + 1), where N_i is the
   count of vectors that have caused node i to be a winner and N_0i
   is an initializing count that indicates the confidence in the initial
   weight vector assignment. The winning node expression is based on
   stochastic estimation convergence constraints and pseudo-Bayesian
   estimation of mean vectors. Kohonen derived a heuristic recursion
   relation for the "optimal" rate. To my surprise, when I solved the
   recursion relation I obtained the same above expression that I've
   been using for years. 

   In addition, I have had success using the similar form 
   (1/n)/(N_0j + N_j + (1/n)) for the n nodes in the
   shrinking updating-neighborhood. Before the final "winners-only"
   stage when neighbors are no longer updated, the number of updating
   neighbors eventually shrinks to n = 6 or 8 for hexagonal or
   rectangular neighborhoods, respectively. 

   Kohonen's neighbor-update formula is more precise replacing my
   constant fraction (1/n) with a node-pair specific h_ij (h_ij
   < 1). However, as long as the initial neighborhood is sufficiently
   large, the shrinking rate is sufficiently slow, and the final
   winner-only stage is sufficiently long, the results should be
   relatively insensitive to exact form of h_ij. 

Another advantage of batch SOMs is that there is no learning rate, so these
complications evaporate. 

Kohonen (1995, p. VII) says that SOMs are not intended for pattern
recognition but for clustering, visualization, and abstraction. Kohonen has
used a "supervised SOM" (1995, pp. 160-161) that is similar to
counterpropagation (Hecht-Nielsen 1990), but he seems to prefer LVQ (see
below) for supervised classification. Many people continue to use SOMs for
classification tasks, sometimes with surprisingly (I am tempted to say
"inexplicably") good results (Cho, 1997). 

o LVQ: Learning Vector Quantization--competitive networks for supervised
classification (Kohonen, 1988, 1995; Ripley, 1996). Each codebook vector is
assigned to one of the target classes. Each class may have one or more
codebook vectors. A case is classified by finding the nearest codebook
vector and assigning the case to the class corresponding to the codebook
vector. Hence LVQ is a kind of nearest-neighbor rule. 

Ordinary VQ methods, such as Kohonen's VQ or k-means, can easily be used for
supervised classification. Simply count the number of training cases from
each class assigned to each cluster, and divide by the total number of cases
in the cluster to get the posterior probability. For a given case, output
the class with the greatest posterior probability--i.e. the class that forms
a majority in the nearest cluster. Such methods can provide universally
consistent classifiers (Devroye et al., 1996) even when the codebook vectors
are obtained by unsupervised algorithms. LVQ tries to improve on this
approach by adapting the codebook vectors in a supervised way. There are
several variants of LVQ--called LVQ1, OLVQ1, LVQ2, and LVQ3--based on
heuristics. However, a smoothed version of LVQ can be trained as a
feedforward network using a NRBFEQ architecture (see "How do MLPs compare
with RBFs?") and optimizing any of the usual error functions; as the width
of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ

There are several other kinds of Kohonen networks described in Kohonen
(1995), including: 

 o DEC--Dynamically Expanding Context 
 o LSM--Learning Subspace Method 
 o ASSOM--Adaptive Subspace SOM 
 o FASSOM--Feedback-controlled Adaptive Subspace SOM 
 o Supervised SOM 

More information on the error functions (or absence thereof) used by Kohonen
VQ and SOM is provided under "What does unsupervised learning learn?" 

For more on-line information on Kohonen networks and other varieties of
SOMs, see: 

 o The web page of The Neural Networks Research Centre, Helsinki University
   of Technology, at 
 o The SOM of articles from at 
 o Akio Utsugi's web page on Bayesian SOMs at the National Institute of
   Bioscience and Human-Technology, Agency of Industrial Science and
   Technology, M.I.T.I., 1-1, Higashi, Tsukuba, Ibaraki, 305 Japan, at 
 o The GTM (generative topographic mapping) home page at the Neural
   Computing Research Group, Aston University, Birmingham, UK, at 
 o Nenet SOM software at 
 o Bernd Fritzke's home page at has
   information on growing SOMs and other related types of NNs 


   Anderberg, M.R. (1973), Cluster Analysis for Applications, New York:
   Academic Press, Inc. 

   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
   study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525. 

   Bishop, C.M., Svensén, M., and Williams, C.K.I (1997), "GTM: A principled
   alternative to the self-organizing map," in Mozer, M.C., Jordan, M.I.,
   and Petsche, T., (eds.) Advances in Neural Information Processing
   Systems 9, Cambridge, MA: The MIT Press, pp. 354-360. Also see 

   Bottou, L., and Bengio, Y. (1995), "Convergence properties of the K-Means
   algorithms," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) 
   Advances in Neural Information Processing Systems 7, Cambridge, MA: The
   MIT Press, pp. 585-592. 

   Cho, S.-B. (1997), "Self-organizing map with dynamical node-splitting:
   Application to handwritten digit recognition," Neural Computation, 9,

   Desieno, D. (1988), "Adding a conscience to competitive learning," Proc.
   Int. Conf. on Neural Networks, I, 117-124, IEEE Press. 

   Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of
   Pattern Recognition, NY: Springer, 

   Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency
   versus interpretability," Biometric Society Meetings, Riverside, CA.
   Abstract in Biomatrics, 21, 768. 

   Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
   Compression, Boston: Kluwer Academic Publishers. 

   Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. 

   Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means
   clustering algorithm," Applied Statistics, 28-100-108. 

   Hastie, T., and Stuetzle, W. (1989), "Principal curves," Journal of the
   American Statistical Association, 84, 502-516. 

   Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

   Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering
   utilizing hybrid search strategies," Pattern Recognition, 22, 75-89. 

   Kohonen, T (1984), Self-Organization and Associative Memory, Berlin:

   Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1
   (suppl 1), 303. 

   Kohonen, T. (1995/1997), Self-Organizing Maps, Berlin: Springer-Verlag.
   First edition was 1995, second edition 1997. See for information on the second

   Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
   N.J.: Prentice-Hall. 

   Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector
   quantizer design," IEEE Transactions on Communications, 28, 84-95. 

   Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions
   on Information Theory, 28, 129-137. 

   MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of
   Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on
   Mathematical Statistics and Probability, 1, 281-297. 

   Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on
   Information Theory, 6, 7-12. 

   Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an iterative
   kernel smoothing process," Neural Computation, 7, 1165-1177. 

   Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

   Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation
   and Self-Organizing Maps: An Introduction, Reading, MA: Addison-Wesley. 

   SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition,
   Cary, NC: SAS Institute. 

   Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal
   Mixtures," Biometrics, 37, 35-43. 

   Tibshirani, R. (1992), "Principal curves revisited," Statistics and
   Computing, 2, 183-190. 

   Utsugi, A. (1996), "Topology selection for self-organizing maps,"
   Network: Computation in Neural Systems, 7, 727-740, available on-line at 

   Utsugi, A. (1997), "Hyperparameter selection for self-organizing maps,"
   Neural Computation, 9, 623-635, available on-line at 

   Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman &

   Zador, P.L. (1982), "Asymptotic quantization error of continuous signals
   and the quantization dimension," IEEE Transactions on Information Theory,
   28, 139-149. 

   Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector
   quantizer design by stochastic relaxation," IEEE Transactions on Signal
   Procesing, 40, 310-322. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 1 of 7: Introduction
Previous Document: How many kinds of NNs exist?
Next Document: How are layers counted?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM