## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z # comp.ai.neural-nets FAQ, Part 1 of 7: IntroductionSection - How many kinds of Kohonen networks exist?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Zip codes ]

Top Document: comp.ai.neural-nets FAQ, Part 1 of 7: Introduction
Previous Document: How many kinds of NNs exist?
Next Document: How are layers counted?
See reader questions & answers on this topic! - Help others by sharing your knowledge
```(And what is k-means?)
======================

Teuvo Kohonen is one of the most famous and prolific researchers in
neurocomputing, and he has invented a variety of networks. But many people
refer to "Kohonen networks" without specifying which kind of Kohonen
network, and this lack of precision can lead to confusion. The phrase
"Kohonen network" most often refers to one of the following three types of
networks:

o VQ: Vector Quantization--competitive networks that can be viewed as
unsupervised density estimators or autoassociators (Kohonen, 1995/1997;
Hecht-Nielsen 1990), closely related to k-means cluster analysis
(MacQueen, 1967; Anderberg, 1973). Each competitive unit corresponds to a
cluster, the center of which is called a "codebook vector". Kohonen's
learning law is an on-line algorithm that finds the codebook vector
closest to each training case and moves the "winning" codebook vector
closer to the training case. The codebook vector is moved a certain
proportion of the distance between it and the training case, the
proportion being specified by the learning rate, that is:

new_codebook = old_codebook * (1-learning_rate)

+ data * learning_rate

Numerous similar algorithms have been developed in the neural net and
machine learning literature; see Hecht-Nielsen (1990) for a brief
historical overview, and Kosko (1992) for a more technical overview of
competitive learning.

MacQueen's on-line k-means algorithm is essentially the same as Kohonen's
learning law except that the learning rate is the reciprocal of the
number of cases that have been assigned to the winnning cluster. Suppose
that when processing a given training case, N cases have been previously
assigned to the winning codebook vector. Then the codebook vector is
updated as:

new_codebook = old_codebook * N/(N+1)

+ data * 1/(N+1)

This reduction of the learning rate makes each codebook vector the mean
of all cases assigned to its cluster and guarantees convergence of the
algorithm to an optimum value of the error function (the sum of squared
Euclidean distances between cases and codebook vectors) as the number of
training cases goes to infinity. Kohonen's learning law with a fixed
learning rate does not converge. As is well known from stochastic
approximation theory, convergence requires the sum of the infinite
sequence of learning rates to be infinite, while the sum of squared
learning rates must be finite (Kohonen, 1995, p. 34). These requirements
are satisfied by MacQueen's k-means algorithm.

Kohonen VQ is often used for off-line learning, in which case the
training data are stored and Kohonen's learning law is applied to each
case in turn, cycling over the data set many times (incremental
training). Convergence to a local optimum can be obtained as the training
time goes to infinity if the learning rate is reduced in a suitable
manner as described above. However, there are off-line k-means
algorithms, both batch and incremental, that converge in a finite number
of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979).
The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the
advantage for large data sets, since the incremental methods require you
either to store the cluster membership of each case or to do two
nearest-cluster computations as each case is processed. Forgy's algorithm
is a simple alternating least-squares algorithm consisting of the
following steps:

1. Initialize the codebook vectors.
2. Repeat the following two steps until convergence:
A. Read the data, assigning each case to the nearest (using Euclidean
distance) codebook vector.
B. Replace each codebook vector with the mean of the cases that were
assigned to it.

Fastest training is usually obtained if MacQueen's on-line algorithm is
used for the first pass and off-line k-means algorithms are applied on
subsequent passes (Bottou and Bengio, 1995). However, these training
methods do not necessarily converge to a global optimum of the error
function. The chance of finding a global optimum can be improved by using
rational initialization (SAS Institute, 1989, pp. 824-825), multiple
random initializations, or various time-consuming training methods
intended for global optimization (Ismail and Kamel, 1989; Zeger, Vaisy,
and Gersho, 1992).

VQ has been a popular topic in the signal processing literature, which
has been largely separate from the literature on Kohonen networks and
from the cluster analysis literature in statistics and taxonomy. In
signal processing, on-line methods such as Kohonen's and MacQueen's are
called "adaptive vector quantization" (AVQ), while off-line k-means
methods go by the names of "Lloyd" or "Lloyd I" (Lloyd, 1982) and "LBG"
(Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho
and Gray (1992) that summarizes these algorithms as information
compression methods.

Kohonen's work emphasized VQ as density estimation and hence the
desirability of equiprobable clusters (Kohonen 1984; Hecht-Nielsen 1990).
However, Kohonen's learning law does not produce equiprobable
clusters--that is, the proportions of training cases assigned to each
cluster are not usually equal. If there are I inputs and the number of
clusters is large, the density of the codebook vectors approximates the
I/(I+2) power of the density of the training data (Kohonen, 1995, p.
35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately
equiprobable only if the data density is uniform or the number of inputs
is large. The most popular method for obtaining equiprobability is
Desieno's (1988) algorithm which adds a "conscience" value to each
distance prior to the competition. The conscience value for each cluster
is adjusted during training so that clusters that win more often have
larger conscience values and are thus handicapped to even out the
probabilities of winning in later iterations.

Kohonen's learning law is an approximation to the k-means model, which is
an approximation to normal mixture estimation by maximum likelihood
assuming that the mixture components (clusters) all have spherical
covariance matrices and equal sampling probabilities. Hence if the
population contains clusters that are not equiprobable, k-means will tend
to produce sample clusters that are more nearly equiprobable than the
population clusters. Corrections for this bias can be obtained by
maximizing the likelihood without the assumption of equal sampling
probabilities Symons (1981). Such corrections are similar to conscience
but have the opposite effect.

In cluster analysis, the purpose is not to compress information but to
recover the true cluster memberships. K-means differs from mixture models
in that, for k-means, the cluster membership for each case is considered
a separate parameter to be estimated, while mixture models estimate a
posterior probability for each case based on the means, covariances, and
sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and
Lewis (1994) found that k-means algorithms recovered cluster membership
more accurately than Kohonen VQ.

o SOM: Self-Organizing Map--competitive networks that provide a
"topological" mapping from the input space to the clusters (Kohonen,
1995). The SOM was inspired by the way in which various human sensory
impressions are neurologically mapped into the brain such that spatial or
other relations among stimuli correspond to spatial relations among the
neurons. In a SOM, the neurons (clusters) are organized into a
grid--usually two-dimensional, but sometimes one-dimensional or (rarely)
three- or more-dimensional. The grid exists in a space that is separate
from the input space; any number of inputs may be used as long as the
number of inputs is greater than the dimensionality of the grid space. A
SOM tries to find clusters such that any two clusters that are close to
each other in the grid space have codebook vectors close to each other in
the input space. But the converse does not hold: codebook vectors that
are close to each other in the input space do not necessarily correspond
to clusters that are close to each other in the grid. Another way to look
at this is that a SOM tries to embed the grid in the input space such
every training case is close to some codebook vector, but the grid is
bent or stretched as little as possible. Yet another way to look at it is
that a SOM is a (discretely) smooth mapping between regions in the input
space and points in the grid space. The best way to undestand this is to
look at the pictures in Kohonen (1995) or various other NN textbooks.

The Kohonen algorithm for SOMs is very similar to the Kohonen algorithm
for AVQ. Let the codebook vectors be indexed by a subscript j, and let
the index of the codebook vector nearest to the current training case be
n. The Kohonen SOM algorithm requires a kernel function K(j,n), where
K(j,j)=1 and K(j,n) is usually a non-increasing function of the
distance (in any metric) between codebook vectors j and n in the grid
space (not the input space). Usually, K(j,n) is zero for codebook
vectors that are far apart in the grid space. As each training case is
processed, all the codebook vectors are updated as:

new_codebook  = old_codebook  * [1 - K(j,n) * learning_rate]
j               j

+ data * K(j,n) * learning_rate

The kernel function does not necessarily remain constant during training.
The neighborhood of a given codebook vector is the set of codebook
vectors for which K(j,n)>0. To avoid poor results (akin to local
minima), it is usually advisable to start with a large neighborhood, and
let the neighborhood gradually shrink during training. If K(j,n)=0
for j not equal to n, then the SOM update formula reduces to the formula
for Kohonen vector quantization. In other words, if the neighborhood size
(for example, the radius of the support of the kernel function K(j,n))
is zero, the SOM algorithm degenerates into simple VQ. Hence it is
important not to let the neighborhood size shrink all the way to zero
during training. Indeed, the choice of the final neighborhood size is the
most important tuning parameter for SOM training, as we will see shortly.

A SOM works by smoothing the codebook vectors in a manner similar to
kernel estimation methods, but the smoothing is done in neighborhoods in
the grid space rather than in the input space (Mulier and Cherkassky
1995). This is easier to see in a batch algorithm for SOMs, which is
similar to Forgy's algorithm for batch k-means, but incorporates an extra
smoothing process:

1. Initialize the codebook vectors.
2. Repeat the following two steps until convergence or boredom:
A. Read the data, assigning each case to the nearest (using Euclidean
distance) codebook vector. While you are doing this, keep track of the
mean and the number of cases for each cluster.
B. Do a nonparametric regression using K(j,n) as a kernel function,
with the grid points as inputs, the cluster means as target values,
and the number of cases in each cluster as an case weight. Replace
each codebook vector with the output of the nonparametric regression
function evaluated at its grid point.

If the nonparametric regression method is Nadaraya-Watson kernel
regression (see What is GRNN?), the batch SOM algorithm produces
essentially the same results as Kohonen's algorithm, barring local
minima. The main difference is that the batch algorithm often converges.
Mulier and Cherkassky (1995) note that other nonparametric regression
methods can be used to provide superior SOM algorithms. In particular,
local-linear smoothing eliminates the notorious "border effect", whereby
the codebook vectors near the border of the grid are compressed in the
input space. The border effect is especially problematic when you try to
use a high degree of smoothing in a Kohonen SOM, since all the codebook
vectors will collapse into the center of the input space. The SOM border
effect has the same mathematical cause as the "boundary effect" in kernel
regression, which causes the estimated regression function to flatten out
near the edges of the regression input space. There are various cures for
the edge effect in nonparametric regression, of which local-linear
smoothing is the simplest (Wand and Jones, 1995). Hence, local-linear
smoothing also cures the border effect in SOMs. Furthermore, local-linear
smoothing is much more general and reliable than the heuristic weighting
rule proposed by Kohonen (1995, p. 129).

Since nonparametric regression is used in the batch SOM algorithm,
various properties of nonparametric regression extend to SOMs. In
particular, it is well known that the shape of the kernel function is not
a crucial matter in nonparametric regression, hence it is not crucial in
SOMs. On the other hand, the amount of smoothing used for nonparametric
regression is crucial, hence the choice of the final neighborhood size in
a SOM is crucial. Unfortunately, I am not aware of any systematic studies
of methods for choosing the final neighborhood size.

The batch SOM algorithm is very similar to the principal curve and
surface algorithm proposed by Hastie and Stuetzle (1989), as pointed out
by Ritter, Martinetz, and Schulten (1992) and Mulier and Cherkassky
(1995). A principal curve is a nonlinear generalization of a principal
component. Given the probability distribution of a population, a
principal curve is defined by the following self-consistency condition:
1. If you choose any point on a principal curve,
2. then find the set of all the points in the input space that are closer
to the chosen point than any other point on the curve,
3. and compute the expected value (mean) of that set of points with
respect to the probability distribution, then
4. you end up with the same point on the curve that you chose originally.
See http://www.iro.umontreal.ca/~kegl/research/pcurves/ for more
information about principal curves and surfaces.

In a multivariate normal distribution, the principal curves are the same
as the principal components. A principal surface is the obvious
generalization from a curve to a surface. In a multivariate normal
distribution, the principal surfaces are the subspaces spanned by any two
principal components.

A one-dimensional local-linear batch SOM can be used to estimate a
principal curve by letting the number of codebook vectors approach
infinity while choosing a kernel function of appropriate smoothness. A
two-dimensional local-linear batch SOM can be used to estimate a
principal surface by letting the number of both rows and columns in the
grid approach infinity. This connection between SOMs and principal curves
and surfaces shows that the choice of the number of codebook vectors is
not crucial, provided the number is fairly large.

If the final neighborhood size in a local-linear batch SOM is large, the
SOM approximates a subspace spanned by principal components--usually the
first principal component if the SOM is one-dimensional, the first two
principal components if the SOM is two-dimensional, and so on. This
result does not depend on the data having a multivariate normal
distribution.

Principal component analysis is a method of data compression, not a
statistical model. However, there is a related method called "common
factor analysis" that is often confused with principal component analysis
but is indeed a statistical model. Common factor analysis posits that the
relations among the observed variables can be explained by a smaller
number of unobserved, "latent" variables. Tibshirani (1992) has proposed
a latent-variable variant of principal curves, and latent-variable
modifications of SOMs have been introduced by Utsugi (1996, 1997) and
Bishop, Svensén, and Williams (1997).

The choice of the number of codebook vectors is usually not critical as
long as the number is fairly large. But results can be sensitive to the
shape of the grid, e.g., square or an elongated rectangle. And the
dimensionality of the grid is a crucial choice. It is difficult to guess
the appropriate shape and dimensionality before analyzing the data.
Determining the shape and dimensionality by trial and error can be quite
tedious. Hence, a variety of methods have been tried for growing SOMs and
related kinds of NNs during training. For more information on growing
SOMs, see Bernd Fritzke's home page at
http://pikas.inf.tu-dresden.de/~fritzke/

Using a 1-by-2 SOM is pointless. There is no "topological structure" in a
1-by-2 grid. A 1-by-2 SOM is essentially the same as VQ with two
clusters, except that the SOM clusters will be closer together than the
VQ clusters if the final neighborhood size for the SOM is large.

In a Kohonen SOM, as in VQ, it is necessary to reduce the learning rate
during training to obtain convergence. Greg Heath has commented in this
regard:

I favor separate learning rates for each winning SOM node (or k-means
cluster) in the form 1/(N_0i + N_i + 1), where N_i is the
count of vectors that have caused node i to be a winner and N_0i
is an initializing count that indicates the confidence in the initial
weight vector assignment. The winning node expression is based on
stochastic estimation convergence constraints and pseudo-Bayesian
estimation of mean vectors. Kohonen derived a heuristic recursion
relation for the "optimal" rate. To my surprise, when I solved the
recursion relation I obtained the same above expression that I've
been using for years.

In addition, I have had success using the similar form
(1/n)/(N_0j + N_j + (1/n)) for the n nodes in the
shrinking updating-neighborhood. Before the final "winners-only"
stage when neighbors are no longer updated, the number of updating
neighbors eventually shrinks to n = 6 or 8 for hexagonal or
rectangular neighborhoods, respectively.

Kohonen's neighbor-update formula is more precise replacing my
constant fraction (1/n) with a node-pair specific h_ij (h_ij
< 1). However, as long as the initial neighborhood is sufficiently
large, the shrinking rate is sufficiently slow, and the final
winner-only stage is sufficiently long, the results should be
relatively insensitive to exact form of h_ij.

Another advantage of batch SOMs is that there is no learning rate, so these
complications evaporate.

Kohonen (1995, p. VII) says that SOMs are not intended for pattern
recognition but for clustering, visualization, and abstraction. Kohonen has
used a "supervised SOM" (1995, pp. 160-161) that is similar to
counterpropagation (Hecht-Nielsen 1990), but he seems to prefer LVQ (see
below) for supervised classification. Many people continue to use SOMs for
classification tasks, sometimes with surprisingly (I am tempted to say
"inexplicably") good results (Cho, 1997).

o LVQ: Learning Vector Quantization--competitive networks for supervised
classification (Kohonen, 1988, 1995; Ripley, 1996). Each codebook vector is
assigned to one of the target classes. Each class may have one or more
codebook vectors. A case is classified by finding the nearest codebook
vector and assigning the case to the class corresponding to the codebook
vector. Hence LVQ is a kind of nearest-neighbor rule.

Ordinary VQ methods, such as Kohonen's VQ or k-means, can easily be used for
supervised classification. Simply count the number of training cases from
each class assigned to each cluster, and divide by the total number of cases
in the cluster to get the posterior probability. For a given case, output
the class with the greatest posterior probability--i.e. the class that forms
a majority in the nearest cluster. Such methods can provide universally
consistent classifiers (Devroye et al., 1996) even when the codebook vectors
are obtained by unsupervised algorithms. LVQ tries to improve on this
approach by adapting the codebook vectors in a supervised way. There are
several variants of LVQ--called LVQ1, OLVQ1, LVQ2, and LVQ3--based on
heuristics. However, a smoothed version of LVQ can be trained as a
feedforward network using a NRBFEQ architecture (see "How do MLPs compare
with RBFs?") and optimizing any of the usual error functions; as the width
of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ
network.

There are several other kinds of Kohonen networks described in Kohonen
(1995), including:

o DEC--Dynamically Expanding Context
o LSM--Learning Subspace Method
o ASSOM--Adaptive Subspace SOM
o FASSOM--Feedback-controlled Adaptive Subspace SOM
o Supervised SOM
o LVQ-SOM

More information on the error functions (or absence thereof) used by Kohonen
VQ and SOM is provided under "What does unsupervised learning learn?"

For more on-line information on Kohonen networks and other varieties of
SOMs, see:

o The web page of The Neural Networks Research Centre, Helsinki University
of Technology, at http://www.cis.hut.fi/research/
o The SOM of articles from comp.ai.neural-nets at
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html
o Akio Utsugi's web page on Bayesian SOMs at the National Institute of
Bioscience and Human-Technology, Agency of Industrial Science and
Technology, M.I.T.I., 1-1, Higashi, Tsukuba, Ibaraki, 305 Japan, at
http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html
o The GTM (generative topographic mapping) home page at the Neural
Computing Research Group, Aston University, Birmingham, UK, at
http://www.ncrg.aston.ac.uk/GTM/
o Nenet SOM software at http://www.mbnet.fi/~phodju/nenet/nenet.html
o Bernd Fritzke's home page at http://pikas.inf.tu-dresden.de/~fritzke/ has
information on growing SOMs and other related types of NNs

References:

Anderberg, M.R. (1973), Cluster Analysis for Applications, New York:

Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
study of the classification capabilities of neural networks using
unsupervised learning: A comparison with k-means clustering",
Psychometrika, 59, 509-525.

Bishop, C.M., Svensén, M., and Williams, C.K.I (1997), "GTM: A principled
alternative to the self-organizing map," in Mozer, M.C., Jordan, M.I.,
and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambridge, MA: The MIT Press, pp. 354-360. Also see
http://www.ncrg.aston.ac.uk/GTM/

Bottou, L., and Bengio, Y. (1995), "Convergence properties of the K-Means
algorithms," in Tesauro, G., Touretzky, D., and Leen, T., (eds.)
Advances in Neural Information Processing Systems 7, Cambridge, MA: The
MIT Press, pp. 585-592.

Cho, S.-B. (1997), "Self-organizing map with dynamical node-splitting:
Application to handwritten digit recognition," Neural Computation, 9,
1345-1355.

Desieno, D. (1988), "Adding a conscience to competitive learning," Proc.
Int. Conf. on Neural Networks, I, 117-124, IEEE Press.

Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of
Pattern Recognition, NY: Springer,

Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency
versus interpretability," Biometric Society Meetings, Riverside, CA.
Abstract in Biomatrics, 21, 768.

Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
Compression, Boston: Kluwer Academic Publishers.

Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.

Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means
clustering algorithm," Applied Statistics, 28-100-108.

Hastie, T., and Stuetzle, W. (1989), "Principal curves," Journal of the
American Statistical Association, 84, 502-516.

Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.

Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering
utilizing hybrid search strategies," Pattern Recognition, 22, 75-89.

Kohonen, T (1984), Self-Organization and Associative Memory, Berlin:
Springer-Verlag.

Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1
(suppl 1), 303.

Kohonen, T. (1995/1997), Self-Organizing Maps, Berlin: Springer-Verlag.
First edition was 1995, second edition 1997. See
http://www.cis.hut.fi/nnrc/new_book.html for information on the second
edition.

Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
N.J.: Prentice-Hall.

Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector
quantizer design," IEEE Transactions on Communications, 28, 84-95.

Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions
on Information Theory, 28, 129-137.

MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of
Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, 1, 281-297.

Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on
Information Theory, 6, 7-12.

Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an iterative
kernel smoothing process," Neural Computation, 7, 1165-1177.

Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
Cambridge: Cambridge University Press.

Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation
and Self-Organizing Maps: An Introduction, Reading, MA: Addison-Wesley.

SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition,
Cary, NC: SAS Institute.

Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal
Mixtures," Biometrics, 37, 35-43.

Tibshirani, R. (1992), "Principal curves revisited," Statistics and
Computing, 2, 183-190.

Utsugi, A. (1996), "Topology selection for self-organizing maps,"
Network: Computation in Neural Systems, 7, 727-740, available on-line at
http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

Utsugi, A. (1997), "Hyperparameter selection for self-organizing maps,"
Neural Computation, 9, 623-635, available on-line at
http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman &
Hall.

Zador, P.L. (1982), "Asymptotic quantization error of continuous signals
and the quantization dimension," IEEE Transactions on Information Theory,
28, 139-149.

Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector
quantizer design by stochastic relaxation," IEEE Transactions on Signal
Procesing, 40, 310-322.

```

## User Contributions: Top Document: comp.ai.neural-nets FAQ, Part 1 of 7: Introduction
Previous Document: How many kinds of NNs exist?
Next Document: How are layers counted?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM