Top Document: comp.ai.neuralnets FAQ, Part 7 of 7: Hardware Previous Document: What to do with missing/incomplete data? Next Document: How to learn an inverse of a function? See reader questions & answers on this topic!  Help others by sharing your knowledge In most of this FAQ, it is assumed that the training cases are statistically independent. That is, the training cases consist of pairs of input and target vectors, (X_i,Y_i), i=1,...,N, such that the conditional distribution of Y_i given all the other training data, (X_j, j=1,...,N, and Y_j, j=1,...i1,i+1,...N) is equal to the conditional distribution of Y_i given X_i regardless of the values in the other training cases. Independence of cases is often achieved by random sampling. The most common violation of the independence assumption occurs when cases are observed in a certain order relating to time or space. That is, case (X_i,Y_i) corresponds to time T_i, with T_1 < T_2 < ... < T_N. It is assumed that the current target Y_i may depend not only on X_i but also on (X_i,Y_i) in the recent past. If the T_i are equally spaced, the simplest way to deal with this dependence is to include additional inputs (called lagged variables, shift registers, or a tapped delay line) in the network. Thus, for target Y_i, the inputs may include X_i, Y_{i1}, X_{i1}, Y_{i1}, X_{i2}, etc. (In some situations, X_i would not be known at the time you are trying to forecast Y_i and would therefore be excluded from the inputs.) Then you can train an ordinary feedforward network with these targets and lagged variables. The use of lagged variables has been extensively studied in the statistical and econometric literature (Judge, Griffiths, Hill, Lütkepohl and Lee, 1985). A network in which the only inputs are lagged target values is called an "autoregressive model." The input space that includes all of the lagged variables is called the "embedding space." If the T_i are not equally spaced, everything gets much more complicated. One approach is to use a smoothing technique to interpolate points at equally spaced intervals, and then use the interpolated values for training instead of the original data. Use of lagged variables increases the number of decisions that must be made during training, since you must consider which lags to include in the network, as well as which input variables, how many hidden units, etc. Neural network researchers have therefore attempted to use partially recurrent networks instead of feedforward networks with lags (Weigend and Gershenfeld, 1994). Recurrent networks store information about past values in the network itself. There are many different kinds of recurrent architectures (Hertz, Krogh, and Palmer 1991; Mozer, 1994; Horne and Giles, 1995; Kremer, 199?). For example, in timedelay neural networks (Lang, Waibel, and Hinton 1990), the outputs for predicting target Y_{i1} are used as inputs when processing target Y_i. Jordan networks (Jordan, 1986) are similar to timedelay neural networks except that the feedback is an exponential smooth of the sequence of output values. In Elman networks (Elman, 1990), the hidden unit activations that occur when processing target Y_{i1} are used as inputs when processing target Y_i. However, there are some problems that cannot be dealt with via recurrent networks alone. For example, many time series exhibit trend, meaning that the target values tend to go up over time, or that the target values tend to go down over time. For example, stock prices and many other financial variables usually go up. If today's price is higher than all previous prices, and you try to forecast tomorrow's price using today's price as a lagged input, you are extrapolating, and extrapolating is unreliable. The simplest methods for handling trend are: o First fit a linear regression predicting the target values from the time, Y_i = a + b T_i + noise, where a and b are regression weights. Compute residuals R_i = Y_i  (a + b T_i). Then train the network using R_i for the target and lagged values. This method is rather crude but may work for deterministic linear trends. Of course, for nonlinear trends, you would need to fit a nonlinear regression. o Instead of using Y_i as a target, use D_i = Y_i  Y_{i1} for the target and lagged values. This is called differencing and is the standard statistical method for handling nondeterministic (stochastic) trends. Sometimes it is necessary to compute differences of differences. For an elementary discussion of trend and various other practical problems in forecasting time series with NNs, such as seasonality, see Masters (1993). For a more advanced discussion of NN forecasting of economic series, see Moody (1998). There are several different ways to compute forecasts. For simplicity, let's assume you have a simple time series, Y_1, ..., Y_99, you want to forecast future values Y_f for f > 99, and you decide to use three lagged values as inputs. The possibilities include: Singlestep, onestepahead, or openloop forecasting: Train a network with target Y_i and inputs Y_{i1}, Y_{i2}, and Y_{i3}. Let the scalar function computed by the network be designated as Net(.,.,.) taking the three input values as arguments and returning the output (predicted) value. Then: forecast Y_100 as Net(Y_99,Y_98,Y_97) forecast Y_101 as Net(Y_100,Y_99,Y_98) forecast Y_102 as Net(Y_101,Y_100,Y_99) forecast Y_103 as Net(Y_102,Y_101,Y_100) forecast Y_104 as Net(Y_103,Y_102,Y_101) and so on. Multistep or closedloop forecasting: Train the network as above, but: forecast Y_100 as P_100 = Net(Y_99,Y_98,Y_97) forecast Y_101 as P_101 = Net(P_100,Y_99,Y_98) forecast Y_102 as P_102 = Net(P_101,P_100,Y_99) forecast Y_103 as P_103 = Net(P_102,P_101,P_100) forecast Y_104 as P_104 = Net(P_103,P_102,P_101) and so on. Nstepahead forecasting: For, say, N=3, train the network as above, but: compute P_100 = Net(Y_99,Y_98,Y_97) compute P_101 = Net(P_100,Y_99,Y_98) forecast Y_102 as P_102 = Net(P_101,P_100,Y_99) forecast Y_103 as P_103 = Net(P_102,P_101,Y_100) forecast Y_104 as P_104 = Net(P_103,P_102,Y_101) and so on. Direct simultaneous longterm forecasting: Train a network with multiple targets Y_i, Y_{i+1}, and Y_{i+2} and inputs Y_{i1}, Y_{i2}, and Y_{i3}. Let the vector function computed by the network be designated as Net3(.,.,.), taking the three input values as arguments and returning the output (predicted) vector. Then: forecast (Y_100,Y_101,Y_102) as Net3(Y_99,Y_98,Y_97) Which method you choose for computing forecasts will obviously depend in part on the requirements of your application. If you have yearly sales figures through 1999 and you need to forecast sales in 2003, you clearly can't use singlestep forecasting. If you need to compute forecasts at a thousand different future times, using direct simultaneous longterm forecasting would require an extremely large network. If a time series is a random walk, a welltrained network will predict Y_i by simply outputting Y_{i1}. If you make a plot showing both the target values and the outputs, the two curves will almost coincide, except for being offset by one time step. People often mistakenly intrepret such a plot to indicate good forecasting accuracy, whereas in fact the network is virtually useless. In such situations, it is more enlightening to plot multistep forecasts or Nstepahead forecasts. For general information on timeseries forecasting, see the following URLs: o Forecasting FAQs: http://forecasting.cwru.edu/faqs.html o Forecasting Principles: http://hops.wharton.upenn.edu/forecast/ o Investment forecasts for stocks and mutual funds: http://www.coe.uncc.edu/~hphillip/ References: Elman, J.L. (1990), "Finding structure in time," Cognitive Science, 14, 179211. Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation. AddisonWesley: Redwood City, California. Horne, B. G. and Giles, C. L. (1995), "An experimental comparison of recurrent neural networks," In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 7, pp. 697704. The MIT Press. Jordan, M. I. (1986), "Attractor dynamics and parallelism in a connectionist sequential machine," In Proceedings of the Eighth Annual conference of the Cognitive Science Society, pages 531546. Lawrence Erlbaum. Judge, G.G., Griffiths, W.E., Hill, R.C., Lütkepohl, H., and Lee, T.C. (1985), The Theory and Practice of Econometrics, NY: John Wiley & Sons. Kremer, S.C. (199?), "Spatiotemporal Connectionist Networks: A Taxonomy and Review," http://hebb.cis.uoguelph.ca/~skremer/Teaching/27642/dynamic2/review.html. Lang, K. J., Waibel, A. H., and Hinton, G. (1990), "A timedelay neural network architecture for isolated word recognition," Neural Networks, 3, 2344. Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego: Academic Press. Moody, J. (1998), "Forecasting the economy with neural nets: A survey of challenges and solutions," in Orr, G,B., and Mueller, KR, eds., Neural Networks: Tricks of the Trade, Berlin: Springer. Mozer, M.C. (1994), "Neural net architectures for temporal sequence processing," in Weigend, A.S. and Gershenfeld, N.A., eds. (1994) Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: AddisonWesley, 243264, http://www.cs.colorado.edu/~mozer/papers/timeseries.html. Weigend, A.S. and Gershenfeld, N.A., eds. (1994) Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: AddisonWesley. User Contributions:1 Andy Apr 24, 2015 @ 7:19 pm Why is it generally a good idea to omit the biases from the penalty term for weight decay? Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 7 of 7: Hardware Previous Document: What to do with missing/incomplete data? Next Document: How to learn an inverse of a function? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
