Publication Date:
2006
abstract:
In a wide variety of real world situations a functional dependence y = g(x) has to be estimated
from a set of observations (xL; yL) = f(xl; yl); l = 0; : : : ;L ¡ 1g concerning a phenomenon of
interest. This is the case when the behavior of a continuous signal has to be forecast starting
from its previous history or when the value of an unmeasurable quantity has to be inferred
from the measurements of other related variables.
If an insu±cient amount of a priori information about the form of the functional dependence g
is available, its estimation must provide for two di®erent actions:
1. at ¯rst a su±ciently large class ¡ of functions must be properly selected (model selection);
2. then, the best element g 2 ¡ must be retrieved by adopting a suitable optimization
algorithm (training phase).
The model selection task is usually performed by taking a very general paradigm, whose
complexity can be controlled by acting on a small number of constant values. For example, the
usual polynomial series expansion can approximate arbitrarily well every measurable function,
i.e., polynomials are universal approximators. However, by including in ¡ only the functions
whose polynomial series expansion does not contain terms with exponent greater than a
prescribed maximum k, we can control the richness of the class ¡. In particular, if k = 1 only
linear functions are included in ¡, if k = 2 the expansion can realize only linear and quadratic
functions, etc.
Other general paradigms have been extensively used for model selection: neural networks have
been shown to possess the universal approximation property [1, 2, 3, 4] and have been
successfully applied in a lot of di®erent ¯elds. In this case, the complexity of the class ¡ can be
controlled by acting on the architecture of the network (the number of layers and the number
of neurons in the feedforward structure).
Once the class ¡ has been chosen, the optimization algorithm to be employed in the training
phase is selected accordingly. For example, the back-propagation technique (and its
modi¯cations) is often adopted to retrieve the function in ¡ that best ¯ts the collection
(xL; yL) of observations at our disposal, usually called training set.
However, the basic goal to be pursued is to obtain a function g that generalizes well, i.e., that
behaves correctly even in correspondence with other points of the domain not included in the
training set. How can it be guaranteed that the element of ¡ that best ¯ts our observations
also generalizes well? This is a fundamental question in the context of learning theory.
Since in many practical situations the input vectors xl cannot be freely chosen and the
training set (xL; yL) can be corrupted by noise, most results on learning theory are based on a
statistical framework, which arose in the pattern recognition community [5, 6, 7, 8] and has
been naturally extended to other inductive problems, like regression estimation [9, 10, 11] and
probability density reconstruction [10]. In this framework, called Statistical Learning (SL), the
input vectors xl, for l = 0; : : : ;L ¡ 1, are viewed as realizations of a random variable,
generated according to an unknown (but ¯xed) probability density p.
On the other hand, there are several cases where the position of the points xl in the input
space can be suitably selected for the problem at hand. If a deterministic algorithm is
employed to choose the input vectors xl, SL is no more the most appropriate approach. In this
case a new framework, called Deterministic Learning (DL), is able to catch the peculiarities of
the situation at hand, thus providing precise conditions about the generalization ability of thefunction g 2 ¡ that best ¯ts the observations of the training set (xL; yL).
This chapter pres
Iris type:
02.01 Contributo in volume (Capitolo o Saggio)
List of contributors:
Cervellera, Cristiano; Muselli, Marco
Published in: