Object-oriented implementation of the backpropagation algorithm.
Ismail, Ismail A.
The Error-Backpropagation (or simply, backpropagation) algorithm is
the most important algorithm for the supervised training of multilayer
feed-forward Artificial Neural Networks (ANN). It derives its name from
the fact that error signals are propagated backward through the network
on a layer-by-layer basis. In this article we use an object-oriented
approach for implementing the backpropagation algorithm.
**********
The backpropagation algorithm is based on the selection of a
suitable error function or cost function, whose values are determined by
the actual and desired outputs of the network, and which is also
dependent on the network parameters such as the weights and the
thresholds (Jacek, 1995). The basic idea is that the cost function has a
particular surface over the weigh space and therefore an iterative minimization process such as the gradient descent method can be used for
its minimization. The method of gradient descent is based on the fact
that, the gradient of a function always points in the direction of
maximum increase of the function then, moving to the direction of the
negative gradient induces a maximal "downhill" movement that
will eventually reach the minimum of the function surface over its
parameter space. This is a rigorous and well-established technique for
minimization of functions and has probably been the main factor behind
the success of back-propagation.
A typical multi-layer feed-forward ANN is shown in Figure 1. This
type of network is also known as a Multilayer Perceptron (MLP). The
units (or nodes) of the network are nonlinear threshold units described
by equations (1) and (2) and their activation function is given by
equation (3):
u = [summation over (N/i=0)] [w.sub.i][x.sub.i] (1)
y = f([summation over (N/i=0)] [w.sub.i][x.sub.i]) (2)
f(u) = 1/1 + exp(-[alpha]u) (3)
Where [x.sub.1],[x.sub.2],...,[x.sub.n] are the input signals,
[w.sub.1],[w.sub.2],...,[w.sub.n] are the synaptic weights, u is the
activation potential of the neuron and [alpha] is the slope parameter of
(3).
The units are arranged in layers and each unit in a layer has all
its inputs connected to the units of a preceding layer (or to the inputs
from the external world in the case of the units in the first layer),
but it does not have any connections to units of the same layer to which
it belongs. The layers are arrayed one succeeding the other so that
there is an input layer, multiple intermediate layers and finally an
output layer. Intermediate layers, that is those that have no inputs or
outputs to the external world, are called hidden layers. Figure 1 shows
a MLP with only one hidden layer. Backpropagation neural networks are
usually fully connected. This means that each unit is connected to every
output from the preceding layer (or to every input from the external
world if the unit is in the first layer). Generally, the input layer is
considered as just a distributor of the signals from the external world
and is not therefore counted as a layer (Lefteri & Robert, 1997).
The backpropagation training consists of two passes of computation:
a forward pass and a backward pass (Jacek, 1995). In the forward pass an
input pattern vector is applied to the sensory nodes of the network that
is, to the units in the input layer. The signals from the input layer
propagate to the units in the first layer and each unit produces an
output according to equation (2). The outputs of these units are
propagated to units in subsequent layers and this process continues
until, finally, the signals reach the output layer where the actual
response of the network to the input vector is obtained. During the
forward pass the synaptic weights of the network are fixed. During the
backward pass, on the other hand, the synaptic weights are all adjusted
in accordance with an error signal which is propagated backward through
the network against the direction of synaptic connections.
THE MATHEMATICAL ANALYSIS OF THE BACKPROPAGATION ALGORITHM
In the forward pass of computation, given an input pattern vector
[y.sup.(p)], each hidden node j receives a net input:
[x.sup.(p).sub.j] = [summation over (k)] [w.sub.jk]
[y.sup.(p).sub.k] (4)
Where [w.sub.jk] represents the weight between hidden node j and
input node k, thus node j produces an output:
[y.sup.(p).sub.j] = f([x.sup.p.sub.j]) = f([summation over (k)]
[w.sub.jk] [y.sup.(p).sub.k]) (5)
Each output node i thus receives:
[x.sup.(p).sub.i] = [summation over (j)] [w.sub.ij]
[y.sup.(p).sub.j] = [summation over (j)] [w.sub.ij] f([summation over
(k)] [w.sub.jk] [y.sup.(p).sub.k]) (6)
Where wij represents the weight between output node j and hidden
node j. Hence it produces for the final output:
[y.sup.(p).sub.i] = f([x.sup.(p).sub.i]) = f([summation over (j)]
[w.sub.ij] [y.sup.(p).sub.j]) = f([summation over (j)] [w.sub.ij]
f([summation over (k)] [w.sub.jk] [y.sup.(p).sub.k])) (7)
The backpropagation algorithm can be implemented in two different
modes: on-line mode and batch mode. In the on-line mode the error
function is calculated after the presentation of each input pattern and
the error signal is propagated back through the network modifying the
weights before the presentation of the next pattern. This error function
is usually the Mean Square Error (MSE) of the difference between the
desired and the actual responses of the network over all the output
units. Then the new weights remain fixed and a new pattern is presented
to the network and this process continuous until all the patterns have
been presented to the network. The presentation of all the patterns is
usually called one epoch or one iteration. In practice many epochs are
needed before the error becomes acceptably small.
In the batch mode the error signal is calculated for each input
pattern and the weights are modified ever time the input pattern has
been presented. Then the error function is calculated as the sum of the
individual MSE errors for each pattern and the weights are accordingly
modified (all in a single step for all the patterns) before the next
iteration. Thus, in the batch mode, the error or cost function
calculated as the MSE over all output units i and over all patterns p is
given by:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)
Clearly E is a differentiable function of all the weights and
therefore we can apply the method of gradient descent. For the
hidden-to-output connections the gradient descent rule gives:
[DELTA][w.sub.ij] = -[eta] [partial]E/[partial][w.sub.ij] (9)
Where h is a constant that determines the rate of learning; it is
called the learning rate of the backpropagation algorithm. Using the
chain rule we have:
[partial]E/[partial][w.sub.ij] =
[partial]E/[partial][y.sup.(p).sub.i] *
[partial][y.sup.(p).sub.i]/[partial][w.sub.ij]
[partial][y.sup.(p).sub.i]/[partial][w.sub.ij] =
[partial][y.sup.(p).sub.i]/[partial][x.sup.(p).sub.i] *
[partial][x.sup.(p).sub.i]/[partial][w.sub.ij] (10)
Giving:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (11)
Thus:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (12)
Where:
[[delta].sup.(p).sub.i] = ([d.sup.(p).sub.i] -
[y.sup.(p).sub.i])f'([x.sup.(p).sub.i]) (13)
For the input-to-hidden connections the gradient descent rule
gives:
[DELTA][w.sub.jk] = -[eta][partial]E/[partial][w.sub.jk] (14)
Using the chain rule we obtain:
[partial]E/[partial][w.sub.jk] =
[partial]E/[partial][y.sup.(p).sub.j]*[partial][y.sup.(p).sub.j]/[par
tial][w.sub.jk]
[partial][y.sup.(p).sub.j]/[partial][w.sub.jk] =
[partial][y.sup.(p).sub.j]/[partial][x.sup.(p).sub.j]*[partial][x.sup
.(p).sub.j]/[partial][w.sub.jk] (15)
Giving:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (16)
Now for [partial]E/[partial][y.sub.j] we have:
[partial]E/[partial][y.sup.(p).sub.j] = -[summation over (p)]
[summation over (i)]([d.sup.(p).sub.i] - [y.sup.(p).sub.i])
[partial]f([x.sup.p.sub.i])/[partial][y.sup.(p).sub.j]
= -[summation over (p)] [summation over (i)] ([d.sup.(p).sub.i] -
[y.sup.(p).sub.i]) [partial]f([x.sup.p.sub.i])/[partial][x.sup.(p).sub.i]*[partial][x.su p.(p).sub.i]/[partial][y.sup.(p).sub.j]
= -[summation over (p)] [summation over (i)] ([d.sup.(p).sub.i] -
[y.sup.(p).sub.i])f'([x.sup.(p).sub.i])[w.sub.ij] (17)
Thus:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (18)
Which gives:
[DELTA][w.sub.jk] = [eta] [summation over (p)] [summation over (i)]
([d.sup.(p).sub.i] -
[y.sup.(p).sub.i])f'([x.sup.(p).sub.i])[w.sub.ij]f'([x.sup.(p).sub.j] )[y.sup.(p).sub.k]
= [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (19)
With:
[[delta].sup.(p).sub.j] = f'([x.sup.(p).sub.j]) [summation
over (i)] [[delta].sup.(p).sub.i] [w.sub.ij] (20)
From equation (11) and (18) we can see that if the activation
function was not differentiable then we would be unable to implement the
gradient-descent rule, as it would be impossible to calculate the
partial derivatives of E with respect to the weights. It is for this
reason that differentiability of the activation function is so important
in backpropagation learning. Note also that the derivative of the
sigmoid function is very easy to compute since:
f'(u) = f (u) [1-f (u)] (21)
This means that it is not necessary to compute f' (u)
separately once we have found f(u). This is one of the advantages of the
sigmoid function as computation time can reduce significantly.
Although we have written the update rules for the batch mode of
training it is clear that the weight updates in the online case would be
given again by equations (12) and (19) without of course the summation
over the training patterns. Note that equation (19) has the same form as
equation (12) but with a different definition of the [delta]'s. In
general, with an arbitrary number of layers, the backpropagation update
rule always has the form:
[DELTA][w.sub.lm] = [eta] * [[delta].sub.l] * [y.sub.m] (23)
Where [DELTA][w.sub.lm] is the weight correction, [eta] is the
learning rate, is the local gradient, and [y.sub.m] is the input signal
of node. This general rule for the adaptation of the weights is also
known as generalized delta rule.
IMPLEMENTATION OF THE BACKPROPAGATION ALGORITHM
We have used C++ as object-oriented programming language for
implementing the backpropagation algorithm. Object-oriented programming is a programming paradigm where a program is made up of a collection of
independent "objects" which perform the actions of a program
by interacting with each other (Lafore, 1999). An object is typically a
collection of related data and routines. Our implementation makes
extensive use of two important principles of OOP: encapsulation and
abstraction (Mohamed & Laitinen, 1998).
Encapsulation describes the notion that all of the inner working of
an object is hidden from other objects. OOP allows for information
hiding, which means that the inner working of an object can literally be
protected from other objects. An object can be thought of as a
"black box," meaning that other objects do not know and
don't need to know how the duty of the object is being performed.
By providing an "interface" to the outside world, an object
describes how it is to be communicated with. This separation of the
inner working from the interface also allows for abstraction.
Abstraction is a programming technique where the essential behavior
of an object is separated from its specific purpose. This allows for
efficiency when dealing with similar objects because common behavior can
be shared through "inheritance." Another benefit of OOP is
that, once created, objects are easy to reuse for other applications.
Our implementation of the back-propagation algorithm has the following
design objectives:
* allow the user to specify the number and size of all layers;
* allow the use of one or more hidden layers;
* be able to save and restore the state of the network;
* display key information at the end of the simulation; and
* drawing the network as specified by the user.
C++ Classes and Class Hierarchy
This section describes, in brief, the C++ classes of our system
that implement the backpropagation algorithm. In this system we use a
class hierarchy with the inheritance feature. Also, we use polymorphism with dynamic binding and function overloading with static binding. First
let us look at the class hierarchy used for this program (Figure 2 shows
the major classes using UML notation) (Grady, James, & Ivar, 1999).
An abstract class is a class that is never meant to be instantiated as
an object, but serves as a base class from which others can inherit functionality and interface definitions. The CBaseLayer class is such a
class. One of its functions is set = zero (see appendix A), which
indicates that this class is an abstract base class. From the CBaseLayer
class are two branches. One is the CInputLayer class, and the other is
the COutputClass. The middle layer class CMiddleLayer is very much like
output layer in function and so inherits from the COutputClass class.
The CNNtwork class is used to set up communication channels between
layers and to feed and remove data from the network. The CNNtwork class
performs the interconnection of layers by setting the pointer of an
input array of a given layer to output array of previous layer. Another
connection that the CNNtwork class is responsible for is setting the
pointer of an output_error array to the back_error array of the next
layer (remember, errors flow in reverse, and the back_error array in the
output error of the layer reflected at its inputs). The CNNtwork class
stores an array of pointers to layers and an array of layer sizes for
all the layers defined. These layer objects and arrays are dynamically
allocated on the heap with the New and Delete functions in C++.
Details in Implementation of the Backpropagation Algorithm
Function overloading can be seen in the definition of the
calc_error() function in the CBaseLayer class (see appendix A). It is
used in the CInputLayer with no parameters, while it is used in the
COutputClass (which the CInputLayer inherits from) with one parameter.
Using the same function name is not a problem, and this is referred to
as overloading. Besides function overloading, you may also have operator
overloading, which is using an operator that performs some familiar
function like + for addition, for another function, say, vector
addition. When you have overloading with the same parameters and keyword
virtual, then you have the potential for dynamic binding, which means
that you determine which overloading function to execute at run time and
not at compile time. Compile time binding is referred to as static
binding. If you put a bunch of C++ objects in an array of pointers to
the base class, and then go through a loop that indexes each pointer and
executes an overloading virtual function that pointer is pointing to,
then you will be using dynamic binding. This is exactly the case in the
function calc_out(), which is declared with the virtual keyword in the
CBaseLayer base class. Each descendant of CBaseLayer can provide a
version of calc_ont(), which differs in functionality from the base
class, and the correct function will be selected at run time based on
the object's identity. In this case caic_out(), which is a function
to calculate the outputs for each layer, is different for in input layer
than for the other two types of layers.
Note the following definitions in the CBaseLayer base class:
int num_inputs;
int num_outputs;
[float.sup.*] output; // pointer to array of outputs
[float.sup.*] inputs;//pointer to array of inputs, which
//are outputs of some other layer
friend CNNtwork;
There are also two pointers to arrays of floats in this class. They
are the pointers to the outputs in a given layer and the inputs to a
given layer. To get a better idea of what a layer encompasses, Figure 3
shows you a small feed forward backpropagation network, with a dotted
line that shows you the three layers of the network.
A layer contains neurons and weights. The layer is responsible for
calculating its output (calc out()), stored in the float [outputs.sup.*]
array, and errors (calc_error()) for each of its respective neurons. The
errors are stored in another array called [float.sup.*] output_errors
defined in the COutputLayer class. Note that the CInputLayer class does
not have any weights associated with it and therefore is a special case.
It does not need to provide any data members or function members related
to errors or backpropagation. The only purpose of the input layer is to
store data to be forward propagated to the next layer.
With the output layer, there are a few more arrays present. First,
for storing backpropagation errors, there is an array called
[float.sup.*] back_errors. There is a weights array called [float.sup.*]
weights, and finally, for storing the expected values that initiate the
error calculation process, there is an array called [flaot.sup.*]
expected_values. Note that the middle layer needs almost all of these
arrays and inherits them by being a derived class of the COutputLayer
class.
SYSTEM MODES
There are two modes of operation in our system: Training Mode and
Testing Mode.
Training Mode:
In this mode the user provides the system with the following
information:
* The training file in the current directory is called
training.txt. This file contains exemplar pairs or patterns. Each
pattern has a set of inputs followed by a set of outputs. Each value is
separated by one or more spaces. Figure 4 shows an example of a
training.txt file.
* the values for the error tolerance and the learning rate;
* the maximum number of cycles, or passes through the training
data;
* the number of layers (the training mode layout is shown in Figure
5); and
* the size for each layer, from the input to the output.
After preparing previous information the system begins training and
reports the current cycle number and the error for each cycle. Also it
may watch the error to see that it is on the whole, decreasing with
time, if it is not, then the system gives a chance to restart the
training because this will start with a brand new set of random weights
and another, possibly better, solution. Once the training is done the
system gives information about the number of cycles and patterns used
and the total error. The system can draw the network under training in
two cases, without training data and with training data, as shown in
Figures 6 and 7 respectively.
Another file that is used in training is the weights file. Once the
system reaches the error tolerance that is specified by the user, or the
maximum number of iterations, the system saves the state of the network,
by saving all of its weight in a file called weight.txt. This file can
then be used subsequently in another run of the system in testing mode.
In addition, the output generated by the network is provided in an
output file called output.txt.
Testing Mode
In this mode, the user provides test data to the system in a file
called test.txt. This file contains only input patterns. When this file
is applied to an already trained network, an output.txt file is
generated, which contains the output from the network for all of the
input patterns. The network goes through one cycle of operation in this
mode, covering all the patterns in the test data file. To start up the
network, the weight file, weight.txt is read to initialize the state of
the network. The user must provide the same network size parameters used
to train the network.
CASE STUDY: CHARACTER RECOGNITION
The problem presented in this section deals with recognizing or
categorizing alphabetic characters. We will input various alphabetic
characters to our system and train the network to recognize these as
separate categories.
Representing Characters
A 5x7 grid of pixels represents each character. To represent the
character A, for example, we use the pattern shown in figure 8. Here the
blackened boxes represent value 1, while blank boxes represent zeros. We
can represent all characters this way, with a binary map of 35 pixel
values.
We wish to distinguish alphabetic characters by assigning them to
different bins. We would apply the inputs and train the network with
anticipated responses. Here is the input file that we used for
distinguishing five different characters, A, X, H, B, and I as shown in
Table 1.
Each line has a 5x7-dot representation for each character. Now we
need to name each of the output categories. We can assign a simple
3-bits representation as follows:
A 000
x 010
H 100
B 101
I 111
Let's train the network to recognize these characters. The
training.dat file looks as in Figure 9:
Now we can start training the network with parameters (learning
rate 0.1, tolerance 0.001 , and repetition count 1000) and with three
layers of sizes 35 (input), 5 (middle), and 3 (output), we get after
total cycles of 3322 a final error of 0.00164419. We note also that the
tolerance specified is nearly met. The match for the last pattern is:
For input vector:
0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000
0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000
1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000
0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000
Output vector is: 0.979606 0.985861 0.989526
Expected output vector is: 1.00000 1.00000 1.00000
To see the outputs of all the patterns, we need to copy the
training.txt file to the test.txt file without the expected output field
and rerun the system in test mode. Running the system in test mode shows
the following result in output.txt file as shown in Figure 10. We noted
that the training patterns are learned very well. If a smaller tolerance
is used, it would be possible to complete the learning in fewer cycles.
CONCLUSION
In this article we have used an object-oriented approach for
implementing one of the most powerful neural network algorithms called
backpropagation algorithms. The algorithm uses the so-called generalized
delta rule and trains the network with exemplar pairs of patterns. The
main motivation for using object-oriented features as encapsulation and
abstraction that we can have well packaged, reusable, extensible, and
reliable programs and program segments.
APPENDIX A
The header file for the CBaseLayer class hierarchy and the
network class CNNtwork:
#define MAX_LAYERS 5
class CNNtwork;
class CBaseLayer
{
protected:
int num_inputs;
int num_outputs;
[float.sup.*] outputs;//pointer to array of outputs
[float.sup.*] inputs;//pointer to array of inputs, which
//are outputs of some other layer
friend CNNtwork;
public:
virtual void calc_out()=0;
};
class ClnputLayer: public CBaseLayer
{
public:
ClnputLayer (int,int);
~ClnputLayer ();
virtual calc_out();
};
class CMiddleLayer;
class COutputLayer: public CBaseLayer
{
protected:
[float.sup.*] weights;
[float.sup.*] output_errors;//array of errors at output;
[float.sup.*] back_errors;//array of errors back-propagated
to inputs
[float.sup.*] expected_values;//to inputs
friend CNNtwork;
public:
COutputLayer (int,int);
~COutputLayer ();
virtual void calc_out();
void calc_error(float &);
void randomize_weights();
void update_weights(const float);
void list_weights();
void write_weights(int, [FILE.sup.*]);
void read_weights(int, [FILE.sup.*]);
void list_errors();
void list-outputs();
};
class CMiddleLayer: public COutputLayer
{
private:
CMiddleLayer (int,int);
~ CMiddleLayer ();
void calc_error();
};
class CNNtwork
{
private:
[CBaseLayer.sup.*] ptr[MAX_LAYERS];
int num_of_layers;
int layer_size[MAX_LAYERS];
[float.sup.*] buffer,
unsigned training;
public:
CNNtwork ();
~ CNNtwork ();
void set_training_value(const unsigned &);
unsigned get_training_value();
void get_layer_info();
void set_up_network();
void randomize_weights();
void update_weights(const [float.sup.*]);
void write_weights([FILE.sup.*]);
void read_weights([FILE.sup.*]);
void list_weights();
void write_outputs([FILE.sup.*]);
void list_outputs();
void list_errors();
void forward_prop();
void backward_prop(float &);
int fill_IObuffer([FILE.sup.*]);
void set_up_pattem(int);
};
Table 1
Characters Representation
Letter Representation
A 00100010101000110001111111000110001000
X 10001010100010000100001000101010001010
H 10001100011000111111100011000111111100
B 11111100011000111111100011000111111101
I 00100001000010000100001000010000100111
Acknowledgements
I wish to express my sincere thanks, appreciation, and gratitude to
Professor Samia Siad Al-Azab and Professor Farouk Aeob for their
invaluable suggestions on this work.
References
Jacek, Z.M. (1995). Introduction to artificial neural systemsy.
Boston: PWS Publishing.
Lefteri, T.H., & Robert, U.E. (1997). Fuzzy and neural
approaches in engineering. New York: Wiley.
LaFore, R. (1999). Object-oriented programming in C++.
Indianapolis, IN: Sams.
Mohamed, R., & Laitinen, M. (1998). Transition to OO software
development. New York: Wiley
Grady, B., James, R., & Ivar, J. (1999). The unified modeling
language user guide. Addison-Wesley.