A modified Fastmap algorithm.
Radulescu, Florin ; Boicea, Alexandru
1. INTRODUCTION
Local searches involve finding a resource in a specified
geographical area. This kind of searches account already more than 10
percent of all Internet queries and are expected to grow (Helft, 2006).
These approaches are almost all based on real maps. In (Radulescu,
2007) we described a simple alternative method for local searches. This
method uses a graph as a simplified map for a city or town and a simple
distance estimation method from a specified point to a resource for
ordering the search result from 'near' to 'far'
without using real maps and visual interfaces.
A real map is actually like a graph and there are two ways to code
a map in a graph:
1) Street segments as graph edges
2) Streets as nodes: every street is a node and every intersection
between any two streets is an edge.
The second modeling approach was selected in this paper for the
simplified map of a city because it leads to a much simple
representation.
In such a graph the distance between any two nodes may be defined
as the length of the minimal path between them. For equal-weighted edges
(the length of each edge is 1), the meaning is "how many streets
away is one street from another".
Storing the graph or all distances between any two streets is
either slow or memory consuming. The idea investigated in this paper was
to create a Euclidian space knowing only the distances between any two
points. This operation is called multidimensional scaling (Torgerson,
1952). In that case every street may be represented by few coordinates
and so the distance between any two streets can be easily computed.
There are many algorithms for multidimensional scaling: FastMap
(Faloutsos & Lin, 1995; Yang et al., 2006), MetricMap and Landmark
MDS (LMDS). These algorithms approximate classical MDS using a subset of
the data and fitting the remainder to the solution.
2. FASTMAP
2.1 Classical Fastmap
For our approach FastMap was selected to be used for
multidimensional scaling of the simplified map. It is a recursive algorithm and at each step the following operations are done:
1) Two distant points (a, b)--nodes in the graph in our case are
selected as an axis.
2) For every point c compute the coordinate x for this axis using
the generalized Pythagoras theorem (also known as the law of cosine):
x = ([D.sup.2] (a, c) + [D.sup.2] (a, b) - [D.sup.2] (b, c))/ (2*D
(a, b)) (1)
3) Use for further axis not the original distances D between points
but the remainder D' after subtracting the distance with respect
with the coordinates already computed.
In Fig. 1, (a, b) are the two points selected as axis, (x, y) are
the coordinates of c and d for this axis (the first axis point a being
the origin) and D and D' are the distances between c and d before
and after computing the current axis coordinates:
[D'.sup.2] = [D.sup.2] - [(x - y).sup.2] (2)
This process stops after computing the desired number of
coordinates for every point or no more axes can be found.
2.2 Problems with real graphs
For real graphs the problem is that the distance matrix is not a
Euclidian one, so when we use (2) for computing D' we obtain for
some pairs of nodes a negative value for square D'. In that case
the only way to continue is to assume D' is 0.
But this assumption leads to propagated errors in computing real
good coordinates for our purpose. The algorithm was run with this
assumption on a 2000 nodes matrix and the optimal number of axes was 18
(too much) and also the average differences between real distances and
computed distances (computed from the resulting coordinates) was high:
0.78 after 18 steps, as shown in Fig. 2.
2.3 Modified Fastmap
For solving this problem we tried different types of adjustments,
modifying the original FastMap algorithm. One of them had very good
results.
The modified algorithm is the following. The third step was
introduced in order to recalculate the distance matrix and have integer distances between nodes.
Step 1: Search for a pair of points (a, b) with maximum distance
between them. These points are selected as an axis.
Step 2: For every point c compute the coordinate x for this axis
using the generalized Pythagoras theorem (1).
Step 3: For every pair of points c and d (Fig. 3.) recalculate the
distance between c and d using:
[D.sup.2](c, d) = INT ([D.sup.2](c, e) + [D.sup.2](e, d)) (3)
Step 4: Use for further axis use D' as in the original
algorithm, using (2), but D' is now computed starting from the
modified distance matrix.
D(c, e) can be obtained directly from the coordinates of c and d
for the current axis.
[FIGURE 1 OMITTED]
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
D(e, d) = |D(d, d') - D(c, c')| and the two distances
from the right side can be obtained using (1) for the two square
triangles (a, c, c') and (b, d, d') where all two other edges
are known.
3. EXPERIMENTAL RESULTS
This proposed algorithm was used on 12 distance matrices containing
random generated data:
* 4 matrices for a graph with 1000 nodes and 4 edges for every
node. In Table 1 the files are labeled F1-F4.
* 4 matrices for a graph with 2000 nodes and 4 edges for every
node. In Table 1 the files are labeled F5-F8.
* 4 matrices for a graph with 2000 nodes and 3 edges for every node
in Table 1 the files are labeled F9-F12.
Table 1 contains the results of running the algorithm for all 12
matrices and for only two steps (first two coordinates).
The table columns are the following: File: The file with the
distance matrix. AvgErr: The average of Abs(Dreal(x, y) - Dcomputed(x,
y)) over all pairs of nodes (x, y). Dcomputed(x, y) is the distance
between nodes x and y resulting from their coordinates using the
Euclidian distance (1). Diff 0: The number of matrix positions where |
Dreal(x, y) - Dcomputed(x, y) | = 0. Diff 1: The number of matrix
positions where | Dreal(x, y) - Dcomputed(x, y) | = 1. Diff >1: The
number of matrix positions where | Dreal(x, y) - Dcomputed(x, y) | >
1. %0-1: The percentage of 0 and 1 difference values from the total:
(Diff0 + Diff1) / (Diff) + Diff1 + Diff>1).
The results are clear: for over 95% of the graph node pairs, if the
described algorithm is used for finding the first two coordinates and
then this coordinates are used for computing the distance between the
two nodes, the distance is the same or almost the same (0 or 1) with the
original (real) distance. These first two coordinates can be
pre-computed and stored. More than that, for some graphs this percentage
is over 98-99%.
[FIGURE 4 OMITTED]
For a comparison between the original and modified algorithm, Fig.
4 presents the graphic for both (update for Fig. 2, for the same
dataset).
The modified algorithm stops after only 6 steps but only first two
coordinates are enough (and optimal). The original algorithm has a
minimum after 18 steps and this is minimum is greater.
4. CONCLUSIONS
The results show that using this modified Fastmap algorithm we can
associate with every node in a graph two coordinates that leads to a
good approximation of the real distance between them. For many
applications (including search engines that order the results form
'near' to 'far') it is then not necessary to store
an entire graph distance matrix but only these few coordinates because
the time for calculating any distance using Euclidian distance formula
is reasonable short.
5. REFERENCES
Helft, M. (2006), The Retooling of a Search Engine, New York Times,
December 4th 2006.
Radulescu, F. (2007). "Locality Sensitive Search: A Fast
Approach", Proceedings of the CSCS16, Bucharest, May 2007, vol. 2,
pp. 192-195.
Torgerson, W.S. (1952). Multidimensional Scaling: Theory and
Method, Psychometrika, vol 17, 1952, pp. 401-419.
Christos F. & Lin K.I. (1995). FastMap: A Fast Algorithm for
Indexing, Data-Mining and Visualization of Traditional and Multimedia
Datasets, Proceedings of the 1995 ACM SIGMOD International Conference on
Management of Data, San Jose, California, May, 1995.
Yang, T., Liu, J., McMillan, L., Wang, W. (2006). A Fast
Approximation to Multidimensional Scaling, Proceedings of the ECCV Workshop on Computation Intensive Methods for Computer Vision, Graz,
Austria, May 2006.
Tab 1. Experimental results
File AvgErr Diff 0 Diff 1
F1 0.17 419.680 77.024
F2 0.16 423.536 74.221
F3 0.17 416.155 82.031
F4 0.16 423.264 74.831
F5 0.20 1.604.398 384.910
F6 0.34 1.431.513 472.224
F7 0.18 1.645.942 344.927
F8 0.37 1.387.147 493.416
F9 0.29 1.440.510 538.753
F10 0.28 1.495.317 457.619
F11 0.28 1.456.455 527.476
F12 0.30 1.419.806 559.414
File Diff > 1 % 0-1
F1 3.796 99.24
F2 2.743 99.45
F3 2.314 99.54
F4 2.405 99.52
F5 11.692 99.42
F6 97.263 95.14
F7 10.131 99.49
F8 120.437 93.98
F9 21.737 98.91
F10 48.064 97.60
F11 17.069 99.15
F12 21.780 98.91