A unifying semantic distance model for determining the similarity of attribute values
Abstract
The relative difference between two data values is of
interest in a number of application domains including
temporal and spatial applications, schema versioning,
data warehousing (particularly data preparation), internet
searching, validation and error correction, and
data mining. Moreover, consistency across systems in
determining such distances and the robustness of such
calculations is essential in some domains and useful in
many. Despite this, there is no generally adopted approach
to determining such distances and no accommodation
of distance within SQL or any commercially
available DBMS.
For non-numeric data values calculating the difference
between values often requires application-specific
support but even for numeric values the practical
distance between two values may not simply be
their numeric difference or Euclidean distance.
In this paper, a model of semantic distance is
developed in which a graph-based approach is used
to quantify the distance between two data values.
The approach facilitates a notion of distance, both
as a simple traversal distance and as weighted arcs.
Transition costs, as an additional expense of passing
through a node, are also accommodated. Furthermore,
multiple distance measures can be incorporated
and a method of ‘localisation’ is discussed which allows
relevant information to take precedence over less
relevant information. Some results from our investigations,
including our SQL based implementation, are
presented.