Professional Documents
Culture Documents
20 November 2002
<message>
<driver name = ‘Smith’ age = ‘42’ >
<car make = ‘Ford’ reg = ‘KFL942’ />
<car make = ‘VW’ reg = ‘PEZ288’ />
</driver>
<driver name = ‘Jones’ age = ‘27’ >
<car make = ‘Fiat’ reg = ‘BCC100’ />
</driver>
</message>
and:
<message>
<drivers>
<driver name = ‘Smith’ age = ‘42’ />
<driver name = ‘Jones’ age = ‘27’ >
</drivers>
<cars>
<car make = ‘Ford’ reg = ‘KFL942’ driver = ‘Smith’ />
<car make = ‘VW’ reg = ‘PEZ288’ driver = ‘Smith’ />
<car make = ‘Fiat’ reg = ‘BCC100’ driver = ‘Jones’ />
</cars>
</message>
These two messages convey exactly the same information, about a group of drivers and their cars.
Each driver has a name and an age, and can drive several cars. Each car has a make and a registration
number, and is driven by only one driver.
In the first message, the fact that a driver drives a car is denoted by nesting of the <car> element
inside the appropriate <driver> element. In the second message, the same fact is denoted not by
nesting of elements, but by each <car> element having a ‘driver’ attribute, which matches the ‘name’
attribute in some <driver> element.
Because the two messages convey the same information, it should be possible to translate from one to
the other.
It is our experience that field-to-field mapping tools cannot do this translation accurately in both
directions. Typically, when going from the ‘flat’ to the nested form, they lack the capability to group
just the correct inner elements (and no others) in any outer element. Furthermore, for mapping tools
which generate code, it seems to be very hard to ‘patch up’ the generated code to do the right job.
Mapping tools therefore fail to do a basic and important translation task.
Denoting an association such as ‘driver drives car’ by nesting of elements, is very common in XML
(XML is good for nesting). But it is far from universal. Many associations are denoted in XML
languages by some kind of shared value. So this translation problem is a very common one. In any
<products>
<product name = ‘widget’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘trunnion’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘plunger’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘valve’ mfr = ‘Perfecto’ mfState = ‘CA’ />
</products>
and:
<message>
<products>
<product name = ‘widget’ mfCode = ‘ac’ />
<product name = ‘trunnion’ mfCode = ‘ac’ />
<product name = ‘plunger’ mfCode = ‘ac’ />
<product name = ‘valve’ mfCode = ‘pf’ />
</products>
<manufacturers>
<mfr name = ‘Acme Inc’ code = ‘ac’ mfState = ‘NY’ />
<mfr name = ‘Perfecto’ code = ‘pf’ mfState = ‘CA’ />
</manufacturers>
</message>
The first fragment is de-normalised, in that information about two different kinds of entity (products
and manufacturers) is held in the same <product> element. Because the same manufacturer may make
many products, some information about the manufacturer (for instance, the state in which it is located)
is duplicated across the elements for many different products.
The second fragment is normalised, so that information about manufacturers is stored separately from
product information, and is not duplicated. ‘Manufacturer state’ is stored only once.
De-normalisation occurs very widely in databases and XML message formats; and normalised forms
are equally common. So it is a very common requirement to translate between normalised and de-
normalised forms.
Field-to-field mapping techniques have great difficulty in doing this, if they can do it at all. They do not
have any natural way to express the re-groupings of data which are required; so they either cannot
make the required translation at all, or at best they do it only by using tortuous procedural constructs,
which would be much better done in a high-level language.
The difficulties of field-to-field mapping in nested and de-normalised data translations have a common
origin. Both of these problems are about associations between objects of different kinds – associations
such as ‘person owns car’ or ‘manufacturer makes product’. (Associations are the ‘relations’ of Entity-
Relation Diagrams, and are fundamental to all data models)
Field-to-field mappings allow you to talk about properties of things – to say ‘these two fields both
represent the same property’ – but they do not allow you to say how associations are represented.
Using a field-to-field mapping tool, there is no way to say ‘this is how the data source represents this
association’. So it is not surprising that field-to-field mapping tools are so bad at translating
associations. They can only do so when both data sources represent the association in the same way –
<people>
<person name = ‘Smith’ age = ‘30’ />
<person name = ‘Jones’ age = ‘25’ />
</people>
and:
<people>
<person>
<prop pName = ‘name’>Smith</prop>
<prop pName = ‘age’>30</prop>
</person>
<person>
<prop pName = ‘name’>Jones</prop>
<prop pName = ‘age’>25</prop>
</person>
</people>
The first XML language can be extended, but only by extending its schema. If you want to represent
some other property of a person (such as their gender) you will have to extend the definition of the
XML language in its schema, to add a new attribute ‘gender’.
The second fragment of XML is designed to be extensible without having to extend its schema. You
can record genders (or any other property of people) by adding an element <prop pName =
‘gender’>male</prop>, which does not require any extension of the schema. This is a more
uncommitted XML language.
Translation between committed and uncommitted languages presents a further challenge for field-to-
field mapping translators. Since, in the second language, the content of the element ‘prop’ can
represent essentially anything, depending on its ‘pName’ attribute, there is no simple way to map it
onto an element or attribute of an uncommitted language. Some mapping products have introduced a
conditional mapping construct to address this problem, but others cannot do it.
This ‘uncommitted’ style of XML language design is used quite commonly (for instance, it is used in
OAGIS XML messages), and it is used to determine entity classes and associations, as well as
properties. It is essential to be able to translate freely between committed and uncommitted structures.
3.4 The Data Grouping Problem
This is illustrated by the following two fragments of XML:
<college>
<student name = ‘Carter’ age = ‘20’ year = ‘2’ />
<student name = ‘LeBrun’ age = ‘19’ year = ‘1’ />
<student name = ‘Schmidt’ age = ‘18’ year = ‘1’ />
</college>
<college>
<students year = ‘1’ >
<student name = ‘Carter’ age = ‘20’ />
</students>
<students year = ‘2’ >
<student name = ‘Schmidt’ age = ‘18’ />
<student name = ‘LeBrun’ age = ‘19’ />
</students>
</college>
In both examples, every student is in a year 1..4. In the second example only, the students in any year
have been grouped together. Such grouping is very common, for instance in XML intended for
transformation to HTML and display as a report.
Translating between the grouped and ungrouped forms requires a structural transformation based on
meaning. It has been our experience that field-to-field mapping products generally cannot make the
required transformation.
3.5 Bringing It Together
These are not rare or isolated problems. They occur all the time in practical data translations, for XML,
databases and other APIs. One of these problems is not likely to occur as above, in isolation – it will
occur together with some of the others, and with others we have not mentioned here (such as merging
and splitting of fields; superclasses and subclasses; and duplicated representation of objects).
As we have seen, each problem on its own presents serious difficulties for the field-to-field mapping
approach. These difficulties arise first, because the approach does not recognise the importance of
associations in data models, and second, because it is ill-equipped to make complex structural
transformations of data.
Vendors of mapping products have partially recognised this difficulty. As well as the basic mapping
functionality, some products offer procedural constructs (such as ‘IF’ constructs and iterators) which
can be built into the mappings to produce more complex behaviour. Sometimes this behaviour can be
made (with difficulty) to give the required structural transformations, and solve one of the four
translation problems. More often it cannot. Even if it can, there are serious difficulties:
1. The mixture of procedural and non-procedural mapping constructs is very hard to understand and
debug
2. If a problem can only be overcome by introducing a complex procedural construct, then the
claimed benefit (of avoiding procedural hand-coding) has disappeared. You would do better to do
the procedural coding in a high-level language which is designed for the purpose.
3. If procedural constructs are used to solve two or more overlapping problems, their interactions
with the automatic functionality and with each other are extremely complex. You are not likely to
produce a solution which anybody can understand or maintain.
If the mapping tool works by code generation, then the alternative (of hand-tweaking the generated
code) is even worse to contemplate.
The result is that field-to-field mapping tools are not capable of tackling industrial-strength data
translation challenges, where the problems described above are widespread.