You are on page 1of 12

EIGHT CHALLENGES IN DATA INTEGRATION

Author: Robert P Worden

20 November 2002

2002 Charteris plc


CONTENTS
1. LIMITATIONS OF DATA TRANSLATION TOOLS 3
2. PROBLEMS OF MANAGEMENT AND COST 6
2.1 The N-Squared Problem 6
2.2 The N-fold Maintenance Problem 6
2.3 The Double Knowledge Problem 6
3. PROBLEMS OF INTRINSIC TRANSLATION CAPABILITY 8
3.1 The Nesting Problem 8
3.2 The De-Normalisation Problem 9
3.3 The Uncommitted Language Problem 10
3.4 The Data Grouping Problem 10
3.5 Bringing It Together 11
4. A WORKING SOLUTION TO THE EIGHT KEY PROBLEMS 12

20 November 2002 Eight Challenges in Data Integration Page 2 of 12


Draft 01
1. LIMITATIONS OF DATA TRANSLATION TOOLS
The problem of application integration has been around for a long time. It is a very expensive
problem, and most would say it has not been satisfactorily solved. Application integration is a large and
ever-increasing fraction of IT budgets – and is at the root of many project failures.
For more than ten years, there have been specialist toolsets available for Enterprise Application
Integration (EAI). These tools offer a range of facilities - for business process orchestration,
transaction management, security, package ‘adaptors’ and so on. They all have facilities to translate data
between different applications and databases.
As XML has become the preferred choice of inter-application ‘glue’ – both within and across
organisations, with or without the label of ‘web services’ – EAI tools have been extended to handle
XML. There has also emerged a set of specialist XML translation tools.
Most of these translation tools work in the same way. In a design phase, you define equivalences
between data items in two different data sources. For this, you use a process of field-to-field mapping.
Then at run time, the mappings are used to make the data translations automatically.
There are many of these field-to-field data mapping tools on the market. For instance:
♦ WebSphere Data Interchange from IBM
♦ BizTalk Mapper from Microsoft
♦ Tibco Message Broker
♦ Mercator Integration Broker
♦ E-Biz Integrator from Sybase
♦ GoXML from XML Global
♦ Embarcadero D/T Designer
♦ Data Mirror Transformation Server
Field-to-field mapping works as follows: The design tool automatically captures the structure of some
data source (for a database, its relational schema; for an XML source, the nesting of elements from its
XML schema or DTD). This structure is displayed as a tree diagram. The nodes and leaves of the tree
are ‘fields’ (e.g. columns in a database, or XML attributes) which hold the smallest items of data.
You display the tree structures of two different data sources side by side. You drag-and-drop to tell the
tool: ‘this field in source A is equivalent to that field in source B’ – drawing a line across, to denote a
mapping between the two fields. Maybe you put a box on the line to define some data translation (e.g.
between different representations of dates) – using a palette of pre-defined translation functions, or
adding custom functions.
Once you have defined the mappings, run-time data translation is done automatically by the tool, or by
code generated from the tool. Therefore you save the cost of hand-coding data translations.
Data mapping would appear to be a big cost-saver. Based on that promise, over the years many people
have bought these mapping-style translation tools. However, these tools have not yet had a big impact
on the practice of data integration; their use has remained localised. We know of no major IT user
which regularly uses a single data integration tool for all its application integration needs. The
dominant method of data integration is still hand-coding.
Why is this? This note describes how field-to-field mapping fails to tackle eight of the most important
challenges of data integration. If these problems are not tackled properly, they come back to bite you,
and force you back to hand-coding. That is possibly why the data mapping products have not achieved
widespread use – in spite of having been around for many years.
Three key challenges of data integration are problems of management and cost:

20 November 2002 Eight Challenges in Data Integration Page 3 of 12


Draft 01
1. The N-squared problem: If you have N different systems or XML languages to translate
between, you might have to define as many as N(N-1) sets of mappings to do all the required
translations. When N is large – say 20 different systems – even a small fraction of this number
makes a very large number of mappings to define, which is very costly.
2. The N-fold Maintenance Problem: Any system or XML language will evolve through
successive versions – with typically a new version every few months. You need mappings from any
one system to many of the other (N-1) systems. Whenever one system or XML language changes
to a new version, all its mappings to any other systems may need to be re-done. This is a big
ongoing cost.
3. The Double Knowledge Problem: To build accurate mappings between two complex data
sources, you need to have deep knowledge of both data sources at the same time. This
combination of knowledge is very rare, and very hard to find in one person. Mapping errors arise
from lack of knowledge of one of the two systems.
A further five are problems of basic translation capability:
4. The Nesting Problem: this is an XML-specific problem, and is now very prevalent. Some XML
languages are deeply nested, representing associations between objects (or relations, in database
terminology) by their nesting. Other languages are more shallow, representing the same
associations by shared values of fields. Using field-to-field mapping, you cannot make accurate
translations between a nested and a shallow XML language.
5. The De-Normalisation Problem: When a relational database is de-normalised, it represents
several related objects by one row of a table. When an XML language is de-normalised, it
represents several related objects in one XML element. Both of these are very common. Both lead
to duplication of data. Accurate translation between normalised and de-normalised forms – or
even between different kinds of de-normalisation – often cannot be done with field-to-field
mapping.
6. The Uncommitted Language Problem: XML languages and relational databases are often
designed to be open-ended, to convey new types of information without any change of schema.
This is done by making individual messages or records hold their own metadata. Translating from
an open-ended language to a closed language, or vice versa, requires extensions to the basic field-
to-field mapping approach.
7. The Data Grouping Problem: In some XML message formats, data items are grouped according
to the values of some properties. Translating from a grouped format to an un-grouped format is
hard or impossible for simple mapping-based translators.
8. Bringing it all together: Problems like those above do not occur in isolation. In any real
integration project, you will encounter most of them, in combination. Even if field-to-field
mapping can, with some effort, tackle one of these problems on its own, how does it cope when
they all occur together?
You can check out these problems of translation capability for yourself. Using the data mapping tool
of your choice, you can try out the specific examples which follow, to see how many of them it can
translate. We would be interested to hear your results.
The problems (1) – (8) above are not rare or esoteric. In real systems, they occur all the time, and in
any large integration project you will probably run into all of them. With field-to-field mapping
products, this leaves you in a very awkward position – with an automated tool which does half the job
you want it to. You then have to try to understand how the automated tool works (for instance, to
understand the code it generates) in order to patch it up to do the whole job. Patching generated code
is a configuration nightmare, and often cannot be done.
That, we believe, is why field-to-field mapping tools have not been a great success in their ten-year
history.

20 November 2002 Eight Challenges in Data Integration Page 4 of 12


Draft 01
In recent months, web services have been heralded as the solution to the problem of application
integration – which will dramatically simplify the problem of integrating disparate systems. It is
important to realise that the standards which underlie web services – SOAP, WSDL, XML Schema,
and so on – do almost nothing to solve the problems of data incompatibility between systems. To
integrate two complex systems via web services, you still need to solve the same problems of data
integration that you had to solve before. And the most common solution on offer is still field-to-field
mapping.
The reader may have guessed that we would not be listing these eight key problems of data integration
if we did not think there was any solution. In the last section of this paper, we briefly describe the
model-centred approach of the Charteris Integration Toolkit, and how it offers practical, working
solutions to all of the key challenges (1) – (8). On request from Charteris, you can demonstrate these
solutions for yourself.

20 November 2002 Eight Challenges in Data Integration Page 5 of 12


Draft 01
2. PROBLEMS OF MANAGEMENT AND COST
2.1 The N-Squared Problem
Even medium-sized companies typically have hundreds of different IT systems, which have grown up
over many years, and have pressing needs to integrate many of them. They also increasingly need to
integrate with their business partners’ systems. If you have N systems to integrate, the maximum
number of interfaces required (and thus, the maximum number of sets of mappings you will need to
make) is N(N-1) – approximately N squared. Even with N = 60, this would be a prohibitive number of
mappings, and would swallow the IT budget many times over.
This upper limit of N(N-1) never occurs in practice. Experience suggests that some number between
10*N and 20*N is more realistic, for medium-large companies. When it is first built, a new system will
typically interface to anywhere between 5 and 15 existing systems. As time goes by, and other newer
systems in turn are interfaced to it, its interface count grows – leading to a range 10-20 interfaces per
system in large companies.
With, for instance, 60 systems and 15 interfaces per system, this leads to 900 separate interfaces –
again, a prohibitive number of interfaces to build, if individual system-to-system mappings must be
made for each interface. Typically, because of these costs, the interfaces are just not built, and
companies live with a fragmented IT architecture – with the heavy business costs that entails.
Some products have tackled the N-squared problem by having a single central translation hub, which
translates in two steps via a central representation. Integration hubs have their uses, but relying on
them as the sole translation hub has not, in practice, proved popular. The performance costs of two-
step translation are more than a factor of two; the hub is a bottleneck and single point of failure. But
most important, the political implications of a single hub are unpopular in many organisations. Many
parts of the business resent the existence of a hub controlled by someone else, and simply bypass it.
2.2 The N-fold Maintenance Problem
If data integration were just a one-off problem, then given a strong business case, a heroic one-off
effort might be made to solve it. But it is not. Technology changes; business requirements change.
Systems change; and ‘standard’ e-commerce languages change.
Typical IT systems and XML languages undergo ceaseless evolution, with major new versions being
released at intervals of a few months. Every time one of these systems or languages changes, all of its
interfaces to other systems may also need to change. If these are defined by system-to-system
mappings, that makes a lot of re-mapping work. On the rule of thumb above (10-20 interfaces per
system), this means revising and updating 10-20 sets of mappings per system, every few months – a
massive maintenance workload.
The result is that necessary system changes are often just not made, because of the prohibitive
interface maintenance effort. Systems fall behind business requirements, and legacy systems are locked
in place by their interfaces – again, with heavy costs to the business.
2.3 The Double Knowledge Problem
It is not always evident, from the simple examples seen in marketing material, how complex it is to
map just one data source onto another. Typical relational databases have hundreds or thousands of
tables – and therefore have many thousands of fields or columns. Widely-used XML languages may
have hundreds or thousands of distinct elements. Therefore just one set of mappings between two data
sources may involve hundreds or thousands of field-to-field mappings.
To make these mappings accurately between any pair of data sources, you need to understand both
sources at a deep level. You need to understand their physical structure in detail – because these are
physical mappings. More important, you need to understand the detailed semantics of both data
sources – because it is wrong to map two fields onto one another if they have different meanings.
In a large organisation, it is usually possible to find a person who has this deep knowledge of the
physical structure and semantics for just one database, API or XML language. It is practically

20 November 2002 Eight Challenges in Data Integration Page 6 of 12


Draft 01
impossible to find anyone who has such deep knowledge of two sources at the same time. So it is not
feasible to ask any one person to make the mappings for any pair of data sources – you need to find
two scarce, knowledgeable people, and task them to make the mappings together.
We have found this is rarely possible, in practical management terms. A person who is really
knowledgeable about one system or database is a valuable and scarce resource, and cannot be spared to
make repeated mappings of that system to other systems. More typically, the mappings are made by
people who have knowledge of only one side of the mapping, and guess the other side from available
documentation. This leads inevitably to expensive errors.

20 November 2002 Eight Challenges in Data Integration Page 7 of 12


Draft 01
3. PROBLEMS OF INTRINSIC TRANSLATION CAPABILITY
We will illustrate the key data translation problems by examples in XML – although for many of them,
the same problem can occur in any other message format, or between relational databases. For each
problem, we will show two samples of XML which convey the same information. The requirement is
to translate accurately between the two, in either direction.
3.1 The Nesting Problem
Consider the following two fragments of XML:

<message>
<driver name = ‘Smith’ age = ‘42’ >
<car make = ‘Ford’ reg = ‘KFL942’ />
<car make = ‘VW’ reg = ‘PEZ288’ />
</driver>
<driver name = ‘Jones’ age = ‘27’ >
<car make = ‘Fiat’ reg = ‘BCC100’ />
</driver>
</message>

and:

<message>
<drivers>
<driver name = ‘Smith’ age = ‘42’ />
<driver name = ‘Jones’ age = ‘27’ >
</drivers>
<cars>
<car make = ‘Ford’ reg = ‘KFL942’ driver = ‘Smith’ />
<car make = ‘VW’ reg = ‘PEZ288’ driver = ‘Smith’ />
<car make = ‘Fiat’ reg = ‘BCC100’ driver = ‘Jones’ />
</cars>
</message>

These two messages convey exactly the same information, about a group of drivers and their cars.
Each driver has a name and an age, and can drive several cars. Each car has a make and a registration
number, and is driven by only one driver.
In the first message, the fact that a driver drives a car is denoted by nesting of the <car> element
inside the appropriate <driver> element. In the second message, the same fact is denoted not by
nesting of elements, but by each <car> element having a ‘driver’ attribute, which matches the ‘name’
attribute in some <driver> element.
Because the two messages convey the same information, it should be possible to translate from one to
the other.
It is our experience that field-to-field mapping tools cannot do this translation accurately in both
directions. Typically, when going from the ‘flat’ to the nested form, they lack the capability to group
just the correct inner elements (and no others) in any outer element. Furthermore, for mapping tools
which generate code, it seems to be very hard to ‘patch up’ the generated code to do the right job.
Mapping tools therefore fail to do a basic and important translation task.
Denoting an association such as ‘driver drives car’ by nesting of elements, is very common in XML
(XML is good for nesting). But it is far from universal. Many associations are denoted in XML
languages by some kind of shared value. So this translation problem is a very common one. In any

20 November 2002 Eight Challenges in Data Integration Page 8 of 12


Draft 01
large XML integration or interoperability project, you will probably have to translate between a nested
and a non-nested representation of an association. Most mapping tools cannot do it.
3.2 The De-Normalisation Problem
This is illustrated by two fragments of XML:

<products>
<product name = ‘widget’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘trunnion’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘plunger’ mfr = ‘Acme Inc’ mfState = ‘NY’ />
<product name = ‘valve’ mfr = ‘Perfecto’ mfState = ‘CA’ />
</products>

and:

<message>
<products>
<product name = ‘widget’ mfCode = ‘ac’ />
<product name = ‘trunnion’ mfCode = ‘ac’ />
<product name = ‘plunger’ mfCode = ‘ac’ />
<product name = ‘valve’ mfCode = ‘pf’ />
</products>
<manufacturers>
<mfr name = ‘Acme Inc’ code = ‘ac’ mfState = ‘NY’ />
<mfr name = ‘Perfecto’ code = ‘pf’ mfState = ‘CA’ />
</manufacturers>
</message>

The first fragment is de-normalised, in that information about two different kinds of entity (products
and manufacturers) is held in the same <product> element. Because the same manufacturer may make
many products, some information about the manufacturer (for instance, the state in which it is located)
is duplicated across the elements for many different products.
The second fragment is normalised, so that information about manufacturers is stored separately from
product information, and is not duplicated. ‘Manufacturer state’ is stored only once.
De-normalisation occurs very widely in databases and XML message formats; and normalised forms
are equally common. So it is a very common requirement to translate between normalised and de-
normalised forms.
Field-to-field mapping techniques have great difficulty in doing this, if they can do it at all. They do not
have any natural way to express the re-groupings of data which are required; so they either cannot
make the required translation at all, or at best they do it only by using tortuous procedural constructs,
which would be much better done in a high-level language.
The difficulties of field-to-field mapping in nested and de-normalised data translations have a common
origin. Both of these problems are about associations between objects of different kinds – associations
such as ‘person owns car’ or ‘manufacturer makes product’. (Associations are the ‘relations’ of Entity-
Relation Diagrams, and are fundamental to all data models)
Field-to-field mappings allow you to talk about properties of things – to say ‘these two fields both
represent the same property’ – but they do not allow you to say how associations are represented.
Using a field-to-field mapping tool, there is no way to say ‘this is how the data source represents this
association’. So it is not surprising that field-to-field mapping tools are so bad at translating
associations. They can only do so when both data sources represent the association in the same way –

20 November 2002 Eight Challenges in Data Integration Page 9 of 12


Draft 01
for instance, by shared values of properties. When one end of a translation represents an association in
some different way – by nesting or de-normalisation – field-to-field mapping breaks down.
Associations are the backbone of data, which hold it together. Without associations, a database or an
XML message would be just a collection of disconnected facts, and would be of little use. The inability
of field-to-field mapping tools to translate association information is a very serious defect.
3.3 The Uncommitted Language Problem
Consider these two fragments of XML:

<people>
<person name = ‘Smith’ age = ‘30’ />
<person name = ‘Jones’ age = ‘25’ />
</people>

and:

<people>
<person>
<prop pName = ‘name’>Smith</prop>
<prop pName = ‘age’>30</prop>
</person>
<person>
<prop pName = ‘name’>Jones</prop>
<prop pName = ‘age’>25</prop>
</person>
</people>

The first XML language can be extended, but only by extending its schema. If you want to represent
some other property of a person (such as their gender) you will have to extend the definition of the
XML language in its schema, to add a new attribute ‘gender’.
The second fragment of XML is designed to be extensible without having to extend its schema. You
can record genders (or any other property of people) by adding an element <prop pName =
‘gender’>male</prop>, which does not require any extension of the schema. This is a more
uncommitted XML language.
Translation between committed and uncommitted languages presents a further challenge for field-to-
field mapping translators. Since, in the second language, the content of the element ‘prop’ can
represent essentially anything, depending on its ‘pName’ attribute, there is no simple way to map it
onto an element or attribute of an uncommitted language. Some mapping products have introduced a
conditional mapping construct to address this problem, but others cannot do it.
This ‘uncommitted’ style of XML language design is used quite commonly (for instance, it is used in
OAGIS XML messages), and it is used to determine entity classes and associations, as well as
properties. It is essential to be able to translate freely between committed and uncommitted structures.
3.4 The Data Grouping Problem
This is illustrated by the following two fragments of XML:

<college>
<student name = ‘Carter’ age = ‘20’ year = ‘2’ />
<student name = ‘LeBrun’ age = ‘19’ year = ‘1’ />
<student name = ‘Schmidt’ age = ‘18’ year = ‘1’ />
</college>

20 November 2002 Eight Challenges in Data Integration Page 10 of 12


Draft 01
and:

<college>
<students year = ‘1’ >
<student name = ‘Carter’ age = ‘20’ />
</students>
<students year = ‘2’ >
<student name = ‘Schmidt’ age = ‘18’ />
<student name = ‘LeBrun’ age = ‘19’ />
</students>
</college>

In both examples, every student is in a year 1..4. In the second example only, the students in any year
have been grouped together. Such grouping is very common, for instance in XML intended for
transformation to HTML and display as a report.
Translating between the grouped and ungrouped forms requires a structural transformation based on
meaning. It has been our experience that field-to-field mapping products generally cannot make the
required transformation.
3.5 Bringing It Together
These are not rare or isolated problems. They occur all the time in practical data translations, for XML,
databases and other APIs. One of these problems is not likely to occur as above, in isolation – it will
occur together with some of the others, and with others we have not mentioned here (such as merging
and splitting of fields; superclasses and subclasses; and duplicated representation of objects).
As we have seen, each problem on its own presents serious difficulties for the field-to-field mapping
approach. These difficulties arise first, because the approach does not recognise the importance of
associations in data models, and second, because it is ill-equipped to make complex structural
transformations of data.
Vendors of mapping products have partially recognised this difficulty. As well as the basic mapping
functionality, some products offer procedural constructs (such as ‘IF’ constructs and iterators) which
can be built into the mappings to produce more complex behaviour. Sometimes this behaviour can be
made (with difficulty) to give the required structural transformations, and solve one of the four
translation problems. More often it cannot. Even if it can, there are serious difficulties:
1. The mixture of procedural and non-procedural mapping constructs is very hard to understand and
debug
2. If a problem can only be overcome by introducing a complex procedural construct, then the
claimed benefit (of avoiding procedural hand-coding) has disappeared. You would do better to do
the procedural coding in a high-level language which is designed for the purpose.
3. If procedural constructs are used to solve two or more overlapping problems, their interactions
with the automatic functionality and with each other are extremely complex. You are not likely to
produce a solution which anybody can understand or maintain.
If the mapping tool works by code generation, then the alternative (of hand-tweaking the generated
code) is even worse to contemplate.
The result is that field-to-field mapping tools are not capable of tackling industrial-strength data
translation challenges, where the problems described above are widespread.

20 November 2002 Eight Challenges in Data Integration Page 11 of 12


Draft 01
4. A WORKING SOLUTION TO THE EIGHT KEY
PROBLEMS
The Charteris Integration Toolkit is described in more detail elsewhere. Here we give only the briefest
description, to describe how it meets the eight challenges above.
The Charteris XML Integration toolkit does not work by field-to-field mapping. Before making any
mappings, you first define a business object model of the domain. This can be done directly in one
of the tools in the toolkit, or in UML using a CASE tool such as Rational Rose. This object model is
technology-independent. Creating it requires business knowledge, not technical knowledge.
Then, you do not map the different XML languages or databases onto each other; you map them each
onto the business object model. The mapping tool captures XML schemas and relational schemas
automatically, and then provides graphical facilities to make and review their mappings onto the object
model. A set of mappings defines how a database or XML language conveys the information in the
object model.
These mappings are exported in an XML format and are used by other tools in the Charteris toolset.
Given the mappings for any two languages, a tool in the toolset can translate messages directly from
one language to the other. This avoids the inefficiencies of a two-step translation, and does not require
any translation hub.
This approach solves the 8 problems of data integration, as follows:
1. The N-Squared Problem: Because each language or database is mapped only onto one business
object model, and not onto any other languages, the cost of making all the mappings grows only
proportional to N; not N 2.
2. The N-Fold Maintenance Problem: When any language or database changes to a new version,
you only need to change one set of mappings onto the business model. You do not need to change
any mappings onto other systems.
3. The Double Knowledge Problem: To make the mappings for one language or system, you need
to understand that language or system in depth, and you need to understand the business object
model. The latter is business knowledge, which you need in any case to understand the semantics
of the language or system. Deep knowledge of two systems is not required.
4. The Nesting Problem: The Charteris toolkit translates accurately and automatically between
nested and ‘flat’ XML languages. This is because the mappings onto the object model define
properly how each language represents the associations in the object model. Therefore association
information is translated accurately.
5. The De-Normalisation Problem: The Charteris toolkit translates accurately and automatically
between different de-normalised forms, both from databases and in XML. Again, this is because
the mappings onto the object model define properly how each language represents the associations
in the object model, which have been used to de-normalise the data.
6. The Uncommitted Language Problem: Mapping constructs in the toolkit describe
uncommitted languages in a simple and natural manner, allowing accurate automatic translation.
7. The Grouping Problem: Translations are accurately made between grouped and un-grouped
forms of the same data.
8. Bringing it all together: The automatic translator is designed to handle all these problems
simultaneously, in one translation. Because you do not have to do ‘specials’ to solve these
problems, you do not have to worry about how the specials interact; and the translator gets it right.
You do not have to take our word for it, that the Charteris toolkit really does solve these problems of
data integration. On request to robert.worden@charteris.com, we will send you an evaluation pack,
with demonstration solutions to the five problems of intrinsic translation capability, and the means for
you to construct your own more complex tests.

20 November 2002 Eight Challenges in Data Integration Page 12 of 12


Draft 01

You might also like