Professional Documents
Culture Documents
Dissertation
in support of the degree of
Doctor philosophiae
(Dr. phil.)
by Jakob Vo
submitted at January 7th 2013
defended at May 31st 2013
at the Faculty of Philosophy I
Humboldt-University Berlin
Berlin School of Library and Information Science
Reviewer:
Prof. Dr. Stefan Gradman
Prof. Dr. Felix Sasaki
Prof. William Honig, PhD
This document is licensed under the terms of the Creative Commons AttributionShareAlike license (CC-BY-SA). Feel free to reuse any parts of it as long as attribution
is given to Jakob Vo and the result is licensed under CC-BY-SA as well.
The full source code of this document, its variants and corrections are available
at https://github.com/jakobib/phdthesis2013. Selected parts and additional
content are made available at http://aboutdata.org
A digital copy of this thesis (with same pagination but larger margins to fit A4 paper
format) is archived at http://edoc.hu-berlin.de/.
A printed version is published through CreateSpace and available by Amazon and
selected distributors.
ISBN-13: 978-1-4909-3186-9
ISBN-10: 1-4909-3186-4
Cover: the Arecibo message, sent into empty space in 1974 (image CC-BY-SA Arne
Nordmann, http://commons.wikimedia.org/wiki/File:Arecibo_message.svg)
Abstract
Many methods, technologies, standards, and languages exist to structure and describe data. The aim of this thesis is to find common features in these methods
to determine how data is actually structured and described. Existing studies are
limited to notions of data as recorded observations and facts, or they require given
structures to build on, such as the concept of a record or the concept of a schema.
These presumed concepts have been deconstructed in this thesis from a semiotic
point of view. This was done by analysing data as signs, communicated in form
of digital documents. The study was conducted by a phenomenological research
method. Conceptual properties of data structuring and description were first collected and experienced critically. Examples of such properties include encodings,
identifiers, formats, schemas, and models. The analysis resulted in six prototypes to
categorize data methods by their primary purpose. The study further revealed five
basic paradigms that deeply shape how data is structured and described in practice.
The third result consists of a pattern language of data structuring. The patterns show
problems and solutions which occur over and over again in data, independent from
particular technologies. Twenty general patterns were identified and described, each
with its benefits, consequences, pitfalls, and relations to other patterns. The results
can help to better understand data and its actual forms, both for consumption and
creation of data. Particular domains of application include data archaeology and
data literacy.
iv
Zusammenfassung
Diese Arbeit behandelt die Frage, wie Daten grundsatzlich strukturiert und beschrieben sind. Im Gegensatz zu vorhandenen Auseinandersetzungen mit Daten im
Sinne von gespeicherten Beobachtungen oder Sachverhalten, werden Daten hierbei
semiotisch als Zeichen aufgefasst. Diese Zeichen werden in Form von digitalen
Dokumenten kommuniziert und sind mittels zahlreicher Standards, Formate, Sprachen, Kodierungen, Schemata, Techniken etc. strukturiert und beschrieben. Diese
Vielfalt von Mitteln wird erstmals in ihrer Gesamtheit mit Hilfe der phenomenologischen Forschungsmethode analysiert. Ziel ist es dabei, durch eine genaue Erfahrung
und Beschreibung von Mitteln zur Strukturierung und Beschreibung von Daten
zum allgemeinen Wesen der Datenstrukturierung und -beschreibung vorzudringen. Die Ergebnisse dieser Arbeit bestehen aus drei Teilen. Erstens ergeben sich
sechs Prototypen, die die beschriebenen Mittel nach ihrem Hauptanwendungszweck
Paradigmen, die das Verstandnis und die Ankategorisieren. Zweitens gibt es funf
wendung von Mitteln zur Strukturierung und Beschreibung von Daten grundlegend
beeinflussen. Drittens legt diese Arbeit eine Mustersprache der Datenstrukturierung
vor. In zwanzig Mustern werden typische Probleme und Losungen dokumentiert,
die bei der Strukturierung und Beschreibung von Daten unabhangig von konkreten Techniken immer wieder auftreten. Die Ergebnisse dieser Arbeit konnen dazu
beitragen, das Verstandnis von Daten das heisst digitalen Dokumente und ihre
Metadaten in allen ihren Formen zu verbessern. Spezielle Anwendungsgebiete
liegen unter Anderem in den Bereichen Datenarchaologie und Daten-Literacy.
Contents
Abstract . . . . . . .
Zusammenfassung .
List of Examples . .
List of Figures . . .
List of Tables . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. iv
. v
. vii
. vii
. vii
1. Introduction
1.1. Motivation . . . .
1.2. Background . . .
1.3. Method and scope
1.4. Related work . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
6
7
2. Foundations
2.1. Mathematics . . . . . . . . . . . . . . . . . . .
2.1.1. Logic . . . . . . . . . . . . . . . . . . .
2.1.2. Set theory . . . . . . . . . . . . . . . .
2.1.3. Tuples and relations . . . . . . . . . .
2.1.4. Graph theory . . . . . . . . . . . . . .
2.2. Computer science . . . . . . . . . . . . . . . .
2.2.1. Formal languages and computation . .
2.2.2. Data types . . . . . . . . . . . . . . . .
2.2.3. Data modeling . . . . . . . . . . . . . .
2.3. Library and information science . . . . . . . .
2.3.1. Background of the discipline . . . . .
2.3.2. Documents . . . . . . . . . . . . . . . .
2.3.3. Metadata . . . . . . . . . . . . . . . . .
2.4. Philosophy . . . . . . . . . . . . . . . . . . . .
2.4.1. Philosophy of Information . . . . . . .
2.4.2. Philosophy of Data . . . . . . . . . . .
2.4.3. Philosophy of Technology and Design
2.5. Semiotics . . . . . . . . . . . . . . . . . . . . .
2.6. Patterns and pattern languages . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
12
15
17
18
23
24
29
32
36
36
37
39
41
41
42
44
45
50
53
54
54
58
vii
Contents
3.2. Identifiers . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1. Basic principles . . . . . . . . . . . . . . . . .
3.2.2. Namespaces and qualifiers . . . . . . . . . . .
3.2.3. Identifier Systems . . . . . . . . . . . . . . . .
3.2.4. Descriptive identifiers . . . . . . . . . . . . .
3.2.5. Ordered identifiers . . . . . . . . . . . . . . .
3.2.6. Hash codes . . . . . . . . . . . . . . . . . . . .
3.3. File systems . . . . . . . . . . . . . . . . . . . . . . .
3.3.1. Origins and evolution . . . . . . . . . . . . . .
3.3.2. Components and properties . . . . . . . . . .
3.3.3. Wrapping file systems . . . . . . . . . . . . .
3.4. Databases . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1. Record databases . . . . . . . . . . . . . . . .
3.4.2. Hierarchical databases . . . . . . . . . . . . .
3.4.3. Network databases . . . . . . . . . . . . . . .
3.4.4. Relational databases . . . . . . . . . . . . . .
3.4.5. Object database . . . . . . . . . . . . . . . . .
3.4.6. NoSQL databases . . . . . . . . . . . . . . . .
3.5. Data structuring languages . . . . . . . . . . . . . . .
3.5.1. Data binding languages . . . . . . . . . . . .
3.5.2. INI, CSV, and S-Expressions . . . . . . . . . .
3.5.3. JSON . . . . . . . . . . . . . . . . . . . . . . .
3.5.4. YAML . . . . . . . . . . . . . . . . . . . . . . .
3.5.5. XML . . . . . . . . . . . . . . . . . . . . . . .
3.5.6. RDF . . . . . . . . . . . . . . . . . . . . . . . .
3.6. Markup languages . . . . . . . . . . . . . . . . . . . .
3.6.1. General markup types . . . . . . . . . . . . .
3.6.2. Text markup languages . . . . . . . . . . . . .
3.7. Schema languages . . . . . . . . . . . . . . . . . . . .
3.7.1. Regular Expressions and Backus-Naur Form .
3.7.2. XML schema languages . . . . . . . . . . . . .
3.7.3. RDF schema languages . . . . . . . . . . . . .
3.7.4. SQL schemas . . . . . . . . . . . . . . . . . . .
3.7.5. Other schema languages . . . . . . . . . . . .
3.8. Conceptual modeling languages . . . . . . . . . . . .
3.8.1. Entity-relationship model . . . . . . . . . . .
3.8.2. Object Role Modeling . . . . . . . . . . . . . .
3.8.3. Unified Modeling language . . . . . . . . . .
3.8.4. Domain specific modeling and metamodeling
3.9. Conceptual diagrams . . . . . . . . . . . . . . . . . .
3.9.1. Diagram types . . . . . . . . . . . . . . . . . .
3.9.2. Diagram properties . . . . . . . . . . . . . . .
3.10. Query languages and APIs . . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
62
63
67
68
69
72
72
73
77
79
79
83
84
85
89
90
92
92
93
95
97
98
103
112
112
114
119
120
122
131
134
135
138
138
142
145
147
149
150
155
156
Contents
4. Findings
4.1. Categorization of methods . . . . . . . .
4.2. Paradigms in data structuring . . . . . .
4.2.1. Documents and objects . . . . . .
4.2.2. Standards and rules . . . . . . . .
4.2.3. Collections, types, and sameness
4.2.4. Entities and connections . . . . .
4.2.5. Levels of abstraction . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
159
159
160
161
164
170
173
176
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
181
181
182
183
184
186
187
188
190
191
192
193
195
196
198
199
200
202
204
205
207
209
209
210
212
213
215
215
216
219
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6. Conclusions
221
6.1. Summary and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.1. Data archaeology . . . . . . . . . . . . . . . . . . . . . . . . . . 225
ix
Contents
6.2.2. Data literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.3. Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.4. Final reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Bibliography
Appendices
A. Honigs analysis model of data structures
B. Conceptual diagrams as digital documents
C. A pattern graph . . . . . . . . . . . . . . .
D. Deconstruction of a MARC record . . . . .
229
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
231
231
233
234
236
Danksagung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
List of Examples
1.
2.
3.
4.
5.
6.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
25
28
29
34
51
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
61
62
65
82
88
93
103
104
118
123
124
126
128
132
136
139
142
144
List of Figures
.
.
.
.
.
.
163
165
166
174
175
176
.
.
.
.
.
.
.
.
.
.
13
19
20
21
27
33
35
46
47
48
64
70
80
82
84
95
95
95
96
98
100
101
102
106
107
110
112
115
List of Figures
2.1. Laws of Boolean algebra and resulting truth tables . . . . . . . . . .
2.2. Terms and types of binary relations . . . . . . . . . . . . . . . . . . .
2.3. Some digraph types exemplified . . . . . . . . . . . . . . . . . . . . .
2.4. Tree types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5. Relations (proper containments) between classes of formal language
2.6. Summarized view of the data modeling process . . . . . . . . . . . .
2.7. Basic elements of ORM2 notation . . . . . . . . . . . . . . . . . . . .
2.8. Shannons model of a communication system extended by nesting .
2.9. Dyadic and triadic models of a sign . . . . . . . . . . . . . . . . . . .
2.10. Methods of reasoning, as illustrated by Eco (1984) . . . . . . . . . .
xi
Contents
3.19. Ambiguities and structural element differences in markup syntax . . 117
3.20. Text markup model of CSL . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.21. Schema languages allow to express schemas for multiple applications 120
3.22. Backus-Naur Form and syntax diagram elements . . . . . . . . . . . . 122
3.23. Lexical space and value space of XSD datatypes with boolean as example129
3.24. Euler diagram, Venn diagram, and Peirces extension to Venn . . . . . 151
3.25. Spider diagram and constraint diagram . . . . . . . . . . . . . . . . . . 152
3.26. Mindmap of spatial relationship types in conceptual diagrams . . . . 154
3.27. Interaction with an information system . . . . . . . . . . . . . . . . . . 156
4.1. Redundancy and relevance in digital documents . . . . . . . . . . . . 167
4.2. Existing encoding mappings between several data structuring languages177
5.1. Connections between patterns (combining patterns in bold) . . . . . . 190
5.2. Connections between patterns (relationing patterns in bold) . . . . . . 198
6.1.
A2.
A3.
A4.
A5.
A6.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
221
233
234
235
236
237
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
18
31
43
44
. . . . . . . .
3.1. Various character encodings of the capital letter A
3.2. Symbol ranges in Unicode . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
56
56
58
61
67
82
93
94
98
106
List of Tables
2.1.
2.2.
2.3.
2.4.
2.5.
xii
List of Tables
108
116
117
117
121
130
149
xiii
Chapter 1
Introduction
Beginning thinkers in this area often suppose that what will be offered to the screen
reader will be merely individual stored documents, available on line quickly, but based
somehow on conventional documents ensiling in conventional computer files.
Our point of view is different.
Ted Nelson (1981): Literary Machines, page 1/9
1.1. Motivation
Bibliographic data conceals a sneaky complexity. At first glance it is all quite familiar
and simple: there is an author, a title, and a date of publication. Not by chance
computer science publications frequently exemplify data by bibliographic data and
metadata by library catalogs. On closer inspection everything falls apart: how about
multiple authors, editors, publishers, and translators? What if authors are unknown
or known under different names? What about subtitles and abbreviations? Which
date does one specify in which detail and when does it change? Library science has
elaborated detailed cataloging rules to answer these questions. But bibliographic
data is not created, stored, modified, and used solely by skilled librarians even
they do not commit to a single schema. Moreover the subject of cataloging is changing. Although Ted Nelsons vision of a purely digital ecosystem of interconnected
documents has not become reality yet, more and more publications appear in digital
form. Traditional concepts such as document, page, edition, and copy blur or
change meaning the continuing popularity of print-oriented techniques like the
portable document format (PDF) is only a sign of reaction to this process.
In library and information science the description of physical documents is still
relevant but it has largely been solved as research topic. The topic of this thesis
is the description of digital documents which are given as data. While there is
an increasing research interest in (digital) documents see Buckland (1998b),
Pedauque (2006), and Skare, Varheim, and Lund (2007) for approaches the nature
of these documents as data has received less attention. Digital documents and all
digital content share two crucial and basic properties, that have long been neglected
in library science: bits can freely be copied and rearranged. Despite their nature
1 Introduction
as artifacts of communication, physical documents are still considered as stable,
distinguishable, and atomic entities. But given a digital document one can create
any number of identical copies, indistinguishable from the original. With same ease
one can create modifications, that may or may not constitute documents in their
own right. These properties of digital documents complicate their bibliographic
description by metadata. Despite the success of full text search for simple retrieval,
the importance of metadata for digital content is even higher than for physical
objects: if metadata is used to differentiate digital copies from each other, it does not
only describe but also constitute the document. Digital documents are also much
more likely processed, transformed, and aggregated than physical documents. In
doing so, metadata is needed to state when two different strings of bits represent the
same document. Lacking a physical structure of pages, we also require metadata to
structure documents into parts.
As documents become digital, so does metadata. To understand the nature of
bibliographic data, we must first understand the nature of data. If it is something
given (as indicated by the Latin origin datum) where does the act of giving originate?
If data describes other data, what properties does it refer to? We will approach this
questions by analysis of existing methods of structuring and describing data. The
results apply to both data and metadata. Later it will be shown what constitutes
the relationship between data and metadata and how the results apply to metadata
about digital documents in particular. A clear distinction between data and metadata
is difficult for several reasons. Sometimes it is not even clear whether metadata is
added as description about a piece of data or whether it is part of it. Can citations be
considered as metadata only if extracted from a document? Are they metadata about
the citing document, the cited document, or both? We may answer this question by
metadata about metadata, but does this lead to an infinite chain of descriptions?
Furthermore not only metadata but also most data is about something. Its referent
may be no document and not digital, but it is not directly accessible: in digital
environments the association between data and its referent is always constructed
by a document because non-documents, such as people, places, and ideas, are not
directly accessible in the digital realm. The affinity between metadata and data has
an effect on every new system and method to structure data. For instance it has
been mentioned from the beginning of the Semantic Web (Tim Berners-Lee 1997;
R. Guha and Bray 1997; Ora Lassila and Swick 1999) although it took some time
to shift the focus from information about documents (information resources) to
information about any objects (non-information resources). Meanwhile concrete
forms of data are undervalued in favor of a common data language such as the
Resource Description Framework (RDF). However, a look at existing data shows that
there is not one common data language but a multiplicity of formats, languages,
systems, and structures. Data and metadata is structured in many forms, e.g. file
systems, databases, markup, formats, encodings, schemas, and queries. It is unlikely
that this plurality will be replaced by one type of data only.
This thesis will look into this variety without proposing one method of data
structuring as superior to the other. Instead I want to find common patterns as
1.1. Motivation
frequent strategies that occur over and over again in data and metadata. Just like
linguistic analysis of natural language reveals insights to (social) reality, a deeper
understanding of structures in data and metadata can give insight to structures of
the world, that is reflected in data. Grammar, dialects, rhetoric figures, and other
patterns that shape natural language are deeply studied in linguistics. A similar
approach to data, which could be called data linguistics, is still to be founded. A
deeper understanding of data patterns is crucial especially for libraries and archival
institutions. Future librarians and archivists will likely be confronted with more
and more digital documents that have been structured and described by outdated
methods. Knowledge of common patterns in data can help when digital preservation
has failed, by application of what could be called data archaeology. It could be
argued that there is no need for data patterns, because concrete data structures
and models already implement and define data much more precisely. Yet existing
approaches are not enough, because they each focus to one specific formalization
method. This practical limitation blocks the view to more general data patterns,
independent from a particular encoding, and it conceals blind spots and weaknesses
of a chosen formalism. Even a perfect theoretical system of data, metadata and
digital documents may not suffice. In practice data is often far less organized than it
was meant to be. Standards are misinterpreted or ignored. Documentation is sketchy.
Markup and formats are unknown or broken. Eventually documents turn out to
be inherently as fuzzy as the reality that they deal about. At least digital libraries
cannot just reject data if it lacks appropriate descriptions, so they must recognize
their complexity and uncertainty. To understand and reveal concealed structures in
data, we must not only know the techniques that have been applied to it, but also the
patterns that underlie and motivated the application of specific technologies. This
thesis will hopefully provide at least some basic guidance for this challenge.
1 Introduction
1.2. Background
We do not, it seems, have a very clear and commonly
agreed upon set of notions about data.
George Mealy (1967): Another look at data
The main topic of this thesis is the structure and description of data in digital documents. The concept of data is relevant to many disciplines with various meanings: a
summary of different philosophies of data is given by Ballsun-Stanton (2010, 2012)
with data as the product of objective, reproducible measurements (data as hard
numbers), data as product of any recorded observations (data as observations),
and data as processable encodings of information and knowledge (data as bits).
This research commits to the third understanding of data, which is found both in
computer science and library and information science. Without committing to a
specific definition of information and knowledge, we assume that data is given as
processable encoding of something, at least as a sequence of bits. This definition is
compatible with notions of data in some disciplines that provide tools and related
works for data research. Chapter 2 gives an overview of basic concepts and foundations from mathematics (section 2.1), computer science (section 2.2), library and
information science (section 2.3), philosophy (section 2.4), semiotics (section 2.5),
and pattern theory (2.6). This thesis can best be located between library and information science on the one hand and computer science on the other. Both disciplines
do not deal with data as primary topic but they prefer the term information which
data is related to as secondary form. The scope of library and information science
includes the description of documents with data as one aspect of digital documents.
Computer science neither deals with data as such but with computation and the
implementation of automatic processes. This involves data, but data was never the
central object of research as suggested by Naur (1966). Over the past years there
has been an increasing interest in data motivated by the growing amount of Open
Data and tools for data analysis. This has brought up ideas of data science and
data journalism (Bradshaw and Rohumaa 2011). However, both commit to the
philosophies of data as hard numbers or data as observations as they deal with
aggregating, filtering, and visualizing large sets of data, based on statistical methods
of data analysis. Such analyses require a basic understanding of data as prerequisite
but they do not make it to their primary object of investigation. The main concern of
data science is big data, that is when the size of the data itself becomes part of the
problem (Loukides 2010). In contrast, the problem of this research is the inherent
complexity of data, which exists independent from its size. It is my aim to show how
data is actually structured and described, independent from particular application
for which, and independent of particular technologies in which data is processed.
For the most part, the research question is focused on digital documents as instances of data. The document is a core concept of library and information science
(see section 2.3). Nevertheless, there exists no commonly agreed upon definition,
even within the discipline. As described by Buckland (1997, 1998a) the nature of
1.2. Background
(digital) documents can better be defined in terms of function rather than format:
whatever functions as a document can be a document. The document must only be
usable as recorded evidence in support of a fact (Briet 1951).1 A digital document,
in short, can be any data object that eventually exists as sequence of bits. Such data
objects are often referred to as information. However Ted Nelson (2010, p. 300)
is right as he writes in response to a misleading summary of his hypertext system
Xanadu by Tim Berners-Lee: not all the worlds information, but all the worlds
documents. The concept of information is arguable, documents much less so.2 An
important distinction between digital documents, that are subject to bibliographic
description, and general data objects or information, that are subject to general data
management in business, is the stability of documents. While business databases
are designed to cover the current state, bibliographic data is designed to cover what
has been published or recorded. Description and interpretation of a digital document may change, be extended, reduced, or turn out to be wrong. Still there is the
assumption of facts, which a document is evidence for, even if both the facts and the
documents can be expressed in several ways. Business data in contrast describes facts
that change in time: products are created and sold, people are hired and fired, etc.
Most information systems cover both types of data: static data, that is not changed,
and dynamic data, that may change. For instance a library system holds description
of publications: these publications are documents which are not changed after they
have been published. At the same time the system holds descriptions of dynamic
holdings, which are bought, lend, and sort out. Dynamic data can be transformed
into static data by just freezing it this process of archiving or preservation is one
major task of library institutions. This frozen data is what constitutes a digital document. This finding, however, does not answer which data constitutes a document
and how one determines its relevant parts. The definition of a document is either
passed on to the eye of the beholder or to the level of metadata about documents.
The former cannot be automatized, and the latter forms a digital document on its
own. As there is no obvious distinction between data and metadata, metadata only
shifts the problem of document identity to another level. Nevertheless metadata
provides useful methods to tackle the nature of digital documents in particular and
data in general. To further identify digital documents, we need to reveal structures
in data and metadata, which will be described by data patterns.
1 Translation from French by Buckland (1997). In practice a fact that is supported by a document can be
futuristic project, Xanadu, in which all the worlds information could be published in hypertext.
1 Introduction
pattern theory and Alexander led to no results. Only Palmer (2009) follows a related approach
with a phenomenological analysis of meta-systems and systems engineering among them some
patterns and schemas.
conceptions. Explicit and implicit structures are intertwined on multiple levels, and
both structure and describe data.
Data mining and machine learning, on the other hand, can only recognize known
structures at one level of description, but they cannot automatically detect and interpret unexpected kinds of structures. In a nutshell, one cannot find out whether and
how different things relate to each other by treating them all as equal. Quantitative
methods only show patterns within the constraints of a fixed format. For instance
data mining can find co-occurrences in data fields from a large number of records,
but first one must identify what constitutes a field and a record. These entities are
examples of general data patterns, which we are looking for.
Both, the scope of this thesis, and its method applied to the revealing of patterns in
data description, independent from particular technologies, are unique. Nevertheless
there are several related works to build on. These works either deal with particular
technologies and parts of the problem, or they tackle the problem of data description
from different points of view and with different methods. The following section gives
a brief overview of related work with emphasis on concrete publications and authors.
The overview begins historically with early and foundational works, followed by
analyzes of patterns in particular domains, including metamodels and taxonomies.
Some additional works are relevant because they share theoretical or methodological
fundamentals with this thesis. In order to better find revealed aspects of data
structuring I also paid particular attention to researchers that complained about
established treatment of data and digital documents (Kent, Nelson, Naur, etc.).
General approaches from specific disciplines and methods (mathematics, computer
science, library and information science, semiotics, philosophy, and the study of
patterns and pattern languages) will be dealt with in chapter 2.
1 Introduction
Some data patterns before the advent of computer systems may be found in forms
and questionnaires (page 137). Most aspects of the of history of forms, however,
still have to be written, so I skipped this topic. The use of bits as most fundamental unit of data also dates back to the pre-electronic age.4 The first foundational
analyzes were created in the late 1950s until the 1960s around the same time
when distinctions between data and information were introduced (Gray 2003) and
when computer science emerged as an independent discipline. These early works
remain important also because they are less bound to particular technological trends
and paradigms, that later emerged. For instance the abstract formulation of data
processing problems by Young and H. K. Kent (1958) contains a clear separation
between information sets (sets of possible information items belonging to the same
class, from which data is drawn), and documents (collections of related information
items). The concept of a hierarchy of models in models of data (Suppes 1962) can
also help understanding general problems of data structuring, although early notions
of data tend to conform to the idea of data as recorded observations (see page 42).
Later the interest of researchers shifted from data to concepts like information (or
even knowledge). Against this, Naur (1966, 1968) suggested the term datalogy for
the science of the nature and use of data. As described by Sveinsdottir and Frkjr
(1988), Naur also criticized the focus of computer science curricula on formalization, disregarding social aspects, psychology of programming and applications. My
application of pattern analysis instead of mathematical formalisms follows Naurs
understanding of datalogy. Another foundational discussion on data is given by
Mealy (1967) and a response by Chapin (1968).
Several works deal with intellectual analysis of data patterns or similar concepts
in particular domains: Armstrong (2006) identified common patterns in Object Orientation by literature analysis and named them quarks. The most common quarks,
asserted as characterizing Object Orientation in more than every second analyzed
article, are inheritance (71 of 88 articles), objects (69), classes (62), encapsulation
(55), methods (50), message passing (49), polymorphism (47), and abstraction (45).
Patterns in hierarchical documents, with focus on XML based languages, have been
analyzed by Dattolo et al. (2007). Their basic patterns for segmentation and extraction of structural document elements, first identified by Vitali, Di Iorio, and
Gubellini (2005), are: markers (with meaning depending on position), atoms (such
as unstructured plain text), block and inline elements, records (sets of unordered,
optional and non-repeatable elements), containers (sets of unordered, optional and
repeatable elements), and tables (sequences of homogeneous elements). More collections of patterns can be found in (conceptual) data modeling literature, although
their primary focus is on problems in business enterprises. The most general compilations have been collected by Hay (1995) and by L. Silverston (2001). Among other
business tasks, the former contains a brief chapter on simple document modeling,
and the latter provides template data models for common entities such as people,
4 The binary number system is attributed to Leibniz (1703) and Boole (1854). Earlier examples of non-
numerical uses of binary systems are the African Ifa divination and the Chinese I Ching hexagrams.
1 Introduction
database management systems, resulting in a description model with core properties,
such as homogeneity, atomicity, repeatability etc., similar to the patterns identified
in this thesis. Honigs work, apparently, has neither been taken up nor updated so
far. Both ISO 11404 and Honigs model are compared with the final data pattern
language in section 5.6.
10
Chapter 2
Foundations
The phenomena to be analyzed in this thesis include all methods to structure and
to describe digital data. To experience these phenomena we must first broaden
our view to see where they can be found. For this reason, this chapter will first
introduce the disciplines that deal with data and the description of digital documents.
This introduction also includes definitions of some basic concepts and notations.
These foundations are both used during collection and analysis in the proceeding
chapters and they can be instances of phenomena in their own right. For instance
mathematical set theory is used to define other methods of data structuring, but
set theory alone can also be used for data structuring. We cannot fully avoid this
circularity as every description must be formulated in some other description
basically this is the core problem of data description. Nevertheless we can show
how different disciplines approach and tackle the problem. The largest part of this
chapter introduces mathematics (section 2.1) and computer science (section 2.2)
because these make the traditional foundations of data: mathematics has proved to
be an effective tool to exactly describe structures of any kind and computer science
provides the most examples, as most problems of practical data processing belong to
its domain. The approach of library and information science (section 2.3) is different:
it is basically concerned with the organization and description of documents. While
more and more documents become digital, the discipline should more and more
deal with data. The impact of philosophy is more subtle: as outlined in section 2.4
there is not much explicit philosophy of data, but philosophical issues permeate
all other disciplines and philosophy helps to reveal blind spots of other points of
views. Semiotics (section 2.5) is relevant to this thesis because it deals with signs and
language, which all meaningful data is an instance of. Section 2.6 finally introduces
the fundamental concepts of patterns and pattern languages. Both can be combined
to pattern theory, which is more a practice or an art than a scientific discipline.
11
2 Foundations
2.1. Mathematics
As far as the laws of mathematics refer to reality, they are not certain;
and as far as they are certain, they do not refer to reality.
Albert Einstein
2.1.1. Logic
Mathematical logic has its origin in philosophy which also studied the principles
of valid reasoning. In particular the logic of Aristotle was influential until the midnineteenth century, when a mathematical analysis of logic was introduced by Boole
(1847, 1854). Mathematical logic replaced natural language with formal symbols to
express truth values and logical statements. Typical notations include:
or 1 for true and > or 0 for false
for negation
and for logical conjunction and logical disjunction
and for logical implication and logical equivalence
symbols such as a, b, x, y for variables and individual constants, independent of
the ontological status of their referents
Alternative visual notations of logic systems, as introduced by Euler (1768) and
Venn (1880), are dealed with in section 3.9. The basic rules how to combine and
interpret statements from these formal symbols can defined by Boolean algebra.
Figure 2.1 contains laws of Boolean algebra and resulting truth tables for the basic
operations, , , , and . Law 5 to 8 could also be derived from 1 to 4 and in
total there are 16 binary boolean operations. The laws of Boolean algebra can be used
1 As shown by G
odel (1931) mathematics can even prove that in a system of axioms that is complex
enough there are consistent statements which cannot be proven or disproven without additional
axioms.
12
2.1. Mathematics
1. commutativity
xy y x
xy = y x
2. distributivity
x (y z) (x y) (x z)
x (y z) (x y) (x z)
3. annihilation
x0 0
x1 x
4. excluded middle x x 0
x x 1
5. idempotence
xx x
xx x
6. associativity
x (y z) (x y) z
x (y z) (x y) z
7. absorption
x (x y) x (x y) x
8. logical identity
x1 x
x0 x
x y xy xy x y x y
x x
0 0
0
0
1
1
0
1
0 1
0
1
1
0
1
0
1 0
0
1
0
0
1 1
1
1
1
1
Figure 2.1.: Laws of Boolean algebra and resulting truth tables
for inference, that is to derive new logical statements from given logical statements
by deductive reasoning2 For instance one can proof that (x y) x y and
(x y) x y which is known as De Morgans law. The logic of Boolean
algebra is equivalent to propositional logic and to the algebra of sets (see section 2.1.2)
among other descriptions. In particular, digital switching circuits can be described
by Boolean algebra (C. Shannon 1938), which is the base of all digital computer
systems.
Further formalization of logical statements is possible with predicate logic, which
extends propositional logic with predicates and quantification. A logical predicate
is an individual symbol that refers to a general statement with zero or more empty
spaces. Predicates are typically written in functional syntax, for instance f ( , )
denotes the binary predicate f and g(a) denotes the unary predicate g where the
space is a filled by variable a. The number of spaces is called the predicates arity.
Predicates are logical statements insofar as they have a truth value for each combination of individual variables. For instance g(a) with predicate g and variable a is
either true or false. The ontological status of predicates, however, is irrelevant to
predicate logic: for instance predicated could refer to properties (e.g. g(a) a is
blue), concept types (g(a) a is a book), relations (f (a, b) a is friend of b), or
2 See also figure 2.10 for methods of reasoning.
13
2 Foundations
attributes (f (a, b) a has size b). Normal predicate logic (also known as first-order
predicate logic) has two fundamental kinds of quantification: universal quantification
() to state that a logical statement is true for all possible values of variable, and
existential quantification () to state that there is at least one value of a variable
that makes a logical statement true. If predicates and/or quantifiers can be replaced
by variables, predicate logic is extended to higher-order logic. Higher-order logic
allows more complex statements about statements, but it is hard to verify statements
even in second-order logic. A common and less complicated extension of predicate
logic is the introduction of an identity predicate or identity relation, also referred
to as equality. With the equality relation = one can define a uniqueness quantifier to
denote that exactly one object exists. We write !x : (x) to denote that there is only
one x for which (x) is true. The equality relation can also be used to state that a
statement is true for a specific number of distinct values. Further extensions and
modification of propositional logic and predicate logic are possible by adding and by
modification of the basic laws of Boolean algebra. For instance one can argue against
the law of excluded middle (law 4 in figure 2.1) and introduce a third truth value in
addition to true and false to denote unknown or irrelevant (ternary logic). Other
so-called non-classical logic extensions include:
an interval of possible truth values (fuzzy logic)
additional quantifiers to express modality of statements (modal logic)
elimination of the law of excluded middle and double negation so statements
only have a truth value only if they can explicitly be inferred (intuitionistic logic)
introduction of default values and exceptions (default logic)
support of inconsistent statements (paraconsistent logic)
One example of modal logic relevant to data description is deontic logic, because
this logic is concerned with obligation, permission, and norms (McNamara 2010).
Deontic logic, however, includes several outstanding philosophical problems: the
basic problem, known as Joergensens dilemma is based on the relation of deontic
values to logical truth values and (Jorgensen 1937): On the one hand norms cannot
be true or false but only fulfilled or violated, but on the other hand some norms
seem to follow logically from others. If one tries to formalize this implications, one
may get unexpected results such as the Good Samaritan Paradox: given that a person
is robbed and given that one should guard a person that is robbed, it follows that a
person should be robbed, because without robbery one cannot guard anyone.
The choice of a specific logic system and how it suits the domain to be described is
a philosophical question (see section 2.4). We mostly assume classical logic, but on a
closer look methods to structure and describe data include non-classical elements,
such as the introduction of NULL values (ternary logic) and default values (default
logic), combined with annotations (higher-order logic). On the other hand, data
description should be easy to compute, so even first-order predicate logic can be
14
2.1. Mathematics
TBox
ABox
description logic
concepts
roles
top concept
bottom concept
concept complement
concept intersection
concept union
concept hierarchy
universal restriction
existencial restriction
concept assertion
role assertion
notation
A, B, C . . .
R, Q . . .
>
C
AuB
AtB
AvB
R.C
R.C
a:C
(a, b) : R
predicate logic
A(x), B(x), C(x) . . . (unary predicates)
R(x, y), Q(x, y) . . . (binary predicates)
x : >(x) (predicate hold for all x)
x : (x) (predicate holds for no x)
C(x)
A(x) B(x)
A(x) B(x)
A(x) B(x)
y : R(x, y) C(x)
y : R(x, y) C(x)
C(a)
R(a, b)
computable complexity.
15
2 Foundations
or members. We write a A to denote that a is a member of the set A or contained in
the set A and a < A to denote that a is not a member of A. A set that contains no
elements is called the empty set and denoted by {} or . The number of members in a
set A is called its cardinality and denoted by |A|. The cardinality of the set of natural
numbers N is denoted |N| = 0 . Sets with cardinality 0 are called countable infinite
in contrast to a countable finite set with finite number of elements. The cardinality of
the continuum (the set of real numbers) is denoted |R| = c which is strictly greater
than 0 . If every member of a set A is also member of a set B we call A a subset of B
and write A B. Reciprocally if every member of B is also member of A then B is a
superset of A or included in A and we write B A. To denote that A is not a subset of
B we write A 1 B and B 2 A. If for two sets A and B both A B and A B then the
sets contain the same members and are called equal, written as A = B. If A and B
are not equal we write A , B. If A is a subset of B, but not equal to B, then A is also
called proper subset of B and we write A ) B and B ( A. A partition of a set A is a set
of subsets of A such that every member of A is exactly in one of the subsets and none
of the subsets is the empty set. Furthermore we define the following operations on
sets:
A B is the union of two sets A and B, which is the set of all objects that are
members of one or both of the two sets.
A B is the intersection of two sets A and B, which is the set of all objects that
are members of both sets.
B \ A is the complement of one set A relative to another set B, which is the set of
all objects that are not members of A but members of B.
P is the power set of a set A, which is the set of all of its subsets. The set of all
subsets with a given cardinality n is written as Pn (S).
To define particular sets, there are two methods. First, one can provide a property
that all members of the set must satisfy. In set-builder notation we write { x | (x) }
to denote the set of all elements that satisfy the property and is also called
the sets membership function. For instance the set of all prime numbers could be
written as { x | x is a prime }. With membership functions we can give more formal
definitions of operations on sets, for instance A B = { x | x A x B }. Properties
and sets can be used interchangeably: each property defines a set and each set A
defines a property being member of set A. Using properties to define sets, however,
requires to somehow refer to a collection of all possible members, which may or may
not satisfy a property. This collection is called the universal set U or a universe if it
is limited to some specific objects. The second method to specify a set is to write
down a list of its elements in curly brackets. For instance {, 4, } can denote the
set of a cross symbol, a triangle symbol, and a heart symbol. Note that neither the
order of listed members nor any repetition of the same member is relevant the
same set could also be written as {4, , , 4}. The identification of same elements
16
2.1. Mathematics
is out of the scope of set theory (unless elements are restricted to sets themselves).
For instance the sets {+, N, <3} and {, 4, } both contains a cross symbol, a triangle
symbol, and a heart symbol. Whether both sets are equal depends on whether
visual differences between the symbols in both sets matter or not. Using properties
such as {x|x looks like a heart symbol} neither solves the symbol grounding problem:
arbitrary properties based on a general universal set lead to paradoxes such the
impossible set R = {X|X < X} (Russells paradox). This set is defined to contain all
sets that do not contain themselves, so R must contain itself if it does not contain
itself, which is a contradiction.
To avoid such paradoxes of naive set theory there are several strategies. ZermeloFraenkel set theory with the axiom of choice assigns each set a rank, that is the
smallest ordinal number greater than the ranks of its members. There is no universal
set but only the set V of all sets with rank . The full cumulative hierarchy, starting
with V0 = , is called von Neumann universe. Another strategy adds classes or categories
as distinct objects to sets. A class is defined in the same way as a set, but it can only
have sets as members. For every property one can define the class of all sets
with property . Every set is also a class, but some proper classes, such as the class of
all sets, and the paradoxical class R cannot be described as sets but only as classes.
Classes and categories as abstractions of sets and other mathematical objects are also
studied in category theory.
17
2 Foundations
injective (left-unique)
functional (right-unique)
one-to-one
(left-)total
surjective (right-total)
function or map
bijective
correspondence
18
2.1. Mathematics
A0 domain of
definition of r
range of r
a1
b1
a2
b2
a3
b3
A
domain of r
relation
r
f
g
g f
B0
domain
A0
A
A0
B0
C
C
B0
B
g f
f
range of g
g
c2
c3
c4
B
codomain of r
codomain
B
B0
B0
C
D
D0
D0
D
D0
d1
d2
d3
D
type of relation
left-total
right-total (or surjective or onto)
correspondence (left- and right-total)
injective and partial function
(total) function
surjective and function
bijective
one-to-one
A graph is a pair hV , Ei where V is a finite set of nodes (or vertices) and E is a finite
set of edges. Unless otherwise indicated the graph is a simple graph, that means edges
are 2-sets without orientation and cannot connect a node with itself: E P2 (V ) Two
nodes u, v are adjacent if {u, v} V . The degree of a node v is the number of adjacent
nodes { u | {u, v} V }. A path is a sequence of two or more nodes v1 , . . . vn such that vi
and vi+1 are adjacent for 1 i < n. Unless otherwise noted a path uses every edge at
most once. A cycle is a path with v1 = vn . If there is a path between two nodes u and
v, they are connected and their distance is the length of their shortest connecting path.
A graph is said to be connected if every pair of nodes in the graph are connected.
Unless otherwise noted a graph is assumed to be connected. A planar graph can be
drawn on a plane without intersecting edges. A graph is bipartite if its nodes can
be partitioned into two sets such that no nodes of the same set are adjacent. Unless
mentioned otherwise we will use the term bipartite graph more specific for a fixed
bipartite graph hV1 , V2 , Ei with specific node partition which is isomorph to a binary
relation (see example 1).
A directed graph or digraph is a graph whose edges (also called arcs) are ordered
19
2 Foundations
c) cycle graph
d) grid graph
b1
a1
a2
a2
b2
a3
a4
b3
a5
a6
a) hypergraph
b1
'
a3
a4
a5
b2
A = {a1 , a2 , a3 , a4 , a5 , a6 }
B = {b1 , b2 , b3 }
= {{a1 }, {a2 , a3 , a4 }, {a3 , a5 }}
V = AB
b3
a6
b) bipartite graph
E = { hv, ei | e A, v e }
= {ha1 , b1 i, ha2 , b2 i, ha3 , b2 i,
ha4 , b2 i, ha3 , b3 i, ha5 , b3 i}
c) surjective relation (E)
20
2.1. Mathematics
21
2 Foundations
a) undirected tree
b) ordered tree
c) polytree
d) multitree
a tree in which the child nodes of each node are ordered. In (figure 2.4 b) ordering
is indicated by dashed linkes pointing from one sibling to the next. There are two
special kinds of DAGs that are also called trees: A polytree (c) is a directed acyclic
graph containing no undirected cycles and a multitree (d) is directed acyclic graph
without directed diamonds (Furnas and Zacks 1994). In a multitree the descendants
of any node form a tree and the ancestors of any node form an inverted tree but there
may be undirected diamonds. Every polytree is also a multitree and every directed
tree is also a polytree.
22
The following sections introduce three topics from computer science that are relevant
to this thesis: formal languages and computation are fundamental computer science
concepts to reason about sets of data (section 2.2.1). Data types are important to
manage data structures in programming languages and in databases (section 2.2.2).
Finally, data modeling tries to bridge the gap between some reality and its description
in form of data in some information system (section 2.2.3). First the discipline should
be put in context by a short overview.
In general, computer science deals with the theoretical and practical automatic
processing of data or information. The first scientific computing organization was
founded in 1947 with the Association for Computing Machinery (ACM). Computer
science as an independent academic field was established until the 1960s. The history
of the discipline, which can be located somewhere between applied mathematics and
engineering, is directly connected to the development and application of computer
systems. The first computers as programmable, general purpose machines were
created in the 1940s for military calculations.4 Meanwhile, computers are used in
almost any aspect of daily live with an impact comparable to the industrial revolution
or with the invention of the printing press. The economic weight of the so called
information industry is another factor that must be beard in mind when thinking
about promises and motivations of (applied) computer science.
From the beginning of computer science, there has been a tendency to describe
computers not only as tools for automatic data processing, but to attribute them
with human terms like intelligent, thinking, knowledge, brain, and semantic.5
The comparision of automatic systems with brain power is also drawn, if computing
is viewed as a natural phenomenon, with mental activity as instances of information
processing, similar to computer systems (Denning 2007).
The traditional, rationalistic paradigm of computer science is contrasted by constructivist views that stress the relativity of automatic systems as social artifacts. For
Turing Award winner6 Peter Naur, pioneer in software engineering who suggested
4 The first computers include: Zuse Z3 (1941) and Z4 (1945) that were created for calculations in
military aviation, as well as other early German calculating machines from this time (Lange 2006, p.
202ff.); Collosus (1944) that helped to decipher encrypted messages by British codebreakers; Harvard
Mark I/ASCC (1944) and its successors that were used by the US Navy; ENIAC (1946) and EDVAC
(1949) that were used by the US Army. Only the IBM SSEC (1948) also served non-military purposes
until in 1949 three computers were completed at research institutions, partly with commerical support:
EDSAC in Cambridge, Manchester Mark 1 in Manchester, and CSIRAC in Sydney. More about early
computers can be found in the collection by Rojas and Hashagen (2000).
5 See for instance Berkeley (1949) and the whole terminology of artificial intelligence. Even the term
language is misleading because programming languages and other formal languages, unlike human
languages, are based on precise formalization but not on speech and communication (Naur 1992).
6 The Turing Award, annually awarded by the ACM is the highest distinction in computer science.
23
2 Foundations
the term datalogy in favour of computer science, programming is not comparable to
industrial production, but an act of theory building (Naur 1985), and the core of
programming is the programmers developing a certain kind of understanding of the
matters of concern. (Naur 2007).7 Despite the predominant practice in computer
science, computing artifacts are neither objective description of reality (W. Kent
1978) nor an optimal solution of a given problem. Instead we construct the problem
as well as the solution (Floyd 1996) and must therefore take responsibility for the
thus constructed reality (Weizenbaum 1976). Having said this, computer science
provides powerful theories and tools to describe and process data.
24
S
I, V, X, L, C, D, M,
T for thousands, H for hundreds, E for hundreds from to
CCC, Z for tens, Y for tens from to XXX, U for units, O
for units from to III
production rules
S T HEU
T , T M , T MM , T MMM , T MMMM
H E , H CM , H CD , H DE
E , E C , E CC , E CCC
Z Y , Z XC , Z XL , Z LY
Y , Y X , Y XX , Y XXX
U O , U IX , U IV , U VO
O , O I , O II , O III
The language could also be expressed by the following regular expression:
^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})
Example 2: Formal grammar of roman numerals up to 4999
paxdiablo at http://stackoverflow.com/questions/267399/#267405.
25
2 Foundations
unrestricted sequences of symbols.
Type-1: context-sensitive languages (CSL) have production rules of the form
X where X is a non-terminal symbol; , , are sequences of symbols;
and only and can be empty. In addition, a rule of the form X is allowed,
if X does not occur at the right side of any rule. The sequences and specify
the context in which X is replaced by another sequence.
Type-2: context-free languages (CFL) have only production rules with one single
non-terminal symbol at the left side (X ). The context of X is not taken into
consideration when it is replaced.
Type-3: regular languages (REG) limit production rules to rules of the form
X A, X , and either only X AY (right regular) or X Y A (left regular),
where X and Y are non-terminal symbols and A is a terminal symbol.
Each type is a proper subset of the former, and each corresponds to a class of
computational power, with type-0 being the class of Turing-complete languages.
Most programming languages are Turing-complete, that means the set of valid
programs can only be defined by a grammar in RE. In practice, many computable
problems are also decidable so they could be expressed in a less powerful language
type. The corresponding class of recursive languages (R), which can always be decided,
is between RE and CSL. However, such sub-Turing programming languages are not
widely used, apart from theorem provers and tools for formal software verification.
To make use of formal languages, there are two general problems: first the problem
to specify a formal grammar, and second the problem to determine whether and how
a word can be produced by a given grammar (parsing). The specification problem
is related to the application of schema languages for writing down grammars (section 3.7). The most difficult part is to formalize a possibly infinite set of words from a
finite set of examples and assumptions. Unfortunately there can be many grammars
that define the same language and even for CFL it is not computable whether an
arbitrary grammar is equivalent to or is a subset of another grammar. The second
problem is computable at least for recursive languages, but the computation may be
very complex (time-intensive),10 and a word may be producible by multiple paths
of rule application. A specific list of rule applications and intermediate symbol sequences that transform the starting symbol into some word is also called the words
parse tree or syntax tree. If a grammar covers words with multiple parse trees, the
grammar is called ambiguous.
Ambiguous grammars, complexity of parsing, and other limitations of formal
language types have motivated the creation of additional language classes and
computation models between REG, CFL, a and CSL. Some relevant classes are:
10 The task of determining whether a given word belongs to a given context-sensitive language may take
more than polynomial time (P): that means there is no upper bound k so that it takes less than nk steps
to check any sufficiently large word of n symbols length. For many problems P is a limit of practical
computability.
26
DCFL
REG
CFL
DLING
LING
IND=NP
TAG
GCSL
CG
CSL
RE
BG
27
2 Foundations
either by direct replacement or by first replacing a with c and then c with b. Objects
with outdegree zero act as terminal symbols (in this example b) and there may exist
paths that can be followed infinitely (in this example a c a c ).
To illustrate the general concept example 3 shows a rewriting system in which
objects are build of geometric figures (visual symbols). The system consists of two
rules that can be used to replace a simple triangle by a structure of two or three
triangles. Starting with a simple triangle (start) one can create an infinite number of
different figures by applying rule 1 or rule 2 to selected parts of the figure. However,
the set of figures that can be reached from a given start symbol, is strictly defined.
For instance the figure in the bottom right of example 3 is no valid figure in the
given rewriting system.
Rewriting systems are of special interest, if the objects to be rewritten have some
internal structure. Common types of such systems are: formal grammars as string
rewriting systems, that operate on sequences of symbols; term or tree rewriting
systems with applications for instance in symbolic computation, automatic theorem
proving, and code optimization; and graph rewriting systems.
rule 1, 2, 1
rule 1
start:
rule 1:
rule 2
rule 1
rule 2:
invalid
Example 3: Rewriting system on geometric figures
In summary, formal languages and rewriting systems describe structures of sequences, trees, or graphs; and how these structures can be parsed. This requires the
identification of symbols or objects of same kind: for instance the size of two triangles in example 3 does not matter, but their orientation. Another often overlooked
property of formal languages and rewriting systems is that they only deal with
syntax although they are often used to described some semantics. If one compares
example 2 and example 3, only the first has some known meaning. An intelligent
reader can give meaning to some of the non-terminal symbols, but other symbols
may be irrelevant artifacts, only introduced because of formal grammar limitations.
To give another example, the production rule C IBTCECF with terminals I, T, E,
F and non-terminals B and C has special meaning but being a production rule. We
can change the names of symbols to get the following rule in Backus-Naur-Form (see
section 3.7.1 and table 3.16 for details about this syntax):
28
a b c d b c b a d a b a
c a b d b b a d
b
a
d
d
Until the 1960s, programming languages only provided a limited set of predefined
data types and programmers had to choose data representation close to the internal
structure of memory. The influential programming language ALGOL 60 (Naur
1963) with fixed data types for integers, decimal numbers, boolean values, and
arrays, introduced lexical scoping, so parts of a program could have their own
private variables. Grouping data was further improved by records and other types
in Pascals precursor ALGOL W (Wirth and Hoare 1966), by object-orientation in
SIMULA (Dahl and Nygaard 1966), and by abstract data types (Liskov and Zilles
29
2 Foundations
1974). The basic idea of abstract data types is to define data objects based on how
they can be used and modified by operations specific to each data type. Operations
hide the representation of an objects state and only show the what instead the how
of computing with data objects.
Programming languages and some other methods of data structuring such as
schema languages (section 3.7) provide a basic set of primitive data types and mechanisms to create new types based on existing types. Typical primitive data types
include the boolean type, character and character string types, and numeric types
(see section 3.1.2) but domain-specific types such as date and time may also be
primitive, depending on the language. A special primitive type in some languages
is the unit type, which can hold no information because all unit type variables have
the same value. General mechanisms to derive new data types include referencing
(pointer types), aggregation, choice, and subtyping, which will be described below.
Both the set of primitive data types and mechanisms to define new types differ
among programming languages, so one cannot simply exchange typed data from
one language to another. As most programming language features, data types are
only a tool but no requirement. Some languages have no data types at all or types
are not explicitly defined but only used by convention.
The main purpose of explicit typing in programming languages is to check that
data object (variables) are used in a predictable way, so programmers can work with
abstract data values instead of stored representations of values. Type checking can
either be performed before execution of the program (static typing) or on execution
(dynamic typing). Some languages also allow to infer a type from the objects
properties (duck typing). Systems of primitive types together with possible rules
of type derivation are studied as type systems in programming language theory.
An important theoretical method for describing type systems are algebraic data
types, which were introduced in functional programming languages with Hope
as first (Burstall, MacQueen, and Sannella 1980) and Haskell as currently most
popular instance. The theory of algebraic data types is based on mathematical type
theory and category theory, which advanced in the 1990s (Pierce 2002). Extensible
programming languages were another approach to better separate data structures
from implementation and to facilitate definition of new data types (Balzer 1967;
Schuman and Jorrand 1967; Solntseff and Yezerski 1974). Incidentally the term
metadata was coined by Bagley (1968) in this context. Extensible programming
languages received less adoption because they made difficult code sharing, but they
now receive a revival as host systems for domain specific languages.
To find common patterns in data types, it is wise not to look only at theoretic
type theory but also at the actual diversity of data types and existing approaches to
create mappings between type systems from different programming languages. As
described by Meek (1994a,b), a working group on language independent data types
created ISO 11404, which was last revised in 2007 (ISO 11404:2007). ISO 14404
influenced some type systems of data binding languages (section 3.5.1) and schema
languages (section 3.7), such as the XML Schema datatypes (Biron and Malhotra
2004) described in section 3.7 IV. The standard defines a data type as a set of
30
Boolean
State
Enumerated
Character
Ordinal
Date-and-Time
Integer
Scaled
Real
Complex
Void
31
2 Foundations
from integer and string could either store an integer or a string value. The
particular selection of type is also called the variables tag.
aggregation types hold multiple values as members. An aggregate type can be homogeneous, if all members must belong to a single datatypes, or heterogeneous.
Typical aggregation types include sets, sequences, bags, and tables, which can
be defined both homogeneous and heterogeneous. Record types correspond
to tuples and associative arrays (also called maps or dictionaries) correspond
to finite functions. Maps store a set of key-value pairs with unique, usually
homogeneous keys and homogeneous or heterogeneous values.
subtypes can be defined by limiting or extending other types. Examples include
simple (untagged) unions of types, upper/lower bounds on value spaces, and
selections/exclusions of specific values. Beside these simple set operations one
can define more complex constraints and also extend choice types and aggregation types. An important kind of subtyping in object-oriented languages is
based on heterogeneous record types which are then called classes. In most type
systems a derived subtype keeps most properties of its base types (inheritance)
and in most context the subtype can be used as if it was the base type. If the
subtype was derived from multiple base types, this feature is called polymorphism. Polymorphism is a powerful feature because it allows one thing to have
different kinds at the same time, but in practice it is difficult to determine which
specific base time to refer to in which context. For this reason most type systems
restrict subtyping at least to some degree.
Type constructors can be combined, even recursively. For instance a tree type
with labeled leaf nodes can be defined as homogeneous aggregation type with
a choice type of either an element (a leaf) or a sequence of trees (child nodes).
ISO 14404 further defines provisions to mark (parts of) types as mandatory, optional,
recommended etc. Optional parts can also be viewed as choice types with a base
type as one option (element given) and the unit type as the other (element not given).
Similar derivation methods as described above exist with differing main focus in
programming languages and other type systems.
32
real world
reality
realm
perception
conceptual
realm
mental models
universe of
discourse
limited model
symbolic abstraction
conceptual model
data
realm
logical
schema
external
schema
implementation
physical
schema
Figure 2.6.: Summarized view of the data modeling process
from one realm of interest to the next, possibly with sub-steps (Simsion 2007, ch.
3.1) iii) several levels of description for different stages and applications (W. Kent
1978, ch. 2.2.2). Figure 2.6 shows a synthesis of data modeling process frameworks
from across the data modeling literature. It is mainly based on (Simsion 2007,
fig. 3-1) who gives an in-depth review of literature and on (Jr. 1975, fig. 2).11 A
common model of reality that exists in our minds, shared between individuals via
any language, is called universe of discourse. We can only express a limited model
and try to formally capture it as conceptual schema in a conceptual model. Conceptual
models are also called domain models or semantic data models (Hull and King
11 Jr. (1975, fig. 2) has a better view of Steel, Jr. (1975, fig. VIII 5.1). Most later works combine realm of
reality and conceptual realm into the conceptual model and concentrate on data realm.
33
2 Foundations
1987; Peckham and Maryanski 1988) and come with a graphical notation for better
understandability. Most conceptual modeling techniques are based on or influenced
by the Entity-Relationship Model (ERM) (C. Chen, Song, and Zhu 2007). This thesis
uses the Object Role Modeling (ORM) notation as laid out below and in section 3.8.2.
The terms model and schema are often used synonymously with connotation
on expression for schemas or on meaning for models. A conceptual model can be
expressed as logical schema in a data description language. It is also called external
schema if it only covers parts of a conceptual model (as views to the full model)
or if it is not primarily meant for storing data. Both logical and external schema
must be implemented in a physical schema to actually hold data. If data is stored as
database, a database management system (DBMS) typically implements the physical
level so users can work on the logical level. External models can also be realized as
data formats and formal ontologies. Examples of languages to express logical and
external schemas are SQL, XSD, and RDFS.
It is important to recognize that each step includes a feedback loop to the prior
level of description: constraints of physical schemas influence logical schemas, logical schemas affect conceptual models, and reality is perceived and changed to better
fit existing mental models, as language affects the way we think (Whorf 1956). Modelers and architects of information systems often ignore these feedbacks, although it
can even cascade through multiple levels. If something cannot be expressed within
the artificial boundaries of a system, we often mistakenly assume that is does not
exist. In practice data is often created and shaped without a clean, explicit data
modeling process. Instead of reflecting mental models, data modeling then starts
with a conceptual model or even directly with a logical schema or implementation
(Simsion 2007). One can therefore simplify the data modeling process in four levels:
mind (reality and mental models), model (conceptual model), schema (logical and
external schemas), and implementation (physical schemas) as shown in figure 6.1 at
page 221.
Data modeling is only one part of the design process of information systems. It
may be part of software engineering if the goal is to create an application, or part of
information engineering if the goal is not a single application but an integrated set of
tasks and techniques for business communication within an enterprise.12 As argued
by Brooks (1987) from in software engineering there is no silver bullet no single
technology or management technique can provide increase in performance orders of
magnitude higher compared to other good techniques. Instead programming is a
creative design process and great differences can only be made by great designers.
This should also apply to data modeling.
34
12 The term information engineering was popularized by James Martin and Clive Finkelstein in the 1980s.
It also denotes the ERM notation variant that they described in (Martin 1990).
predicate
uniqueness bar
Work
Title
[signified] [signifier]
entity type
value type
role names
35
2 Foundations
36
of the works of Otlet, Ostwald and other researchers fell into oblivion. Two aspects
of Otlets work have stayed characteristic for documentation and later information
science: the proactive use of technology, and the central role of document concepts.
While libraries regularly hesitate to use new technologies, information science at
best is actively involved in development, and propagation of technology to ease and
automize the organization of information. Otlet systematically used reproducible index cards (predating digital records), pioneered the use of microfilm, and envisioned
networked multi-media machines decades before their electronic realization (Otlet
1934, 1990). In the 1950s and 1960s information retrieval evolved as major branch
of information science with important contributions from Calvin Mooers, Eugene
Garfield, and Gerard Salton, among others. The development was driven by the exponential growth of publications (Solla Price 1963) and motivated by computerized
automatization, which promises to speed up and improve the process of finding relevant information. However, automatic systems tend to ignore human factors, such
as motivation and ethics15 A closer look on computer science paradigmas, especially
artificial intelligence and its recurrences, also reveals suprisingly positivist images
of knowledge. Meanwhile information science has lead to more specialized but interdisciplinary sub-disciplines like information systems, information architecture, and
information ethics. The core concept of information see Capurro and Hjrland
(2003) and Ma (2012) for analysis still has impact on the transformation of library
and information science. For instance in 2006 the newest faculty at University of
Berkeley, originating in library and information science, was simply renamed the
School of Information.16
2.3.2. Documents
Despite the trend on information, the roots of library and information science are
located in the organization of collected documents. Since the 1990s one can identify
a renaissance of the document approach with independent schools of though. The
Kopenhagen school of document theory can be found in contributions by Lund
(2009), Hjrland (2007), rom (2007) and other papers collected by Skare, Varheim,
and Lund (2007).17 The French school of thought is most visible by publications
of Roger T. Pedauque (2003; 2006; 2007; 2011), a group of scholars publishing
under common pseudonym. English introductions to their discource have been
given by Truex and F. Rowe (2007) and Gradmann and Meister (2008). Despite the
importance of documents as core concept of library and information science, there is
no commonly agreed upon definition. While libraries tend to defined documents
based on physical entities the most prominent instance of a document is a book
15 This can be exemplified by Mooers law, in which Calvin Mooers (1960) observes that many people may
not want information, and they will avoid using a system precisely because it gives them information
[. . . ] not having and not using information can often lead to less trouble and pain than having and
using it. For other neglected aspects see the works of Joseph Weizenbaum (e.g. 1976).
16 See http://www.ischool.berkeley.edu/about/.
17 See also http://thedocumentacademy.org.
37
2 Foundations
information science tends to abstract documents from their form. This focus
results from research on aspects of preservation and access to original documents
on the one hand compared to research on aspects of document descriptions and
connections on the other. With the shift to digital documents it is more difficult
to use form as defining criterion because traditional concepts such as page and
edition loose meaning. For this reason Buckland (1997, 1998b) argues to define
documents in terms of function rather than form. This idea had already been brough
up by Briet (1951) before the advent of digital documents. Eventually any entity
that is any sequence of bits in the digital world can act as document. To be a
document it must be conserve ou enregistre, aux fins de representer, de reconstituer
ou de prouver un phenom`ene ou physique ou intellectuel (Briet 1951, p. 7).18 This
implies two important properties of documents: first document must be recorded,
and second they must refer to something. The documents referent is also called its
content.19 As described by Yeo (2010), the content is not necessarily fixed and known,
but it highly depends on context. The property of being recorded distinguishes
digital documents in library and information science from more general data objects,
for instance databases: digital documents do not change. Even dynamic documents
are fixed as soon as you package them in some form suitable for storage. A. H.
Renear and Wickett (2009) have carried this argument to the extreme: either a
change constitutes a new document or it is not relevant enough to be recorded. A
document is created to persist as fixed snapshot, while other data objects can also
be created to capture the current state of a dynamic system. This distinction can
be exemplified by a library information system that manages both, stable digital
documents, and dynamic data about users and access to documents.
The traditional role of a library is the collection of separated documents. With
increase in aggregated documents which combine independent smaller parts, such
as encyclopaedia and journals that hold single articles, library institutions need
to divide documents into separate conceptual units. For this purpose Paul Otlet
introduced the monographic principle in 1918 and applied this new document concept
in the Universal Bibliographic Repertory (Otlet 1990; Rayward 1994):20 the idea was
to detach what the book amalgamates, to reduce all that is complex to its elements
and to devote a page to each. (Otlet 1918, cited and translated by Rayward). The
monographic principle requires methods to extract individual pieces of information
from documents which can then be used to create new documents. With hypertext
and new document types like blog articles and tweets, the creation of monographic
documents experiences a revival. In the Semantic Web community the idea is beeing
reinvented as nano-publications (Groth, Gibson, and Velterop 2010; Mons and
Velterop 2009).
18 Translated by Buckland (1997) as preserved or recorded, intended to represent, to reconstruct, or to
Brucke
founded in 1911 by Wilhelm Ostwald.
38
2.3.3. Metadata
More than documents as such, library and information science is interested in their
description, that is bibliography (from the Greek o for [de]scription of
books). Applied to digital documents, all bibliographic data is metadata. This term
became popular in library and information science, during the 1990s. Meanwhile,
metadata subsumes any information about digital and non-digital content, including
traditional library catalogs.21 Before this, the term metadata had been introduced
in computing by Bagley (1968)22 but it was only used casually for management
data in databases and programming languages. With rise of the Web, its meaning
shifted from data about data sets and computer files to data about online resources.
Finally, metadata became popular with the creation and promotion of the Dublin
Core Metadata Element Set (DCMES).
Similar to documents, metadata can best be defined based on its function. Coyle
(2010) describes metadata as something constructed, constructive and actionable.
As a result, there is no strict distinction between data and metadata but the use of
data as metadata depends on context: a digital record can both be a plain data object,
a digital document, and a piece of metadata, even with different content in different
usage scenarios. The relevance of usage for metadata distinguishes metadata from
traditional cataloguing, as described by Gradmann (1998): traditional bibliographic
records were created mainly to describe a document with a very limited context of
usage. Gradmann argues that metadata are intended to be part of a usage context
different than that of cataloguing records, and that they are technically linked to
this context to a very high degree. The important role of a technical infrastructure
which metadata is used in, requires an analysis of the infrastructure as carried out in
chapter 3 of this thesis.
A major application of metadata and a growing branch of library and information
science is the (long-term) preservation of digital documents. Long-term preservation
provides two general strategies to cope with the rapid change and decay of technologies: either you need to emulate the environment of digital objects or you must
regularly migrate them to other environments and formats. Both strategies require
good descriptions of the data and environment to be archived. When time passes, the
descriptions themselves become subject of preservation. By this, digital documents
may get buried in nested layers of metadata or they may become migrations of migrations as shadows of the original documents. Knowledge of general patterns in data
21 Shelley and Johnson (1995) according to Caplan (2003) state NASAs Directory Interchange Format
(DIF) as the first standard that defines metadata in its modern sense (NASA 1988, F-10). The definition
in this standard includes any information describing a data set, including user guides and schemas.
One of the first specifications named metadata was the Content Standard for Digital Spatial Metadata
(CSDSM) (FGDC 1994).
22 Solntseff and Yezerski (1974) refer to The notion of metadata introduced by Bagley, citing Bagley
(1968), but I could not get a copy of his report because of copyright restrictions. Philip Bagley is
among the forgotten forefathers of library and information science, also because he created the very
first analysis of possibles uses of an existing computer, the Whirlwind I, for document retrieval (Bagley
1951).
39
2 Foundations
and metadata could help to reveal data by data archaeology also when long-term
preservation has failed (see section 6.2.1 in the outlook). In other applications but
preservation, metadata is difficult to work with, because it is aggregated from heterogeneous sources with different structures than expected (Tennant 2004; Thomale
2010). Nevertheless, existing metadata research provides useful some guidelines
and tools to achieve interoperability even among applicatons with different usage
context: Metadata registries collect and describe standards, metadata crosswalks
provide mappings, and metadata application profiles allow for customization without loosing a general consensus how data should be structured. The vast diversity
of metadata standards and formats, which are defined and evaluated in library and
information science, shows both the need for metadata and its complexity. A broad
overview of the large number of metadata formats and specifications in the cultural
heritage sector is provided by Riley (2010).
40
2.4. Philosophy
2.4. Philosophy
Philosophy is the principal field of research, which all other scientific disciplines
originate from. It typically adresses questions, that are ignored by other fields,
including itself and its own methods. Philosophy critically examines beliefs and
finds problems, that are difficult to answer within the limited scope of one domain.
Such examinations are not always welcome, or regarded as somehow interesting
but essentially irrelevant. Philosophy often deals with blind spots like hidden
assumptions and foundations. Such blind spots also exist in library and information
science and in computer science in regard to the description of digital documents.
This section summarizes some existing philosophical approaches and questions
about documents and data, that may unhide these hidden assumptions.
reference from Floridi to Capurro, one rejection of Floridis position by Capurro (2008), an overview
by Compton (2006) and a comparision of both in Persian (Khandan 2009).
24 Floridi (2005) adds that (semantic) information must also be veridical, but we can ignore this aspect.
41
2 Foundations
as uninterpreted differences (symbols or signals). (Floridi 2009). The terms data
and symbols are even used synonym when Floridi refers to the Symbol Grounding
Problem (SGP) as data grounding problem. The data/symbol grounding problem
is one of the open problems in philosophy of information (Harnad 2007; Taddeo
and Floridi 2005).25 In short SGP asks, how symbols aquire their meaning. The
question was raised by Stevan Harnad, inspired by Searles Chinese Room Argument.
The latter shows that manipulation of formal symbols (as defined in terms of formal
languages) is not by itself constitutive of, nor sufficient for, thinking. So how can
symbols inside an autonomous system be connected to their referents in the external
world, without the mediation of an external interpreter? I disagree with the view
that data constituting information can be meaningful independent of an informee
(Floridi 2010, p. 22). SGP cannot be solved, because symbols always require some
mediation. Data are not uninterpreted differences (symbols or signals), but you
already require interpretation to draw distinctions, which are needed to constitute
symbols. Apparently we need a deeper philosophical look at data, not simply derived
from concepts like information and meaning.
42
2.4. Philosophy
philosophy of data
data as hard numbers
data as recorded observations
data as bits
data as
objective observation
subjective observation
creation instead of observation
metadata, which itself does not count as data. Data as hard numbers are mostly used
in scientific contexts, where statistical data analysis is the main method to derive data
from other data. The philosophy of data as recorded observations understands data
from an engineering perspective, as recorded product of observations. Everything
produces data and our knowledge is needed to select relevant instances. Data can
be turned into information and knowledge by contextualization against other data,
information, and knowledge in a hierarchical process or in feedback cycles. A
summary of these different philosophies is given in table 2.4. In the following we
will ignore the positivist view of data as hard numbers most digitital docments
do not simply origin from subjective or objective observations, but from designed
creations.
In philosophy of information the concept of data is only covered briefly to define
information. A serious treatment of the term metada in particular does not exist.
Floridi (2005, 2010) refers to the diaphoric definition of data, and gives a simple classification with five, non mutually exclusive types of data (primary data, secondary
data, metadata, operational data, and derivative data) In his definition data is
x being distinct from y, where x and y are two uninterpreted variables and the relation
of being distinct, as well as the domain, are left open to further interpretation.
Floridi (2010, p. 23)
43
2 Foundations
primary data
secondary data
operational data
derivative data
metadata
type
de re
de signo
de dicto
distinction among
existing things
percieved signals
interpreted symbols
27 An exception is the discussion of virtual reality, which was popular for some years.
44
2.5. Semiotics
2.5. Semiotics
Modern semiotics and linguistics were established in the early 20th century, mainly
influenced by Charles Sanders Peirce and by Ferdinand de Saussure (also named
as semiology). Influences of both researchers still exist as independent scholarly
traditions. For introductions to semiotics see Eco (1976, 1977, 1984), Chandler
(2007), and Trabant (1996). The discipline is related to linguistics, literary studies,
philosophy, cognitive science, and communication science, among other fields. Some
authors have investigated connections between semiotics and library and information
science (Brier 2006, 2008; Huang 2006; Mai 2001; Raber and Budd 2003; Warner
1990). The following section briefly introduces to the sign as central concept, justifies
its application to data and digital documents, and highlights some semiotic results
that will assist the phenomenological description of data structuring and description.
I. Digital documents as signs
The sign as core concept of semiotics is most relevant to this thesis. The classical
definition of a sign is often referred to as aliquid stat pro aliquo (something
stands for something), but this definition has always been just one component of the
function of a sign, even in medieval semiotics (Meier-Oeser 2011). More precisely,
according to Peirce (1931a, paragraph 2.228) a sign is something which stands to
somebody for something in some respect or capacity. The triadic form of Peirces
model of signs is further described below. For Saussure (1916) a sign consists of a
form (signifier) and a mental concept (signified) which both cannot be separated (see
figure 2.9). As explained by Taverniers (2008), this dyadic model has been refined by
Hjelmslev (1953) as relation between expression and content, both having substance,
form and purport among other dimensions. In the European tradition the focus of
semiotics/semiology later shifted from signs to signification with researchers such
as Roland Barthes (1967), Algirdas Julien Greimas (1966), and Umberto Eco (1984).
Despite some semiotic focus on linguistic signs (especially words), signs can also
be images, sounds, gestures and other objects, as long as they are interpreted. Most
related to the structuring of description of data there are approaches to analyse
signs in form of diagrams (Bertin 2011), human-computer interaction (Souza 2012),
and information (Brier 2008; Huang 2006; Raber and Budd 2003), but there is no
semiotics of data so far. This thesis includes at least a preliminary semiology of
data. From a semiotic point of view, a digital document is a sign as soon as it is
created or perceived as document. In particular it is impossible to create a document
that does not act as some sign, as it is impossible not to communicate (Watzlawick,
Helmick-Beavin, and Jackson 1967). Even an empty file or a random sequence of
binary data can communicate something, for instance the fact that something went
wrong.
45
2 Foundations
II. Signs are more than signals
The semiotic view to data as sign reveals some important aspects that are less
visible from other disciplines. First of all, data at least if it is given as some
digital document is more than a simple signal in the mathematical theory of
communication. In this model (C. E. Shannon 1948) information from a source
is encoded as signal by a sender. The signal is then transmitted to a receiver and
decoded to a destination (see figure 2.8 above and figure A2 with an illustration of
the theory of communication adopted for diagrams). The information can fully be
reconstructed unless the signal is altered during transmission by noise. In practical
data processing systems this process is nested in chains of encodings and decodings
(figure 2.8). This model is limited at least for two reasons. First, the mathematical
theory of communication explicitly excludes all aspects of meaning and it only deals
with limited sets of predefined signals:
The fundamental problem of communication is that of reproducing at one point either
exactly or approximately a message selected at another point. Frequently the messages
have meaning; that is they refer to or are correlated according to some system with certain
physical or conceptual entities. These semantic aspects of communication are irrelevant to
the engineering problem. The significant aspect is that the actual message is one selected
from a set of possible messages.
C. E. Shannon (1948), emphasis not included in the original
Second, transmitter and receiver are part of the sign. Not only can signals be
changed or interrupted by noise between each encoding and decoding. Also choice
and application of encodings and decoding operations comprises risk of errors. If
data is seen as sign which involves more than simple encoding and decoding, we can
describe nesting by a process of unlimited semiosis (Eco 1979). In short, a semiotic
sign is not limited to syntax, neither should be digital documents.
source
transmitter
encoding
receiver
decoding
destination
communication
system
D2
destination
nested chain of
encodings/decodings
noise
source
E1
E2
...
D1
46
2.5. Semiotics
thought/
interpretant
signifier
signified
symbol/
representamen
referent/
object
interpretant, the thing for which it stands, its object in a in a revised version of a paper from 1867
(Peirce 1931b, paragraph 564). I have not found the original publication of this revision yet.
47
2 Foundations
Deduction
Rule
Case
Result
Induction
Rule
Case
Result
Abduction
Rule
Case
Result
48
2.5. Semiotics
can be embedded, but syntagm is not limited to syntax and grammar. Similar elements that can be embedded in same places are connected by paradigmatic relations.
Examples of data elements connected by a paradigmatic relation are arrays and
lists, and the different RDF nodes types which can all be used as object in an RDF
triple (see table 3.11). The final collection of data patterns (chapter 5) also includes
paradigmatic links between patterns (as alternative patterns) and syntagmatic links
between patterns (as implied, specialized and related patterns).
V. Further insights
More insights from semiotics and linguistics may be possible if we take into account
the acts of communication which signs are used in. The usual classification of communication studies includes aspects of syntax (relationships among signs, without
reference to their meaning), semantics (relationships between signs and meanings),
and pragmatics (relationships between signs and their use). Detailed models of
communication are given for instance by Jakobson (1963), in the theory of speech
acts (Austin 1962; Searle 1969), and in discourse analysis (Foucault 1969). For the
following analysis, however, details of communication are ignored because our focus
is not the situation in which data is used but the way it is structured and described.
Neither does this thesis include the diachronous nature of data, that is the temporal
context in which it changes. The difference between synchronity and diachronity
has also been introduced by de Saussure, an introduction to this opposition with
application to information as sign is given by Raber and Budd (2003).
In summary, semiotics provides fruitfull insights to the nature of data, even with
limitation to immutable properties. Taking into account the full nature of signs,
there are many issues left for further research in data semiotics.
49
2 Foundations
The novel approach of this thesis is to use patterns for data description, independent
from particular structuring methods and technologies. This section will first introduce the notion of patterns, then summarize existing works that deal with patterns
in data structuring and finally give an example.
Patterns as systematic tools for describing good design practice were first introduced by Christopher Alexander, Sarah Ishikawa, and Murray Silverstein 1977.
They identified 253 existing architectural patterns from entire regions and cities to
buildings, rooms, and furniture. In Alexanders original definition 1977, p. x each
pattern describes a problem which occurs over and over again in our environment,
and then describes the core of the solution to that problem, in such a way that you
can use this solution a million times over, without ever doing it the same way twice.
Patterns can be found by observing current practice and then looking for commonalities in solutions to a problem. In contrast to simple rules or best practice guidelines,
a pattern, however, does not solve the problem by providing a particular solution,
but by showing benefits and consequences. Each pattern provides a solution and
each solution has some tradeoffs. The pattern description guides designers in their
decisions of particular solutions for particular applications. Each pattern is given a
name, which can be used to refer to one pattern from another. The full potential of
patterns unfolds if a set of patterns is collected and combined in a pattern language.
In Alexanders words a pattern language is a network of patterns that call upon one
another. Patterns help us remember insights and knowledge about design and can
be used in combination to create solutions. A pattern language for writing patterns
was presented by Meszaros and Doble (1997).
The pattern language approach with its application in architecture has been
adopted in other fields of engineering, especially in software engineering (Beck and
Cunningham 1987). Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides
(the so-called gang of four) published an influental book on design patterns in object
oriented programming (Gamma et al. 1994). In 1995 Ward Cunningham created the
Portland Pattern Repository (Cunningham 1995), accompanied by WikiWikiWeb,
which was the worlds first wiki.30
Although these design patterns are used for the creation of computer programs,
they do not reflect problems and solutions of data structuring as analyzed in this
thesis. Design patterns refer to dynamic processes, while digital documents are
static. General patterns in description and structuring of data must also be separated
from pattern recognition, as practiced in data mining and other statistical methods of
machine learning. These quantitative methods can only recognize structures within
the boundaries of a fixed method of data description (for instance statistical patterns
30 The Portland Pattern Repository and WikiWikiWeb are still active at http://c2.com/.
50
in lists of numbers without questioning the nature of numbers and lists). A general
limitation of existing approaches is the focus to one specific formalization method.
This practical limitation blocks the view to more general data patterns, independent
from a particular encoding, and it conceals blind spots and weaknesses of a chosen
formalism. Some works about patterns in particular data description languages have
been mentioned in section 1.4.
51
2 Foundations
The third part is attached as additional element to the first (dependence pattern), and
it may unambiguously refer into a registry of allowed qualification values (identifier
pattern). The second part acts as connection or delimiter (separator pattern). Even
its two bytes (, ) can have structure: the whitespace character is often used as filling
without significance (garbage pattern).
In summary one can identify several typical structuring methods in the twelve
bytes given above. The interpretation, however, does not need to be right depending on context the sequence could mean virtually anything but patterns help to
reveal interpretations that were most likely intended when creating the data.
52
Chapter 3
53
All data must be written down in some form. At least a standardized set of base
symbols (characters) is needed together with a set of conventions how to meaningfully combine these symbols. We call this two sets a writing system or notation.
The connection of data and writing systems is not an invention of the digital age:
Cuneiform script on clay tablets, the earliest known records of a writing system from
the 4th millennium BC, was first used exclusively for accounting and record keeping,
thus for capturing data. The simplest writing system can only write down sequences
of binary data. It consists of two distinct symbols (usually called zero and one),
and the convention of concatenating these symbols to sequences. More complex
writing systems involve more characters. A character encoding maps characters and
their combination rules to a writing system on symbols that can easier be stored and
communicated. Examples of character encodings include Morse code, Braille, the
American Standard Code for Information Interchange (ASCII), and Unicode. Digital
character encodings map characters to a sequence of bits. Before Unicode became
the dominant character encoding standard (starting in the early 1990s), there were
many alternative encodings for different sets of characters. Table 3.1 lists some
hexadecimal
5D
67
81
8F
86
C5
EA 41
binary
1011101
01100111
10000001
10001111
10000110
11000101
11101010 01000001
3.1.1. Unicode
Unicode aims at unifying all binary character encodings by covering all characters
for all writing systems of the world, modern and ancient (The Unicode Consortium
2011). The standard includes graphemes and grapheme-like units like punctuations,
technical symbols, and many other characters used in writing text. The Unicode
standard defines a grapheme as a minimally distinctive unit of writing in the context
of a particular writing system. or what a user thinks of as a character. For instance
54
55
codepoint
= [#x00-#x10FFFF]
surrogate
= [#xD800-#xDFFF]
ustring
= ( codepoint - surrogate )*
noncharacter = [#xFDD0-#xFDEF] | #xFFFE |#xFFFF |#x1FFFE|#x1FFFF|
#x2FFFE|#x2FFFF|#x3FFFE|#x3FFFF|#x4FFFE|#x4FFFF|#x5FFFE|#x5FFFF|
#x6FFFE|#x6FFFF|#x7FFFE|#x7FFFF|#x8FFFE|#x8FFFF|#x9FFFE|#x9FFFF|
#xAFFFE|#xAFFFF|#xBFFFE|#xBFFFF|#xCFFFE|#xCFFFF|#xDFFFE|#xDFFFF|
#xEFFFE|#xEFFFF|#xFFFFE|#xFFFFF|#x10FFFE|#x10FFFF
Table 3.2.: Symbol ranges in Unicode
LE and BE indicate the byte order little-endian (default) and big-endian. Different
combinations of UTF-8, UTF-16, UTF-32, or UTF-EBCDIC2 with BE or LE define
alternative transformation formats. They can easily be mapped to each other as
isomomorphic encodings of UCS. A full breakdown of the encoding of the composed
character is provided with example 34 in section 4.2.5.
Unicode encoding scheme
UTF-16, LE: A
UTF-8, LE: A
UTF-16, LE: A + combining
UTF-8, LE: A + combining
hexadecimal
21 2B
2B 21
00 C5
C3 85
00 41 03 0A
41 CC 8A
00100001
00101011
00000000
11000011
00000000
00000011
01000001
binary
00101011
00100001
11000101
10000101
01000001
00001010
11001100
10001010
Table 3.3.: Various Unicode encoding schemes encoding the capital letter A
56
Unicode. The stability guarantee on normalization only applies to assigned characters in UCS.
4 For instance one method of writing the M
aori language in the early 20th century contained a special
wh ligature as distinct character. Although there are printed books that use this letter, it was rejected
for inclusion in Unicode. See http://unicode.org/faq/ligature_digraph.html for the objections
of Unicode Consortium in ligatures and digraphs. Icons and pictograms are reluctantly included as
well, but Unicode version 6.0 introduced several hundred Emoji icons for animals, clothes, vehicles,
food and other random artifacts. The history including the capital sharp s, rejected in 2004 but
included as U+1E9E in Unicode version 5.1.0 after a second proposal, reveals some interesting insight
into the process of standardization.
57
symbols
Å
Å, Å, Å, Å . . .
pattern P16 (Unicode U+2821)
pattern P34567 (Unicode U+287C)
Aa
---
58
3.2. Identifiers
3.2. Identifiers
Whats in a name? that which we call a rose by any other name would smell as sweet.
William Shakespeare: Romeo and Juliet
While character and number encodings are used as base of data structuring, identifiers virtually pervade systems to structure and describe data on all levels. This
section first introduces basic identifier principles (section 3.2.1) followed by properties of namespaces and qualifiers (section 3.2.2) which identify the context of
association between identifier and referred thing. Identifier systems (section 3.2.3)
provide an infrastructure in which identifiers are assigned, managed, and used.
Part 3.2.4 on descriptive identifiers and section 3.2.5 on ordered identifiers highlight
two important but often overlooked properties of identifiers on a more theoretical
level. Finally hash codes as special kind of distributed identifier systems are explained in section 3.2.6. A summarized overview of designated identifier properties
is given in table 3.6.
assignment to it.
59
60
3.2. Identifiers
and known as lookup tables. In some cases one can even swap symbol and referent,
because both are unique on their side of the table (see table 3.5 for an example).
Unambiguity (each identifier must refer to only one object) and uniqueness (for
each object there should be only one identifier) of often combined as uniqueness as
most important requirement for an identifier. Other properties frequently cited
as important qualities are persistence (identifiers should not change over time),
scope (the context of an identifier should be broad or even global), readability
(identifiers should be easy to remember or contain information), and actionability
(given an identifier one should be able to do something with it, for instance access
the identified object). A summary of these and more designated identifier properties
is given in table 3.6. To ensure the required properties it needs an identifier system
as controlled mechanism or convention for creating, managing, and using identifiers
(see part 3.2.3). When we analyze general data, identifier properties can also indicate
whether some piece of data is actually used as identifier or not: being an identifier is
nothing inherently inscribed in data, but it is an example of a popular data pattern,
that is used all all over systems to structure and describe data.
year
1100
011@
no
title
4000
021A
no
edition
4020
032@
no
DDC
5010
045F
yes
61
local
Frankfurt
Dublin
OH
set
type
1000/182
US
qualifier
Main
Ohio
US
std
rdf
10
sgn
syntax
L/Q
L, Q
Q-L
Q::L
Q:L
Q.L
Q-L
description
city name
city name
ISO 3166-2 area code
C++ identifier
URI reference in RDF
DOI as specific Handle
IANA language & subtag
Example 9: Qualified identifiers with local part (L) and qualifier (Q)
For some prefixed types of namespaces, the qualifier is not fixed, but can be replaced by another prefix for the same namespace. For instance http://dx.doi.org/
and other known resolver addresses are actually used as prefix for the namespace of
document object identifiers (DOI) but one could also use the DOI as Uniform Resource
Identifier (URI) with prefix info:uri/. Another example is a prefixed element name
in the Extensible markup language (XML, see section 3.5.5): some XML applications
ignore the prefix and respect a locally defined namespace, some ignore the prefix,
and some need both to identify an element (see example 14 at page 103). Finally,
there are systems in which a namespace is just an abbreviation that can be defined
and expanded as needed (see the @prefix statement from Notation3 as described in
62
3.2. Identifiers
this surname, and the local part Karl signifies a given personal name, so in this example all parts
encode some meaning if one takes person names as meaningful.
63
UCS
URI
US-ASCII
depends on
URL
URN
IP
URN-ISBN
subset of
exclusion
DNS
EAN
ISBN-13
ISBN
(partial) mapping
ISBN-10
(see 2.2.1) as identifier syntax (see section 2.2.1). Every identifier symbols must
conform to this syntax, so its grammar rules help to discover and use identifiers:
with a well-defined syntax one does not need to resolve each string to check whether
it is an identifier, but one can validate possible identifiers based on their shape.
Often only parts of the grammar are defined in a schema language (see section 3.7.1),
and there may be additional informal agreements to exclude some sets of characters
and sequences. For instance whitespace characters are less used in identifiers: even if
allowed, at least multiple consecutive whitespace characters are not encountered in
practice. If identifier system depend or reuse each other, for instance as namespaces
and qualifiers (see the dotted relations in figure 3.1), there can be difficulties to
embed one identifier within the syntax of the other. If such an embedding may lead
to disallowed character sequences or to syntax ambiguities, the host identifier system
usually defines a method to escape the embedded identifier. A typical escaping
mechanism, is percent-encoding. A character code of one byte in percent encoding
is replaced by the percent character % followed by the two hexadecimal digits
representing that bytes numeric value (T. Berners-Lee, R. Fielding, and Masinter
2005, section 2.1). However, the question when and which parts of an identifier
must be encoded, is a frequent source of confusion. Quite often an identifier reuses
another identifier that already includes an embedding, so one ends up with a complex
hierarchy of dependencies and nested encodings.
If readability is no primary requirement, digital identifiers can be plain numbers
or sequences of bits (see part 3.2.5). An example are IP addresses, which can be
mapped to more readable DNS names. But even a very simple identifier syntaxes
such as number ranges can cause problems: in 2011 the IP system version 4 ran out
of identifiers because it was limited to 232 distinct numbers. The update to IP version
6 takes several years, because version 4 is used in many other specifications and
64
3.2. Identifiers
implementations that must also be updated accordingly. A similar but less extensive
update of an identifier system was the extension of International Standard Book
Numbers from ten digits (nine plus one check digit) to thirteen (nine plus bookland
namespace and check digit). The assignment of ISBN identifiers is delegated to
national agencies who then delegate it to publishers. Therefore the system is not
always used as intended: in theory an ISBN can never be reused and every edition
of a title must have a new ISBN. In practice new editions often reuse the ISBN
of the previous edition and some publishers even assign existing ISBNs to totally
different books.7 The update from ISBN-10 to ISBN-13 was based on an already
existing encoding of ISBN-10 in the European Article Number (EAN). Nevertheless,
it required a lot of marketing and modifications in library systems (Halm 2005).
Among other things difficulties resulted from different syntaxes to express equivalent
ISBNs (example 10). I fact all syntaxes (except ISBN-13 with bookland namespace
979) can be mapped to each other, so it is an arbitrary decision, which form to take
as the real ISBN. Similar problems of synonymy are also present in other identifier
systems.
ISBN-10 with hyphen
ISBN-10 with space
plain ISBN-10
EAN
EAN barcode aligned
ISBN-13 with hyphen
plain ISBN-13
URN-ISBN
1-4909-3186-4
1 4909 3186 4
1490931864
9781490931869
9 78149 093186 9
978-1-4909-3186-9
9781490931869
URN:ISBN:1-4909-3186-4
Today the most used identifier systems, apart from character encodings, are
Uniform Resource Location (URL) (referred to as web addresses in common speech),
Uniform Resource Identifier (URI), and Uniform Resource Name (URN). These systems
are often confused because they all evolved together with the World Wide Web
(WWW) and the Hypertext Transfer protocol (HTTP). The WWW was introduced by
Tim Berners-Lee in 1990. His first design includes considerations on document
naming as probably the most crucial aspect of design and standardization in an
open hypertext system. Tim Berners-Lee (1991) discusses addressing, naming, and
uniqueness as three different properties and introduces URL to cover the addressing
as part of a global naming system. HTTP as defined by Tim Berners-Lee (1992)
allowed for use of different types of Universal Resource Identifiers, but only listed
URL and other addressing schemes (FTP, gopher, etc.) as examples.8 In 1992
7 For instance ISBN 3-453-52013-0 was used for two unrelated books by Heyne-Verlag in 1974 and 2004
and ISBN 3-257-21097-3 was assigned to every single work of B. Traven published by Diogenes-Verlag.
8 An exception was the Content-ID (cid) scheme which did not include a server as physical address.
cid was later specified as URL scheme (sic!) with RFC 2111 (1997) and RFC 2392 (1998) but it
65
66
3.2. Identifiers
unambiguous
unique
global
persistent
readable
structured
uniform
performant
descriptive
actionable
distributed
ordered
a conversion algorithm that maps URI to IRI. If URI was a subset of IRI, no such
mapping would be needed. Both mappings use UTF-8 as intermediate format and
percent-encoding of special characters.
The ongoing problems of URI and related identifier systems have multiple reasons.
For instance the assumption of an universal set of objects leads to paradoxes
because real world objects and identifiers have no rank or category in terms of set
theory. With the data: URI scheme one can even convert any piece of data into an
identifier (Masinter 1998).
Above all, many goals of an identifier system cannot be solved on a purely technical
level. The domain name system (DNS) gives examples how politics and social power
structures shape identifier systems (Rood 2000). Identifier systems can also be
implemented and interpreted differently by different users. As noted by Tim BernersLee (1998a), people change identifiers, by purpose or by accident. Like all social
constructs, identifier systems can also become outdated: for instance the URI scheme
info: was launched in 2003 but closed in 2010 in favor of URL (OCLC 2010;
Sompel et al. 2006). Last but no least identifier systems often try to solve problems
that cannot be solved together: Wilcox-OHearn (2001) showed that an identifier
system cannot provide securely unique, memorable (readable), and decentralized
(distributed) identifiers at the same time but only two of these properties can be
combined: local identifier systems can generate readable and distributed identifiers
but they are not globally unique, centralized systems such as DNS can globally
unique and readable identifiers, and cryptographic hashes are distributed and unique
but not readable. Partial solutions to Zookos triangle (Stiegler 2005) involve
multiple layers of identifier systems, which is another example of the importance to
study relationships in combined identifier systems such as depicted in figure 3.1.
67
complex of multiple buildings. A third example is taken from Eriksson and Agerfalk
(2010): the Swedish person identification number contains of ten digits where digit
one to six represent a persons day of birth (YYMMDD) and the tenth position is a
check number that can be calculated from the digit one to nine. This implies that
one cannot derive the full day of birth, because the century is not included (2005
and 1905 both become 05). To distinguish people born at the same day (withount
century), digit seven to nine of the identifier contain a unique sequence number that
is only partial descriptive. The ninth position is odd if the number identifies a man
and even if it identifies a woman. To be more precice, the ninth position can only
describe the assumed sex of a person at the time when the identifier was created,
because for some people the sex may have changed during their life. Furthermore
some attributes may be unknown: when more and more immigrants with unknown
birthdate got a Swedish person identification number, the first of January or the
first of July was recorded instead and the identifier system ran out of numbers
gender as given while others as purely artificial (Butler 1990). However most identifier conflicts origin
from different interpretations what kind of object the referent actually is.
68
3.2. Identifiers
69
hash function
referent
fixed width
identifier
impractical (one-way)
Figure 3.2.: (Cryptographic) hash function
Cryptographic hash functions treat the whole input as significant part: any change
of an input value must result in a different hash code, so attackers cannot modify
messages without modifying the message digest. In addition the function should
fullfill the following properties: First, the hash code should not reveal more information about the input that its own expression.15 Second, the hash function must be a
one-way function: given a hash code it must be very hard to find a message that is
14 Some hash algorithms allow the hash of a composite object to be computed from the hashes of its
constituent parts. For instance the Rabin fingerprint f of a concatenation A.B of two strings A and B
can be computed via the equality f (A.B) = f (f (A).B)) (Broder 1993).
15 In a broader sense, all hash functions are descriptive because their hash codes are defined based on the
full digital content as property of the identified object. In a narrower sense cryptographic hash keys
are not descriptive because they only describe the hash code as property of the objects content.
70
3.2. Identifiers
mapped to this digest. This property is also needed to prevent creating a message as
collision of another given message. Third, an even more strict requirement used to
evaluate the strength of cryptographic hash functions is the lack of a method to find
any collisions. Given that the number of possible input values is much large than
the number of possible hash codes, there always exist collisions, but it is very difficult to find them. For instance the cryptographic hash function SHA1 (D. Eastlake
and Jones 2001) has codes of 160 bit length, so there are 2160 different SHA1 hash
codes. According to rules of probability the expected number of hashes that can be
generated before an accidental collision (birthday paradox) is 280 . The sun will
expand in around 5 billion years (less than 258 seconds from now), making life on
earth impossible. Until then one can generate 222 (4 million) hashes per second and
collisions are still unlikely. With systematic crypographic attacks the number can be
smaller but it is still much larger than other sources of error.
71
for Information and Documentation (FID), in the same year (Nelson 1965b). The abstract is reprinted
72
are probably various possible file structures that will be useful in aiding creative
thought. Regularly he complained about The tyranny of the file and hierarchical
directory structures (Nelson 1986).18 His article was referenced in the specification
of Multics file system (Daley and Neumann 1965) but never really picked up among
file system developers, so today POSIX remains the standard way of thinking about
files.
73
File systems normally do not restrict the content of files, even files of zero byte
length are allowed. The maximum file size, the maximum number of files, and the
maximum sum of file sizes may be limited in a given context. To uniquely identity
files, each file should have a file name. This simple assumption is complicated by
operating systems and file systems which impose different restrictions on file names.
In its most general form a file name is a non-empty sequence of bytes. All systems
exclude at least the NUL byte 0x00. File systems also limit the length of a file name to
a maximum number of bytes and/or Unicode characters. Modern file systems (NTFS,
HFS+, ZFS etc.) only accept valid Unicode characters, so file names are (possibly
normalized) Unicode strings. Nevertheless there is a strong tradition in the Unix
community to view file names sequences of as raw bytes. Some file systems are case
sensitive (one can have two distinct file names A and a), some are case insensitive,
and some are case insensitive but case preserving (A and a refer to the same file
but its name can be named either of them). Moreover each system disallows some
special characters or bytes, for instance the directory separator / or \. Other special
characters include quotes (" and '), brackets (<, >, [, ]), dot (.), colon (:), vertical
bar (|), asterisk (*), and question mark (?).
II. Extensions and types
In the first CTSS system, files were composed of two parts, each with up to six
characters. The first part was used as descriptive name and the second indicated
the files type. Many operating systems followed this convention and supported
or required file extensions. However the extension may not reflect a well-defined
type or a file may not have an extension at all. Therefore many programs treat the
extension as one of multiple indicators to determine file type. Depending on the
context file(name) extensions are part of the the file name or additional metadata of
the file.
III. Versions
Versioning files is not common in todays file systems although it was already supported in early time sharing system systems such as ITS (D. E. Eastlake 1972) and
TENEX (Bobrow et al. 1972). The ITS system simply treated numeric file extensions
as version numbers. The user could select to read a file with the highest version
number of a given name or to increment the highest version number and create a
new version for writing. In TENEX and its successors (TOPS-20 and OpenVMS) the
version number is an additional part of the file name. Today versioning is rarely implemented in the file system but on top of it in applications, for instance in revision
control systems (Subversion, git, mercurial, etc.), or in storage services that can be
accesed like a file system such as Amazon S3 (Amazon 2010). Other file system features like cloning or snapshots as provided in ZFS and NTFS can emulate versioning
to some extend. Just like file extensions version numbers can be considered as part
74
of the file name or as special kind of additional metadata. However version numbers
allow multiple versions to share the same file name, and they define a strict order
between all versions of a files (or a partial order in revision control systems that
allow branches and merges). The version number itself does not have to be a simple
integer but can also be a timestamp.
IV. Directories and hierarchies
Directories are a common method to group files. In CTSS each user had a private
directory and there were common directories for sharing. A directory acts like a
namespace for file names (see section 3.2.2): different files in different directories
can share the same name. From a conceptual point of view there is little difference
between simple directories and other file system prefixes such as volume or disk:
both hold simply a set of file names. In a broader context each directory must
have a unique name just like a file. Hierarchical file system as introduced with
Multics typically apply the same rules to file names and directory names while flat
file systems such as Apple DOS and Amazon S3 have different rules for directory
names and file names. In addition the maximum number of files per directory can
be limited.
In the mid-1960s hierarchies were introduced in the Multics project that led to the
development of Unix after Bell Labs pulled out of Multics in 1969 (Ritchie 1979).
In Multics a directory is a special kind of file and the file structure is a tree of files,
some of which are directories (Daley and Neumann 1965). The user is considered
to be operating in one directory at any one time, called his working directory. File
names must only be unique with respect to the directory in which it occurs. For
every file and directory one can define a unique name by prepending the chain of
directory names required to reach the file from the root. This chain is called absolute
path (tree name in Multics). One can also construct a relative path starting from a
given working directory. Pathes require a special character as directory separator
that must not occur in file names. It should be noted that POSIX does not require
a file system to form a tree. The first Unix file system had the shape of a general
directed graph (Ritchie 1979). However the requirement to have unique pathes was
more important. For this reson other hierarchies but trees such as directed (acyclic)
graphs, or polytrees are forbidden by the file system or operating system. Other
restrictions can be put on the maximum depth directory nesting and the maximum
length of a path.
V. Links
In simple file systems each file is identified by its name. Multics and Unix changed
this one-to-one relationship by seperating files and directory entries (hard links) and
by adding pointers to file names (symbolic links). Both kinds of links are included
75
Berkeley Fast File System implemented them in the BSD-branch of Unix in 1983.
76
introduced in the Macintosh File System (MFS) in 1984. Unlike other attributes
the fork can be a byte sequence of arbitrary size just like the file content. Forks are
available in Apples successor file systems HFS/HFS+ and as Alternate Data Streams
in Microsofts NTFS (1993).
From a conceptual point of view the file content can be treated as one attribute
or fork of the file among others. In practice extended attributes are rarely used
beyond system applications because of limited support in user interfaces. The BeOS
file system supported typed attributes (string, time, double, float, int, boolean, raw,
and image) and indexing, searching, and sorting by any attribute field similar to a
databases (Giampaolo 1999). A query to all files with specific attribute properties
could also be used as virtual directories similar to views in a database.
77
78
3.4. Databases
3.4. Databases
Historically, data base systems evolved as generalized access methods [. . . ] As a result,
most data base systems emphasize the question of how data may be stored or accessed,
but they ignore the question of what the data means to the people who use it.
John F. Sowa 1976: Conceptual Graphs for a Data Base Interface
A database is a managed collection of data with some common structure. The general
form of a database is shaped by its database model, which is implied by the particular
implementation of a database management system (DBMS). Overviews of database
models can be found in Silberschatz, Korth, and Sudarshan (2010); Elmasri and
S. Navathe (2010); S. B. Navathe (1992); and Kerschberg, Klug, and Tsichritzis (1976).
Although you can identify some general model types, the exact definitions of specific
database models differ. As pointed out by W. Kent (1978, chapter 9), comparisons
should not confuse data models and concrete implementations. Database models
rarely occur as such, but only as abstractions of DBMS implementations. The model
can be derived from an explicit specification, from the DBMS schema language (see
section 3.7.4), and from its query API (see section 3.10).
The development of database systems and models is not a logical chain of improvements, but driven by trends and products. In the 1960s the difference between file
systems, as described in the previous section, and database systems was still small
for instance IBM marketed a series of mainframe DBMSes Formatted File System.
Simple databases were (and still are) build of plain records and tables without any
elaborated model (section 3.4.1). The hierarchical model (section 3.4.2) and the
network model (section 3.4.3) were designed close to properties of the underlying
storage media. A better separation between logical level and physical level was
introduced with the relational model (section 3.4.4). Since the 1970s (in database
research) and the 1980s (in commercial products), it has become the preferred
database model, at least in its interpretation by SQL. In the 1980s object oriented
databases (section 3.4.5) and graph databases appeared. Object orientation had
impact on relational databases, which partly evolved to object-relational databases.
Graph databases experience some revival since the late 2000s in connection with the
NoSQL movement (section 3.4.6).
So called semantic database models (Hull and King 1987; Peckham and Maryanski 1988) will be described as conceptual data models in section 3.8. Other more
specialized databases models not covered here are spatial, temporal, and spatiotemporal databases, and multidimensional databases. The former add better support
of space, time, and versioning (C. X. Chen 2001). The latter are used to efficiently
summarize and analyze large amounts of data (Vassiliadis and Sellis 1999).
79
=
=
dc:title
dc:publisher =
=
dc:date
dc:creator
0A
0A
field names
0A
record
0A
field values
separators
80
3.4. Databases
field values, it must be escaped. The second method is favoured in database theory,
because it directly maps to the mathematical concept of a tuple and you can list
multiple record in a table (figure 11). However single records are not self-describing
and do not allow exceptions or repeated fields. In practice, they lead to the invention
of special NULL-values to denote not applicable or unknown field values and to
ad-hoc formatted lists of values, packed in one field.26 The third method is more
flexible, but it also needs some separators between field names and field values. An
example of a self-describing27 record is shown in figure 3.3. It contains a Dublin
Core (DCMES) record, encoded with the special characters = (U+3D) and line break
(U+0A). Another example with numbers and characters as field names is given in
appendix D.
The inclusion of field names in records is also known as markup, which is described
in more detail in section 3.6. Markup allows you to freely omit, repeat, and order
fields. Such records do not even need a record schema but can consist of a simple
list of field names and field values, such as in INI files (see section 3.5.2). These
schema-free records have been avoided in most databases the last decades, but
they are getting popular again with NoSQL databases. If fields are unordered
and they can only occur once per record, the record model corresponds to the
concepts of associative arrays, maps, hashtables, or dictionaries from type theory
and programming languages. If stored or sent as sequence of bytes, however, the
record does not show, whether fields are ordered and which fields have been omitted.
Given the record in figure 3.3, you need background knowledge about the record type
Dublin Core Metadata Element Set (DCMES) to know, that the order dc:creator,
dc:title, dc:publisher, dc:date is not relevant, and that there are eleven other
possible fields defined in DCMES. Other interpretations of DCMES allow repeating
the dc:creator field to express lists of creators, which requires field order to be
preserved. The limitation of fields of a record to non-repeatable, atomic values
also known as first normal form is often assumed implicitely. But as described
by Fotache (2006), the notion of atomicity depends on context. For instance in
figure 3.3 you could split the dc:publisher field value into the publishers name
(Prentice-Hall), place (Englewood Cliffs), and state (NJ). dc:title could be split
in words and characters, and dc:date in century, decade, and year of the decade. The
inclusion of field names in records increases flexibility, but it still selects a specific
set of fields, that may be quite different in another context.
Some record models, for instance the MultiValue/PICK database and MARC
records, allow fields to be repeated or split up into subfields. If you allow arbitrary
nesting of records in fields, subfields, sub-subfields and so on, you end up with a
hierarchical database. If you restrict the number of levels to a fixed value n you can
represent an ordered, flat file database structure with n special separator elements
that must not occur in (sub)field values. ASCII defines four such control characters
26 See the list of names, the creator field in figure 3.3.
27 The term self-describing is used with similar carelessness as the term semantic. In most cases, the
only description of self-describing data is a simple mapping of data elements to their field names.
81
ASCII name
File Separator (FS)
Group Separator (GS)
Record Separator (RS)
Unit Separator (US)
Unicode name
INF. SEPARATOR FOUR
INF. SEPARATOR THREE
INF. SEPARATOR TWO
INF. SEPARATOR ONE
MARC name
Record Separator
Field Separator
Subfield Delimiter
File
* indexed
Record
Field
Value
that were used in many binary flat file formats (example 3.7).
Records are used as building blocks in many data structuring methods. The
database model of a plain record store is also known as flat file database model.
Records in a flat file database are typically stored and accessed sequentially, which
enforces an order on all records and fields, no matter if this order is relevant or not.
Figure 3.4 shows the model of a flat file database with atomic fields: Each file may
have zero or more records, which each must belong to exactly one file. Each record
may contain zero or more fields, which each must belong to exactly one record, and
have exactly one field value. To refer to a particular record or field, it must be indexed.
A record index is also called record identifier. If records and/or fields are ordered,
their position can be used as one index. Field names are the usual index for fields,
but only for non-repeatable fields. Selected field values, or combinations of multiple
field values, can be used as record identifier; but only if every record happens to
contain the selected fields, and if their values uniquely identify the records. Such
additional constraints are not part of the record model, but fundamental in the
relational model.
A table of records is the usual method to manage files of records, if all records share
the same fixed set of fields (figure 11). The first row contains the record schema
by listing its field names, and each following row contains one record. The table
header can be omitted, if fields are indexed by their position. Single records are also
indexed by their position, so record identifiers are neither part of the record.
The lack of a clear definition of record identifiers is one drawback of the record
model. Without record identifiers as link target you cannot express relations that
span multiple records. On the other hand there are implicit relationships between
the fields of one record, but some relationships within a record cannot be described
(W. Kent 1978, ch. 8.4f). It even depends on context, whether a record represent
an entity or a relationship. In summary, records lack a clear method to express
relationships while struggling with a rich variety of representational alternatives
(W. Kent 1988). Nevertheless the record concept is useful in grouping data elements,
82
3.4. Databases
author
Kerninghan and Ritchie
Bjarne Stroustrup
title
The C programming language
The C++ programming language
year
1978
1985
, title
, year
author
, 1978
Kerninghan and Ritchie , The C programming language
,
Bjarne Stroustrup
The C++ programming language , 1985
record schema (first row)
0A
0A
0A
separators
and can be found on various levels you must only take care which variant of the
record model you deal with in a particular application.
83
hierarchy 2
library
library
vendor
1
2
catalog
topic
acquisition
acquisition
catalog
vendor
topic
publication
tlink
publication
3
Hierarchical database
constraint not expressible
2 n-ary relationship requires virtual hierarchy
3
recursive relationship requires virtual hierarchy*
4 m:n relationship requires virtual hierarchy
5 n:1 relationship requires virtual hierarchy
* some hierarchical models allow recursion
tplink
1 1:1
publication must be expressed as record (2). This also requires a virtual hierarchy
(called logical in contrast to physical in IMS) between vendor and acquisition
and between publication and acquisition. The n:1 relationship between acquisition
and publication also requires a virtual hierarchy (5). The n:m relationship between
publications and topics (4) and the recursive 1:n relationship between topics and
subtopics (3) are also modeled by virtual hierarchies but the monohierarchy cannot
be expressed.
Virtual hierarchy pointers can be used to circumvent some of the hierarchic
models limitations, similar to symbolic links in file systems (section V). Yet in
practice they are less efficient and their integrity is not ensured by the DBMS. Other
hierarchic DBMS, for instance native XML databases share similar limitations.
84
3.4. Databases
between a data description language (DDL) to define the database schema and a data
manipulation language (DML) to query and modify the content of a database. The
DBTG model can be described as a partly ordered graph with typed records as nodes
and typed links as edges. Records can have attributes as described in section 3.4.1.
Similar to the hierarchical model, links are limited to 1:n relationships. In the DBTG
model a relationship is called set with the relationship type as set type, one record as
owner and zero or more records as member of the set. The DDL statement insertion
is automatic; retention is mandatory can mark a set type as required for the
member record. Other features of the DBTG model include unique key attributes,
ordering sets by a selected record attribute, and singular sets.
In contrast to the hierarchical model there is no separation between hierarchical
links and virtual pointers. A record can be a member of multiple owners, as long
as each membership takes place in a different set type. Other limitations of the
hierarchical model are also present in the network model: there is no separation
between 1:n and 1:1 relationships ((1) in fig. 3.5 on the right) and relationships with
more then two members must be modeled by records ((2)). Such additional junction
records are also used to model recursive relationships (3) and n:m relationships (4).
Later extensions of the network database model evolved to the role data model. It
was planned as generalization of the network and the relational model, similar to
object databases, but never got really adopted (Bachman and Daya 1977; Steimann
2007).
85
http://www.itl.nist.gov/div897/ctg/sql_form.htm.
30 In short, the relation model had a lot of impact, especially spoiled and misinterpreted by SQL, which
also had a lot of impact, especially spoiled and misinterpreted by its differing implementations.
86
3.4. Databases
the third normal form (3NF) (Codd 1971), the Boyce-Codd normal form (BCNF)
(Codd 1974), the fourth normal form (4NF) (Fagin 1977), and more forms. The
general objective of normalization is to map relations and dependencies information
to database relations, without introducing possible redundancy and inconsistencies.
Single facts should only be stored once, and facts, that can be derived from other
facts, should not be stored at all. As a result, no queries are favoured at the expense
of others. Any complex query can be build by joining multiple tables, that share
fields of the same domain. To speed up specific queries, indexes and views, that act
as caches of query results, can be created. In practice, some parts of the database
are not normalized on purpose for performance reasons in this case, the database
user must take care of database integrity.31
Leaving performance issues aside, normalization is also useful independent from
relational databases. 1NF deals with uniformity and atomicity: all rows of a record
type must contain the same number of fields, and field values must not be decomposable into smaller data items (subfields). First normal form is often assumed
implicitly, but silently ignored at the same time. For instance a table of publications
and authors must contain exactly one author for each single publication to fulfill
1NF. To express publications without author, we must either introduce a new table
of all publications, or a virtual anonymous author as NULL value. Lists of authors
(like Kerningham and Ritchie in figure 11) can be allowed by either adding another
table with author-list and list-member, or by allowing lists as atomic data types. The
latter solution has been favoured by advocates of object-relational models such as
Darwen and Date (1995), as the notion of atomicity depends on context (Fotache
2006). However, the ad-hoc introduction of NULL values and lists without any strict
definition of their meaning does neither align with 1NF nor with any other database
formalism.32
Second and third normal forms ground on the concepts of database keys and
functional dependencies. Both can also be applied to other types of databases. A
key is a set of one or more fields (or table columns), which in their relation (or
table) uniquely identify a record. If relations are sets, every table has an implicit
key, build of all their columns. Generally, keys should be short, to be used as
reference (also known as foreign keys) in other tables. Under 2NF and 3NF, all
non-key fields must functionally depend on the combination of all key fields. That
means, the mathematical binary relations between all key-fields and any non-key
field must be a total function (totality is enforced by exclusion of NULL values with
1NF). In example 12, publications are listed in table a with country, author, year,
title, publisher, and place. If we choose the set {author, year} as database key (red
uniqueness overline in 12), each combination of author and year must uniquely
identify a publications country, title, publisher, and place. Obviously, an author
can create multiple titles per year, so we must modify the key. For instance, each
31 A popular example of denormalized databases are data cubes in Online Analytical Processing (OLAP).
32 In practice, NULL values often cover a fuzzy bunch of meanings, and lists are encoded by introduction
of separators (for instance the string and ), that are not part of any database definition known to the
DBMS.
87
d author country
publisher place
author award
4NF and additional normal forms attempt to minimize the number of fields
involved in a key. Let us assume we want to store information about the popularity
of authors. Table e in example 12 list languages, that authors have been published in,
and awards, that authors have received. Under forth normal form, the implicit key
{author, language, award} should be split in two independent keys (tables f), because
languages and awards are independent from each other. If the awards and languages
are properties of publications, a normalized database should not contain tables e
and f at all, because their content can be derived by relational algebra as view from
other tables.
Despite the theoretical importance of database normalization, it has failed to
become an ultimate aid for database designers. At most, they use normalization
for validation after creation of databases based on intuition. Fotache (2006) gives
some reasons for this gap: popular textbooks on database design often describe and
exemplify normalization poorly or even incorrectly. Most literature on normalization
33 Note that this only extends the number of publications per author and year from one to 26.
34 For some reasons, nationalists and library classifications try to uniquely group authors under countries.
88
3.4. Databases
focuses on rigor and sober mathematics and uses artificial examples, instead of real
world applications. There is a lack of graphical diagramming tools for functional
dependencies, and other graphical modeling languages such as ERM (section 3.8.1)
base on different philosophies. Another problem, also raised by W. Kent (1983b), is
the dependency on uniquely identified entities. If fields contain different values for
the same object, for instance different spellings of names, speaking about keys and
dependencies is futile. However normalization only reveals the difference between
rigor data and fuzzy reality, that would also exist without it.
89
principle. It states that all provable properties of instances of T must also be true for instances of S.
Thus any algorithm designed for T instances of will behave exactly the same if used with S instances.
90
3.4. Databases
91
46 http://cassandra.apache.org/
92
93
first defined
1984 by ISO
Thrift
Protocol Buffers
Etch
MGraph
BSON
2007 by Facebook
2008 by Google
2008 by Cisco
2008 by Microsoft
2010 by MongoDB
DDL
Encoding Control Notation (ECN)
Interface
Description
Language (IDL)
.proto files
MSchema/MGrammar
94
Type(s)
int32, sint32
uint32, fixed32
int64, sint64
uint64, fixed64
float
double
bool
string
bytes
enum
message
in XML
int32
uint32
int64
uint64
float
double
bool
string
string
enum
class
in Java
int
int
long
long
float
double
boolean
String
ByteString
enum
class
Content
signed 32-bit integer
unsigned 32-bit integer
signed 64-bit integer
unsigned 64-bit integer
32 bit floating point (IEEE 754)
64 bit floating point (IEEE 754)
true of false
Unicode or 7-bit ASCII string
sequence of bytes
choice from a set of given values
multimap with unique, unsorted keys
repeatable, sorted fields (possibly constraint by a schema)
in Rivest (1997) for S-EXP. Each language uses a tiny set of data types with strings
or byte sequences as the only atomic type. Syntaxes of INI, CSV, and S-EXP are
mainly defined as context-free language in Backus-Naur Form with some additional
constraints.
We will now show underlying models for each of these languages. INI is primarily
used for configuration files. In its most basic form, it is just a key-value structure
with field names (Field) and values (Value). Some INI files may have a second level
(Section). Section names should be unique per file and field names should be unique
per section, but both constraints depend on the particular variant of INI. A general
model is shown in figure 3.6. In summary, INI files are a special instance of the
record database model as described in section 3.4.1 (see see flat file database model
in figure 3.4).
CSV is popular to exchange simple lists of database records. An example is given
in figure 11. CSV is based on a tabular model (figure 3.7) where data is stored in cells
(Cell) that form a grid of rows (Row) and columns (Column). In general, all rows must
have the same set of columns, or they are automatically unified by adding missing
cells with a default value.
S-EXP originates in the Lisp programming languages and it is also used in some
data exchange protocols. The model of S-EXP is a rooted, ordered tree with strings
or empty lists as leafs (figure 3.8). There is not one standard but several dialects. A
canonical subset of S-EXP with binary form has been proposed by Rivest (1997).
The Value of a Field, Cell, or String in INI, CSV, or S-EXP respectively can hold
arbitrary byte sequences or character strings, depending on the specific language
variant. Some byte or character sequences may be disallowed, especially for names
of sections, fields, and columns in INI and in CSV.
95
Section
(.name)
Field
Value
Name
Figure 3.6.: Model of INI with variants
* Rows
*
r
Row
(.nr)
Cell
Value
Column
(.name)
3.5.3. JSON
JavaScript Object Notation (JSON) is based on notations of the JavaScript programming language. First specified by Crockford (2002) and later standardized as RFC
(Crockford 2006) it soon became a widespread language to exchange structured
data between web applications, serving as an alternative to XML. JSON was first
published in form of a railroad diagram (see section 3.7.1) and later expressed in a
variant of Backus-Naur Form. Figure 3.9 shows a full BNF grammar of JSON. In a
nutshell JSON is based on data model with five atomic value types (String, Number,
Boolean, and Null), and two composite types Array and Object. Strings can hold any
Unicode codepoint, but most application will limit codepoints to allowed Unicode
characters. Numbers include integer values and floating point values without limit
in length and precision.47 An Array holds a (possibly empty) list of values, and a
Object holds a (possibly empty) map from strings as keys to data elements as member
values.
The definition of JSON syntax as context-free language imposes the mathematical
structure of a partly-ordered tree on models of JSON. In such a model, nodes are
values but atomic types must be leaf nodes and the root node must be a composite.
47 Special numbers like -0, NaN, and Inf are not allowed.
*
Element
List
* Elements
96
Value
Composite
Object
Array
Member
Key
Value
String
charref
Number
Boolean
Null
s
s* ( Object | Array ) s*
"{" ( Member ( "," Member )* )? "}"
"[" ( Value ( s* "," s* Value )* )? "]"
Key ":" s* Value s*
s* String s*
Composite | String | Number | Boolean | Null
'"' ( char - ( '"' | '\' ) | charref )* '"'
'\' ( ["\/bfnrt] | [0-9A-F][0-9A-F][0-9A-F][0-9A-F] )+
"-"? ( "0" | [1-9] [0-9]* ) ( "." [0-9]+ )?
( ( "e" | "E" ) ( "+" | "-" )? [0-9]+ )?
= "true" | "false"
= "null"
= ( #x20 | #x9 | #xA | #xD )
=
=
=
=
=
=
=
=
=
Similar structures to JSON are found in many programming languages, for instance
JavaScript and Perl but they may contain pointers that go beyond the tree structure. In
addition, virtually all implementations add uniqueness constraint on objects keys,48
limit maximum size of text, numbers, and nesting level, and restrict String to the
Unicode character set.49 With the rise of NoSQL (see 3.4.6) JSON is also used more
and more to store data in databases. Most JSON databases put additional restrictions
on special object keys ("", id, id, $ref. . . ) that are used for uniquely identifying
and linking JSON documents or parts of it. Other extensions such as Binary JSON
(BSON) restrict atomic types and/or add data types that are not part of the JSON
specification.50 There are some proposals for schema languages for JSON (JSON
Schema,51 , Kwalify,52 JSONR53 . . . ), and for query languages to select a subsets of a
given JSON document (JPath,54 JSONPath55 . . . ) but none of them is widely accepted.
Manipulation of JSON data is usually done directly in programming languages or
via custom database APIs.
The clear and simple definition of JSON has made it a popular data structuring
language not only for web applications but also for ad-hoc tasks in structuring,
storing, and exchanging data. Proplems may result from differences in compatibility
of atomic types (especially keys and numbers) and from data that does not fit into
48 repeated object keys (like {"a":1,"a":2}) are allowed in theory.
49 Unicode codepoints outside of UCS are allowed but not supported by all implementations.
50 BSON extends some parts of JSON but is does not support numbers of arbitrary length
51 see http://json-schema.org
52 see http://www.kuwata-lab.com/kwalify/
53 see http://web.archive.org/web/20070824050006/http://laurentszyster.be/jsonr/
54 see http://bluelinecity.com/software/jpath/ and http://bitcheese.net/wiki/code/hjpath
97
3.5.4. YAML
YAML (YAML Aint Markup Language) was developed as human-readable alternative to XML and first published by Clark Evans in 2001 (Ben-Kiki, Evans, and
Ingerson 2009). Unlike most other DSL it can natively express hierarchical and
non-hierarchical structures. In contrast to most other data serialization languages,
the YAML specification defines in one document: a syntax, a conceptual model, and
an abstract serialization to map between syntax in model.
YAML syntax is very flexible: it allows multiple alternatives to express complex
structures in a simple, human readable way as stream of Unicode characters. Some
examples of language constructs are given in figure 3.10. Apart from repeatable
object keys and Unicode Surrogate codepoints, which are not allowed in YAML, the
syntax is a superset of JSON syntax. Other similarities exist with the semistructured
data expression syntax (ssd) used by Abiteboul, Buneman, and Suciu (2000). The
abstract serialization of YAML is called its serialization tree. The serialization tree can
be be traversed as sequence of parsing/serializing events, similar to the event-driven
Simple API for XML (SAX) (see page 3.5.5). The conceptual model of YAML is
called its representation graph. It is defined by the specification as rooted, connected,
directed graph of tagged nodes. Eventually this is a special multi-property graph
with possible loops and three disjoint kinds of nodes. Figure 3.10 gives a partial
model of the representation graph in ORM2 notation:56 Sequence nodes impose on
order on outgoing edges, and Mapping nodes have their outgoing edges indexed by
node values, as described below. Nodes of outdegree zero can also be of the Scalar
kind, which each holds a Unicode string as value. Mapping keys can be arbitrary
nodes, which makes the structure rather complex but in practice most YAML
instances represent simple hierarchies. Each node in a YAML representation graph
has exactly one Tag as node type.
Tags can be either identified by an URI (GlobalTag) or by a simple string (LocalTag).
A YAML schema is a set of tags. Each tag is defined by an URI, an expected node kind
(scalar, sequence, or mapping) and a mechanism for converting its nodes values to a
canonical form.57 Furthermore, a tag may provide additional information such as
the set of allowed values for validation and the schema may provide a mechanism
for automatically resolving values to tags. For instance a schema could automatically
tag the string true as boolean value instead of a literal string. Normalization of
node values to their canonical form is important for node comparision. Keys of
a mapping node must not only be different but unequal. Two nodes are equal if
they have the same tag and the same canonical content. Equality of sequences and
mappings is defined recursively.58 The YAML specification lists some possible types
56 Roots, scalar values, (local) tag names and URIs are not included.
57 See http://yaml.org/type/ for a registry of known tags.
58 Note that recursive equality checks may require determining whether the subgraphs used as keys are
98
Node
[value]
Sequence
SequenceTag
Scalar
ScalarTag
Mapping
MappingTag
Tag
[key]
&x foo
[ *x, bar ]
3.5.5. XML
XML has succeeded beyond the wildest expectations as a convenient format for encoding
information in an open and easily computable fashion. But it is just a format, and the
difficult work of analysis and modeling information has not and will never go away.
Wilde and Glushko (2008)
The Extensible Markup Language (XML) was designed between 1996 and 1998 as
99
100
document
misc
s
element
starttag
endtag
emptytag
content
text
reference
charref
entityref
value
cdata
comment
pi
attribute
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
(external entity).
Most entities are replaced by their content, when an XML document is read by
an XML processor (a piece of software that parses the syntax of an XML document
and provides access to its content and structure). However, some named entities
can remain as unparsed artifacts because they are external or because the DTD is
not taken into account by the processor. In practice the Simple API for XML (SAX)
(Megginson 2004) is a common abstraction in XML processors, especially for the
Java programming language. SAX is not a formal specification but it originates in an
implementation of an XML parser that was first discussed in early 1998. The API
of SAX provides a stream of parsing events that can be used to construct an XML
document, if the stream of events follows the well-formedness constrain of XML
(every XML document can be mapped to a stream of SAX events but not vice versa).
XML 1.0 defines two types of XML processors: validating and non-validating
processors. Non-validating processors must only check whether a document is wellformed, but they do not need to process all aspects of a DTD.61 Validating parsers
must analyze the entire DTD, including other documents referenced from the DTD,
and they must check whether the document matches the additional rules from its
schema (see section 3.7.2). A processor may even change the content of an XML
61 Some simple XML processors just ignore the DTD although this is against the specification. Removal
of DTD is one of the most common request in discussions about a future XML 2.0, as most XML
documents have no DTD, and validating is mostly done by using other schema languages.
101
parsing
document model
(structure)
a
<a b:c="d">
<e f=g>h</e>
&i;<?j k?>
</a>
b:c d
j k
i
f
h
not well-formed
(syntax error, no model)
of some model type
(DOM, Infoset, Canonical
...)
modified by validation
invalid by validation
Parsing XML can best be understood as a process that converts XML syntax, given
as sequence of characters, to another data structure (figure 3.12). In general the act
of parsing an XML document is not reversible, because some aspects of XML syntax
are considered as irrelevant (figure 3.13). The resulting data structure is a model
not only of the parsed document, but of all other logically equivalent documents
that result in the same model. Parsing XML can result in different structures. If
the original data was not well-formed, there is no model, and the document is no
XML by definition.62 The specific type of model defines, which parts of syntax are
translated to which parts of a model and which parts are omitted as irrelevant to
the given model (figure 3.13). A processor may also modify the document to some
degree or it may mark the document as invalid.
The most prominent models of XML are the Document Object Model (DOM) and
XML Infoset. DOM evolved parallel to XML in the late 1990s. It was created to
harmonize existing JavaScript-Interfaces that had been created by Web browser
makers for manipulating HTML documents. The part of DOM that deals with XML
documents is DOM Core. Actually there are three variants: Level 1 is based on the
tree structure of XML 1.0, Level 2 expresses the structure of XML with Namespaces,
and Level 3 expresses a model compatible with XML Infoset (Cowan and Tobin
2004). Another model of XML is shared by XPath 1.0 and Canonical XML (Boyer
and Marcy 2008), XPath 2.0 and XQuery define yet another model (Berglund et al.
2010). A given model may also be expressed in other languages but XML syntax.
For instance Fast Infoset (International Organization for Standardization 2005) is a
binary representation of Infoset based on ASN.1 and Tobin (2001) defines an RDF
Schema to serialize XML document models as RDF instances.
Despite all minor differences, all document and processing models of XML share
a basic structure, that can be described as ordered tree with nodes of different types.
62 In practice you sometimes have to deal with not-well-formed documents that were intended to be XML.
You can call this documents broken XML if there is a chance to recover well-formedness.
102
namespace prefixes
CDATA sections
standalone flag
specified schemas
comments
whitespaces
processing instructions
Basically, there are element nodes with exactly one element as root, attribute nodes,
and text nodes. Other node types (processing-instructions, comments, external
entity references etc.) are much less used to hold relevant information, and they
more depend on the particular document model.
Each element node has a (possibly empty) set of unordered attribute nodes with
unique attribute names, and an ordered (possibly empty) list of text and/or element
nodes as child nodes. Attribute nodes cannot hold nested structures but only one
text node each, and text nodes are Unicode strings with some code points excluded.
Each attribute and each element node has a name. The exact definition of a name
from figure 3.11 depends on the specific XML model: in XML 1.0 a name is just a
Unicode string that not contains some disallowed characters. The dependence on
a particular version of Unicode was lifted with the fifth edition (Bray, J. P. Paoli,
Sperberg-McQueen, et al. 2008). The most important (and often confusing) extension
to XML 1.0 is XML Namespaces (Bray, Hollander, et al. 2009): namespaces allow
names of elements and attributes to be qualified by an URI. This way names can
be grouped together in vocabularies and elements from different vocabularies can
be mixed in one document. In the model of XML with namespaces (and in other
techniques that build upon namespaces, such as DOM Level 2 and 3, Infoset etc.)
a name is triple consisting of the namespace URI, a local name, and a namespace
prefix. In XML syntax namespaces are declared by special attributes that start with
xmlns (in example 15 the namespace is declared at the root element so it applies to
the whole document). Example 14 shows three XML elements that make use of a
namespace declaration. In most cases only the namespace URI and the local name
matter, so the first two examples should be treated as equivalent. The prefix is also
included in most models, and some applications rely on it.63 The third example
63 See http://www.w3.org/TR/xml-c14n#NoNSPrefixRewriting for details.
103
To allow more complex graph structures, there are several techniques to extend
the basic tree model of XML with links: attributes can be defined to only hold unique
ID values or references to other identifiers (IDREF in DTD or keyref constraints in
XML Schema). XLink (DeRose, Maler, et al. 2010) and XPointer (Grosso et al. 2003)
describe other extensions to XML to create links to portions of XML documents.
However, like other extensions to XML 1.0, this adds another layer of complexity
and another model that first must be agreed on to achieve interoperability. To reduce
complexity within the family of XML specification, simplified subsets have been
proposed by Bray (2002), Clark (2010) and others, but none of them has widely been
adopted yet. Nevertheless, XML is successfully being used to encode and exchange
data on the Web and in other areas from markup languages such as TEI to structured
metadata formats such as METS, MODS, and EAD. Furthermore several serialization
forms of other formats in XML exist, for instance RDF/XML for RDF and MARCXML
for MARC. As described by Wilde and Glushko (2008), many problems with XML
arose from overbroad claims for XML, which in the end is just a format. It still
suits best for marked-up textual data and other records that can be modeled well as
ordered tree, but less for data with arbitrary order and links.
3.5.6. RDF
The Resource Description Framework (RDF) dates back to the Meta Content Framework
(MFC) which Ramanathan Guha had created in the 1990s at Apple (Andreessen 1999;
R. V. Guha 1996). R. Guha and Bray (1997) submitted the idea of MFC to the W3C,
where it evolved to a general graph-based metadata framework, first released as RDF
specified by Ora Lassila and Swick (1999). Merged with ideas of Tim Berners-Lee
(1997), the focus had widened to metadata about any objects for creating a Semantic
Web, that can express knowledge about the world itself (T. Berners-Lee, J. Hendler,
and O. Lassila 2001). The ambitious aim of RDF can be traced back further to the
artificial intelligence project Cyc, which Guha was co-leader of in 19871994, and to
64 Some vocabularies may specify additional identifiers for XML elements, for example in XML Schema
each element has an URI that happens to be constructable by appending local name to namespace
URI. However there is no general rule to do so in other vocabularies.
104
the original proposal of the WWW (Tim Berners-Lee 1989). We will first describe
the basic components of RDF, show how its structure can be described and extended,
and list several serialization forms. Afterwards we will discuss how semantics is
brought to RDF via an algebra over RDF graphs.
I. RDF components
In its most basic form that is without additional techniques such as RDFS, OWL,
SKOS, etc. RDF is just a, graph-based data structuring language. The RDF data
model is defined as abstract syntax by Klyne and Carroll (2004) as follows: an RDF
graph is a set of RDF triples (also known as statements) each consisting of a subject,
a predicate (also known as property), and an object. The nodes of an RDF graph are
its subjects and objects, and the edges are labeled by its predicates. Because RDF
graphs are defined on mathematical sets, each particular combination of subject,
predicate, and object is only counted once in a graph. In summary, an RDF graph
can be described as multigraph with labeled edges, possible loops (triples where
subject and object coincide), and partly labeled nodes. This multigraph may contain
two kinds of graph labels: First an URI reference (or URIref ) is an absolute, percentencoded URI or IRI. Two URIrefs are equal if and only if they compare as equal as
105
(Klyne and Carroll 2004, p. 6.4). This makes http://example.com/%41 and http://example.com/A
two distinct URIrefs, although in most applications, after normalization of percent-encoding, the
former results in the latter. The ambiguity cannot be solved in general, because there is no general
canonicalization algorithm for all types of IRIs. It is possible but strongly discouraged to have two
different URIrefs that percent-encode the same IRI.
66 By this, RDF can be seen as one of many attempts to create a perfect language, in which same words
always refer to same objects. The history of other attempts and their failures have been illustrated
vivid by Eco (1995).
67 To give an example, the XML Schema specification (Biron and Malhotra 2004) defines the datatype
xs:boolean with four allowed lexical values (true, 1, false, 0) that map to the canonical
literal values true and false.
68 It is assumed that the underlying graph isomorphism problem (GI) is strictly harder than polynomial
time (P), and strictly easier than nondeterministic non-deterministic polynomial time (NP) (Kobler
2006).
106
dc:title
info:oclcnum:318301686
dc:date
dc:creator
foaf:name
dc:creator
foaf:name
dc:title =
abbreviations
Kernighan
Ritchie
http://purl.org/dc/elements/1.1/title
dc:creator =
http://purl.org/dc/elements/1.1/creator
foaf:name =
http://xmlns.com/foaf/0.1/name
RDF extension
subject
predicate
object
datatype
lang.
graph
standard
U B
U
U BL
U
T
symmetric
U BL
U
U BL
generalized
U BL U BL U BL
full blanks
U B
U B
U B
named graphs
U
language URIs
U [B]
U : URIref, B: blank node, L: Literal, S: Unicode string, T : language tag
L = S (S U ) (S T ) and U , B, L, S, T are pairwise disjoint.
The set of blank nodes B is partitioned into disjoint sets for different RDF graphs.
Table 3.11.: Definitions of the RDF data model and its extensions
107
108
109
to Anyone Can Make Statements About Any Resource without changing the general declaration.
110
@base <http://viaf.org/viaf/>.
<info:oclcnum:318301686> dc:title
"The C programming language"@en;
dc:date "1978"^^xs:gYear;
dc:creator <108136058>, <616522>.
<108136058> foaf:name "Kernighan".
<616522> foaf:name "Ritchie".
blank nodes replaced by URIrefs
<info:oclcnum:318301686> dc:title
"The C programming language"@en;
dc:date "1978"^^xs:gYear;
dc:creator _:b1.
_:b1 foaf:name
"Kernighan", "Ritchie".
two blank nodes merged to one
data model, but from an algebra, that can be defined on RDF graphs.73 The algebra
allows you to freely merge and intersect RDF data based on simple set algebra. This
is not possible in most other data structuring languages, for instance tree-based
languages, which must have exactly one root element.74 As blank nodes are always
disjoint for different graphs, the simple set intersection of RDF graphs cannot contain
blank nodes. The same applies to two RDF graphs A and B with A B. The RDF
specification uses the word equivalent in a different way, so we better call A and B
set-equivalent or identical if A B.
The RDF specification defines another kind of equivalence and two additional
relationships between RDF graphs: graph equivalence, graph instance, and graph
entailment. Basically, two RDF graphs A and B are equivalent, written as A B, if
there is a bijection M that maps literals to equivalent literals, URIrefs to equivalent
URIrefs, blank nodes to blank nodes. In addition the triple (s, p, o) is in A if and only
if (M(s), M(p), M(o)) is in B (Klyne and Carroll 2004, sec. 6.3). It is recommended,
but not required to apply Unicode Normalization before comparing graphs for
equivalence.
An RDF graph H is called instance of an RDF graph G, if H can be obtained from
G by replacing zero or more of its blank nodes by literals, URIrefs, or other blank
nodes.75 H is a proper instance of G, if at least one of its blank nodes has been
replaced by a non-blank node, or if at least two blank nodes have been mapped to
one. Two graphs A and B are equivalent if and only if both are instances of each
other but neither a proper instance. Figure 3.16 shows two possible non-equivalent
instances of the graph from figure 3.15.
73 Some may argue, that the algebra is an inherent part of RDF. But this would neglect all RDF applications,
based data formats, but as described in section 3.5.2, these formats lack a precise definition.
75 The term instance is used also in RDFS and OWL for the rdf:type URIref.
111
entailment regimes. Finding lean graphs for a given regime is an area of ongoing research (Pichler
et al. 2010).
78 In Notation3: { ?s ?p ?o } => { ?s ?p [ ] . [ ] ?p ?o }.
79 In Notation3: { ?s ?p ?o } => { ?s rdf:type rdf:Property }.
112
Triple
Node
Blank
URI
Literal
URIref
String
LanguageTag
Figure 3.17.: Model of generalized RDF
113
elements like <i> and <tt>. With HTML5 the standard is shifted more to descriptive markup with
presentational capabilities separated in CSS (Hickson and Hyatt 2009).
114
Extensible Stylesheet Language (XSL) consists of i) a query language for addressing parts of an XML
document (XPath), ii) a programming language for transforming XML documents (XSLT) that uses
115
ODF
RTF,
XSL-FO
CSS
storage
format
load/save
edit
format
export
output
format
Cascading Style Sheets (CSS) is a stylesheet language to describe the presentation of elements in document markup languages (Bos et al. 2009). Introduced
first by Hakon Wium Lie in 1994 for HTML, it has since been adopted for several
other document types. In contrast to the other languages listed above, CSS does
not markup text but describe the layout properties of elements in marked up
documents.
Each language implies or defines a document model with entities such as pages,
paragraphs, tables, lists, lines characters etc. Character encodings as described in
section 3.1 provide the fundament of such models. Markup languages and document
models are shaped by the focus of their application. We can divide i) storage formats,
which are mainly used to store and exchange documents, ii) output formats, which
procedurally or descriptively trigger a display device, and iii) edit formats, which are
used to create, analyze, and modify documents (figure 3.18). Edit formats require
the markup language to be expressed in a specific syntax; basically most markup
languages are foremost defined by a markup syntax and markup is often used as a
synonym for a markup languages syntax. Syntax, however, should not be confused
with the document model. Table 3.13 list four common concepts and their expression
in syntax of different markup languages. As shown in table 3.14, the same document
(here it is just a title) can be expressed differently in different form. Following the
radical position of A. H. Renear and Wickett (2009), either the forms do not represent
the same document or the document does not change if we transform one form to
the other. But what is the document, if we only have markup language in form of
syntax?
XPath, iii) an area model that defines layout properties, and iv) an XML syntax to specify documents
based in the are model (XSL-FO). XSL builds on concepts of CSS and the Document Style Semantics and
Specification Language (DSSSL) but not as its subset or superset.
116
RTF
TEX
HTML
TEI
text
**text**, __text__
*text*
**text**
{\b text}
\textbf {text}
bold face
<emphasis
role=strong>
<b>
<hi rend="bold">
I<text>
.fam I text .fam,
\fItext \fP
text
*text*, _text_
_text_
*text*
\textit {text},
\emph {text}
{\i text}
italic face
<emphasis>,
<firstterm>
<t>
rend="italics"
C<text>
.fam C text .fam,
\fCtext \fP
<tt>, <code>
text
@text@
text
\texttt {text},
\verb #text#
{\fmodern text}
<tt>, <code>
rend="typewriter"
monospace
<code>, <varname>
MediaWiki
Markdown
Textile
reStructuredText
B<text>
.fam B text .fam,
\fBtext \fP
language
DocBook
POD
GNU troff (man)
sub-/superscript
<subscript>,
<superscript>
<sub>, <sup>
rend="subscript",
rend="superscript"
^{text}, _{text}
{\sub text},
{\sup text}
<sub>, <sup>
<sub>, <sup>
~text~, ^text^
:sub:text,
:sup:text
^text^, ~text~
117
Radiative
Radiative
Radiative
Radiative
Radiative
Radiative
β-decay in <sup>141</sup>Ce
\textbeta-decay in ^{141}Ce
β-decay in [^141^]Ce
B-decay in ^1^4^1Ce
B-decay in 141 Ce
beta-decay in 141Ce
dened as codepoint U+00B9 and U+2074 and the upright beta is dened as codepoint U+03B2.
<b><i>A</i></b>
<i><i>A</i></i>
<i>A</i><i>B</i>
<sup>a</sup><sup>b</sup>
a<sub>b<sup>c</sup></sub>
a<sup><sup>b</sup></sup>
<i><b>A</b></i>
<i>A</i>
<i>AB</i>
<sup>ab</sup>
a<sup>b<sub>c</sub></sup>
a<sup>b</sup>
A
A
AB
ab
abc abc
b
a ab
Dening a markup language by its syntax introduces some problems with escaping,
concatenating, nesting, and alternatives. First, characters acting as syntax elements
cannot directly be used in text but must be escaped or forbidden. For instance in
HTML the less-than sign < (U+3C) must be escaped as <, <, or <.
Second, syntax elements often cannot arbitrarily be nested and/or concatenated. For
instance in Textile **__B__** encodes a bold, italic letter B but if it is surrounded
by other letters, it must be put in square brackets as A[**__B__**]C. Alternatives
occur especially if there is no isomorphism between syntax and document model
(
). As each markup element has a special meaning you can rarely derive general
rules for all parts of the syntax.
In most markup languages it is not even obvious which parts of a markup syntax
are nested (tree model) and which are concatenated (sequence/event model). The
interpretation of an element can depend on its position in a tree of elements, on its
position in a sequence of elements, and/or on its position relative to other elements
(see table
, elements are shown underlined for better readability). In practise all
markup language syntaxes have some tree-based parts and some event-based parts.
A widespread example of markup for metadata exists in citation styles and forms
of heading, such as ISBD. Character strings that result from library cataloging rules,
have mainly used for printing catalog cards and to support other forms of retrieval.
Although often described as record, they can better be viewed as a special kind of
document. As shown by Thomale (
) for MARC, the origin as markup has also
118
syntax
MediaWiki
invalid HTML
HTML
model
events
events
tree
document instance
''' bold '' bold-italic '''
italic ''
<b> bold <i> bold-italic </b>
italic </i>
<b> bold <i> bold-italic </i></b><i>italic </i>
83 The treatment of existing markup in cataloging, which would result in markup in markup, has not been
analyzed deeply yet. Most cataloging rules do not preserve font styles, emphasis, or even non-latin
characters but make use of transcriptions and comments.
84 http://johnmacfarlane.net/pandoc/
119
Position
{1,2,. . . }
has
GraphemeCluster
bold
2
italic
small caps
has
Character
QuotationLevel
has
protected
Level
is
Protect
{case,decor}
1 character
2 only
Shift
{1,2}
{supscript,
superscript}
includes
Schema languages, data definition languages (DDL), or data description languages are
primarily used to further restrict existing data structuring systems, such as database
models, data structuring languages, or markup languages. The process of declaring
a schema is often called data definition and the result is a data format or (logical) data
schema. In data modeling, these schemas are located at the data realm (figure 2.6 at
page 33), although some schemas (e.g. RDF schemas) also span to conceptual realm
if used as conceptual modeling languages (see section 3.8).
The purpose of a schema can be both prescriptive specification of documents
to be created and validation of existing documents. The expression of schemas
in dedicated schema language better allows for sharing and analysis of schemas,
independent of particular applications. Different schemas expressed in the same
schema language can be used by a validator or parser (figure 3.21). Validators ensure
common data structures based on shared schemas. Without such schema, data
from one application may be rejected or lead to unexpected results in the other.
The validator acts as interpreter that processes a schema in its schema language as
program to transform documents as input to document analysis as output.
Each schema language has an inherent model of the data that schemas further
restrict. In the case of regular expressions and Backus-Naur Form (part 3.7.1)
this model is a simple sequence of characters. XML schema languages restrict
120
XML documents (part IV), RDF schema languages restrict RDF graphs (part 3.7.3),
and SQL schemas restrict databases. Other methods used as schema languages
(programming languages, forms, and the data format description language) are
summarized briefly at the end of this section (part 3.7.5).
Document
Validator
Document analysis
control
boilerplate
Schema
extension/restriction
Figure 3.21.: Schema languages allow to express schemas for multiple applications
121
in CFL and a language in REG is also in CFL, so one can use regular expressions as subtrahend in BNF.
86 Repeating groups and boolean extensions are more difficult to picture.
122
s =
a b
a | b
a?
a+
a*
a:n
a:n-m
a:n-
non-terminal symbols
defines a rule for symbol s
sequence of symbols
alternative symbols
optional symbol
repeatable symbol
optional repeatable symbol
repeat n times
repeat n to m times
repeat at least n times
"..."
#xX
[...]
a - b
a & b
terminal symbols
sequence of characters
character with Unicode codepoint X
regular expression character class
BNF
rule name =
name
terminal "abc"
symbol
nonterminal name
symbol <name>
sequence
a, b
a b
alternatives a | b
BNF
diagram
optional
diagram
(...)?
[...]
...
abc
repetition
name
{...}
(...)*
*(...)
...
(...)+
...
b
...:n
a
...
b
Figure 3.22.: Backus-Naur Form and syntax diagram elements
123
124
entity references and CDATA sections have been replaced by equivalent content
(see figure 3.12 and page 3.5.5). DTD syntax for attribute declaration is different
to element declaration, as attributes have no order and cannot be repeated at one
element. Attributes can be marked as optional (keyword #IMPLIED) or mandatory
(keyword #REQUIRED) and there are some limited capabilities to restrict possible
attribute values, for instance to one value from a predefined list. Character content
of XML element can not be restricted. One can specify default values for attributes
and disallow changing attribute values (keyword #FIXED. Simple integrity condition
are possible, by declaring some attribute values as unique identifiers (keyword
ID) and some attribute values as pointers to these identifiers (keywords IDREF and
IDREFS). A proposed ISO standard to further extend DTD has received little attention
(International Organization for Standardization 2008c).
II. RELAX NG
The REgular LAnguage for XML Next Generation (RELAX NG) was developed as
merger of Tree Regular Expressions (TREX) and Regular Language description for
XML (RELAX), both experimental XML schema languages, created in 2000/2001
(Vlist 2003). RELAX NG was standardized at OASIS and published at ISO in 2003 and
2008 (International Organization for Standardization 2008b). A RELAX NG schema
can be written in an XML syntax and in a more readable compact syntax, which
is similar to Backus-Naur Form. Both forms can be translated to its counterparts
without loss of information. The grammar from example 17 could be written in
RELAX NG Compact as shown in example 18 (the datatype xs:gYear is added to
the year element):
125
element bib {
( element editor { person }+ | element author { person }+ ),
element year { xs:gYear }?,
title,
keyword*
}
person = {
attribute given { text }?
attribute surname { text }
}
Example 18: Grammar rules in RELAX NG Compact syntax
RELAX NG unifies syntax for element and attribute declaration and it provides
functionality that goes beyond DTD. In particular, it better supports grouping,
combining, and annotating grammar rules. It further adds context-sensitive content models, namespaces, unordered content, and datatypes for character content.
Datatypes are not defined in RELAX NG but only referenced fro other specifications.
The most common datatypes are those from XML Schema, for instance xs:integer
for character sequences that represent integer values. Identifier attributes (ID, IDREF,
IDREFS) and default attribute values are not supported but an official extension exists
to add these features (Clark and Murata 2001). Additional rules can be embedded in
RELAX NG schema by using other schema languages, especially Schematron.
III. Schematron
Schematron is a rule-based XML schema language, expressed in XML (Vlist 2007). It
was first proposed in 1999 by Rick Jelliffe and later standardized as Schematron 1.3
(2000), Schematron 1.5 (2001) and Schematron 1.6 (2002). Schematron was published as ISO 19757-3 (2006) and an extended version is being published in 2011.
The publication at ISO may be one reason why Schematron is less known and less
used than other schema languages described in this section. In contrast to grammarbased languages, a Schematron schema does not specify the whole tree structure
of an XML document, but it defines a set of additional constraints. Schematron
constraints are expressed by two XPath expressions: the context defines which part
of a document to check and the test specifies a condition that must be true for each
context element. A condition can trigger a message if it is met (with report) or if it
is not met (with assert). One or more conditions with same context are grouped
in a rule and one or more rules are grouped in a pattern, which may further be
associated to phases. Schematron also supports variables and abstract patterns as
pattern templates, as shown in example 19: this Schematron schema can be used
to restrict author and editor names from the previous schemas (example 17 and
126
18) by checking that given names are not empty strings and surnames do not end
with a dot, unless there is a given name. Such interdependencies are difficult or not
possible to express in other schema languages.
<schema xmlns="http://www.ascc.net/xml/schematron">
<title>Person name checks</title>
<pattern abstract="true" id="person">
<rule context="$p[@given]">
<report test="0 < string-length(normalize-space(@given))"
>person with valid given name</report>
</rule>
<rule context="$p[@surname and not(@given)]">
<let name="s" value="normalize-space(@surname)"/>
<let name="l" value="string-length($s)"/>
<assert test="0 < $l and substring($s,$l) != '.'"
>Surname without given name must not be abbreviated</assert>
</rule>
</pattern>
<pattern is-a="person">
<param name="p" value="author"/>
</pattern>
<pattern is-a="person">
<param name="p" value="editor"/>
</pattern>
</schema>
Example 19: Simple Schematron schema, restricting person names
127
Schema for Object-Oriented XML (SOX), both proposed as W3C note in 1999; Document Content
Description (DCD) a W3C submission from 1998; and XML Data Reduced (XDR), proposed by
Microsoft in 1998-2001.
88 You could use a compact syntax for XML, but proposals such as Wilde (2003) have not received much
adoption.
128
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="bib">
<xs:complexType>
<xs:sequence>
<xs:choice>
<xs:element name="author" type="person"
maxOccurs="unbounded"/>
<xs:element name="editor" type="person"
maxOccurs="unbounded"/>
</xs:choice>
<xs:element name="year" type="xs:gYear" minOccurs="0"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="keyword" type="xs:string"
minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="person">
<xs:attribute name="given" type="myString"/>
<xs:attribute name="surname" type="myString" use="required">
<xs:alternative test="not(../@given)">
<xs:simpleType name="noAbbrev">
<xs:restriction base="myString">
<xs:pattern value=".*[^\.]"/>
</xs:restriction>
</xs:simpleType>
</xs:alternative>
</xs:attribute>
</xs:complexType>
<xs:simpleType name="myString">
<xs:restriction base="xs:string">
<xs:minLength value="1"/>
<xs:whiteSpace value="collapse"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
Example 20: XML Schema, including features of XSD 1.1
129
value space
order
canonical mapping
xs:boolean
1
true
0
false
identity
equality
true
false
not identical
not equal
not ordered
Figure 3.23.: Lexical space and value space of XSD datatypes with boolean as example
A datatype consists of a value space, which is a set of abstract values; a lexical space,
which is set of Unicode strings used to denote the values; a surjective lexical mapping
function that maps from the lexical space to the value space; and an equality relation
for the value space (figure 3.23). Furthermore datatypes may have an order relation
for the value space, a canonical mapping that maps each value from the value space to
its preferred representation in the lexical space. In addition, there is an identity relation that in most cases is the same as the equality relation.89 There are 50 build-in
datatypes, each referencable by an URI. Grouped by purpose one can identify:
a type for Unicode strings, limited to the characters allowed in XML (xs:string),
and two types for strings with normalized whitespace (xs:normalizedString
and xs:token),
various numeric types (see table 3.17),
a boolean type (xs:boolean, as shown in figure 3.23),
twelve different types for dates, times, and their parts or combinations,
two binary types for encoded byte sequences (xs:base64Binary, xs:hexBinary),
a type for URIs (xs:anyURI) and for tuples of namespace URI and local name to
represent XML element names with namespaces (see example 14),
a type defined equivalent to xs:QName but for XML Notations (xs:NOTATION),
six special types derived from xs:token to represent differnt kinds of identifiers
(xs:language, xs:Name, xs:NCName, xs:ID, xs:IDREF, xs:NMToken),
89 The distinction between equality and identity was introduced in XSD 1.1. For instance the lexical
values -0 and 0 were both mapped to value zero for the datatype xs:float, but now they are mapped
to non-identical but equal values, to better mirror floating point number encoding (see page 58).
Another example from xs:float is the value of NaN (not a number), which is identical but not equal
to itself.
130
datatype
xs:anyType
xs:anySimpleType
xs:anyAtomicType
xs:decimal
xs:integer
xs:long
xs:int
xs:short
xs:byte
xs:nonNegativeInteger
xs:positiveInteger
xs:unsignedLong
xs:unsignedInt
xs:unsignedShort
xs:unsignedByte
xs:nonPositiveInteger
xs:negativeInteger
xs:float
xs:double
value space
all types (simple and complex)
all simple types
all simple types that are not lists or unions
{ x/10y | x Z, y Z0 } R
Z = Int
Int64
Int32
Int16
Int8
Z0 = UInt
Z>0
UInt64
UInt32
UInt16
UInt8
Z0
Z<0
32 bit IEEE 754 floating point (except sNaN)
64 bit IEEE 754 floating point (except sNaN)
derived list types for three token types (xs:ENTITIES, xs:IDREFS, xs:NMTOKENS),
the special types xs:anyType, xs:anySimpleType, and xs:anyAtomicType as
base types of all other predefined types.
The recommendation groups its datatypes into primitive types and derived types.
Each derived type must have exactly one base type with xs:anyType as base type
of all other simple types. A subset of the derivation tree is shown in table 3.17. A
derived type can specify restrictions in predefined constraining facets that serve
to normalize or constrain its lexical space and/or its value space. Possible facets
include length of the lexical representation, a regular expression pattern that all
lexical values must match, and lower/upper bounds on its value space for ordered
values. Beside derivation one can define new types as lists or as unions of existing
types. It should be noted, that the lexical mapping of an union type may be no
function, because the same lexical value can represent multiple values for different
types (for instance the string "2" and the number 2).
The XSD type system includes some more caveats: for instance primitive datatypes
are disjoint (the number 2 as xs:float is incomparable to the number 2 as xs:double
and as xs:decimal). xs:float and xs:double have no common base type although
they both decode subsets of R and there are unrelated types with the same value
space (xs:base64Binary and xs:hexBinary, which both map to the set of finite-
131
terministic grammars for formal languages (section 2.2.1) to tree languages. A detailed analysis of
132
a Friend (FOAF) ontology and the DBPedia ontology. The example already shows
that in practice one rarely deals with a schema language as a whole but with specific
features of schema languages.
Although a general schema language for RDF graphs could be created as a rewriting system on graph patterns (see page 27), existing schema languages do not
allow arbitrary graph patterns, but adhere to a object-oriented system with classes,
properties, and individuals. In short, RDF schema languages provide resources
to document parts of an ontology (rdfs:isDefinedBy, rdfs:label, rdfs:comment,
vs:term_status. . . ), to define (specific kinds of) classes and properties (rdf:type,
rdfs:Class, rdf:Property, owl:AnnotationProperty, owl:FunctionalProperty. . . ),
and to define rules and relations between these entities (rdfs:subClassOf, owl:disjointWith,
rdfs:domain, rdfs:range. . . ).
The most basic schema language is the basic RDF vocabulary with the property
rdf:type, the resource rdf:Property, the datatype rdf:XMLLiteral, and the rules
of rdf-entailment (P. Hayes and McBride 2004, section 3). This minimal ontology
language is extended by Brickley and R. V. Guha (2004) to the RDF Vocabulary
Description Language, which is called RDF Schema (RDFS). RDFS is further extended
by the OWL family of ontology languages. OWL was first published as working
draft by Dean et al. (2002) and extended to OWL2 (Schneider 2009). OWL has since
superseded other general ontology languages such as DAML+OIL. In addition, there
are rule languages, such as the Rule Interchange Format (RIF) (Kifer and Boley 2010)
and SPARQL Inferencing Notation (SPIN) (Knublauch, J. A. Hendler, and Idehen 2011).
The usability of ontology languages can partly be improved by specific syntaxes,
such as the Manchester Syntax for OWL (Horridge and Patel-Schneider 2009) and by
using specialized ontology editors. One can also use other logic languages and even
natural language. In fact many ontologies contain additional rules and descriptions
that are only provided in natural language, because using a formal rule language
would be too complex or too strict. For instance the restriction of foaf:birthday
to strings of the form mm-dd in example 21 is only given as deontic comment,
although it could also be expressed as formal constraint. Some rules can also be
expressed in Notation3: its syntax supports statements in first-order predicate logic
by using formulae with quantification and logical implication. The processing of
this rules, however, is much less supported by applications then standard RDFS
and OWL. To reduce complexity and to facilitate implementation and computation,
OWL is split up in three different sublanguages (Dean et al. 2002) or three different
profiles (OWL2). With limited expressibility there are the following dialects:
OWL-Full contains all features of the original Web Ontology Language. It is
now replaced by OWL2.
OWL-Lite was intended as subset of OWL for easy implementation. As it turned
out to be more complex then intended, OWL-Lite is now deprecated.
OWL DL and OWL2 DL are similar subsets of OWL and OWL2 that can be
mapped to an extension of description logic with useful computational prop-
133
# RDF vocabularies
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dbp: <http://dbpedia.org/property/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix vs: <http://www.w3.org/2003/06/sw-vocab-status/ns#> .
# RDF data that uses the RDF resources foaf:Person, foaf:name,
# foaf:birthday, dbp:dateOfBirth, and xs:date from ontologies.
<http://viaf.org/viaf/616522> rdf:type foaf:Person ;
foaf:name "Dennis M. Ritchie" ;
foaf:birthday "09-09" ;
dbp:dateOfBirth "1941-09-09"^^xs:date .
# Parts of RDF ontologies, described by RDF schema languages:
## FOAF ontology
<http://xmlns.com/foaf/0.1/> rdf:type owl:Ontology ;
dc:title "Friend of a Friend (FOAF) vocabulary" .
foaf:Person rdf:type rdfs:Class ;
rdfs:isDefinedBy <http://xmlns.com/foaf/0.1/> ;
rdfs:label "Person" ;
rdfs:subClassOf foaf:Person ;
owl:disjointWith foaf:Document, foaf:Organization, foaf:Project ;
vs:term_status "stable" .
foaf:birthday rdf:type owl:AnnotationProperty, owl:FunctionalProperty;
rdfs:comment """The birthday of this Agent, represented in
mm-dd string form, eg. '12-31'.""" ;
rdfs:domain foaf:Agent ;
rdfs:range rdfs:Literal ;
## DBPedia ontology
dbp:dateOfBirth rdf:type rdf:Property ;
rdfs:label "DATE OF BIRTH"@en .
Example 21: RDF data and its ontology, described by RDF schema languages
134
135
136
Content
the lack of a value (undef, null, nil, etc)
any defined value; anything that is not nil
a value that is either true or false
a number; may be parameterized by a range (fixed or inclusive or exclusive
minimum and/or maximum value)
an integer; may be parameterized by a range
a string, even the empty string
any of bool, num, int, and str
a list of values all of one type; may be parameterized by a range as length
a list of values of different, given types
a record of named entries, each with its own schema and a required/optional flag
a map of names to values, with all the values of one type
either anything at all, or any of a given list of types (union)
a combination of types (intersection)
Example 22: Predefined data types and elements of the Rx schema language
137
138
139
92 The expression the map is not the territory is attributed to Korzybski (1933) and applied to data
140
work
N
wrote
creator
M
Person
died
1
Place
Title
Name
Name
Example 23: ERM diagram in Chens notation plus attributes
Since introduction of ERM, a large number of variants have evolved (Patig 2006).
These variants often add new features and come with different graphical notations.
The diversity of ERM variants and notations requires to check carefully which
conventions and semantics apply in a given application. As shown by Hitchman
(1995), many additional constructs like subtypes, n-ary relationships, and naming
both ends of a relationship are rarely used and not well understood. In its original
form, ERM is rarely used in data modeling practice (Simsion 2007, pp. 49, 345).
Widely used variants and notations include information engineering notation (IE),
also known as crows foot notation (Finkelstein 1989; Martin and MacClure 1985)
and Barker Notation (Barker 1990). Both add some additional constructs to ERM
but also limit relationship types to binary associations only (T. Halpin and Morgan
2008, pp. 318ff.). The full variety of ERM dialects between 1975 and 2003, has been
analyzed in a study on ERM evolution: Patig (2006, p. 72ff) identified 33 conceptual
constructs in addition to labeled entity types, relationship types, attributes and the
141
142
143
titled
Title
1
PlaceName
made
relationship transformed to entity
1
Authorship
N
named
labeled
PersonName
1
created
Person
at
Place
1
died
Death
1
in
Year
Another limitation of most ERM variants is less obvious: all entity types are
assumed to be disjoint, unless they are connected by inheritance or by identity
relationships. For instance in example 23 and 24, an entity is either instance of Book,
Person, or Place. This implicit rule can cause problems with entity types that are less
easy to separate, such as Title, PersonName, and PlaceName in example 24: here the
disjointness constraint becomes an artifact of the model, that is not present in the
universe discourse normally a book, a place, and a person can have the same
name without being the same object. The disjointness assumption of ERM can also
be found in other conceptual modeling languages and in most schema languages.93
93 An exception are RDF schema languages (section 3.7.3), because of the Open World Assumption: An
RDF entity (resource in RDF terminology) can be of multiple entity types (classes), unless an explicit
disjointness constraint is enforced by a specific ontology.
144
145
Year
Person
(.name)
Place
(.name)
. . . died. . . in. . .
Among the special ORM features not included in example 25, there are ring
constraints, frequency and value constraints on entity types and roles, deontic rules,
and objectification. Ring constraints can be applied to pairs of roles that may be
populated by the same entities. The simplest case if a binary relationship with both
roles played by the same entity type, but rings can also occur on longer predicates and
indirectly because of subtypes. There are 10 ring constraints (reflexive, irreflexive,
purely reflexive, symmetric, asymmetric, antisymmetric, transitive, intransitive,
strongly Intransitive, acyclic) with 26 legal combinations. Example 26 a) shows a
model in which a Person can be child of a Person. The predicate is constraint as acyclic
(left), because no one can be its own child or ancestor. A frequency constraint ( 2)
is added to the parent role to state that a person is child of at most two people. An
additional strongly intransitive ring constraint is given as deontic (right, in blue).
This means, if one person is child of another, there should be no other chain of
child-of-relationships between the two. By this deontic rule, incest is forbidden, but
94 See http://www.rulespeak.com/ for an example. In brief, SBVR was influenced by fact-oriented
146
still possible, while circular ancestorship or more than two parents of one person are
impossible.
A special feature of ORM that is rarely found in other conceptual modeling languages is objectification or nesting. Objectification allows instances of relationships
to be treated as entities in their own right (T. Halpin and Morgan 2008, ch. 10.5.). In
contrast to transformations of relationships to entities in example 24, an objectified
relationship may still be used as relationship. In example 26 b) the relationship
wrote is objectified as Writing. In natural language objectification is related to the
activity of nominalization. For instance the statement Shelley wrote Frankenstein
may be nominalized as Shelleys writing of Frankenstein. The interpretation of
objectified facts, bears some difficulties: relationships one the one hand represent
possible propositions, which can either be true or false: Shelley either has written
Frankenstein or not. Entities on the other hand are states of affairs. It makes no
sense to say that Shelleys writing of Frankenstein is true or false, but you can make
statements about this event, for instance it started in summer 1816 in Geneva, it
was not known when the novel was first published anonymously in 1818, and it is
described in other books. For this reason a relationship and its objectification should
be seen as distinct objects connected by a 1:1 relationship. If you further analyze
nominalization, different ontological types of objectification may be distinguished
(Moltmann 2007), for instance to differentiate statements like I know that Shelley
wrote Frankenstein and I know the particular circumstances of Shelleys writing of
Frankenstein. General problems of mapping between relationships and entities will
be dealt with in section 4.2.4.
Writing !
a)
Person
b)
Person
Book
wrote
[parent]
2
child of
Year
took place in
described in
(2006) as reasons why CORBAs failed. UML in contrast is quite popular, but its broad coverage maked
it difficult to know, to what in particular applications refer to with UML.
147
148
UML for describing logical and physical data models which can directly be transformed to software. Conceptual concepts in contrast, must rather match a specific
domain, independent from its technical implementation.
Example 27 shows an UML class diagram that depicts a conceptual model similar
to the ERM and UML models above. The ternary relationship between Person, Year,
and Place has been replaced by a DeathEvent entity. The model makes use of attributes
and data types (string and year). n-ary relationships and additional entities connected
to binary relationships are also supported by UML but not shown in example 27.
Person
1..*
wrote
0..*
Name: string
died
Book
Title: string
1
0..1
Place
Name: string
149
150
Notation
IE
zero or more
one or more
zero or one
one
Barker
E
n..m
E
[role]
attribute
Name: Type
primary key
# Name
UML
0..*
E
inheritance
ORM
1..*
0..1
1
n..m
role
Name: Type
(Name)
E
Name <<PK>>
E
{or}
or
{xor}
xor
Table 3.18.: Some conceptual modeling notation variants
in chapter 5.
151
152
B
A
B
C
A
o
Figure 3.24.: Euler diagram, Venn diagram, and Peirces extension to Venn
153
A
f
mind maps.
154
2008) and Dau (2009a). A specification, including the Conceptual Graph Interchange
Format (CGIF) was published as part of ISO/IEC 24707 (2007). An example of a
conceptual graph from Sowa (2008) is shown in example 28 with its notation also
translated to extended CGIF. The graph represents the statement if a cat is on a
mat, then it is a happy pet this interpretation, however, requires background
knowledge in form of a mental model of the real world. According to Sowa (2008)
the Attr relation indicates that the cat, also called a pet, has an attribute, which is
an instance of happiness but what does having an attribute, which is an instance of
happines mean? One the strict logical level of concetual graphs, there is no relation
between this formal statement and the idea of being happy and no knowledge
about whether being happy has a different ontological status than being a pet.97
With their grounding in sets and relationships, diagrammatic logic systems share
strength and weaknesses of formal logic (section 2.1.1): they can be very precise but
they poorly cover non-traditional logic that better fit to descriptions of reality.
A comparision of several diagrammatic logic system for use in artificial intelligence
is given by Sowa (1992b). Diagrammatic logic systems are very similar to conceptual
modeling notations based on entities and relationships.98 Both types of diagrams
can be formalized as multi-bipartite graphs or directed multi-hypergraphs from a
graph-theoretic view.
schemas for database systems. In later publications he applied them to a wider range of topics from
artificial intelligence and cognitive science.
155
156
arrows
lines
connections
spatial
relationships
hyperedges
inclusion
geometric
relations
box
intersection
spatial
concatenation
157
158
format
information system
format
response
To limit the analysis to parts relevant to this thesis, one first needs to look at
the general interaction with an information system via APIs and query languages
(figure 3.27): an information system is accessed by sending a request which may
result in a response. For instance a digital library is an information system that
can be accessed by requests to add, modify, delete, and select stored documents.
Both request and response are digital documents in a defined format, specific to
the information system. To exclude dynamic properties of information systems,
we limit the analysis to requests that do not modify the visible behaviour of the
system. In particular, all these requests must be stateless and cacheable in terms of
the Representational State Transfer (REST) model (R. T. Fielding 2000): each request
[. . . ] must contain all of the information necessary to understand the request and
requests that are equivalent [. . . ] result in a response identical to that in the cache.
Requests of this type are mostly known as information retrieval requests. Examples
of query languages, formats, protocols, and APIs for document retrieval include
Z39.50, SRU, CQL, and OpenSearch. Beyond information retrieval queries there
are further other kinds of queries. A classification of queries types for information
systems has been undertaken by Reiner (1988, p. 33) by distinguishing queries that
ask for one of:
documents (which),
facts (where, when, who, what, . . . ),
decisions (yes or no),
explanations (how, why).
For all kinds of these queries, a universal query language called Intermediary
Query Language (IQL) has been built up based on predicate logic (Reiner 1988,
1991). With a clearly defined syntax and well-founded semantics this language can
be expressed in semantically equivalent formal languages.100 A query can be asked
100 Reiner (1988, 1991) in her thesis implements the Untrained User Query Language (UUQL) and the
159
160
Chapter 4
Findings
In the previous chapter, observations were mainly grouped by aspects of practical
similarities. File systems, for instance, may differ in their architecture, but all serve
the same purpose, despite technical differences. The same applies to databases,
schema languages, diagrams types, and other methods of data structuring. This
chapter analyzes and jointly groups all methods into independent strategies. In
section 4.1 it is found that general prototype categorization better describes what
methods of data structuring actually do with data. Section 4.2 then determines
typical topics that can be observed as paradigms consistently among all methods.
The outcome of this chapter consists of two categorizations, one based on prototypes and one based on paradigms. The categorizations further help to detect
fundamental problems and issues of data structuring and to get candidates and
directions for patterns, which will then be elaborated in the next chapter.
161
4 Findings
category
encodings
storage systems
identifier and query languages
structuring and markup languages
schema languages
conceptual models
main purpose
express data
store data
refer to data
structure data
constrain data
describe data
examples
Unicode, Base64
NTFS, RDBMS
URI, XPath
XML, CSV, RDF
BNF, XSD
Mind Maps, ERM
162
163
4 Findings
views will be illustrated with examples (II) before uniting them as two sides of the
coin of data as sign (III).
I. Two points of view
The document view primarily tries to describe the artifact in more or less detail. For
instance a document can be described by its format, size, and divisions. One can
model the document, for instance as ordered hierarchy, and express it, for instance
in a markup language such as TEI. Even alternative descriptions are possible, for
instance concurrent hierarchies (Pondorf and Witt 2010; A. Renear, Mylonas, and
D. Durand 1996), as long as all descriptions are discoveries of the same concrete
document. The object point of view, in contrast, is less interested in the specific
details of form: it rather tries to create a broad picture of the document content. An
example of the object view is a description of data as set of connected entities and
properties. The document approach is mainly found in library information science
where cataloging is applied to document artifacts that (are assumed to) already
exist. The second approach is mainly found in computer science and in software
engineering where digital artifacts are created to solve tasks of computation. Both
views, however, always exist together. Neither documents nor objects are better
descriptions per se, but both are valid, and both can be found on a large scale and on
a small scale. The important question to reveal this paradigm is not whether data
is better described as documents or as objects, but where a line between the two is
drawn in a particular (application of) data technology.
II. Examples
A visible instance of this separation is the distinction between data values and data
objects, for instance in databases and in modeling languages. Example 29 shows a
simple ORM data model and a corresponding SQL schema with years, events, places,
and names. Years and names are defined as values types in the model, so they are
given directly in concrete model instances. Places and events, in contrast, are abstract
entity types which are objects without explicit form. In the SQL schema values are
expressed by fields with data types, and objects are expressed by tables. Still, objects
cannot exist alone, but they need data fields that act as object place-holders, such as
the Id fields in example 29. The difficult task is to find out which parts of data are
plain documents, and which parts are arbitrary object identifiers.
The line between documents and objects is often less clear than the distinction
between value types or field values, and entity types or object identifier in example 29.
As described in section 3.2.4, identifiers can also hold information about the objects
they refer to in this case data objects are values. In the same way, most document
values can be interpreted as descriptive identifiers for some objects: for instance the
YearAD field in example 29 may not only hold a year number but refer to another
table that describes Years objects. Both variants can better be shown in RDF which
has a clear separation between resources as objects on the one side and literals on the
164
Year
Event
Place
Name
other.1 In RDF years are normally be expressed as literals with datatype xs:integer
or xs:gYear. But in some data sets years are objects, identified by URI references,
such as <http://dbpedia.org/resource/2010> for the year 2010. The choice is
rather arbitrary from a conceptual perspective, but RDF technologies provide no
mechanism to switch between document form and object form. A possible mapping
in extended RDF would be the Turtle statement
"2010"^^xs:gYear owl:sameAs <http://dbpedia.org/resource/2010> .
but no common RDF software can make sense of this.2
Switching between data as document and data as object is also possible for nondescriptive identifiers. As shown in example 10, an ISBN can be expressed in several
variants (ISBN-10, ISBN-13, with/without hyphen or space, etc.). While a general
ISBN is an identifier that refers to an abstract publication object, each variant is a
distinct document. Another example are number encodings (section 3.1.2) which
treat numbers as abstract objects while they are used as concrete values in other
context. Number encodings are just one instance of datatypes (section 2.2.2), which
are used to tag data pieces as values. One can also find document values combined
with the entities and relationships paradigm where objects are seen as as primary
objects and values are attached to objects as secondary properties or attributes
(paradigm 4.2.4). As discussed in paradigm 4.2.5, levels of abstraction can act as
borders between the two forms of a piece of data.
1 A parallel document/object dichotomy in RDF exists with the separation between information resources
and numbers as resources. There is some awareness of the dichotomy between documents and objects,
but crossing the line in practice seems to be no serious option.
165
4 Findings
III. Data as sign
Actually both approaches are two sides of a coin: the document may contain an object
and the object may be expressed in a document. In the document view the content of
a digital document is taken literally and in the object view it is taken figuratively. A
semiotic view helps to better understand the nature of this dichotomy: given a piece
of data as sign, the document view corresponds to its nature as signifier and the
object view corresponds to its nature as signified. The connection between document
and object is an arbitrary result of social convention, so there is not only one digital
object in a digital document.
The social grounding of data as signs becomes visible if one looks at the primary
purpose of the documents and objects paradigm: both approaches provide analysis
models of digital artifacts. In software engineering there are two notions of analysis
models, which are often confused in practice (Genova, Valiente, and Nubiola 2005):
one models an existing system as selection of the real world (descriptive analysis
model) and the other specifies a software system (prescriptive synthesis model).
Analysis is done by reverse engineering, it is an act of discovering structures. In
simple cases you just look at given data to find out how it is structured. The
object approach, on the other hand, tries to create a clever structure that the digital
artifact can be be put inside. Again there are simple cases in which there seems to
be only one obvious schema. Nevertheless analysis (document to object, signifier
to signified) and synthesis (object to document, signified to signifier) is based on
experience, intuition, and ad-hoc decisions as usual to the application of signs.
Strength: documents and objects are useful methods to describe a digital artifacts
as a whole, either analyzed as concrete, given value, or synthesized as abstract,
created reference.
Weakness: it is often not clear whether a particular piece of data is actually used
as value or as object. Once the distinction is fixed in a data description language,
it is mentally difficult to switch the point of view.
Patterns: The patterns most likely found together with this paradigm include the
label pattern and the atomicity pattern.
166
only operate by virtue of the informal norms needed to interpret them, while technical norms can
play no role in an organization unless embedded within a system of formal norms.
167
4 Findings
A schema language (e.g. XSD, section 3.7.2) is specified by a standard with:
rules how to express schemas in the schema language (e.g. the syntax of XSD):
communication norms and formal norms;
rules how to specify other document formats via schemas (e.g. the meaning of
XSD elements): control norms and technical norms;
indications how to make use of schemas in practice (e.g. how to apply and
combine XSD schemas): informal norms, that may be substantive, communication
or control.
Example 30: Specification of a schema language as standard
of its symbols but only how to combine them to valid words (see section 2.2.1). At
the other end of the spectrum of specifications there are general business rules. A
business rule is a statement that defines or constrains some aspect of the business
[. . . ] to assert business structure, or to control or influence the behavior of the
business (Business Rules Group 2011). Like other standards, business rules should
provide an enforcement regime what the consequences would be if the rule were
broken (see conformance below), but rules can also exist as less formal agreements.
Examples of business rules in bibliographic data are cataloging rules and application
profiles.
Most specifications make use of both, formal language and natural language.
Without some kind of formalization, natural language is fuzzy, and without further
explanation formal languages and notations are precise but meaningless. One
strategy to bridge the gap between both is making parts of natural language more
precise again by standardization. Examples include verbalization of ORM and the
definition of specific words for mandatory and deontic requirements in RFC 2119
(Bradner 1997) summarized in example 31. Such formalizations create a layering of
standards where substantive standards are affected by communication and control
standards, which at the top are affected by informal standards.
Independent from the problem how to express rules, a standard should neither
be too strict nor too lax for its use case. Examples of common artifacts when data
standards collide with real life applications include ad-hoc NULL values, such as
n/a or , in response to mandatory constraints and ad-hoc subfield separators,
such as , or /, in response to non-repeatable fields. The balance between strict
and lax rules in a standard is influenced by many factors. For instance the choice
between prescriptive and descriptive rules can result in more or less degrees of
freedom in markup types (section 3.6.1). As shown in figure 4.1 only part of the
intended meaning of a digital document is explicitly encoded other parts depend
on context. In addition, the document consists of redundant parts. Standards should
clearly show which degrees of freedom contribute to the communication of meaning
and which parts are irrelevant or predictable. A common example of irrelevant parts
168
MUST (or REQUIRED or SHALL) means that the definition is an absolute requirement.
MUST NOT (or SHALL NOT) means that the definition is an absolute prohibition.
SHOULD (or RECOMMENDED) means that the full implications of not following
the definition must be understood and carefully weighted because it is strongly
recommended.
SHOULD NOT (or NOT RECOMMENDED) means that the full implications of
implementing a defined item must be understood and carefully weighted because
it is strongly discouraged.
MAY (or OPTIONAL) means that a feature is truly optional. Systems that do not
implement it MUST be prepared to interoperate with systems that implement the
feature and vice versa.
Example 31: Summary of precise words defined in RFC 2119
explicitly encoded
context-dependent
meaning
redundancy
document
169
4 Findings
Across all technologies one finds a request to create generalized, abstract, or
language-independent specifications which can be applied to different usage scenarios.
Concrete approaches include mathematical notation (section 2.1), abstract data types
(section 2.2.2), data binding languages (section 3.5.1), generalized markup languages
(section 3.6), and modeling languages (section 3.8). In one of the rare works on
general language-independence in data, Meek (1995) lists some lessons learned from
language independent standardization, some of which apply to standardization
in general and some of which to language-independence in particular.5 Despite
the usefulness of language-independent standards, these standards tend to get
ignored. For instance ISO 11404 (2007) is referenced by XML Schema datatypes
(section IV), but non-XML languages prefer to refer to the latter instead of ISO 11404.
Furthermore each language-independent standard, while abstracting from other
languages, defines its own language, adding just another layer of abstraction.
III. Conformance of data standards
Given a standard with its specification still there is no guarantee that data will be
structured the way it was intended. Standards in practice are interpreted, ignored,
and misused in many ways. Unlike propositions, standards can not be true or false,
but only valid or invalid, compared to some practice. The relation between practice,
actually given as digital documents for the domain of this thesis, and standards
is mostly expressed the other way round: we say that some data conforms to a
standard if the standard contains a valid data description.6 The importance to get
the conformity rules right is stressed as critical to every standard by Meek (1995).
In particular all requirements must be testable, and implementation-dependent
features or extensions should be avoided. Conformance tests (or validation tests)
which must exactly match the specifications, are found in three forms:
A validator checks whether a particular documents conforms to a selected standard. Validators test the creation of data but they can also be used as membership function to fully define a standard in terms of set theory. In contrast to
general implementations, a validator must be strict even on minor errors. For
instance a web browser will accept broken HTML code, but a validator such
as the W3C Markup Validation service7 will show all detectable violations of
the HTML standard. Most data is created neither with specifications nor with
exact validators but with implementations. These implementations may actually
5 The general rules are dont be too ambitious, dont let perfection be the enemy of the good enough,
define your target audience, take related standards into account and get conformity rules right.
The specific rules are make yourself language-independent, and recruit others like you, identify
what kind of feature or facility you are trying to define, get the level of abstraction right (see
section 2.2.2 and 4.2.5), avoid any representational aspects or assumptions, promote your standard
continually, decide early on what to do about bindings, and again get conformity rules right.
6 One must also take care not to confuse statements of conformance and statements of usefulness:
practice and standards can be valid but lunatic, when compared to some goal with common sense.
7 Available at http://validator.w3.org/.
170
171
4 Findings
Collections, types, and sameness share a principle of grouping, which can be detected in all methods of data structuring. This section will first give examples from
chapter 3 and then analyze each of the three paradigm expressions, and how they all
depend on questions of identity.
I. Examples
Character encodings classify characters by properties like letter case, character type,
and writing system. Different kinds of equivalence and normalization are used to
find out when two character sequences are same (section 3.1.1). Identifier systems
(section 3.2.3) group objects by giving them same identifiers or by partitioning them
in namespaces. File systems (section 3.3) were specifically developed to organize
collections of data. Above single files, collections are found in directories, file types
or other properties of files. The same applies to databases which are collections
of records (section 3.4). Records may further be divided into record types and
record fields can be typed, to only hold specific groups of values. Data structuring
languages (section 3.5) are essentially build of basic data types and collection types
such as records, lists, and tables. Usually these data types are disjoint. Types can
also be non-exclusive, for instance RDFs rdf:type property. Schema languages
(section 3.7) and type systems of programming languages can be used to define
new types by refining existing ones. Schema rules and constraints can also allow to
check whether an object belongs to a specific type. The support of specific collection
types in conceptual modeling languages is rather poor (T. Halpin and Morgan 2008,
ch. 10.4) but they define collections just by appointing them. For instance one can
define an entity type publication and virtually create a collection of things that
are publications. This way, however, it is also possible to create virtual collections
like Veeblefetzer and Potrzebie without indication which things actually belong to
these collections.
II. Three appearances of grouping
The grouping paradigm of this section can be detected in three appearances. Collections are the most visible appearance of grouping. Independent from the internal
structure of a collection (ordered sequence, unordered set, structured graph. . . )
there is the idea of a set of things grouped together. To define this set, one can
either list its members one by one, or one can provide a membership function and
a universal set to choose from, as described in section 2.1.2. Unfortunately this
implies all problems of set theory such as the identification of same elements and
non-paradoxical universal sets. As each set defines a property, each collection can
also be seen as type and vice versa.
172
a duck and quacks like a duck, I call it a duck, attributed to James Whitcomb Riley. Duck typing
in programming does not necessarily include inference of a predefined type like duck, but it only
ensures the availability of a given set of characteristics.
173
4 Findings
paradigm
collection
type
sameness
membership function
part-of
is-a, instance-of, kind-of
is, stands-for
relationship
meronomy
hyponomy
identity, metonomy, synecdoche
10 Synecdoche, or more general metonomy, is a figure of speech in which a term is used for instance for
a larger whole (pars pro toto), or for the general type it refers to. For instance a title can refer to a
document, a work, or its physical copy, although it is a labeling property.
174
people who have been employees, or are eligible to become, or have applied to be, or have pretended
to be, or have refused to be, and so on, together with various combinations of these sets.
12 This also includes broken links where no entity can be found for an identifier.
175
4 Findings
for data description. Once committed to this paradigm, you see graphs everywhere.
This fallacy is more obvious if focused on specialized forms of graphs: for instance
one tends to see trees everywhere, given tools and technologies such as hierarchical
databases, file systems, and XML or given object oriented tools with inheritance,
directed acyclic graph seem to fit very well. Even if one broadens its view to general
hypergraphs with connections that can span more than two entities, there is the
dichotomy between nodes/entities and edges/connections as two types of objects.13
The dichotomy is not wrong per se, but it comes with two major problems: the choice
which piece of data to express as entity and which as connection is rather arbitrary
and it is difficult and ambiguous to map between entities and connections if needed.
II. Two problems illustrated
To illustrate the first problem, let us assume you want to store data about people
and the year they were born. Example 32 gives several encoding forms in JSON
(the principle could also be shown with other data structuring languages). In JSON
connections are present as key-value pairs of objects. In the first form (line 1), there
is a direct connection of birth between name and year entities. The second form (line
2) moves the birth connection into an entity, and connects this entity to the person.
As shown in line 3 and 4 one can follow this procedure further and uses entities for
the connection between birth and year and for the connection between year and year
value. The choice between entity and connection here depends on which granularity
you prefer. Line 5 shows a yet another encoding that groups name and birth in a
common entity so there is no explicit connection between the two.
1
2
3
4
5
{
{
{
{
[
:
:
:
:
:
1906 , ...
{ " birth "
{ " birth "
{ " birth "
" Hannah " ,
}
: 1906 } , ... }
: { " year " : 1906 } } , ... }
: { " year " : { " AD " : 1906 } } } }
" birth ": 1906 } , ... ]
176
(figure a) consists of two entity types, Person and Document, that are connected by the
binary n:m relationship author of. The implicit uniqueness constraint that spans all
relationships (every fact can only be given once) is drawn explicitly. For simplicity,
some documents may exist without author and some people may exist without
having authored a document. The relationship can be objectified as entity Authorship.
Can you relate this new entity to Person and Document to fully replace the original
author of relationship? First, an Authorship can only exist together with at least one
Person and at least one Document, so each connection has a mandatory role constraint
(figure b to e).The uniqueness constraint, however, can be transformed in several
ways. The most obvious solution is to create two 1:n relationships, so both a person
and a document can have multiple authorships. Each authorship belongs to exactly
one person and one document (figure b). This still allows multiple authorships with
the same person and the same document. An external uniqueness constraint can
solve the error (figure c) but it is often forgotten in practice. Another solution is to
use one 1:1 relationship between Authorship and Document and one n:m relationship
between Authorship and Person, so each document has at most one authorship, but
authorship can consist of a group of people. (figure d). Similarly one could interpret
authorship as a lifework of a person, so every Person has at most one Authorship that
consists of a set of Document instances (figure e). There are even more possibilities
if one makes Authorship an independent entity: one could move both mandatory
role constraints to the connection between Authorship and Document to say that every
document must be authored, but its authorship may have no person. The example
shows that a simple relationship can be transformed to an entity, but multiple models
and interpretations exist. The same problem arises on the logical and physical level
of data description as shown by W. Kent (1988), who also summarized the motivation
for this paradigm as following:
[It is] difficult to partition a subject like information into neat categories like categories,
entities, and relationships. Nevertheless, in both cases, its much harder to deal with
the subject if we dont attempt some such partitioning. W. Kent (1978, p. 15)
177
4 Findings
Person
b)
a)
Authorship
Person
c)
Authorship
Document
Document
Person
Person
Authorship !
Person
Document
author of
e)
Authorship
d)
Authorship
Document
Document
A
A
41
1000001
01000001
41
30A
___01100
__001010
11001100
10001010
CC
8A
To give another example, one could create a general tree-store that abstracts
and integrates the hierarchical content of XML files and the hierarchical directory
and
structures of file systems, as proposed by Wilde (2006) and Holupirek, Grun,
Scholl (2007). On a higher level one could then point to a data element by XPath
like expressions without having to deal with details of neither file systems nor XML.
The multitude and ubiquity of layers in data formats is often invisible by purpose:
full awareness of each level at the same time would mostly result in confusion. In
178
a)
RDF
c)
zzStructure
a)
b)
c)
d)
e)
f)
b)
e)
d)
defined on
XML
Unicode
f)
JSON
YAML
subset of
RDF/XML (Dave Beckett 2004)
RDF Schema for XML Infoset (Tobin 2001)
zzStructure in RDF (Gutteridge 2010)
RDF/JSON (K. Alexander 2008)
JSON-LD (Sporny, Kellogg, and Lanthaler 2012)
JSONx (Muschett, Salz, and Schenker 2011)
YAML in XML (Ben-Kiki, Evans, and Ingerson 2006)
Figure 4.2.: Existing encoding mappings between several data structuring languages
fact, the main purpose of abstraction layers is to hide complexity and irrelevant
details. Such abstractions not only hide technical aspects of structuring, but they
also subsume concepts of description: a data element in the tree-store example can
be a file or an XML element on a lower level, but these concepts are irrelevant one
a higher level. Another purpose of abstraction is the translation between different
data languages. Depending on the application, abstraction as paradigm also occurs
as encoding, wrapping (see section 3.3.3), granularity (C. M. Keet 2008b, 2011), or
mapping.
Given a set of precise mapping rules, any formal language can be encoded in, or
mapped to any other languages. To give some examples, figure 4.2 shows existing
encodings between JSON, RDF, XML, and other data structuring languages. As the
mapping graph in figure 4.2 contains circles, one could endlessly encode data in
layers (RDF in RDF/XML in JSONx in XML Infoset in RDF . . . ) without essentially
adding value the existence of layers alone does not guarantee that each layer
actually hides complexity and details. In fact, existing data can be compared with
stratigraphic deposits in archaeology or geology (see section 6.2.1 for and extension
of this comparison). An example is MARCXML, an encoding of MARC in XML
keeps irrelevant punctation and other artifacts from ISBD in MARC. In addition to
full encodings that map every relevant aspect of one language in another, there are
abstractions which only cover a subset of the original language. By this, a mapping
can also be used as specification (see section 4.2.2).
Although each level of abstraction should fully hide the details of implementation
on levels below, one sometimes need to take into account several levels, lacking a
clean separation between each of them. An example is given by Thomale (2010) for
MARC. The general reason why abstraction cannot fully hide levels is the dependency
179
4 Findings
level
1) abstract
2) computational
a) linguistic/syntax
b) operational/semantic
3) representational
domain
conceptual value space
representable values and processes
how values are expressed
processes which values express
value representation
words syntax and semantic. The statement, however, gets an additional meaning if one considers the
stacking of multiple languages in layers of abstraction. The term language in Meeks paper mainly
refers to programming languages but his results can also be applied to descriptive data languages.
180
conceptual data models (C. M. Keet 2007) and it can help to map multiple models
that partly overlap on different levels of description.
Independent of the kind of abstraction, each abstraction helps to focus on particular properties, but it has a price: as expressed by Yang (2009) every abstraction layer
does not only adds a little over head to the CPU, but also to the poor human who
has to read that code. Integration and readability can be improved by redundancy
and by precise standards (see section 4.2.2). As standards can be fuzzy, violated,
and misinterpreted, there can be confusion about which which level of description
is actually used. For instance an RDF document can use predicates from the OWL
ontology, but this does not guarantee that the full enforcement of owl-entailment
with all of its aspects was actually intended. No abstraction can fully remove the
burden of actually reading and interpreting digital documents.
Strength: levels help to separate relevant and irrelevant parts.
Weakness: levels often cannot clearly be separated and too many levels may more
confuse then they simplify.
Patterns: The pattern most likely found together with this paradigm include the
encoding pattern and the embedding pattern.
181
Chapter 5
5.1. Organization
In short, a pattern is a named description of a problem which occurs over and
over again in our environment (C. Alexander, Ishikawa, and Silverstein 1977).
The pattern guides to ways of solving the problem, independent from particular
solutions. Each pattern from this pattern language of data structuring highlights a
specific problem that occurs when data is actually organized. As mentioned before,
problems of data structuring should not be confused with particular data structures
as technical solutions. A pattern refers to particular methods of data structuring (see
chapter 3), but it does not adhere to concrete implementations. Instead, each pattern
shows general strategies of solutions with its benefits, consequences, and pitfalls.
Each of the patterns is structured similarly to the design patterns presented by
Gamma et al. (1994), Cunningham (1995), and other pattern languages.1 Each
pattern consists of four essential elements that imply a set of uniform sections:
The name is a short label for referencing and describing the pattern. A good
name should be easy to recognize and communicate the pattern. Additional
1 An informal collection of pattern templates can be found in WikiWikiWeb at http://c2.com/cgi/
wiki?PatternForms. See also Meszaros and Doble (1997) for patterns in pattern languages.
183
184
elements. The patterns label and atomicity take an element as such without further
inspection. The patterns size, optionality, and prohibition also include the idea of
content which a data element is build of and which can be shaped in a specific way
for each of the patterns.
185
186
In file systems the file is atomic: its content is one arbitrary piece of data.
In conceptual modeling the entity is atomic.
Most data description languages have the notion of basic data types.
An API encapsulates internals of a data element.
First normal form (1NF) in relational databases.
The only totally atomic data element is the bit.
Counter examples
A character string delimited by double quotes is not fully atomic. The string
must either disallow quotes as content or allow escape sequences (prohibition)
that force interpretation of the strings internal structure.
Difficulties
Internals of data elements are rarely hidden in total. As soon as details of an
element such as its member elements (see container) can be inspected, the
element is not fully atomic anymore.
One cannot refer to parts of an atomic element.
Although a non-descriptive identifier should be atomic, it is common practice
to inspect its structure. For instance the actual character string of an URI
Reference has no meaning in the RDF model, but it is common to group and
interpret these strings for instances based on namespaces.
Atomicity is broken if levels of abstraction are not fully separated.
One should be able to replace the content of an atomic element with random
data, for instance XXXXX. In practice the content is often limited by
prohibition, so the element is not fully atomic.
Related patterns
A container is an alternative strategy to wrap data. Its internal structure is
typically visible.
To achieve atomicity, and as alternative to atomicity, encoding can be used.
If the hidden content of an atomic element does not matter anyway, the
element can also be garbage.
Atomic elements may still have properties which can be connected to the
atomic elements via dependence.
Implied patterns
It must be known where an atomic data element starts and where it ends without
having to look into its content, so atomic data elements have a known size.
187
188
189
190
Implementations
Explicitly list all disallowed elements (creating a container).
Refer to a schema that tells which elements to avoid.
Use an encoding that does not include prohibited elements.
Examples
File systems disallow specific characters in file names, such as quotes, brackets, dot, colon, bar, asterisk, and question mark.
A null-terminated character string must not contain null-bytes.
Unicode and languages build on top, such as XML and RDF, disallow specific
character code points.
A separator element cannot occur as normal content.
With Closed World Assumption everything is disallowed unless defined
as allowed. With Open World Assumption one needs to explicitly state
disallowed elements.
Formal grammars extended by difference operator or negation in boolean
grammars allow to express arbitrary forbidden elements in a schema.
In mandatory fields (optionality) empty elements are prohibited.
Specific graph types disallow some kinds of vertices, such as loops and circles.
Difficulties
Prohibitions as exceptions from a rule are easy to grasp for human beings
but they are more difficult to detect and compute algorithmically. Boolean
grammars which support formal expression of exceptions via a negation
operator are still more research topic rather than a practical tool for data
description.
Exceptions can have their own exceptions (the world is complex).
Some prohibitions are not stated explicitly but implied by external constraints (derivation). For instance numbers in JSON can have arbitrary precision but in practice they are limited to standard floating point and integer
representations.
Related patterns
If the prohibition depends on existence of another element, it is rather an
instance of the flag pattern.
optionality and mandatory constraints can be used in a schema to express
whether an element must be present.
Implied patterns
Every prohibition is either part of a schema or it constitutes a virtual schema
consisting of this single prohibition.
191
separator
dependence
graph
sequence
atomicity
container
size
embedding
etcetera
192
193
194
Difficulties
For most graphs there is no simple normalization. The graph canonization
or isomorphism problem is computationally hard because elements in a
graph have no natural order. This contrasts with sequence as basic method to
express data.
Most practical graphs are more than simple structures build of nodes and
vertices. Specialized types and properties of graphs exist, such as directed
graphs, multigraphs, hypergraphs, labeled graphs, etc. For instance diagrams likely evolve to generalized hypergraphs with vertices that connect
more than two nodes and even other vertices. Additional levels of encoding
may be necessary to get the common form of a graph with simple nodes and
vertices.
Related patterns
Specific graph types such as trees, grids, and lists often indicate alternative
patterns such as hierarchies (embedding) and order (sequence). Bijective and
injective graphs may better express encoding, normalization or dependence.
Implied patterns
Vertices in a graph are secondary elements to nodes (dependence).
The set of all nodes and/or vertices can be used as container.
195
196
197
198
Difficulties
A clean hierarchy is sign of oversimplification. In practice one has to deal
with cross-connections, parallel and overlapping hierarchies (e.g. "( { )
}").
Once a template has been filled with values, it becomes invisible. One must
know the embedding rules to rediscover embedded elements, otherwise
embedding frame and content easily get mixed up.
Embeddings are part of other embedding, forming a long chain of levels.
This chain should contain no circles, but self-referential embeddings may
exist both in the conceptual realm and in the data realm (for instance a
document that refers to itself or a zip file that contains a copy of itself).
Related patterns
A hierarchical structure could also be a constrained graph instead.
Hierarchic nesting is also found in encoding. While encodings stress the
relations between signifier and signified, the purpose of an embedding is
more to give context. The relation between encodings and embeddings is
similar to the semiotic relation between langue and parole.
Embedded elements may be mandatory or optional (optionality), they may
be constrained by prohibition and they may be abbreviated (etcetera). If an
embedding is primarily used to express such constraints, it is likely a schema.
Embedded elements may be secondary to the frame they are embedded in
(dependence).
Specialized patterns
A container embeds multiple member elements.
199
optionality
garbage
normalization
schema
encoding
flag
identifier
derivation
prohibition
void
200
201
202
203
204
205
206
207
208
209
210
211
212
Motivation
Collections may be too large to be expressed, or parts of a collection may be
unimportant or already implied by context. This pattern allows for abbreviating
and to show that a collection contains more then explicitly expressed.
Implementations
First, one needs to decide which parts to omit:
Only express the first or the most important parts.
Only express mandatory parts and omit the rest (optionality).
Give a random sample of of elements (garbage).
Second, the etcetera indicator can be expressed in several ways:
Use a special element as etcetera indicator. This element must be a prohibition
to not be confused with normal parts.
Use obviously wrong elements as placeholders (garbage).
Define a fixed cut, for instance a maximum length.
Examples
... or et al. to indicate an abbreviated sequence.
e.g. to indicate that the included elements are examples from a larger set.
Omission of parts in the middle of an element with [...].
Library cataloging rules exist to only include three authors in a record, so
the list of authors is always abbreviated if there are more then three authors.
Counter examples
Omission of details can also be an example of generalization and abstraction (see
encoding) instead of abbreviation.
Difficulties
Type and number of omitted parts and the reason for abbreviating are often
unclear.
An etcetera indicator and normal parts of a collection must not be confused
(for instance strings that actually end with ...).
Indicators could also be used to tell that a collection may be extended or that
an element can be repeated (see container and schema patterns).
Related patterns
With a fixed cut this pattern also uses the void pattern instead of an explicit
etcetera indicator. The void pattern is also similar because it indicates elements.
The etcetera pattern in contrast indicates the existence of more elements.
213
214
Even garbage indicates something: at least the fact that the garbage data
element is missing, inapplicable, or should be ignored for some other reason. The specific reason, however, is rarely indicated by garbage elements.
Proposals to differentiate kinds of null values contradict the original idea of
garbage elements to be ignorable.
Garbage elements can be introduced against the original purpose of a schema
to allow optionality where no support of optionality was intended. For
instance obviously wrong names and email addresses are found if these
fields are mandatory.
Related patterns
While garbage is explicit data that has no content, instances of the void
pattern have content without explicit data.
Data that cannot be interpreted as referring to other data may also be a label
instead of garbage. Eventually all labels are meaningless to a computer.
Irrelevant data has no internal structure, so atomicity is often implied. Atomicity is also an alternative pattern if it turns out that the data is not fully
irrelevant.
Garbage is often used as separator which does not need to have a value of its
own.
Implied patterns
A schema should define the context in which a data element is garbage or not.
215
216
5.6. Evaluation
5.6. Evaluation
Despite their popularity, there is little research on evaluation of pattern languages
and the pattern language paradigm itself (Dearden and Finlay 2006; Petter, Khazanchi, and Murphy 2010). A pattern language is meant to be a tool of communication
to describe typical solutions to common problems. Evaluation can therefore concentrate on: first whether and how well the language describes typical solutions to
common problems, and second whether and how well the language communicates
these problems and solutions. The second question can best be answered with user
studies which would go beyond the scope of this thesis. The first question can be
answered by direct examination of the patterns and their problems and solutions.
To check whether the pattern language developed in this thesis reflects experience
in data description, the pattern language is compared to similar collections and
models. If similar approaches have led to similar solutions, this is a strong indicator
that the pattern language actually describes common problems and solutions in
data structuring and description. Most existing pattern languages such as Gamma
et al. (1994) are not comparable because they do not refer to static digital documents
but to dynamic behavior of information systems. Literature review led to three
models of data that are similar enough to allow for comparison. These models will
briefly be compared with the pattern language in the following. Evaluation in greater
depth will require more feedback. Evaluation of patterns, as suggested by Petter,
Khazanchi, and Murphy (2010), is a conscious continuous improvement activity,
which should be applied to the entire life-cycle of a pattern language. The current
language should therefore be taken as valid starting point for further improvement.
217
218
5.6. Evaluation
dimension/axis
aggregates
homogeneous elements
basic item elements
ordered elements
number of elements (fixed, limited, unbounded)
element identified by number
element identified by name
element identified by pointer
files
file selection
unique entries
sequential file
kinds of entries
associations
cardinality (1-1, 1-n, n-m)
kinds of ends
loops allowed
complete associations
exclusive associations
patterns
container
flag, schema, label
atomicity
sequence
size, schema
sequence
identifier
identifier
container
identifier
normalization
sequence
flag
graph
schema, size
flag, derivation
graph, schema
optionality
flag
219
concept
Abstraction
Class
Encapsulation
Inheritance
Object
Message Passing
Method
Polymorphism
definition
Creating classes to simplify
aspects of reality using distinctions inherent to the problem.
A description of the organization and actions shared by
one or more similar objects.
Designing classes and objects
to restricts access to the data
and behavior by defining a
limited set of messages that
an object of that class can receive.
The data and behavior of one
class is included in or used as
the basis for another class.
an individual, identifiable
item, either real or abstract,
which contains data about itself and descriptions of its
manipulations of the data.
the process by which an object sends data to another object or asks the other object to
invoke a method.
A way to access, set, or manipulate an objects information
Different classes may respond
to the same message and each
implement it appropriately.
patterns
normalization
label, schema
atomicity, encoding
derivation
identifier
220
5.6. Evaluation
221
notions
value space/representation
properties
equality
order
upper/lower bound
cardinality
numeric types
primitive data types
Boolean
State
Enumerated
Character
Ordinal
Date-and-Time
Integer
Scaled (fixed point)
Real
Complex
Void
derivation methods
pointer types
choice types
aggregation types
subtypes
patterns
encoding
patterns
normalization
sequence
schema
size, normalization
sequence
patterns
flag
flag, encoding
sequence
encoding
sequence
sequence, embedding
sequence
size, embedding
size, embedding
size, embedding
void
patterns
identifier, dependence
flag
container, embedding
derivation
222
Chapter 6
Conclusions
6.1. Summary and results
Many methods, technologies, standards, and languages exist to structure and describe data. The aim of this thesis is to find common features in these methods to
determine how data is actually structured and described. The study is motivated by
a growing number of purely digital documents and metadata, which both eventually
exist as sequence of bits. In contrast to existing approaches, that commit to notions
of data as recorded observations and facts, this thesis analyzes data as signs, communicated in form of digital documents. The document approach is rooted in library
and information science as documentation science. In this discipline digital documents and metadata are primarily given as stable artifacts instead of processable
information like in computer science. The notion of data as documents, as applied
in this thesis, excludes statistical methods of data analysis in favour of intellectual
data analysis. The study assumes that all data is implicitly and explicitly shaped by
a process of data modeling, which is always grounded in the mind of a human being
(see figure 6.1 and its unpackaged version in section 2.2.3, figure 2.6). The study
also denies a clear distinction between data and metadata because metadata is both
a digital document and used to structure and describe digital documents: ones data
is the others metadata and ones metadata is the others document. Such relations
in data, however, are not purely arbitrary but based on conventions that have been
analyzed in this thesis.
The plethora of existing ways to structure and describe data was analyzed by a
phenomenological research method, which is based on three steps: first, conceptual
properties of data structuring and description were collected and experienced critically by phenomenological intuiting. As realized in chapter 2 and chapter 3, data is
mind
model
schema
implementation
01101. . .
223
6 Conclusions
structured and described in different disciplines (mathematics, computer science,
library and information science, philosophy, and semiotics) and by different practices. Examples of these practices include encodings, identifiers, markup, formats,
schemas, and models. The most common methods to structure and describe data
include data structuring languages (section 3.5) and schema languages (section 3.7).
After this empirical part, the methods found were grouped using phenomenological
analysis without adhering to known concepts and categories. The result of this
second step was presented in chapter 4: the analysis resulted in six prototypes that
categorize data methods by their primary purpose (section 4.1). These prototypes can
be used to better grasp the actual nature of a method, independent of its originally
intended purpose:
1. encodings (most of section 3.1)
2. storage systems (most of sections 3.3 and 3.4)
3. identifier and query languages (most of sections 3.2 and 3.10)
4. structuring and markup languages (most of sections 3.5 and 3.6)
5. schema languages: (most of section 3.7)
6. conceptual models (most of sections 3.8 and 3.9)
The study further revealed five basic paradigms, described in section 4.2, each
with its benefits and drawbacks. The paradigms provide general kinds of viewing
and dealing with data and they deeply shape the way that people deal with data
structuring and description:
1. documents and objects
2. standards and rules
3. collections, types, and sameness
4. entities and connections
5. levels of abstractions
The third step, that is phenomenological describing, resulted in a language of
twenty fundamental patterns in data structuring and description (chapter 5). The
patterns show problems and solutions which occur over and over again in data,
independent from particular technologies. This application of the pattern language
approach is novel. Existing design patterns in software engineering refer to dynamic
systems instead of static digital documents and the patterns mostly refer to one
particular method of data description. The pattern language given in this work
consists of twenty patterns, each described with its names, problems, solutions, and
224
consequences. Each pattern shows general strategies in data structuring and description with its benefits, consequences, and pitfalls, and relates this strategy to other
patterns. An overview of the pattern language is given below with a classification
of the patterns (table 6.1) and with a graph of pattern connections (appendix C). In
section 5.6 the pattern language is compared with related works for evaluation.
This thesis collected and analyzed a wide range of traditions (chapter 2), methods
(chapter 3), prototypes and paradigms (chapter 4), and patterns (chapter 5) of data
structuring and description. The results can help data modelers and programmers
to find a trade-off when selecting methods of data structuring and description for
their particular application. Patterns can also help to identify solutions that have
implicitly been implemented in data. Last but no least the result of this thesis
facilitates a better understanding of data. Applications of the results and options for
further research will be summarized in the following sections (6.2 and 6.3) before
concluding with a final reflection (section 6.4).
225
6 Conclusions
1. basic patterns (page 182ff.)
a) pure data elements
i. label
ii. atomicity
b) data elements with content
i. size
ii. optionality
iii. prohibition
2. combining patterns (page 190ff.)
a) combine multiple elements on the same level
i. sequence
ii. graph
b) combine elements by subsumption
i. container
ii. dependence
iii. embedding
3. relationing patterns (page 198ff.)
a) primary
i. identifier
ii. derivation
b) secondary
i. encoding
ii. flag
c) tertiary
i. normalization
ii. schema
4. continuing patterns (page 209ff.)
i. separator
ii. etcetera
iii. garbage
iv. void
Table 6.1.: Full classification of patterns in data structuring
226
6.2. Applications
6.2. Applications
The results of this thesis can be applied virtually everywhere data is intellectually
used and created, including the design of automatic methods of data processing.
In particular the identified categories, paradigms, and patterns can help to better
understand existing data and to improve (the creation of) data models. Ideally, the
results will foster a general understanding of methods to describe and structure data,
independent from specific technologies and trends, such as programming languages,
software architectures, and storage systems. Two specific emerging domains of
application will be described below with data archaeology (section 6.2.1) and data
literacy (section 6.2.2).
227
6 Conclusions
important to locate data archaeology in the (digital) humanities1 as meaningful data
is always a product of human action. It can therefore only be studied involving the
cultural context of its creation and usage.
As Steve Hoberman points out in the third edition of W. Kent and Hoberman
(2012, p. 63), data archaeology is also an act of reverse-engineering: Just as an
archaeologist must try to find out what this piece of clay that was buried under the
sand for thousand of years was used for, so must we try to figure out what these
[data] fields were used for when no or little documentation or knowledgeable people
resources exist. The data categories, paradigms, and patterns identified in this
thesis can help to detect intended shape and purpose of such buried data elements.
1 See Svensson (2010) for a discussion of the scope and definition of digital humanities.
2 Qin and DIgnazio (2010) refer to scientific or science data literacy with the ability of collecting,
processing, managing, evaluating, and using data for scientific inquiry but they neither provide a
separation to general data literacy nor a definition of data.
228
3 An idea not followed in this thesis was to depict each pattern by an icon for better recognition.
229
6 Conclusions
Last but not least, the semiotic background of data could be elaborated in more
detail. At best, this thesis provides a semiology of data similar to the semiology
of graphics by Bertin (1967, 2011). Expanding the notion of data as sign to data as
language, this thesis might also be placed in a new discipline called data linguistics.
Several linguistic subfields exist, each concerned with particular aspects of human
language. For instance anthropological linguistics and sociolinguistics study the
relation between language and society, and historical linguistics studies the history
and evolution of languages. Although digital documents are used for communication,
there is no branch of linguistics dedicated to the study data as language.
for a comparision. However both, Nelson and Berners-Lee, do not talk about totally different things as
one can show with the paradigm of documents and objects (section 4.2.1).
230
p. 351): this thesis does not present a new and better method to structure and
describe data. The contribution, however, is more important than yet another data
language. The prototypes, paradigms, and patterns, provide another look at data
(Mealy 1967) by revealing unspelled assumptions that deeply shape how data is and
will be structured and described in practice.
231
Bibliography
Abiteboul, Serge, Peter Buneman, and Dan Suciu (2000). Data on the web: from
relations to semistructured data and XML. San Francisco: Morgan Kaufmann.
Alexander, Christopher, Sara Ishikawa, and Murray Silverstein (1977). A Pattern
Language: Towns, Buildings, Construction. Oxford: Oxford University Press.
Alexander, Keith (2008). RDF/JSON: A Specification for serialising RDF in JSON.
In: Scripting for the Semantic Web. CEUR Workshop Proceedings. http://ceurws.org/Vol-368/paper16.pdf.
Alur, Rajeev and P. Madhusudan (2009). Adding nesting structure to words. In:
Journal of the ACM 56.3, 16:116:43.
Amazon (2010). Amazon S3 Developer Guide. Tech. rep. Amazon. http : / / aws .
amazon.com/documentation/s3/.
Ames, Alexander et al. (2005). Richer File System Metadata Using Links and Attributes. In: Proceedings of the 13th NASA Goddard Conference on Mass Storage
Systems and Technologies (MSST). IEEE Computer Society, pp. 4960.
Andreessen, Marc (1999). Innovators of the Net: Ramanathan V. Guha and RDF. (Accessible via Internet Archive). http://home.netscape.com/columns/techvision/
innovators_rg.html.
Angles, Renzo and Claudio Gutierrez (Feb. 2008). Survey of graph database models.
In: ACM Computing Surveys (CSUR) 40.1, pp. 139.
Armstrong, Deborah J. (Feb. 2006). The Quarks of Object-Oriented Development.
In: Communications of the ACM 49.2, pp. 123128.
CODASYL Database Task Group Report (1971). Tech. rep.
Atkinson, M. et al. (Dec. 1989). The Object-Oriented Database System Manifesto.
In: Proceedings of the First International Conference on Deductive and Object-Oriented
Databases, pp. 4057.
Austin, John L. (1962). How to do things with words. Cambridge: Harvard University
Press.
Avram, H.D. (1975). MARC; its history and implications. Library of Congress.
Ayers, Danny and Max Volkel (Mar. 2008). Cool URIs for the Semantic Web. Interest
Group Note 20080331. W3C. http://www.w3.org/TR/2008/NOTE- cooluris20080331/.
Baader, Franz et al. (2010). The Description Logic Handbook: Theory, Implementation,
and Applications. 2nd. Cambridge: Cambridge University Press.
Bachman, Charles W. (1969). Data Structure Diagrams. In: DATA BASE 1.2, pp. 4
10.
233
Bibliography
Bachman, Charles W. and Manilal Daya (1977). The Role Concept in Data Models.
In: Proceedings of the 3rd International Conference on Very Large Date Bases (VLDB).
IEEE Computer Society, pp. 464476.
Bagley, Philip Rutherford (1951). Electronic digital machines for high-speed information searching. M.S. Thesis. MIT. http://hdl.handle.net/1721.1/12185.
(Nov. 1968). Extension of programming language concepts. Tech. rep. 0518086. University City Science Center, p. 222.
Ballsun-Stanton, Brian (2010). Asking about Data: Experimental Philosophy of
Information Technology. In: 5th International Conference on Computer Sciences and
Convergence Information Technology, pp. 119124.
(2012). Asking About Data: Exploring Different Realities of Data via the Social
Data Flow Network Methodology. PhD thesis. University of New South Wales.
Balzer, R. M. (1967). Dataless programming. In: AFIPS. ACM, pp. 535544.
Barker, Richard (1990). CASE Method Entity Relationship Modeling. Addison-Wesley.
Barthes, Roland (1967). Elements of Semiology. London: Cape.
Beardsmore, Anthony (2007). Schema description for arbitrary data formats with
the Data Format Description Language. In: IESA. Ed. by Ricardo Jardim-Goncalves
et al. Springer, pp. 829840.
Becker, Peter (2007). Le charme discret du formulaire. De la communication entre
administration et citoyen dans lapres-guerre. In: Politiques et usages de la langue
234
Bibliography
Berners-Lee, T., R. Fielding, and L. Masinter (2005). Uniform Resource Identifier (URI):
Generic Syntax. Tech. rep.
Berners-Lee, T., J. Hendler, and O. Lassila (May 2001). The Semantic Web. In:
Scientific American 284.5.
Berners-Lee, Tim (1989). Information Management: A Proposal. Tech. rep. http :
//www.w3c.org/History/1989/proposal.html.
(1991). Document Naming. http://www.w3.org/DesignIssues/Naming.html.
(1992). Basic HTTP as defined in 1992. http://www.w3.org/Protocols/HTTP/
HTTP2.html.
(June 1994). Universal Resource Identifiers in WWW. Tech. rep. 1630.
(Jan. 1997). Metadata Architecture. http://www.w3.org/DesignIssues/Metadata.
html.
(1998a). Cool URIs dont change. http://www.w3.org/Provider/Style/URI.html.
(1998b). Principles of Design. http://www.w3.org/DesignIssues/Principles.
html.
Berners-Lee, Tim and Dan Connolly (Jan. 14, 2008). Notation3 (N3): A readable RDF
syntax. Tech. rep. W3C. http://www.w3.org/TeamSubmission/n3/.
Berners-Lee, Tim and Mark Fischetti (1999). Weaving the web: The original design and
ultimate destiny of the world wide web by its inventor. San Francisco: Harper.
Berners-Lee, Tim, L. Masinter, and M. McCahill (Dec. 1994). Uniform Resource
Locators (URL). RFC 1738. IETF.
Bertin, Jacques (1967). Semiologie graphique : les diagrammes, les reseaux, les cartes.
Paris: Mouton.
(2011). Semiology of Graphics: Diagrams, Networks, Maps. Redlands: Esri Press.
Biron, Paul V. and Ashok Malhotra (Oct. 28, 2004). XML Schema Part 2: Datatypes
Second Edition. Tech. rep. http://www.w3.org/TR/2004/REC- xmlschema- 220041028/.
Bizer, Chris and Richard Cyganiak (July 30, 2007). The TriG Syntax. Tech. rep. FU
Berlin. http : / / www . wiwiss . fu - berlin . de / suhl / bizer / TriG / Spec / TriG 20070730/.
Blackman, Kenneth R. (1998). IMS Celebrates Thirty Years as an IBM Product. In:
IBM Systems Journal 37.4, pp. 596603.
Bobrow, Daniel G. et al. (1972). TENEX, a Paged Time Sharing System for the
PDP-10. In: Communications of the ACM 15.3, pp. 135143.
Boole, George (1847). The mathematical analysis of logic: being an essay towards a
calculus of deductive reasoning. Cambridge: Macmillan, Barclay, & Macmillan.
(1854). An Investigation of the Laws of Thought on Which are Founded the Mathematical Theories of Logic and Probabilities. London: Walton and Maberly.
Borges, Jorge Luis (1952). El Idioma Analtico de John Wilkins. In: Otras inquisiciones (1937-1952). Buenos Aires: Sur, pp. 139144.
Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) (Sept. 8, 2009). Tech. rep. W3C.
http://www.w3.org/TR/CSS21/.
235
Bibliography
Bourbaki, Nicolas (1970). Elements de mathematique. Theorie des ensembles. Paris:
Hermann.
Bowker, Geoffrey C. and Susan Leigh Star (1999). Sorting Things Out: Classification
and Its Consequences. The MIT Press.
Boyd, Michael and Peter McBrien (2005). Comparing and Transforming Between
Data Models Via an Intermediate Hypergraph Data Model. In: LNCS 4. Ed. by
Stefano Spaccapietra, pp. 69109.
Boyer, John M. and Glenn Marcy (May 2, 2008). Canonical XML 1.1. Tech. rep.
http://www.w3.org/TR/2008/REC-xml-c14n11-20080502.
Bradner, Scott (1997). Key words for use in RFCs to Indicate Requirement Levels. RFC
2119. IETF.
Bradshaw, Paul and Liisa Rohumaa (2011). The Online Journalism Handbook: Skills to
Survive and Thrive in the Digital Age. Longman.
Bray, Tim (Feb. 10, 2002). Extensible Markup Language - SW (XML-SW). Tech. rep.
http://www.textuality.com/xml/xmlSW.html.
Bray, Tim, Dave Hollander, et al. (Aug. 2009). Namespaces in XML 1.0. W3C Recommendation. W3C. http://www.w3.org/TR/2009/REC-xml-names-20091208/.
Bray, Tim, Jean Paoli Paoli, and C.M. Sperberg-McQueen (Feb. 1998). Extensible
Markup Language (XML) 1.0. Tech. rep. W3C. http://www.w3.org/TR/1998/RECxml-19980210.
Bray, Tim, Jean Paoli Paoli, C.M. Sperberg-McQueen, et al. (Nov. 2008). Extensible
Markup Language (XML) 1.0. Tech. rep. W3C. http://www.w3.org/TR/2008/RECxml-20081126/.
Bray, Tim, Jean Paoli, et al. (Apr. 2004). Extensible Markup Language (XML) 1.1. Tech.
rep. http://www.w3.org/TR/2004/REC-xml11-20040204/.
Brickley, Dan and Ramanathan V. Guha (Feb. 10, 2004). RDF Vocabulary Description
Language 1.0: RDF Schema. Tech. rep. http://www.w3.org/TR/2004/REC-rdfschema-20040210/.
Brier, Sren (2006). The foundation of LIS in information science and semiotics. In:
Libreas 2.1. http://www.ib.hu-berlin.de/~libreas/libreas_neu/ausgabe4/
001bri.htm.
(2008). Cybersemiotics. Why information is not enough! University of Toronto Press.
236
Bibliography
237
Bibliography
Carroll, Jeremy J. and Patrick Stickler (2004). RDF Triples in XML. In: Proceedings of the Extreme Markup Languages 2004 Conference. http : / / conferences .
idealliance.org/extreme/html/2004/Stickler01/EML2004Stickler01.html.
Cayley, Arthur (1857). On the theory of the analytical forms called trees. In:
Philosophical Magazine 13, pp. 1726.
Chamberlin, D. D. and R. F. Boyce (May 1974). SEQUEL: A Structured English
Query Language. In: ACM SIGMOD, 249264.
Chandler, Daniel (2007). Semiotics: The Basics. Taylor & Francis.
Chang, Fay et al. (2006). Bigtable: A distributed storage system for structured data.
In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and
Implementation (OSDI06). http://research.google.com/archive/bigtable.
html.
Chapin, Ned (1968). A deeper look at data. In: Proceedings of the 23rd ACM national
conference. New York: ACM, pp. 631638.
Chen, Chaomei, Il-Yeol Song, and Weizhong Zhu (2007). Trends in Conceptual Modeling: Citation Analysis of the ER Conference Papers (1975-2005). In: Proceedings
of the 11th ISSI. CSIC, pp. 189200.
Chen, Cindy Xinmin (2001). Data Models and Query Languages of Spatio-Temporal
Information. PhD thesis. Unviersity of California. http://wis.cs.ucla.edu/
wis/theses/thesis_cchen.ps.
Chen, Haitao and Husheng Liao (July 2010). A survey to conceptual modeling for
XML. In: ICCSIT 2010. Vol. 8, pp. 473 477.
Chen, Peter P. (1976). The Entity-Relationship Model - Toward a Unified View of
Data. In: ACM Transactions on Database Systems 1.1, pp. 936.
Chen, Peter P., Bernhard Thalheim, and LeahY. Wong (1999). Future Directions of
Conceptual Modeling. In: Conceptual Modeling. Ed. by G. Goos et al. Vol. 1565.
LNCS. Springer, pp. 287301.
Church, Alonzo (1936). A Note on the Entscheidungsproblem. In: Journal of Symbolic Logic 1.1, pp. 4041.
XSL Transformations (XSLT) Version 1.0 (Nov. 1999). Tech. rep. http://www.w3.org/
TR/xslt.
Clark, James (2002). RELAX NG and W3C XML Schema. http://www.imc.org/ietfxml-use/mail-archive/msg00217.html.
(Dec. 13, 2010). MicroXML. http://blog.jclark.com/2010/12/microxml.html.
Clark, James and Makoto Murata (Dec. 3, 2001). RELAX NG DTD Compatibility.
http://relaxng.org/compatibility.html.
Codd, Edgar F. (June 1970). A Relational Model of Data for Large Shared Data
Banks. In: CACM 13.6, pp. 377387.
(Aug. 1971). Further Normalization of the Data Base Relational Model. Tech. rep.
RJ909. IBM.
(Apr. 1974). Recent Investigations into Relational Data Base Systems. Tech. rep.
RJ1385. IBM.
238
Bibliography
239
Bibliography
Darwen, Hugh and C. J. Date (1995). The Third Manifesto. In: SIGMOD Record
24.1, pp. 3949.
Data Description Language Committee (1978). CODASYL: Reports of the Data
Description Language Committee. In: Information Systems 3.4, pp. 247320.
Data, International Oceanographic and Information Exchange, eds. (2007). Global
Oceanographic Data Archaeology and Rescue (GODAR). http://www.iode.org/
index.php?Itemid=57&id=18&option=com_content&task=view.
Date, C. J. and Hugh Darwen (1997). A Guide to SQL Standard. 4th edition. AddisonWesley.
(2006). Databases, Types, and The Relational Model: The Third Manifesto. 3rd. AddisonWesley.
Dattolo, Antonina, Angelo Di Iorio, et al. (2007). Structural Patterns for Descriptive
Documents. In: ICWE. Ed. by Luciano Baresi, Piero Fraternali, and Geert-Jan
Houben. Vol. 4607. LNCS. Springer, pp. 421426.
Dattolo, Antonina and Flaminia L. Luccio (2009). A State of Art Survey on zzstructures. In: Proceedings of the 1st Workshop on New Forms of Xanalogical Storage
and Function. CEUR 508, pp. 16.
Dau, Frithjof (2009a). Formal, Diagrammatic Logic with Conceptual Graphs. In:
Conceptual Structures in Practice. Ed. by Pascal Hitzler and Henrik Scharfe. Chapman and Hall/CRC, pp. 1744.
(2009b). The Advent of Formal Diagrammatic Reasoning Systems. In: ICFCA.
Ed. by Sebastien Ferre and Sebastian Rudolph. Vol. 5548. LNCS. Springer, pp. 38
56.
Davis, Mark (Oct. 8, 2010). Unicode Text Segmentation. Tech. rep. 29.
240
Bibliography
241
Bibliography
(2002b). What is the Philosophy of Information? In: Metaphilosophy 33.1/2,
pp. 117138.
(2005). Is Information Meaningful Data? In: Philosophy and Phenomenological
Research 70.2, pp. 351370.
(2009). Trends in the Philosophy of Information. In: Handbook of Philosophy of
Information. Ed. by Pieter Adriaans and Johan van Benthem. Elsevier, pp. 113132.
(2010). Information a very short introduction. Oxford University Press.
Floyd, Christiane (1996). Choices about choices. In: Systems Research 13.3, pp. 261
270.
Fotache, Marin (2006). Why Normalization Failed to Become the Ultimate Guide for
Database Designers? Tech. rep. 9.
Foucault, Michel (1969). Larcheologie du savoir. Paris: Gallimard.
Franssen, Maarten, Gert-Jan Lokhorst, and Ibo van de Poel (2009). Philosophy
of Technology. In: Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta.
Stanford University. http://plato.stanford.edu/entries/technology/.
Frederiks, Paul J. M., Arthur H. M. ter Hofstede, and E. Lippe (1997). A unifying
framework for conceptual data modelling concepts. In: Information and Software
Technology 39.1, pp. 1525.
Free Software Foundation (Mar. 2009). GNU tar: an archiver tool. Tech. rep. version
1.23. Free Software Foundation. http://www.gnu.org/software/tar/manual/.
Friedl, Jeffrey E. F. (2006). Mastering Regular Expressions. 3. OReilly.
Friendly, Michael (2009). Milestones in the history of thematic cartography, statistical
graphics, and data visualization. http://datavis.ca/milestones/.
Fuchs, Norbert E., Uta Schwertel, and Sunna Torge (1999). Controlled Natural
Language Can Replace First-Order Logic. In: ASE, pp. 295298.
Furnas, George W. and Jeff Zacks (1994). Multitrees: Enriching and Reusing Hierarchical Structure. In: Proceedings of the CHI. New York: ACM, pp. 330336.
Gamma, Erich et al. (1994). Design Patterns: Elements of Reusable Object-Oriented
Software. 1st ed. Addison-Wesley.
Ubersicht
der PICA3-Kategorien (Aug. 26, 2010). Tech. rep. VZG. http : / / www .
gbv.de/bibliotheken/verbundbibliotheken/02Verbund/01Erschliessung/
02Richtlinien/01KatRicht/pica3.pdf.
Genova, G., M.C. Valiente, and J. Nubiola (2005). A Semiotic Approach to UML
Models. In: Proceedings of the Workshop on Philosophical Foundations of Information
Systems Engineering. Vol. 13, pp. 547557.
Giampaolo, Dominic (1999). Practical File System Design with the Be File System.
Morgan Kaufmann Publishers.
Gil, Joseph, John Howse, and Stuart Kent (1999a). Constraint Diagrams: A Step
Beyond UML. In: Proceedings of TOOLS 1999. Ed. by Donald Firesmith et al. IEEE
Computer Society, pp. 453463.
(1999b). Formalising Spider Diagrams. In: Proceedings of IEEE Symposium on
Visual Languages. IEEE, pp. 209212.
242
Bibliography
(2001). Towards a Formalization of Constraint Diagrams. In: HCC. IEEE Computer Society, pp. 7279.
Glasersfeld, Ernst von (1990). An exposition of constructivism: Why some like it
radical. In: Journal for Research in Mathematics Education 4, pp. 1929.
243
Bibliography
Groth, Paul, Andrew Gibson, and Jan Velterop (2010). The anatomy of a nanopublication. In: Information Services and Use 30.1, pp. 5156.
Guha, Ramanathan V. (1996). Towards a Theory of Meta Content. Apple Technical Report 169. http : / / downlode . org / Etext / MCF / towards _ a _ theory _ of _
metacontent.html.
Guha, R.V. and Tim Bray (June 1997). Meta Content Framework Using XML. W3C
Submission. http://www.w3.org/TR/NOTE-MCF-XML-970624/.
Gutteridge, Christopher (2010). zzStructure Ontology. Tech. rep. Temple ov thee
Lemur. http://data.totl.net/zz/.
Haendel, Melissa A., Nicole A. Vasilevsky, and Jacqueline A. Wirz (2012). Dealing
with Data: A Case Study on Information and Data Management Literacy. In:
PLoS Biology 10.5. http://www.plosbiology.org/article/info:doi/10.1371/
journal.pbio.1001339.
Halm, Johan van (2005). Questions and issues about the conversion to 13-digit
ISBN and ISSN revision in library systems. In: Information Services & Use 25.2,
pp. 115 118.
Halpin, Terry (2004). Business Rule Verbalization. In: ISTA. Ed. by Anatoly E.
Informatik, pp. 3952.
Doroshenko et al. Vol. 48. LNI. Gesellschaft fur
Halpin, Terry and Tony Morgan (2008). Information Modeling and Relational Databases:
From Conceptual Analysis to Logical Design. 2nd. Morgan Kaufmann.
Hammer, Eric (1994). Reasoning with Sentences and Diagrams. In: Notre Dame
Journal of Formal Logic 35.1, pp. 7387.
244
Bibliography
Hirschheim, Rudy, Heinz K. Klein, and Kalle Lyytinen (Oct. 1995). Information
Systems Development and Data Modeling. Conceptual and Philosophical Foundations.
Tracts in Theoretical Computer Science. Cambridge: Cambridge University Press.
Hitchman, S. (1995). Practitioner perceptions on the use of some semantic concepts
in the entityrelationship model. In: European Journal of Information Systems 4,
pp. 3140.
Hjelmslev, Louis (1953). Prolegomena to a Theory of Language. Baltimore: Wawerly
Press.
Hjrland, Birger (Oct. 2007). Arguments for the bibliographical paradigm. Some
thoughts inspired by the new English edition of the UDC. In: information research
12.4. http://informationr.net/ir/12-4/colis/colis06.html.
Holmevik, Jan Rune (Dec. 1994). Compiling Simula: A historical study of technological genesis. In: IEEE Annals in the History of Computing 16.4, 2537.
and Marc H. Scholl (2007). Melting Pot
Holupirek, Alexander, Christian Grun,
XML: Bringing File Systems and Databases One Step Closer. In: Datenbanksysteme
in Business, Technologie und Web (BTW 2007), 12. Fachtagung des GI-Fachbereichs
Datenbanken und Informationssysteme (DBIS). Ed. by Alfons Kemper et al. Vol. 103.
LNI. GI, pp. 309323.
Honig, William Leonard (1975). A model of data structures commonly used in
programming languages and data base management systems. PhD thesis. Northwestern University Evanston, p. 423.
Honig, William Leonard and C. Robert Carlson (1978). Toward an Understanding
of (Actual) Data Structures. In: The Computer Journal 21.2, pp. 98104.
Hopcroft, John E. and Jeff D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. 1st. Addison-Wesley.
Horridge, Matthew and Peter F. Patel-Schneider (2009). OWL 2 Web Ontology Language: Manchester Syntax. Tech. rep. W3C. http : / / www . w3 . org / TR / owl2 manchester-syntax/.
Horrocks, Ian, Oliver Kutz, and Ulrike Sattler (2006). The Even More Irresistible
SROIQ. In: Proc. of the 10th Int. Conf. on Principles of Knowledge Representation and
Reasoning. Ed. by Patrick Doherty, John Mylopoulos, and Christopher A. Welty.
AAAI Press, pp. 5767.
Howse, John (2008). Diagrammatic Reasoning Systems. In: Proceedings of ICCS
2008. Ed. by Peter W. Eklund and Ollivier Haemmerle. Vol. 5113. LNCS. Springer,
pp. 120.
Huang, Sheng-Cheng (2006). A Semiotic View of Information: Semiotics as a Foundation of LIS Research in Information Behavior. In: Proceedings of the 69th Annual
Meeting of ASIST. Ed. by Richard B. Hill. http://hdl.handle.net/10760/8796.
Hull, R. and R. King (Sept. 1987). Semantic Database Modeling: Survey, Applications, and Research issues. In: ACM Computing Surveys 19.3, pp. 201260.
Husserl, Edmund (1931). Ideas: General Introduction to Pure Phenomenology. Allen &
Unwin.
245
Bibliography
(1986). Die phanomenologische Methode: Ausgewahlte Texte I. Ed. by Klaus Held.
Reclam.
IFLA Study Group on the Functional Requirements for Bibliographic Records, ed.
(1998). Funktional Requierements For Bibliographic Records. K.G. Saur.
Institute of Electrical Electronics Engineers (1988). Portable Operating System Interface
for Computer Environments. Tech. rep. 1003.1-1988. New York.
(2008). IEEE Standard for Binary Floating-Point Arithmetic. Tech. rep. 754-2008.
IEEE.
International Organization for Standardization, ed. (1987). ISO/TR 9007:1987 Information processing systems Concepts and terminology for the conceptual schema and
the information base.
ed. (1996). ISO/IEC 14977:1996 Information Technology - Syntactic Metalanguage Extended BNF.
ed. (2000). ISO/IEC 13250:2000 Topic Maps: Information Technology Document
Description and Markup Language.
ed. (2005). ISO/IEC 24824-1 Generic Applications of ASN.1: Fast Infoset.
ed. (2006). ISO/IEC 19757-3:2007 Document Schema Definition Languages (DSDL)
Part 3: Rule-based validation Schematron.
ed. (2007a). ISO/IEC 11404:2007 Information technology General-Purpose Datatypes.
ed. (2007b). ISO/IEC 24707:2007 Information technology Common Logic (CL).
ed. (2008a). ISO/IEC 19757:2008 Document Schema Definition Languages (DSDL).
ed. (2008b). ISO/IEC 19757-9:2008 Document Schema Definition Languages (DSDL)
Part 3: Regular-grammar-based validation RELAX NG.
ed. (2008c). ISO/IEC 19757-9:2008 Namespace and datatype declaration in Document
Type Definitions (DTDs).
Jakobson, Roman (1963). Essais de linguistique generale. Paris: Minuit.
Jarrar, Mustafa, Maria Keet, and Paolo Dongilli (Feb. 2006). Multilingual verbalization of ORM conceptual models and axiomatized ontologies. Tech. rep. STARLab
Technical Report. Vrije Universiteit Brussel. http://www.meteck.org/files/
ORMmultiverb_JKD.pdf.
Jay, Barry (1995). A Semantics for Shape. In: Science of Computer Programming
25.2-3, pp. 251283.
(2009). Pattern Calculus: Computing with Functions and Structures. Springer.
Jech, Thomas (2003). Set Theory: The Third Millennium Edition. Springer.
Jorgensen, Jorgen (1937). Imperatives and Logic. In: Erkenntnis 7, pp. 288296.
Jr., Thomas B. Steel (1975). Data Base Standardization - A Status Report. In: IBM
Symposium: Data Base Systems. Ed. by Helmut F. Hasselmeier and Wilhelm G.
Spruth. Vol. 39. LNCS. Springer, pp. 362386.
Keet, C. Maria (2007). Enhancing Comprehension of Ontologies and Conceptual
Models Through Abstractions. In: AI*IA. Ed. by Roberto Basili and Maria Teresa
Pazienza. Vol. 4733. LNCS. Springer, pp. 813821.
246
Bibliography
(2008a). A formal comparison of conceptual data modeling languages. In: CEURWS, pp. 2537.
(2008b). A formal theory of granularity. PhD thesis. KRDB Research Centre,
Faculty of Computer Science, Free University of Bozen-Bolzano.
(2008c). Unifying industry-grade class-based conceptual data modeling languages with CMcom. In: 21st International Workshop on Description Logics. Vol. 353.
CEUR-WS.
(2011). The granular perspective as semantically enriched granulation hierarchy.
In: IJGCRSIS 2.1, pp. 5170.
Kelly, Steven and Juha-Pekka Tolvanen (2008). Domain-Specific Modeling: Enabling
Full Code Generation. Wiley.
Kent, William (1978). Data and Reality. Basic assumptions in data processing reconsidered. North-Holland.
(1979). Limitations of Record-Based Information Models. In: Transactions on
Database Systems (TODS) 4.1, pp. 107131.
(1983a). A Taxonomy for Entity-Relationship Models. http://www.bkent.net/Doc/
ertax.htm.
(Feb. 1983b). A Simple Guide to Five Normal Forms in Relational Database
Theory. In: Communications of the ACM 26 26.2, pp. 120125.
(1984). Fact-based data analysis and design. In: Journal of Systems and Software
4.2-3, pp. 99121.
(1988). The Many Forms of a Single Fact. In: Proceeedings of the IEEE COMPCON.
http://www.bkent.net/Doc/manyform.htm.
(June 1991). A Rigorous Model of Object Reference, Identity, and Existence. In:
Journal of Object-Oriented Programming 4.3, pp. 2836.
(2003). The unsolvable identity problem. In: Extreme Markup Languages. http://
www.mulberrytech.com/Extreme/Proceedings/html/2003/Kent01/EML2003Kent01.
html.
Kent, William and Steve Hoberman (2012). Data and Reality. A timeless perspective on perceiving and managing in our imprecise world. 3rd. Westfield: Technics
Publications.
247
Bibliography
RIF Overview (June 22, 2010). Working Group Note. http://www.w3.org/TR/2010/
NOTE-rif-overview-20100622/.
Kimball, Ralph (May 1998). Surrogate keys. In: DBMS 11.5. http://web.archive.
org/web/19990219111430/http://www.dbmsmag.com/9805d05.html.
Resource Description Framework (RDF): Concepts and Abstract Syntax (Feb. 10, 2004).
Tech. rep. W3C. http://www.w3.org/TR/rdf-concepts/.
Knublauch, Holger, James A. Hendler, and Kingsley Idehen (2011). SPIN - Overview
and Motivation. Tech. rep. http://www.w3.org/Submission/2011/SUBM-spinoverview-20110222/.
Knuth, Donald E. (June 1957). Potrzebie System of Weights and Measures. In:
MAD magazine 33, p. 36.
(1964). backus normal form vs. Backus Naur form. In: Communications of the
ACM 7.12, pp. 735736.
(1984). The TeXbook: a complete users guide to computer typesetting with TEX.
Addison-Wesley.
Kobler, Johannes (2006). On Graph Isomorphism for Restricted Graph Classes. In:
CiE. Ed. by Arnold Beckmann et al. Vol. 3988. LNCS. Springer, pp. 241256.
Koenig, Andrew (1998). Programming abstractly. Lecture slides. http://www.cs.
princeton.edu/courses/archive/spr98/cs333/lectures/19/.
Korzybski, Alfred (1933). A Non-Aristotelian System and its Necessity for Rigour in
Mathematics and Physics. In: Science and sanity: an introduction to non-Aristotelian
systems and general semantics. Lancaster.
Kronick, David A. (1962). A History of Scientific and Technical Periodicals: The Origins
and Development of the Scientific and Technological Press, 1665-1790. New York:
Scarecrow Press.
Kuhn, Thomas S. (1962). The Structure of Scientific Revolutions. Chicago: University
of Chicago Press.
(1974). Second Thoughts on Paradigms. In: The Structure of Scientific Theories.
Ed. by Frederick Suppe. University of Illinois Press, 45982.
Lakoff, George (1987). Women, Fire, and Dangerous Things: What Categories Reveal
About the Mind. Chicago: University of Chicago Press.
Lamport, Leslie (1994). LATEX: a document preparation system: users guide and reference manual. 2nd. Addison-Wesley.
248
Bibliography
Lietard, Ludovic (2008). A new definition for linguistic summaries of data. In:
Proceedings of Fuzzy Systems, pp. 506511.
Liskov, Barbara (Mar. 1987). Data Abstraction and Hierarchy. In: ACM SIGPLAN
Notices 23.5, pp. 1734.
Liskov, Barbara and Stephen Zilles (Apr. 1974). Programming with Abstract Data
Types. In: ACM SIGPLAN Notices 9.4, pp. 5059.
Loukides, Mike (June 2, 2010). What is data science? In: OReilly radar. http :
//radar.oreilly.com/2010/06/what-is-data-science.html.
Lund, Niels Windfeld (2009). Document theory. In: ARIST 43.1, pp. 155.
Lynch, Clifford (Oct. 1997). Identifiers and Their Role in Networked Information
Applications. In: ARL: A Bimonthly Newsletter of Research Library Issues and Actions
194. http://www.arl.org/bm~doc/identifier.pdf.
Mai, Jens-Erik (2001). Semiotics and indexing: An analysis of the subject indexing
process. In: Journal of Documentation 57.5, pp. 591622.
Ma, Lai (Apr. 2012). Meanings of information: The assumptions and research
consequences of three foundational LIS theories. In: JASIST 63.4, pp. 716723.
Manola, Frank and Eric Miller (Feb. 2004). RDF Primer. http://www.w3.org/TR/
2004/REC-rdf-primer-20040210/.
Marshall, Catherine C. and Frank M. Shipman III (1995). Spatial Hypertext: Designing for Change. In: Communications of the ACM 38.8, pp. 8897.
Martinez-Gil, Jorge, Enrique Alba, and Jose F. Aldana-Montes (2010). Statistical
Study about Existing OWL Ontologies from a Significant Sample as Previous
Step for their Alignment. In: International Conference on Complex, Intelligent and
Software Intensive Systems. Los Alamitos, CA: IEEE, pp. 980985.
Martin, James (1990). Information engineering. Englewood Cliffs: Prentice Hall.
Martin, James and Carma MacClure (1985). Diagramming techniques for Analysts and
Programmers. Prentice-Hall.
Masinter, L. (Aug. 1998). The data URL scheme. RFC 2397.
McCallum, Sally H. (2002). MARC: Keystone for Library Automation. In: IEEE
Annals of the History of Computing 24.2, pp. 3449.
(2009). Machine Readable Cataloging (MARC): 1975-2007. In: Encyclopedia of
Library and Information Sciences. Ed. by Marcia J. Bates and Mary Niles Maack. 3rd.
Taylor & Francis, pp. 35303539.
McGee, W. C. (Sept. 1981). Data Base Technology. In: IBM Journal of Research and
Development 25.5, pp. 505519.
McGuffin, Michael J. and m. c. schraefel (2004). A comparison of hyperstructures:
zzstructures, mSpaces, and polyarchies. In: Proceedings of the fifteenth ACM conference on Hypertext and hypermedia. New York: ACM, pp. 153162.
McNamara, Paul (Sept. 2010). Deontic logic. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Fall 2010. Stanford University. http://plato.
stanford.edu/archives/fall2010/entries/logic-deontic/.
249
Bibliography
McNaughton, Robert (1999). An Insertion into the Chomsky Hierarchy? In: Jewels
are Forever, Contributions on Theoretical Computer Science in Honor of Arto Salomaa,
pp. 204212.
URIs, URLs, and URNs: Clarifications and Recommendations 1.0 (Sept. 2001). Tech. rep.
http://www.w3.org/TR/uri-clarification/.
Mealy, George H. (1967). Another look at data. In: Proceedings of the 1967, fall joint
computer conference. ACM, pp. 525534.
Meek, Brian L. (Sept. 1994a). A taxonomy of datatypes. In: SIGPLAN Notices 29.9,
pp. 159167.
(Apr. 1994b). Programming languages: towards greater commonality. In: SIGPLAN Notices 29.4, pp. 4957.
(1995). The seven golden rules for producing language-independent standards.
In: Software Engineering Standards Symposium, pp. 250256.
(June 1996). Too soon, too late, too narrow, too wide, too shallow, too deep. In:
StandardView 4.2, pp. 114118.
Megginson, David (Apr. 2004). Simple API for XML (SAX 2.0). http://www.saxproject.
org/.
Meier-Oeser, Stephan (June 2011). Medieval Semiotics. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer 2011. http://plato.stanford.
edu/archives/sum2011/entries/semiotics-medieval.
Paradigm (2011). In: Merriam-Webster Online Dictionary. Ed. by Meriam-Webster.
accessed 2011-09-20. http://www.merriam-webster.com/dictionary/paradigm.
Meszaros, Gerard and Jim Doble (1997). A Pattern Language for Pattern Writing.
In: Pattern Languages of Program Design 3, pp. 529574.
Mohan, Sriram and Arijit Sengupta (2009). Conceptual Modeling for XML: A
Myth or a Reality. In: Database Technologies: Concepts, Methodologies, Tools, and
Applications. Ed. by John Erickson. IGI Global, pp. 527549.
Moltmann, Friederike (2007). Events, tropes, and truthmaking. In: Philosophical
Studies 134.3, pp. 363403.
Monniaux, David (2008). The pitfalls of verifying floating-point computations. In:
ACM Transactions of Programming Language Systems 30.3, 12:112:41.
Mons, Barend and Jan Velterop (2009). Nano-Publication in the e-science era. In:
International Semantic Web Conference.
Moody, Daniel Laurence (2009). The physics of notations: towards a scientific basis for constructing visual notations in software engineering. In: IEEE Transactions
on Software Engineering 35.6, pp. 756779.
Mooers, Calvin (July 1960). Mooers Law: or, why some retrieval systems are used
and others are not. In: American Documentation 11.3, p. 204.
Murata, Makoto et al. (Nov. 2005). Taxonomy of XML Schema Languages Using Formal Language Theory. In: ACM Transactions on Internet Technology 5.4, pp. 660
704.
250
Bibliography
Muschett, Brien, Rich Salz, and Michael Schenker (Mar. 2011). JSONx, an XML
Encoding for JSON. draft. IETF. http://tools.ietf.org/html/draft- rsalzjsonx-00.
Directory interchange format manual, version 1.0 (1988). Tech. rep. NASA.
Revised Report on the Algorithmic Language Algol 60 (Jan. 1963). In: Communications of the ACM 6.1. Ed. by Peter Naur, pp. 123.
Naur, Peter (July 1966). The science of datalogy. In: Communications of the ACM
9.7, p. 485.
(1968). Datalogy, the science of data and data processes, and its place in education. In: Proceedings of the IFIP Congress 68. Ed. by A. J. H. Morrell. North-Holland,
pp. 13831387.
(1985). Programming as Theory building. In: Microprocessing and Microprogramming 15.5, pp. 253261.
(1992). Programming Languages Are Not Languages Why Programming Language Is a Misleading Designation. In: Computing: A Human Activity. ACM Press,
pp. XXXX.
(2007). Computing versus human thinking. In: Communications of the ACM 50.1,
pp. 8594.
Navathe, Shamkant B. (1992). Evolution of Data Modeling for Databases. In:
Communications of the ACM 35.9, pp. 112123.
Martin (2006). Conceptual Modeling for XML: A Survey. In: DATESO. Ed.
Necasky,
Vol. 176. CEUR Workshop
by Vaclav Snas el, Karel Richta, and Jaroslav Pokorny.
Proceedings. CEUR-WS.org. http://www.ceur-ws.org/Vol-176/paper7.pdf.
(2008). Conceptual Modeling for XML. PhD thesis.
Nelson, Ted (1965a). Complex information processing: a file structure for the
complex, the changing and the indeterminate. In: Proceedings of the 20th ACM
National Conference. Cleveland: ACM, pp. 84100.
(1965b). The Hypertext. In: Proceedings of the International Federation of Documentation (FID) Congress. 31. abstract, reprinted in Nelsons Possiplex, p. 154.
Washington, DC.
(1981). Literary Machines. 3rd. Sausalito, California: Mindful Press.
(Dec. 1986). The Tyranny of the File. In: Datamation ?15, ?
(Dec. 1999). Xanalogical Structure, Needed Now More Than Ever: Parallel Documents, Deep Links to Content, Deep Versioning, and Deep Re-Use. In: ACM
Computing Surveys 31.4.
(2004). A Cosmology for a Different Computer Universe: Data Model, Mechanisms, Virtual Machine and Visualization Infrastructure. In: Journal of Digital
Information 5.1. http://journals.tdl.org/jodi/article/viewArticle/131.
(2010). Possiplex: Movies, Intellect, and Creative Control. My Computer Life and the
Fight for Civilization. 1st. Mindful Press.
(May 22, 2012). Computers for Cynics 0 - The Myth of Technology. http://www.
youtube.com/watch?v=KdnGPQaICjk.
251
Bibliography
Neward, Ted (June 26, 2006). The Vietnam of Computer Science. http : / / blogs .
tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx.
Nijssen, Gerardus M. and Terry Halpin (1989). Conceptual Schema and Relational
Database Design: A Fact Oriented Approach. Prentice-Hall.
(2006). The Theory Underlying Concept Maps
Novak, Joseph D. and Alberto J. Canas
and How to Construct Them. Tech. rep. 2006-01. Technical Report IHMC CmapTools. http://cmap.ihmc.us/Publications/ResearchPapers/TheoryCmaps/
TheoryUnderlyingConceptMaps.htm.
Open Document Format for Office Applications v1.2 (Jan. 2012). Tech. rep.
Object Management Group (May 2009). Ontology Definition Metamodel (OMG) Version
1.0. Tech. rep. formal/2009-05-01. Object Management Group. http://www.omg.
org/spec/ODM/1.0.
OCLC, ed. (2010). info URI Registry. http://info-uri.info/.
Oei, J L Han et al. (1992). The Meta Model Hierarchy: A Framework for Information
Systems Concepts and Techniques. Tech. rep. 92-17, pp. 151159.
Ogbuji (Jan. 12, 2002). XML class warfare. In: ADT Magazine. http://adtmag.
com/articles/2002/12/01/xml-class-warfare.aspx.
Ogden, Charles Kay and I.A. Richards (1923). The meaning of meaning. London:
Trubner & Co.
Okhotin, Alexander (2010). Fast parsing for Boolean grammars: a generalization
of Valiants algorithm. In: International Conference on Developments in Language
Theory. Vol. 6224. LNCS, pp. 340351.
Oliver, Ian and Vesa Luukala (2006). On UMLs Composite Structure Diagram. In:
Fifth Workshop on System Analysis and Modelling.
Olivier, Martin S. (2009). On metadata context in Database Forensics. In: Digital
Investigation 5.3-4, pp. 115 123.
Documents Associated With UML Version 2.4.1 (Aug. 2011). Tech. rep. http://www.
omg.org/spec/UML/2.4.1/.
rom, Anders (2007). The concept of information versus the concept of document. In: Document (re)turn. Contributions from a research field in transition. Ed. by
Roswitha Skare, Niels Windfeld Lund, and Andreas Varheim. Frankfurt: Peter
Lang, pp. 5372.
Otlet, Paul (1918). Transformations operees dans lappareil bibliographique des
science. In: Revue Scientifique 58, pp. 236241.
(1934). Traite de documentation. Le livre sur le livre. Theorie et pratique. IIB Publication 197. Brussels: Editiones Mundaneum.
(1990). International organisation and dissemination of knowledge selected essays of
Paul Otlet. Ed. by W. Boyd Rayward. FID publications. Elsevier.
Palmer, Kent (2009). Emergent design explorations in systems phenomenology in
relation to ontology hermeneutics and the metadialectics of design. PhD thesis.
University of South Australia.
252
Bibliography
253
Bibliography
Pourabdollah, Amir (2009). Theory and practice of the ternary relations model
of information management. PhD thesis. University of Nottingham. http://
etheses.nottingham.ac.uk/708/.
Powell, Alan, Michael Beckerle, and Stephen Hanson (Jan. 2011). Data Format Description Language (DFDL). Tech. rep. Open Grid Forum. http://www.ogf.org/dfdl/.
Prudhommeaux, Eric and Andy Seaborne (Jan. 15, 2008). SPARQL Query Language
for RDF. Tech. rep. http://www.w3.org/TR/2008/REC- rdf- sparql- query20080115/.
Qin, Jian and John DIgnazio (June 2010). Lessons learned from a two-year experience in science data literacy education. In:
Quin, Liam (1996). Suggestive Markup: Explicit Relationships in Descriptive and
Prescriptive DTDs. In: Proceedings of the SGML 96 conference. Boston. http :
//www.holoweb.net/~liam/papers/1996-sgml96-SuggestiveMarkup/.
Raber, Douglas and John M. Budd (2003). Information as sign: semiotics and information science. In: Journal of Documentation 59.5, pp. 507522.
Ranganathan, Shiyali Ramamrita (1931). The five laws of library science. Madras:
Madras Library Association.
Rayward, W. Boyd (May 1994). Visions of Xanadu: Paul Otlet (1868-1944) and
Hypertext. In: Journal of the American Society for Information Science 45.4, pp. 235
250.
(Apr. 1997). The Origins of Information Science and the Work of the International Institute of Bibliography/International Federation for Documentation and
Information(FID). In: Journal of the American Society for Information Science 48,
pp. 289300.
Reimer, Jeremy (Mar. 16, 2008). From BFS to ZFS: past, present, and future of file
systems. In: Ars Technica. http://arstechnica.com/hardware/news/2008/03/
past-present-future-file-systems.ars.
Dokumenten-, FaktenReiner, Ulrike (1988). Semantik von Anfragesprachen fur
und Erklarungssuchsysteme. PhD thesis. TU Berlin.
(1991). Anfragesprachen fuer Informationssysteme. Frankfurt: DGD.
Renear, Allen H. (2000). The descriptive/procedural distinction is flawed. In:
Markup Languages: Theory and Practise 4.2, pp. 411420.
Renear, Allen H. and David Dubin (2003). Towards identity conditions for digital
documents. In: Proceedings of the International DCMI Metadata Conference and
Workshop. Seattle: DCMI, pp. 181189.
Renear, Allen H. and Karen M. Wickett (2009). Documents Cannot Be Edited.
In: Proceedings of Balisage: The Markup Conference 2009. Vol. 3. Balisage Series
on Markup Technologies. Montreal. http://www.balisage.net/Proceedings/
vol3/html/Renear01/BalisageVol3-Renear01.html.
Renear, Allen, Elli Mylonas, and David Durand (1996). Refining our Notion of
What Text Really Is: The Problem of Overlapping Hierarchies. In: Research in
254
Bibliography
Humanities Computing. Ed. by Nancy Ide and Susan Hockey. Vol. 4. Clarendon
Press, pp. 263280.
Repici, Dominic John (2010). The Comma Separated Value (CSV) File Format. http:
//www.creativyst.com/Doc/Articles/CSV/CSV01.htm.
Riley, Jenn (2010). Seeing Standards: A Visualization of the Metadata Universe. Tech.
rep. http://www.dlib.indiana.edu/~jenlrile/metadatamap/.
Ritchie, Dennis M. (1979). The Evolution of the Unix Time-sharing System. In:
Language Design and Programming Methodology. Vol. 79. LNCS. Springer, pp. 25
35.
Rivest, R. (Nov. 1997). S-Expressions. Tech. rep. Internet Draft. Network Working
Group. http://people.csail.mit.edu/rivest/Sexp.txt.
Rodriguez, Marko A. and Peter Neubauer (Aug. 2010). Constructions from Dots and
Lines. In: Bulletin of the American Society for Information Science and Technology
36.6, pp. 3541.
and Ulf Hashagen, eds. (2000). The First Computers: History and ArchitecRojas, Raul
tures. MIT Press.
Rood, Hendrik (Aug. 2000). Whats in a name, whats in a number: some characteristics of identifiers on electronic networks. In: Telecommunications Policy 24.6-7,
pp. 533552.
Rosch, Eleanor (1983). Prototype classification and logical classification: The two
systems. In: New Trends in Conceptual Representation: Challenges to Piagets Theory?
Ed. by Ellin Kofsky Scholnick. Erlbaum, pp. 7386.
Saltzer, J.H. (1965). CTSS Technical Notes. Tech. rep. MIT-LCS-TR-016. MIT. http:
//publications.csail.mit.edu/lcs/specpub.php?id=584.
Saussure, Ferdinand D. (1916). Cours de linguistique generale. Paris: Bayot.
Schield, Milo (2004). Information literacy, statistical literacy, and data literacy. In:
IASSIST Quarterly 28.Summer/Fall, pp. 611.
Schneider, Michael (Oct. 27, 2009). OWL 2 Web Ontology Language: RDF-Based
Semantics. Tech. rep. http : / / www . w3 . org / TR / 2009 / REC - owl2 - rdf - based semantics-20091027/.
Schrettinger, Martin (1808). Versuch eines vollstandigen Lehrbuches der Bibliothek
Wissenschaft. Band 1. Munchen.
(1829). Versuch eines vollstandigen Lehrbuches der Bibliothek-Wissenschaft, 2. Band.
Munchen:
Lindenau.
Schuman, Stephen A. and Philippe Jorrand (1967). Definition Mechanisms in Extensible Programming Languages. In: Proceedings of AFIPS. ACM, pp. 920.
Searle, John (1969). Speech Acts. Cambridge: Cambridge University Press.
Seibel, Peter (2009). Coders at Work: Reflections on the Craft of Programming. New
York: Apress.
Sengupta, Arijit and Erik Wilde (Feb. 2006). The Case for Conceptual Modeling for
255
Bibliography
Shafranovich, Y. (Oct. 2005). Common Format and MIME Type for Comma-Separated
Values (CSV) Files. RFC 4180. IETF.
Shannon, Claude (1938). A Symbolic Analysis of Relay and Switching Circuits. In:
Transactions of the American Institute of Electrical Engineers 57, pp. 713723.
Shannon, Claude Elwood (1948). A mathematical theory of communication. In:
Bell Systems Technical Journal 27, pp. 379423, 623656.
Shelley, E.P. and B.D. Johnson (1995). Metadata: Concepts and Models. In: Proceedings of the Third National Conference on the Management of Geoscience Information
and Data. Adelaide, South Australia: Australian Mineral Foundation.
Sheth, Amit, Cartic Ramakrishnan, and Christopher Thomas (2005). Semantics for
the Semantic Web: The Implicit, the Formal and the Powerful. In: International
Journal on Semantic Web & Information Systems 1.1, pp. 118.
Shin, Sun-Joo (1995). The logical status of diagrams. Cambridge: Cambridge University
Press.
Signes, Ricardo and John Cappiello (2008). Rx: Simple, Extensible Schemata. Tech. rep.
http://rx.codesimply.com/.
Silberschatz, Abraham, Henry F. Korth, and S. Sudarshan (2010). Database system
concepts. 6th ed. New York: McGraw-Hill.
Silverston, L. (2001). The Data Model Resource Book - A Library of Universal Data
Models for All Enterprises. Vol. 1. John Wiley & Sons.
Silverston, Len and Paul Agnew (2009). The Data Model Resource Book, Vol. 3: Universal Patterns for Data Modeling. Wiley.
Simsion, Graeme (2007). Data Modeling Theory and Practise. Technics Publications.
Skare, Roswitha, Andreas Varheim, and Niels Windfeld Lund (2007). A Document
(Re)turn. Contributions from a Research Field in Transition. Frankfurt: Peter Lang.
Smith, David Woodruff (June 2009). Phenomenology. In: The Stanford Encyclopedia
of Philosophy. Ed. by Edward N. Zalta. Summer 2009. Stanford University. http:
//plato.stanford.edu/archives/sum2009/entries/phenomenology/.
Solla Price, Derek J. de (1963). Little science, big science. New York: Columbia University Press.
Solntseff, N and A Yezerski (1974). A survey of extensible programming languages.
In: Annual Review in Automatic Programming. Vol. 7. Elsevier, pp. 267307.
Sompel, H. Van de et al. (Apr. 2006). The info URI Scheme for Information Assets
with Identifiers in Public Namespaces. RFC 4452.
Souza, Clarisse Sieckenius de (2012). Semiotics. In: Encyclopedia of Human-Computer
Interaction. Ed. by Mads Soegaard and Rikke Friis Dam. Aarhus: The InteractionDesign.org Foundation. http://www.interaction-design.org/encyclopedia/
semiotics_and_human-computer_interaction.html.
Sowa, John F. (July 1976). Conceptual Graphs for a Data Base Interface. In: IBM
Journal of Research and Development 20.4, pp. 336357.
(1992a). Conceptual graphs summary. In: Conceptual structures: current research
and practice. Ed. by P. Eklund et al. Ellis Horwood, pp. 351.
256
Bibliography
Stuhrenberg,
Maik and Christian Wurm (2010). Refining the Taxonomy of XML
Schema Languages. A new Approach for Categorizing XML Schema Languages in
Terms of Processing Complexity. In: Proceedings of Balisage Markup Conference
2010. Vol. 5. Balisage Series on Markup Technologies.
257
Bibliography
Suppes, Patrick (1962). Models of Data. In: Logic, Methodology and Philosophy of
Science: Proceedings of the 1960 International Congress. Ed. by E. Nagel, P. Suppes,
and A. Tarski. Stanford: Stanford University Press, pp. 252261.
Sutton, Valerie (2002). Lessons In SignWriting. La Jolla: Center for Sutton Movement
Writing.
Sveinsdottir, Edda and Erik Frkjr (1988). Datalogy - The Copenhagen Tradition
of Computer Science. In: BIT 28.3, pp. 450472.
Svensson, Patrik (2010). The Landscape of Digital Humanities. In: Digital Humanities Quarterly 4.1. http://www.digitalhumanities.org/dhq/vol/4/1/000080/
000080.html.
Sylvester, John Joseph (1878). Chemistry and Algebra. In: Nature 17, p. 284.
Taddeo, Mariarosaria and Luciano Floridi (2005). The Symbol Grounding Problem:
a Critical Review of Fifteen Years of Research. In: Journal of Experimental and
Theoretical Artificial Intelligence 17.4, pp. 419445.
(2007). A Praxical Solution of the Symbol Grounding Problem. In: Minds and
Machines 17.4, pp. 369389.
Tanenbaum, Andrew S. (2008). Modern Operating Systems. 3rd. Pearson.
Taverniers, Miriam (Sept. 2008). Hjelmslevs semiotic model of language: An exegesis. In: Semiotica 171, 367394.
Teichroew, Daniel and Ernest A. Hershey (1977). PSL/PSA: A Computer Aided
Technique for Structured Documentation and Analysis of Information Processing
Systems. In: IEEE Transctions on Software Engineering 3.1, pp. 4148.
Tennant, Roy (Oct. 15, 2002). MARC Must Die. In: Library Journal. http://www.
libraryjournal.com/article/CA250046.html.
(2004). Digital Libraries: Metadatas Bitter Harvest. In: Library Journal 12. http:
//www.libraryjournal.com/article/CA434443.html.
The Unicode Consortium (2011). The Unicode Standard. Tech. rep. Version 6.0.0.
Mountain View, CA: Unicode Consortium. http://www.unicode.org/versions/
Unicode6.0.0/.
Thomale, Jason (Sept. 21, 2010). Interpreting MARC: Wheres the Bibliographic
Data? In: Code4Lib Journal 11. http://journal.code4lib.org/articles/3832.
Thompson, Henry S. et al., eds. (Oct. 2004). XML Schema Part 1: Structures Second
Edition. W3C Recommendation. http://www.w3.org/TR/2004/REC-xmlschema1-20041028.
Tobin, Richard (Apr. 6, 2001). An RDF Schema for the XML Information Set. Tech. rep.
http://www.w3.org/TR/2001/NOTE-xml-infoset-rdfs-20010406.
Trabant, Jurgen
(1996). Elemente der Semiotik. Tubingen:
UTB.
Truex, Duane and Frantz Rowe (2007). Issues at the IS Core: How French Scholars
Inform the Discourse. In: ICIS 2007 Proceedings. 134. http://aisel.aisnet.
org/icis2007/134.
Tufte, Edward R. (2001). The Visual Display of Quantitative Information. 2. Cheshire,
CT: Graphics Press.
258
Bibliography
259
Bibliography
Warner, Julian (1990). Semiotics, information science, documents and computers.
In: Journal of Documentation 46.1, pp. 1632.
Watzlawick, Paul, Janet Helmick-Beavin, and Don D. Jackson (1967). Pragmatics of
Human Communication - a study of interactional patterns, pathologies, and paradoxes.
New York: Norton.
Weizenbaum, Joseph (1976). Computer Power and Human Reason: From Judgment to
Calculation. New York: W. H. Freeman & Co.
Wells, Herbert G (1938). World Brain. 1st. New York: Doubleday, Doran & Co.
Whistler, Ken and Asmus Freytag (2009). The Unicode Character Property Model.
Tech. rep. Unicode Standard Annex 23. Unicode Consortium. http://unicode.
org/reports/tr23/.
Whorf, Benjamin L. (1956). Language, thought and reality: Selected writings of Benjamin
Lee Whorf. Ed. by J. B. Carroll. Cambridge, MA: MIT Press.
Widder, Oliver (Apr. 17, 2010). Meta. In: Geek and Poke. http://geekandpoke.
typepad.com/geekandpoke/2010/04/meta.html.
Wieringa, Roel and Wiebren de Jonge (1991). The identification of objects and roles Object identifiers revisited. Tech. rep. IR-267. Faculty of Mathematics and Computer
Science, Vrije Universiteit.
Wikipedia (2010). INI file. http://en.wikipedia.org/wiki/INI_file.
Wilcox-OHearn, Zooko (2001). Names: Distributed, Secure, Human-Readable: Choose
Two. http://zooko.com/distnames.html.
Wilde, Erik (Aug. 27, 2003). A Compact Syntax for W3C XML Schema. In: XML.com.
http://www.xml.com/pub/a/2003/08/27/xscs.html.
(May 2006). Merging Trees: File System and Content Integration. In: 15th International World Wide Web Conference. Ed. by Les Carr et al. Edinburgh: ACM Press,
955956.
Wilde, Erik and Robert J. Glushko (2008). XML Fever. In: Communications of the
ACM 51.7, pp. 4046.
Winn, William (1990). Encoding and Retrieval of Information in Maps and Diagrams. In: IEEE Transactions on Professional Communication 33.3, pp. 103107.
Wirth, Niklaus (1976). Algorithms + Data Structures = Programs. Prentice-Hall.
(1977). What can we do about the unnecessary diversity of notation for syntactic
definitions? In: Communications of the ACM 20.11, pp. 822823.
Wirth, Niklaus and C. A. R. Hoare (June 1966). A Contribution to the Development
of ALGOL. In: Communications of the ACM 9.6, pp. 413432.
Wyssusek, Boris (2007). A philosophical re-appraisal of Peter Naurs notion of
Programming as theory building. In: Proceedings of the 15th European Conference
on Information Systems (ECIS). St. Gallen, 15051514.
Yager, Ronald and Teresa C. Rubinson (1981). Linguistic summaries of data bases.
In: Proceedings of the Decision and Control conference, pp. 10941097.
260
Bibliography
Yang, Bill (Oct. 2009). The problem of too many layers of indirection (abstraction).
In: Bill Yangs Weblog. http://billgyang.blogspot.de/2009/10/dark-side-ofabstraction.html.
Yeo, Geoffrey (2010). Nothing is the same as something else: significant properties
and notions of identity and originality. In: Archival Science 10.2, pp. 85116.
Young, John W. and Henry K. Kent (1958). An abstract formulation of data processing problems. In: ACM 58: Preprints of papers presented at the 13th national
meeting of the Association for Computing Machinery. ACM, pp. 14.
Zelle, Rintze M. (Sept. 3, 2012). Citation Style Language 1.0.1. Tech. rep. http :
//citationstyles.org/downloads/specification-csl101-20120903.html.
Zeumer, Karl, ed. (2001). Formulae Merowingici et Karolini aevi. Hannover: Hahn.
All URLs in this thesis have been checked and updated at January 5th with a script
documented at http://tex.stackexchange.com/a/89521/1044. The full bibliography is available at http://www.bibsonomy.org/user/voj.
261
Appendices
A. Honigs analysis model of data structures
In his dissertation Honig (1975) developed an analysis model of data structures
based on a review of 21 programming languages and data base management systems.
A partial summary of the model is given by Honig and C. R. Carlson (1978). In
Honigs analysis model data structures are divided into three classes (aggregates,
associations, and files) and each class is modeled with a set of questions. Each
question delinates one significant characteristic of the data structure and can be
viewed as one axis of a n-dimensional universe of data structures. This appendix
includes a copy of these questions for better comparision, as applied in section 5.6.1.
Appendices
264
265
Appendices
information transmitted
Visual Notation
Code
Intended
message
Encoding
Diagram
signal
Received
message
Decoding
Diagram creator
Diagram user
Source
Destination
noise
Channel
Medium
C. A pattern graph
Figure A4 contains a graph that was automatically created from the connections
between patterns in chapter 5. Bold arrows indicate connections to implied patterns
or to patterns which occur in the context of another pattern: for instance the context
of a separator pattern is a sequence and sequences imply an embedding. The relationship, however, is no formal implication in terms of logic. Subsets of this graph are
shown in figure 5.1 with focus on combining patterns and figure 5.2 with focus on
relational patterns. The full graph in figure A4 further contains dashed arrows that
indicate which patterns can be found in implementations of another pattern. One
could further draw connections between related patterns, but these links are too
dense to make use of it in a static graph with all patterns. A hypertext version will
be provided at http://aboutdata.org for easier browsing of the pattern language.
An alternative overview of the pattern language is given in form of a classification in
table 6.1 at page 224.
266
derivation
flag
dependence
separator
void
garbage
embedding
graph
prohibition
identifier
encoding
schema
normalization
optionality
sequence
container
etcetera
label
size
atomicity
C. A pattern graph
267
Appendices
1F
245 14
260 ##
1F
1F
1F
1F
700 1#
1F
field names
(tags)
a Kernighan, Brian W.
a The C programming language.
a Englewood Cliffs, NJ :
b Prentice-Hall,
c 1978.
a Ritchie, Dennis M.
1E
1E
field
record
1E
1E 1D
subfield values
subfield codes as (repeatable) subfield indices
Figure A5.: MARC record and flat file database model with subfields
The governing paradigm of MARC is the paradigm of standards and rules (section 4.2.2), so this paradigm can reveal most defects of the format. It is worth
5 Sure todays formats will be criticized in 40 years for not being suitable then.
6 Even the basic structure cannot be taken for granted: in 2004 German and Austrian libraries decided
to adopt MARC, but they introduced an invalid subfield code (A), making some of their records broken
MARC. Such violating interpretations also occur at schema and conceptual levels.
268
File
* indexed
Record
Field
Subfield
Value
and/or ordered
Figure A6.: Flat file record model of MARC
remarking that MARC is neither specified by a formal language nor does it come
with a schema language to express subsets and applications of MARC (MARCXML,
an encoding of MARC in XML, only defines a schema for the basic record structure
but not for particular data elements). Furthermore there is no official validator to
check whether records conform to (a specific dialect of) MARC.
The lack of formal specifications and automatic tools for validation increase the
importance of intellectual analysis of MARC records. Many actual data patterns do
not simply follow the basic record structure of MARC. In particular Thomale (2010)
found that the underlying structure is based on linguistics rather than a format
that was designed to be machine-readable, so MARC should better be treated like
textual markup. The interpretation of records as markup, which is normally based
on element order (sequence) contrasts with the requirement select data elements
based on the field-subfield-structure (identifier pattern). For instance one could
combine tag, indicator, and subfield code to a normalized pointer, such as 245 14 a
for the title in figure A5. Within MARC fields, Coyle (2011) identified three pattern
structures: first, subfields can indepedently and directly describe a resource (so
they can be used as part of a pointer). Second, subfields can qualify or modify
other subfields (dependence pattern or flag pattern), and third, multiple subfields can
together form a resource description (sequence, container, or embedding pattern).
An in-depth analysis of MARC in particular is out of the scope of this work, so
this appendix ends with some additional pattern instances from the sample record:
The fields 100 and 700 form a sequence of authors.
Author names are structured by an embedding with comma as separator (surname,
given). Second given names are further abbreviated (etcetera).
Several instances of punctuation are irrelevant (garbage pattern).
NJ in Englewood Cliffs, NJ is an identifier that refers to New Jersey.
Core elements (Brian, Prentice-Hall, . . . ) are instances of the label pattern.
269
Appendices
Danksagung
die fachliche und praktische Unterstutzung
Fur
bei der Vollendung dieser Arbeit
270