Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
By W.H. Inmon, Daniel Linstedt and Mary Levins
4.5/5
()
About this ebook
Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things.
Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together.
- New case studies include expanded coverage of textual management and analytics
- New chapters on visualization and big data
- Discussion of new visualizations of the end-state architecture
W.H. Inmon
Best known as the “Father of Data Warehousing," Bill Inmon has become the most prolific and well-known author worldwide in the big data analysis, data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the “Ten IT People Who Mattered in the Last 40 Years of the computer profession. Having 35 years of experience in database technology and data warehouse design, he is known globally for his seminars on developing data warehouses and information architectures. Bill has been a keynote speaker in demand for numerous computing associations, industry conferences and trade shows. Bill Inmon also has an extensive entrepreneurial background: He founded Pine Cone Systems, later named Ambeo in 1995, and founded, and took public, Prism Solutions in 1991. Bill consults with a large number of Fortune 1000 clients, and leading IT executives on Data Warehousing, Business Intelligence, and Database Management, offering data warehouse design and database management services, as well as producing methodologies and technologies that advance the enterprise architectures of large and small organizations world-wide. He has worked for American Management Systems and Coopers & Lybrand. Bill received his Bachelor of Science degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University.
Read more from W.H. Inmon
Database Design: Know It All Rating: 5 out of 5 stars5/5Building the Data Warehouse Rating: 5 out of 5 stars5/5Business Metadata: Capturing Enterprise Knowledge Rating: 4 out of 5 stars4/5DW 2.0: The Architecture for the Next Generation of Data Warehousing Rating: 4 out of 5 stars4/5Mastering the SAP Business Information Warehouse Rating: 4 out of 5 stars4/5Corporate Information Factory Rating: 1 out of 5 stars1/5
Related to Data Architecture
Related ebooks
Building a Scalable Data Warehouse with Data Vault 2.0 Rating: 4 out of 5 stars4/5Data Mapping for Data Warehouse Design Rating: 5 out of 5 stars5/5Database Modeling and Design: Logical Design Rating: 0 out of 5 stars0 ratingsDeveloping High Quality Data Models Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsData Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses Rating: 4 out of 5 stars4/5Architecting Big Data & Analytics Solutions - Integrated with IoT & Cloud Rating: 5 out of 5 stars5/5Relational Database Design and Implementation Rating: 5 out of 5 stars5/5Big Data Analytics Rating: 0 out of 5 stars0 ratingsBig Data for Enterprise Architects Rating: 5 out of 5 stars5/5Data Mining: Concepts and Techniques Rating: 4 out of 5 stars4/5Designing Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsManaging Data in Motion: Data Integration Best Practice Techniques and Technologies Rating: 0 out of 5 stars0 ratingsExpert Cube Development with SSAS Multidimensional Models Rating: 0 out of 5 stars0 ratingsExpert Cube Development with Microsoft SQL Server 2008 Analysis Services Rating: 5 out of 5 stars5/5Master Data Management Rating: 0 out of 5 stars0 ratingsSoftware Architecture for Big Data and the Cloud Rating: 0 out of 5 stars0 ratingsRelational Database Design and Implementation: Clearly Explained Rating: 0 out of 5 stars0 ratingsBusiness Intelligence: The Savvy Manager's Guide Rating: 4 out of 5 stars4/5Learn Data Warehousing in 24 Hours Rating: 0 out of 5 stars0 ratingsMaking Sense of NoSQL: A guide for managers and the rest of us Rating: 0 out of 5 stars0 ratingsMicrosoft SQL Server 2014 Business Intelligence Development Beginner’s Guide Rating: 0 out of 5 stars0 ratingsHandbook of Statistical Analysis and Data Mining Applications Rating: 4 out of 5 stars4/5Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects Rating: 0 out of 5 stars0 ratingsPrinciples of Data Integration Rating: 5 out of 5 stars5/5Agile Data Warehousing for the Enterprise: A Guide for Solution Architects and Project Leaders Rating: 0 out of 5 stars0 ratingsData Architecture: From Zen to Reality Rating: 4 out of 5 stars4/5Data Modeling Essentials Rating: 4 out of 5 stars4/5
Databases For You
Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Excel 2021 Rating: 4 out of 5 stars4/5Spring in Action, Sixth Edition Rating: 5 out of 5 stars5/5Beginning Microsoft SQL Server 2012 Programming Rating: 1 out of 5 stars1/5Practical Data Analysis Rating: 4 out of 5 stars4/5SQL Clearly Explained Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Oracle DBA Mentor: Succeeding as an Oracle Database Administrator Rating: 0 out of 5 stars0 ratingsCompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsLearn SQL in 24 Hours Rating: 5 out of 5 stars5/5Getting Started with SQL Server 2014 Administration Rating: 0 out of 5 stars0 ratingsCOBOL Basic Training Using VSAM, IMS and DB2 Rating: 5 out of 5 stars5/5Relational Database Design and Implementation Rating: 5 out of 5 stars5/5Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsAccess 2019 For Dummies Rating: 0 out of 5 stars0 ratingsSQL Server: Tips and Tricks - 1 Rating: 5 out of 5 stars5/5Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program Rating: 4 out of 5 stars4/5IMS-DB Basic Training For Application Developers Rating: 0 out of 5 stars0 ratingsLearn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5COMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsJump Start MySQL: Master the Database That Powers the Web Rating: 0 out of 5 stars0 ratingsThe Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality Rating: 5 out of 5 stars5/5Access 2010 All-in-One For Dummies Rating: 4 out of 5 stars4/5The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation Rating: 4 out of 5 stars4/5SQL: Practical Guide for Developers Rating: 2 out of 5 stars2/5Learning PostgreSQL Rating: 1 out of 5 stars1/5Data Science Strategy For Dummies Rating: 0 out of 5 stars0 ratingsLearn Git in a Month of Lunches Rating: 0 out of 5 stars0 ratings
Reviews for Data Architecture
3 ratings0 reviews
Book preview
Data Architecture - W.H. Inmon
data.
Chapter 1.2
The Data Infrastructure
Abstract
Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the great divide.
The divide is so large; many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.
Keywords
Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data
If there is any secret to data management and data architecture, it is understanding data in terms of its infrastructure. Stated differently, trying to understand the larger architecture under which data are managed and operate is almost impossible without understanding the underlying infrastructure, which surrounds data. Therefore, we shall spend some time understanding infrastructure.
Two Types of Repetitive Data
A good starting point for understanding infrastructure is to start with the observation that there are two types of repetitive data found in corporate data. In the structured side of corporate data, repetitive data are found. In the unstructured big data side of corporate data, repetitive data are also found. Despite the fact that the types of data sound the same, there are significant differences between the different types of repetitive data. When it comes to structured repetitive data, it is normal to have transactions as part of the repetitive data. There are sales transactions, stocking of SKU transactions, inventory replenishment transactions, payment transactions, and so forth. In the structured world, there are many of these transactions that find their way into the repetitive structured world.
The other kind of repetitive data is the repetitive data found in the unstructured big data world. In the unstructured big data world, we might have metering data, analog data, manufacturing data, clickstream data, and so forth.
There is the question then—are these types of repetitive data the same? They certainly are repetitive. But these different types of repetitive data are not the same. What is the difference then between these two types of repetitive data? Fig. 1.2.1 shows (symbolically) these two types of repetitive data.
Fig. 1.2.1 Two types of repetitive data.
Repetitive Structured Data
In order to understand the differences between these two types of repetitive data, it is necessary to understand each type of data individually. Let's start with repetitive structured data. Fig. 1.2.2 shows the repetitive structured data are broken into records and blocks.
Fig. 1.2.2 Repetitive data broken into blocks.
The most basic unit of information in the repetitive structured environment is a block of data. Inside each block of data are records of data.
Fig. 1.2.3 shows a simple record of data.
Fig. 1.2.3 Records inside a block.
Each record of data is (normally!) representative of a transaction. For example, there are records of data representing the sale of a product. Each record is representative of a single sale.
Inside each record are keys, attributes, and indexes. Fig. 1.2.4 shows the anatomy of a record.
Fig. 1.2.4 Attributes, keys, and indexes.
If a record is representative of a sale, the attributes might be information about the date of the sale, the item sold, the cost of the item, any tax on the item, who bought the item, and so forth. The key of the record is one or more attributes that uniquely define the record. The key for a sale might be the date of sale, item sold, and location of the sale.
The indexes that are attached to the record are on the attributes that are needed when there is a desire to have quick access to the record.
The infrastructure that is attached to structured repetitive data managed under a DBMS is seen in Fig. 1.2.5.
Fig. 1.2.5 A standard DBMS.
Repetitive Big Data
The other type of repetitive data is repetitive data found in big data. Fig. 1.2.6 depicts the repetitive data found in big data.
Fig. 1.2.6 Repetitive big data.
At first glance, there are just a lot of repetitive records seen in Fig. 1.2.6. But upon closer examination, it is seen that all of those repetitive big data records are packed away into a string of data and that string of data is stored inside a block of data, as seen in Fig. 1.2.7.
Fig. 1.2.7 A block of data.
The structured infrastructure seen in Fig. 1.2.7 is typical of an infrastructure managed under one of several DBMS such as Oracle, SQL Server, and DB2.
The infrastructure for big data is quite different than the infrastructure found in a standard DBMS. In the infrastructure for big data, there is a block. And in the block are found many repetitive records. Each record is merely concatenated to each other record. Fig. 1.2.8 is representative of a record that might be found in big data.
Fig. 1.2.8 Records inside the block.
In Fig. 1.2.8, it is seen that there is merely a long string of data, with records stacked one against the other. The system only sees the block and the long string of data. In order to find a record, the system needs to parse
the string, as seen in Fig. 1.2.9.
Fig. 1.2.9 Parsing records inside the block.
Suppose the system wants to find a given record. The system needs to sequentially read the string of data until it recognizes that there is a record. Then, the system needs to go into the record and determine whether it is record B.
This is how a search is conducted in the most primitive state in big data.
It doesn’t take much of an imagination to see that a lot of machine cycles are chewed up looking for data in big data. To this end, the big data environment employs a means of processing referred to as the Roman census
approach. More will be described about the Roman census approach in the chapter on big data.
The Two Infrastructures
The two different infrastructures are contrasted in Fig. 1.2.10.
Fig. 1.2.10 Two different infrastructures.
Without much effort, it is seen that the infrastructures surrounding big data and structured data are quite different. The infrastructure surrounding big data is quite simple and streamlined. The infrastructure surrounding structured DBMS data is elaborate and anything but streamlined.
There is then no argument as to the fact that there are significant differences between the infrastructure of repetitive structured data and repetitive big data.
What's Being Optimized?
When looking at the two infrastructures, it is natural to ask—what is being optimized by the different infrastructures. In the case of big data, the optimization of the infrastructure is on the ability of the system to manage almost unlimited amounts of data. Fig. 1.2.11 shows that with the infrastructure of big data, adding new data is a very easy and streamlined thing to do.
Fig. 1.2.11 Optimal for storing massive amounts of data.
But the infrastructure behind a structured DBMS is optimized for something quite different than managing huge amounts of data. In the case of the structured DBMS environment, the optimization is on the ability to find any one given unit of data quickly and efficiently.
Fig. 1.2.12 shows the optimization of the infrastructure of a standard structured DBMS.
Fig. 1.2.12 Optimal for direct online access of data.
Comparing the Two Infrastructures
Another way to think of the different infrastructures is in terms of the amount of data and overhead required to find a given unit of data. In order to find a given unit of data, the big data environment has to search through a whole host of data. Many input/output operations (I/Os) have got to be done to find a given item. To find that same item in a structured DBMS environment, only a few I/Os need to be done. So if you want to optimize on the speed of access of data, the standard structured DBMS is the way to go.
On the other hand, in order to achieve the speed of access, an elaborate infrastructure for data is required by the standard structured DBMS. An infrastructure must be both built and maintained over time, as data change. A considerable amount of system resources is required for the building and maintenance of this infrastructure. But when it comes to big data, the infrastructure required to be built and maintained is nil. The big data infrastructure is built easily and maintained very easily.
This section began with the proposition that repetitive data can be found in both the structured and big data environment. At first glance, the repetitive data are the same or are very similar. But when you look at the infrastructure and the mechanics implied in the infrastructure, it is seen that the repetitive data in each of the environments are indeed very different.
Chapter 1.3
The Great Divide
Abstract
Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the great divide.
The divide is so large that many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.
Keywords
Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data
Classifying Corporate Data
Corporate data can be classified in many different ways. One of the major classifications is by structured versus unstructured data. And unstructured data can be further broken into two categories—repetitive unstructured data and nonrepetitive unstructured data. This division of data is shown in Fig.