Professional Documents
Culture Documents
Some may
have an ODS (operational data store), while some may have multiple data
marts. Some may have a small number of data sources, while some may
have dozens of data sources. In view of this, it is far more reasonable to
present the different layers of a data warehouse architecture rather than
discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer
The picture below shows the relationships among the different components
of the data warehouse architecture:
This represents the different data sources that feed data into the data
warehouse. The data source can be of any format -- plain text file,
relational database, other types of database, Excel file, etc., can all act as a
data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data, inventory data,
marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely some
minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data
mart. Having one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a
transactional nature to an analytical nature. This layer is also where data cleansing happens.
The ETL design phase is often the most time-consuming phase in a data warehousing project,
and an ETL tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types
of entities can be found here: data warehouse, data mart, and operational data store (ODS). In
any given system, you may have just one of the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a tabular /
graphical report in a browser, an emailed report that gets automatically generated and sent
everyday, or an alert that warns users of exceptions, among others. Usually an OLAP
tool and/or a reporting tool is used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A
logical data model would be an example of something that's in the metadata layer. A metadata
tool is often used to manage metadata.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as ETL job
status, system performance, and user access history.
Aggregation
within itself.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies
on the fact that most common queries will analyze a subset or an
aggregation of the detailed data.
ensures that all the system sources are used in the most effective way.
process does not generally operate during the regular load of information
into data warehouse.
Bottom Tier - The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities to
feed data into the bottom tier. These back end tools and utilities perform the Extract,
Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be implemented
in either of the following ways.
o
Top-Tier - This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to
build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.
Points to remember about data marts:
The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design
are not organization-wide.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes
or beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one
data warehouse to other.
Perform simple transformations into structure similar to the one in the data
warehouse.
Fast Load
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant
when large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the EPOS
sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of
third-party system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Backup/Recovery tool
SQL Scripts
Creates indexes, business views, partition views against the base data.
Transforms and merges the source data into the published data warehouse.
Archives the data that has reached the end of its captured life.
Note: A warehouse Manager also analyzes query profiles to determine index and
aggregations are appropriate.
Query Manager
Query manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate tables, the speed of querying and response
generation can be increased.
Query manager is responsible for scheduling the execution of the queries posed by
the user.
Stored procedures
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and
then archived to tape. The detailed information part of data warehouse keeps the detailed
information in the starflake schema. Detailed information is loaded into the data warehouse
to supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored
and how it is used.
Note: If detailed information is held offline to minimize disk storage, we should make sure
that the data has been extracted, cleaned up, and transformed into starflake schema before it
is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These
aggregations are generated by the warehouse manager. Summary Information must be
treated as transient. It changes on-the-go in order to respond to the changing query profiles.
Points to remember about summary information.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed
information.
Note: Do not data mart for any other reason since the operation cost of
data marting could be very high. Before data marting, make sure that
data marting strategy is appropriate for your particular solution.
As the merchant is not interested in the products they are not dealing
with, the data marting is a subset of the data dealing which the product
group of interest. The following diagram shows data marting for different
users.
Given below are the issues to be taken into account while determining
the functional split:
The merchant could query the sales trend of other products to analyze what
is happening to the sales.
The summaries are data marted in the same way as they would have
been designed within the data warehouse. Summary tables help to utilize
all dimension data in the starflake schema.
Network Access
Network Access
A data mart could be on a different location from the data warehouse, so
we should ensure that the LAN or WAN has the capacity to handle the
data volumes being transferred within thedata mart load process.
Network capacity.
Relational OLAP
ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following:
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for
multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle
dense and sparse data sets.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will
discuss OLAP operations in multidimensional data.
Roll-up
Drill-down
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following
ways:
By dimension reduction
Initially the concept hierarchy was "street < city < province < country".
When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of
the following ways:
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter
to the level of month.
When drill-down is performed, one or more dimensions from the data cube
are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube
and provides a new sub-cube. Consider the following diagram that shows
how slice works.
Here Slice is performed for the dimension "time" using the criterion time =
"Q1".
Dice
Dice selects two or more dimensions from a given cube and provides a
new sub-cube. Consider the following diagram that shows the dice
operation.
The dice operation on the cube based on the following selection criteria
involves three dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in
view in order to provide an alternative presentation of data. Consider the
following diagram that shows the pivot operation.
In this the item and location axes in 2-D slice are rotated.
OLAP vs OLTP
Sr.No
.
Data Warehouse
(OLAP)
Operational Database
(OLTP)
Involves historical
processing of information.
Involves day-to-day
processing.
It focuses on Information
out.
Number or users is in
hundreds.
Number of users is in
thousands.
10
Number of records
accessed is in millions.
11
12
Highly flexible.
Spatial database
From Wikipedia, the free encyclopedia
A spatial database, or geodatabase is a database that is optimized to store and query data that
represents objects defined in a geometric space. Most spatial databases allow representing
simple geometric objects such as points, lines and polygons. Some spatial databases handle
more complex structures such as 3D objects, topological coverages, linear networks, and TINs.
While typical databases are designed to manage various numeric and character types of data,
additional functionality needs to be added for databases to process spatial data types efficiently.
These are typically called geometry or feature. The Open Geospatial Consortium created
the Simple Features specification and sets standards for adding spatial functionality to database
systems.[1]
Spatial Measurements: Computes line length, polygon area, the distance between
geometries, etc.
Spatial Functions: Modify existing features to create new ones, for example by providing
a buffer around them, intersecting features, etc.
Observer Functions: Queries which return specific information about a feature such as
the location of the center of a circle
Some databases support only simplified or modified sets of these operations, especially in cases
of NoSQL systems like MongoDB and CouchDB.
Mobile communication
Mobile hardware
Mobile software
Mobile communication
The mobile communication in this case, refers to the infrastructure put in
place to ensure that seamless and reliable communication goes on.
These would include devices such as protocols, services, bandwidth, and
portals necessary to facilitate and support the stated services. The data
format is also defined at this stage. This ensures that there is no collision
with other existing systems which offer the same service.
Mobile Hardware
Mobile hardware includes mobile devices or device components that
receive or access the service of mobility. They would range from portable
laptops, smartphones, tablet Pc's, Personal Digital Assistants.
These devices will have a receptor medium that is capable of sensing and
receiving signals. These devices are configured to operate in full- duplex,
whereby they are capable of sending and receiving signals at the same
time. They don't have to wait until one device has finished
communicating for the other device to initiate communications.
Above mentioned devices use an existing and established network to
operate on. In most cases, it would be a wireless network.
Mobile software
Mobile software is the actual program that runs on the mobile hardware.
It deals with the characteristics and requirements of mobile applications.
This is the engine of the mobile device. In other terms, it is the operating
system of the appliance. It's the essential component that operates the
mobile device.
Since portability is the main factor, this type of computing ensures that
users are not tied or pinned to a single physical location, but are able to
operate from anywhere. It incorporates all aspects of wireless
communications.
Multimedia database
From Wikipedia, the free encyclopedia
1Contents of MMDB
1.1Comparison of multimedia data types
4Application areas
5See also
6References
Contents of MMDB[edit]
A Multimedia Database (MMDB) hosts one or more multimedia data types[3] (i.e. text, images,
graphic objects, audio, video, animation sequences. These data types are broadly categorized
into three classes:
Text
Elements
Printable characters
Time-dependence
No
Graphic
Vectors, regions
No
Image
Pixels
No
Audio
Sound, Volume
Yes
Video
Yes
Media format data: information about the format of the media data after it goes through
the acquisition, processing, and encoding phases.
Media keyword data: the keyword descriptions, usually relating to the generation of the
media data.
Media feature data: content dependent data such as contain information about the
distribution of colours, the kinds of textures and the different shapes present in an image.
The last three types are called metadata as they describe several different aspects of the media
data. The media keyword data and media feature data are used as indices for searching
purpose. The media format data is used to present the retrieved information.
Integration
Data independence
Separate the database and the management from the application programs
Concurrency control
Data objects can be saved and re-used by different transactions and program
invocations
Privacy
Multimedia databases should have the ability to uniformly query data (media data, textual data)
represented in different formats and have the ability to simultaneously query different media
sources and conduct classical database operations across them. (Query support)
They should have the ability to retrieve media objects from a local storage device in a good
manner. (Storage support)
They should have the ability to take the response generated by a query and develop a
presentation of that response in terms of audio-visual