You are on page 1of 12

Module 6

Introduction to Big Data


and Data Analytics
Module 6: Introduction to Big Data and Data Analytics

6.1: KTP 1 Introduction to Big Data


6.2: KTP 2 Introduction to Data Analytics
6.3: Summary
6.4: References

Module Overview:

Driven by the increasing power and availability of analytic technology, the scale and scope of
data available for audit work as oversight organizations continues to increase at a rapid pace.
Obtaining access to data, analyzing data, and developing insights will continue to be an essential
part of the work that Supreme Audit Institutions (SAIs) do, both now and in the future.

This module provides an overview of big data: what’re the characteristics of big data, what kind
of data sources can be distinguished and how all these data can be categorized, as well as how
data should be processed and stored. We will also discuss how data analytics help auditors. A
common model of performing data analytics will be introduced and some examples of audit
practice using respective data analysis technologies will be presented. We end with a discussion
on the opportunities and challenges of data analytics we may face within the SAI in the era of
big data.

Module Learning Objective:

At the end of the module, participants will be able to describe the big data and data analytics to
the extent that it covers data source, data types and the process of data analytics, as evaluated
by the mentors.

6.1 KTP 1: Introduction to Big Data

6.1.1 Overview
Big data is a popular term used to describe the exponential growth and availability of data
created by people, applications, and organizations. Wikipedia also defines this term as a
collection of data sets so large and complex that it becomes difficult to process using on-hand

Module
6 Introduction to Big Data and Data Analytics 1
database management tools or traditional data processing applications. The proliferation of
structured and unstructured data, combined with technical advances in storage, processing
power, and analytic tools, has enabled big data to become a competitive advantage for
governments that use it to gain insights into national governance and assist in rational decision-
making.

Given the fact that big data is closely related to national governance and sustainable
development in many countries, and that the mission of SAIs is to enhance accountability,
promote good governance and monitor the fulfillment of sustainable development, embracing
big data and enhancing big data assisted audit are the development trend and our realistic choice.
However, it also raises the level of knowledge and skills required for auditors to work with big
data effectively.

This session provides an overview of big data: what’re the characteristics of big data, what kind
of data sources can be distinguished and how all these data can be categorized, as well as how
data should be processed and stored.

6.1.2 Characteristics of Big Data


The general consensus of the day is that there are specific attributes that define big data. The
most common characteristics of big data are known as four Vs: volume, variety, velocity and
veracity.

Volume refers to the quantity of data, as big data is frequently defined in terms of massive data
sets with measures such as petabytes and zettabytes commonly referenced. And these vast
amounts of data are generated every second. The more data we have acquired, the more insights
we can extract from it. Thus, volume of data plays very crucial role in determining value out of
data.

Variety refers to the increasingly diversified sources and types of data requiring management
and analysis. We used to store data from sources like spreadsheets and databases. Now data
comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. So we need
to integrate these complex and multiple data types - structured and unstructured - from an array
of systems and sources both internal and external. The more varied data we have acquired, the
more multifaceted view we could develop. However the variety of unstructured data creates
problems for storage, mining and analyzing the data.

Velocity in big data deals with the accelerating speed at which data flows in from sources like
business transactions, machines, networks and human interaction with things like social media
applications, mobile devices, etc. The flow of data is massive and continuous. A rapid data
ingestion and rapid analysis capability provides us with the timely and correct insights.

Module
6 Introduction to Big Data and Data Analytics 2
Veracity in this dimension refers to the biases, noise and abnormality in data being generated.
Is the data that is being stored and mined meaningfully to the problem being analyzed? Given
the increasing volume of data being generated at an unprecedented rate, and in ever more
diverse forms, there is a clear need for us to manage the uncertainty associated with particular
types of data.

However, as systems become more efficient and the need to process data faster continues to
increase, the original data management dimensions have expanded to include other
characteristics unique to big data. The additional characteristics are validity, variability,
visualization, volatility and value.

Validity, like big data veracity, means the correct and accurate data for the intended use. The
validity of big data sources and subsequent analysis must be accurate, if we are to use the results
for decision making.

Variability is different from variety. It refers to data whose meaning is constantly changing. This
is particularly the case when gathering data relies on language processing.

Visualization. Analytics results from big data are often hard to interpret. Therefore, translating
vast amounts of data into readily presentable graphics and charts that are easy to understand is
critical to end-user satisfaction and may highlight additional insights.

Volatility refers to how long the data is valid and how long it should be stored. In this world of
real-time data, we need to determine at what point the data is no longer relevant to the current
analysis.

Value. Last, but arguably the most important of all, is value. The other characteristics of big data
are meaningless if we don't derive value from the data. Organizations, societies and
governments can all benefits from big data. Value is generated when new insights are translated
into actions that create positive outcomes.

Since the data collected by the SAI comes from different sources (inside and outside the
government) with multiple data types (structured and unstructured) and huge quantity, the
audit data naturally has the characteristics of big data.

6.1.3 Data Source and Data Types


SAIs have many different data sources at their disposal for filling their audit data environment.
By an audit data environment we mean the technical infrastructure where all data is stored,
processed, and accessed, coming from all kinds of sources to be used for audit. In this
environment, we see nowadays a combination of “traditional” data warehouse technology

Module
6 Introduction to Big Data and Data Analytics 3
combined with modern big data tooling, like Hadoop, etc. This combination of technology
enables dealing with the challenges posed by the existence of different types of data sources,
and specific challenges in data volume and variety. Within these data sources, we distinguish
between internal versus external data sources and structured versus unstructured data sources.
The different combinations of internal versus external and structured versus unstructured data
are shown in Figure 6.1.

Figure 6.1: Two Dimension of Data: Data Source and Data Type

6.1.3.1 External data sources versus internal data sources


A first important distinction in these data sources is between external and internal data sources.
External data sources are not present within the audited entities and are frequently acquired
and added to the audit data environment from (or collected by) other government departments
or third parties. For example, if we are going to audit the local hospitals, focusing on the
performance of government subsidies for the medical treatment of low-income people, we
should acquire the poverty datasets from the social security administration first. This will provide
us with the information of low-income people, such as social security number, gender, age and
income level, etc. This information could be linked to medical treatment data within the hospital
to find out whether all the low-income patients have received subsidies and whether there are
any patients with a higher income have received subsidies improperly. Other external

Module
6 Introduction to Big Data and Data Analytics 4
information may, for example, involve the relationship among people. Credit card firms such as
Mastercard, and Visa have a huge population of customers and good knowledge of whether a
consumer is related to another, by analyzing the use and repayment of their credit cards. This
could be quite useful for auditors to detect fraud and corruption.

Another type of fast-growing external data are data from social media. These data come from
parties like Facebook, Twitter, LinkedIn, and Wechat, all of which have huge user bases, so that
often a substantial proportion of an audited entity participate in these social media. Linking
these data to the audit data environment is quite a challenge and not yet a common practice.
This is mainly due to the highly unstructured way these data are being created. Some companies
are collecting these data by using platforms like Radian6 for social media monitoring. It could
also be used to perceive social trends and assist strategic planning of the SAI.

Internal data sources are already present within audited entities and may include financial
statement data, transaction data, invoice data, contact data, and usage data. These internal data
sources are gathered and stored in the audited entity’s information system. Internal data could
be the base for auditors to perform their work, since all the audited entity’s behavior should be
recorded, described and understood by extracting insights from the internal data.

6.1.3.2 Structured versus unstructured data


An alternative way to structure data sources is to make a distinction between structured and
unstructured data sources. Structured data are data that come in a fixed format, based on a
detailed record and variable structure, good labeling of values in the database, and high data
quality. The invoice data mentioned earlier (internal data) and social security data (external data)
are good examples of highly structured data. On the other side of the coin we have unstructured
data. These data are often very bulky in size, without a fixed format, containing a lot of free
format text and often need a good deal of data interpretation and data reduction in order to
create usable information. Examples of these data sources are data from audited entity’s
management documents (internal data), where decision-making or remarks are often registered
in free format text, and social media data (external data) contained in Twitter messages,
Facebook comments, etc.

Historically, the majority of data stored within audited entities has been structured and
maintained within relational - or even legacy hierarchical or flat-file - databases. It allows for
repeatable queries, as much of the data is maintained in relational tables. It is often easier to
control than unstructured data, due to defined ownership and vendor-supported database
solutions. However, the use of unstructured data is growing and becoming more common within
audited entities. This type of data is not confined to traditional data structures or constraints. It

Module
6 Introduction to Big Data and Data Analytics 5
is typically more difficult to manage, due to its evolving and unpredictable nature, and it is
usually sourced from large, disparate, and often external data sources. Consequently, new
solutions have been developed to manage and analyze this type of data. See Figure 6.2 for a
diagram that shows the difference between structured and unstructured data.

Figure 6.2: Examples of Structured and Unstructured Data

6.1.4 Data Storage

6.1.4.1 Data Warehouse versus Data Lake


A large repository of enterprise-wide data specifically designed for analytics and reporting is
known as a data warehouse (see Figure 6.3). Data warehouses are typically relational databases
that store information from multiple sources. Data is uploaded from operational systems that
contain transaction data to data warehouses, which store complete information about one or
more subjects. ETL (extract, transform, and load) or ELT (extract, load, and transform) tools are
configured to move data from the operational system to the data warehouse. The data is loaded
in the format and structure of the data warehouse, which is often aggregated. The biggest
negatives are flexibility. Traditional data warehouses can only operate on data they know about.
They have fixed schema and aren’t very flexible at handling unstructured data. They are good for
transaction analytics, in which decisions must be made quickly based upon a defined set of data
elements, but are less effective in applications in which relationships aren’t well-defined.

Module
6 Introduction to Big Data and Data Analytics 6
Figure 6.3: Data Warehouse

Data lakes (see Figure 6.4) are becoming an increasingly popular solution to support big data
storage and data discovery. Data lakes are similar to data warehouses in that they store large
amounts of data from various sources, but they can store data in its natural format, that
facilitates the collocation of data in various schemata and structural forms. The idea of Data Lake
is to have a single store of all data in the enterprise ranging from raw data (which implies exact
copy of source system data) to transformed data which is used for various tasks including
reporting, visualization, analytics and machine learning. The data lake includes structured data
from relational databases and unstructured data from different sources, thus creating a
centralized data store accommodating all forms of data. This technology gives companies the
ability to store and use big data, allowing them to embrace non-traditional data types. Hadoop,
an open source big data framework that has grown exponentially over the last decade, aligns
well with the data lake based on the pure amount of data.

Module
6 Introduction to Big Data and Data Analytics 7
Figure 6.4: Data Warehouse

The table 6.1 below highlights a few of the key differences between a data warehouse and a data
lake.

Table 6.1: Date Warehouse vs. Date Lake

 Data. A data warehouse only stores data that has been modeled/structured, while a data
lake is no respecter of data. It stores all - structured and unstructured.
 Processing. Before we can load data into a data warehouse, we first need to give it some
shape and structure - i.e., we need to model it. That’s called schema-on- write. With a data
lake, you just load in the raw data, and then when you’re ready to use the data, that’s the
time you give it shape and structure. That’s called schema-on-read.

Module
6 Introduction to Big Data and Data Analytics 8
 Storage. One of the primary features of big data technologies like Hadoop is that the cost
of data storage is relatively low as compared to the data warehouse. There are two key
reasons for this: First, Hadoop is open source software, so the licensing and community
support is free. And second, Hadoop is designed to be installed on low-cost commodity
hardware.
 Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically
hard to change the structure, but it can be very time-consuming given all the processes that
are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse, which
gives developers the ability to easily configure and reconfigure their models, queries, and
apps.
 Security. Data warehouse technologies have been around for decades, while big data
technologies are relatively new. Thus, the ability to secure data in a data warehouse is much
more mature than securing data in a data lake. It should be noted, however, that there’s a
significant effort being placed on security right now in the big data industry. It’s not a
question of if, but when.

6.1.4.2 Database structures


Within an audited entity, several databases with relevant operational data are usually present.
For example, hospitals may have databases on patients’ information (e.g. disease, prescription,
and treatment), medication inventory, invoice, doctor information, etc. In a data warehouse all
these data sources are integrated into one large database. The central focus of this database is
the patient. However, from this database should not only patient information be retrieved, but
also other information, for example information on doctor performance.

An important element of good databases is that the data are arranged in such a way that they
can easily be retrieved by users. The database structure may, however, depend on the user.
Director of specific division in the hospital may interested in the performance of subordinate
doctors, while inventory manager likely wants to know which kind of medicine has been mostly
used in the prescription. An auditor focusing on the procurement is most likely interested in
suppliers, such as quantity and price. To overcome this hurdle, relational database structures are
currently standard. Relational databases use different key-variables which link several databases
to each other. For example, in a management context the patient id and the doctor id are usually
key variables. In an inventory context the doctor id can again be a key variable, while the
medicine id should be too. Based on the type of information required, a primary key variable, a
secondary key variable, etc. can be distinguished. In a patient database, the patient id would be
the primary key variable, while in a inventory database the medicine id is the primary key
variable.

Module
6 Introduction to Big Data and Data Analytics 9
Table 6.2 shows an example of the structure of a patient database. In this example, the patient
id is the primary key variable. The patient name provides further information on that patient.
The doctor id is a secondary key variable, and the disease id is another. One can easily add more
key variables, such as prescription and medicine, etc. From this database, one can in principle
derive a database with the other non-primary key variables as key variables. For example, if one
wanted to take the disease id as a key variable, a database could be created through aggregation
and transformation procedures which shows how many patients suffer from the specific disease
and which doctor is most experienced in treating such disease(see Table 6.3).

Patient id Patient name Doctor id Disease id


PA1001 Tom DO001 DI0001
PA1002 Johnson DO001 DI0002
PA1003 Lucy DO002 DI0003
PA1004 Thomas DO002 DI0004

Table 6.2 Examples of simple data table with patient as the central element

Number of Most experienced


Disease id
patients doctor id
DI0001 356 DO002
DI0002 168 DO004
DI0003 122 DO003
DI0004 67 DO001

Table 6.3 Examples of disease data table derived from the patient database

6.1.5 Data quality


Data quality consists of several dimensions: completeness of data, data being up to date and no
mistakes in data. Completeness of data refers to whether all available data are present for all
records. For example, mistakes frequently occur in a customer management database, especially
with regard to customer descriptors, such as contact number and address. These mistakes may
arise as customers write down unclear address etc. on forms, perhaps on purpose, or when
typographical errors mean that data entry is not done correctly. The data being up to date
concerns whether the data are being updated on a very frequent basis. A database that is not
up to date can easily contain mistakes on all kinds of variables. For example, if customers have
moved to another address, the address in the database will be wrong if it has not been updated.

Module
6 Introduction to Big Data and Data Analytics 10
Or if integration of databases is not done frequently, a recent product purchase or a recent
defection might not have been included yet, leading to unreliable information. Mistakes can
potentially have strong negative consequences for the audited entity. For example, social
security department may continue paying pension to someone who has recently passed away.
Moreover, audited entity’s data being not up to date may also cause wrong audit findings. There
are several options available to solve problems with data quality. One option is to use reference
databases for cleaning historical data quality issues in the databases; they can also be used as a
reference table during data entry. Another option is to use a software approach for data cleaning,
for example, by using tools to recognize duplicates within a dataset, or to recognize different
options for registering a specific address.

Data will never be perfect. One of the important problems auditors may face is that there are
missing values of the variable in a field of the dataset. For example, there is no information on
contact number and address for some audited entity’s suppliers. Missing value is a common
occurrence, which reduces the representativeness of the dataset and can distort inferences and
conclusions drawn from the data. One easy way to deal with these missing data is to throw away
observations with missing values. This may, however, cause sample problems, especially when
these missing values occur very frequently in a non-random fashion. For example, mainly for
suppliers relating to specific project. Data may be missing because suppliers were not asked to
provide these data. But it may also have some other meaning. For instance, suppliers may on
purpose not provide data, because the quality of the products they provided may not so good
or some fraud may hide behind the purchase. In the latter case missing values can potentially be
predictors of behavior, e.g. auditors might assume that suppliers with missing values are more
likely to commit fraud. Thus in general, auditors should carefully analyze the reason for multiple
missing values, and consider whether these missing values occur in a random or non-random
fashion. Understanding the reasons for and the nature of missing values is important to
appropriately handle the remaining data. If it is a random event, one could probably delete
observations with missing values or replace the missing values with, for example, the mean,
median or mode of the available values.

6.1.6 Summary
In this session, we started with the characteristics of Big Data. There are four common Vs and five
additional Vs to define big data. And we also make the distinction between different data sources,
group them in different ways: structured versus unstructured and internal versus external. We
stressed that big data are not just about the internal structured sources, but also have to deal
with the other sources. The description of database structures and possible issues with data
quality and missing values should help in tackling practical issues around working with big data.

Module
6 Introduction to Big Data and Data Analytics 11

You might also like