You are on page 1of 13

CP5293 BIG DATA ANALYTICS LTPC

OBJECTIVES: 3003
To understand the competitive advantages of big data analytics
To understand the big data frameworks
To learn data analysis methods
To learn stream computing
To gain knowledge on Hadoop related tools such as HBase, Cassandra, Pig, and Hive for big
data analytics.

UNIT I INTRODUCTION TO BIG DATA 7


Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs Traditional Data -
Risks of Big Data - Structure of Big Data - Challenges of Conventional Systems - Web Data – Evolution of
Analytic Scalability - Evolution of Analytic Processes, Tools and methods - Analysis vs Reporting -
Modern Data Analytic Tools.
UNIT II HADOOP FRAMEWORK 9 Distributed
File Systems - Large-Scale FileSystem Organization – HDFS concepts - MapReduce Execution,
Algorithms using MapReduce, Matrix-Vector Multiplication – Hadoop YARN.
UNIT III DATA ANALYSIS 13 Statistical
Methods:Regression modelling, Multivariate Analysis - Classification: SVM & Kernel Methods - Rule
Mining - Cluster Analysis, Types of Data in Cluster Analysis, Partitioning Methods, Hierarchical Methods,
Density Based Methods, Grid Based Methods, Model Based Clustering Methods, Clustering High
Dimensional Data - Predictive Analytics – Data analysis using R.
UNIT IV MINING DATA STREAMS 7
Streams: Concepts – Stream Data Model and Architecture - Sampling data in a stream - Mining Data
Streams and Mining Time-series data - Real Time Analytics Platform (RTAP) Applications - Case Studies
- Real Time Sentiment Analysis, Stock Market Predictions.
UNIT V BIG DATA FRAMEWORKS 9 Introduction
to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients –
Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt
– Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File
Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries.
TOTAL: 45 PERIODS
OUTCOMES: At the end of this course, the students will be able to:
Understand how to leverage the insights from big data analytics
Analyze data by utilizing various statistical and data mining approaches
Perform analytics on real-time streaming data
Understand the various NoSql alternative database models
REFERENCES:
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with
Advanced Analytics”, Wiley and SAS Business Series, 2012.
2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools,
Techniques, NoSQL, and Graph", 2013.
3. Learning R – A Step-by-step Function Guide to Data Analysis, Richard Cotton, O‟Reilly Media, 2013.
4. Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer, Second Edition, 2007.
5. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013.
6. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot
Persistence", Addison-Wesley Professional, 2012.
Unit 1
Part A
1.What is big data?
Big data is data sets that are so voluminous and complex that traditional data processing
application software are inadequate to deal with them. Big data challenges include capturing
data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating
and information privacy. There are three dimensions to big data known as Volume, Variety and
Velocity.
2.What is the characteristics of big data?
Big data can be described by the following characteristics
Volume
The quantity of generated and stored data. The size of the data determines the value and
potential insight- and whether it can actually be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to effectively use the
resulting insight.
Velocity
In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The data quality of captured data can vary greatly, affecting the accurate analysis

3 .Difference between big data and traditional data


Traditional database systems are based on the structured data i.e. traditional data is
stored in fixed format or fields in a file.
Big data uses the semi-structured and unstructured data and improves the variety of
the data gathered from different sources like customers, audience or subscribers
4. Explain about the risk of big data?
Anticipation of such effects could prompt public and government scrutiny leading to
regulation that could constrain the use of Big Data for positive purposes. Questions about big
data and analytics raise risks that can have three components—risk of error; legal impact; and
ethical breach.
5. What is big data analytics?
Big data analytics is the process of examining large and varied data sets -- i.e., big data --
to uncover hidden patterns, unknown correlations, market trends, customer preferences and other
useful information that can help organizations make more-informed business decisions.
6. What is the main source of big data?
Big Data Sources. Big data sources are repositories of large volumes of data. ... This
brings more information to users' applications without requiring that the data be held in a single
repository or cloud vendor proprietary data store. Examples of big data sources are Amazon
Redshift, HP Vertica, and MongoDB.
7.What is web data?
The Web data is the data that comes from large or diverse number of sources. Web
data are developed with the help of Semantic Web tools such as RDF, OWL, and SPARQL.
Also, the web data allows sharing of information through HTTP protocol or SPARQL endpoint.
8. List out the data analytic tools?
Trifacta
Rapid Miner
Rattle GUI
Qlikview
Weka
KNIME
Orange.
9. what are the challenges of big data?
Data challenges
Volume, velocity, veracity, variety, Data discovery and comprehensiveness Scalability
Process challenges
Capturing data Aligning data from different sources, Transforming data into suitable
form for data analysis ,Modeling data(mathematically, simulation,) Understanding output,
visualizing results and display issues on mobile devices
Management challenges
Security , Privacy, Governance, Ethical issues
10. what are the Trends in Big Data Analytics
1. Big Data Analytics in the cloud
2. Hadoop: The new enterprise data operating system
3. Big Data lakes
4. More predictive analytics
5. SQL on Hadoop: Faster, better
6. More, better NoSQL
7. Deep learning
8. In-memory analytics
Unit II
Part A
1.What is Hadoop YARN?
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology.YARN is one of the key features in the second-generation Hadoop 2 version of the
Apache Software Foundation's open source distributed processing framework.
YARN (Yet Another Resource Negotiator) is a component of the MapReduce project
created to overcome some performance issues in Hadoop'soriginal design. MapReduce Version
2 is a re-write of the original MapReduce code run as anapplication on top of YARN.
2. What is DFS?
A distributed file system is a client/server-based application that allows clients to access
and process data stored on the server as if it were on their own computer. When a user accesses a
file on the server, the server sends the user a copy of the file, which is cached on the user's
computer while the data is being processed and is then returned to the server.
3.What is hadoop?
Hadoop is an open source big data framework deployed on a distributed cluster of nodes
that allows processing of big data. Hadoop uses commodity hardware for large scale
computation hence it provides cost benefit to enterprises.
4. Define Mapreduce Concepts?
MapReduce™ is the heart of Apache™ Hadoop®. It is this programming paradigm that
allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The
MapReduce concept is fairly simple to understand for those who are familiar with clustered
scale-out data processing solutions.
The Hadoop MapReduce framework takes these concepts and uses them to process large
volumes of information. A MapReduce program has two components: one that implements the
mapper, and another that implements the reducer.
5.What is HDFS?
HDFS is a distributed file system that provides high-performance access to data across
Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for
managing pools of big data and supporting big data analytics applications.
6. List out the core components in hadoop?
MapReduce – A software programming model for processing large sets of data in
parallel
HDFS – The Java-based distributed file system that can store all kinds of data without
prior organization.
YARN – A resource management framework for scheduling and handling resource
requests from distributed applications.\
7.What are the key advantages in hadoop?
A key advantage of using Hadoop is its fault tolerance. ... When it comes to handling
large data sets in a safe and cost-effective manner, Hadoop has theadvantage over relational
database management systems, and its value for any size business will continue to increase as
unstructured data continues to grow.
8.List out the Hadoop Applications?
Making Hadoop Applications More Widely Accessible. Apache Hadoop, the open source
MapReduce framework, has dramatically lowered the cost barriers to processing and analyzing
big data. ...
A Graphical Abstraction Layer on Top of Hadoop Applications. ...
Hadoop Applications, Seamlessly Integrated.
9. What are the characteristics of hadoop?
The prominent characteristics of Hadoop: Hadoop provides a reliable shared storage
(HDFS) and analysis system (MapReduce). Hadoop is highlyscalable and unlike the relational
databases, Hadoop scales linearly. Due to linear scale, a Hadoop Cluster can contain tens,
hundreds, or even thousands of servers.
10.what is Matrix Inversion?
Matrix inversion is a fundamental operation for solving linear equations for many
computational applications, especially for various emerging big data applications. However, it is
a challenging task to invert large-scale matrices of extremely high order (several thousands or
millions), which are common in most web-scale systems such as social networks and
recommendation systems. In this paper, we present a LU decomposition-based block-recursive
algorithm for large-scale matrix inversion

Unit III
Part A
1,What is classification?
Classification is a general process related to categorization, the process in which ideas
and objects are recognized, differentiated, and understood.
A classification system is an approach to accomplishing classification.
2, What is clustering?
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled
data.A loose definition of clustering could be “the process of organizing objects into groups
whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
3, What are the different types of regression medels?
Linear Regression. It is one of the most widely known modeling technique
Logistic Regression
Polynomial Regression
Stepwise Regression
Ridge Regression
Lasso Regression.
Elastic Net Regression
4, What is the difference between correlation and regression?
Correlation and linear regression are not the same. Correlation quantifies the degree to
which two variables are related. Correlation does not fit a line through the data points. You
simply are computing a correlation coefficient (r) that tells you how much one variable tends to
change when the other one does.
5, What is rule mining?
Association rule mining is a procedure which is meant to find frequent patterns,
correlations, associations, or causal structures from data sets found in various kinds of databases
such as relational databases, transactional databases, and other forms of data repositories.
6, what is predictive analytics?
Predictive analytics encompasses a variety of statistical techniques from predictive
modelling, machine learning, and data mining that analyze current and historical facts to make
predictions about future or otherwise unknown events.
7, List out the Clustering methods?
Partitioning Methods
Hierarchical Methods
Density based methods
Grid based methods
Model based clustering methods
8, What is R?
Programming with Big Data in R is a series of R packages and an environment for
statistical computing with big data by using high-performance statistical computation.
Two main implementations in R using MPI are Rmpi and pbdMPI of pbdR.
9, what are the characteristics of data analysis?
There are five data characteristics that are the building blocks of an efficient data
analytics solution:
accuracy,
completeness,
consistency,
uniqueness, and timeliness.
10,What is data analysis?
Data analysis, also known as analysis of data or data analytics, is a process of
inspecting, cleansing, transforming, and modeling data with the goal of discovering useful
information, suggesting conclusions, and supporting decision-making.
The process of evaluating data using analytical and logical reasoning to examine each
component of the data provided. ... There are a variety of specific data analysis method, some
of which include datamining, text analytics, business intelligence, and data visualizations.
Unit IV
Part A
1, What do you mean by data stream?
In connection-oriented communication, a data stream is a sequence of digitally encoded
coherent signals (packets of data or data packets) used to transmit or receive information that is
in the process of being transmitted.

2 . Differentiate between DBMS and DSMS?


Database Systems (DBS) • Persistent relations (relatively static, stored) • One-time queries •
Random access • “Unbounded” disk store • Only current state matters • No real-time services •
Relatively low update rate • Data at any granularity • Assume precise data • Access plan
determined by query processor, physical DB design
DSMS • Transient streams (on-line analysis) • Continuous queries (CQs) • Sequential access •
Bounded main memory • Historical data is important • Real-time requirements • Possibly multi-
GB arrival rate • Data at fine granularity • Data stale/imprecise • Unpredictable/variable data
arrival and characteristics

3. List out the applications of DSMS?


Sensor Networks: – Monitoring of sensor data from many sources, complex filtering, activation
of alarms, aggregation and joins over single or multiple streams
• Network Traffic Analysis: – Analyzing Internet traffic in near real-time to compute traffic
statistics and detect critical conditions
• Financial Tickers: – On-line analysis of stock prices, discover correlations, identify trends
• On-line auctions
• Transaction Log Analysis, e.g., Web, telephone calls

4. What is Data Stream?


Data Stream Mining is the process of extracting knowledge structures from continuous,
rapid datarecords. A data stream is an ordered sequence of instances that in many applications
of data stream mining can be read only once or a small number of times using limited
computing and storage capabilities.

5. What do you meant by Real Time analytics?


Real-time analytics is the use of, or the capacity to use, data and related resources as soon
as the data enters the system. ... Real-time analytics is also known as dynamic analysis, real-
time analysis, real-time data integration and real-time intelligence.

6.What is the definition of real time data?


Real-time data (RTD) is information that is delivered immediately after collection. There
is no delay in the timeliness of the information provided. Real-time data is often used for
navigation or tracking. Some uses of the term "real-time data" confuse it with the term
dynamic data.

7. Why we need RTAP?


RTAP addresses the following issues in the traditional or existing RDBMS system Server based
licensing is too expensive to use large DB servers

 Slow processing speed


 Little support tools for data extraction outside data warehouse

 Copying large datasets into system is too slow

 Workload differences among jobs

 Data kept in files and folder, managing them are difficult

8.What is regression?
• Predicts the quantity or probability of an outcome
• What is the likelihood of heart attack, given age, weight, …?
• What is the expected profit a customer will generate?
• What is the forecasted price of a stock?
• Algorithms: Logistic, Linear, Polynomial, Transform
9, What is Real Time Analytics Platform (RTAP)?
Real Time Analytics Platform (RTAP) analyzes data, correlates and predicts outcomes
on a real time basis. The platform enables enterprises to track things in real time on a
worldwide basis and helps in timely decision making. This platform provides us to build a range
of powerful analytic applications.

10, What is Sampling data in a stream?


Sampling from a finite stream is a special case of sampling from a station- ary window in
which the window boundaries correspond to the first and last stream elements.The foregoing
schemes fall into the category of equal-probability sampling because each window element is
equally likely to be included in the sample.

Unit V
Part A
1, What is NoSQL?
A NoSQL (originally referring to "non SQL" or "non relational") database provides a
mechanism for storage and retrieval of data that is modeled in means other than the tabular
relations used in relational databases. ... NoSQL databases are increasingly used in big data and
real-time web applications.
2, Why do we need NoSQL?
A relational database may require vertical and, sometimes horizontal expansion of servers,
to expand as data or processing requirements grow. An alternative, more cloud-friendly approach
is to employ NoSQL. ... NoSQL is a whole new way of thinking about a database. NoSQL is not
a relational database
3.What is HBase?
HBase is an open-source, non-relational, distributed database modeled after Google's
Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-
like capabilities for Hadoop
4, What is the difference between HBase and Hive?
Despite providing SQL functionality, Hive does not provide interactive querying yet - it
only runs batch processes on Hadoop. Apache HBase is a NoSQL key/value store which runs on
top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than
MapReduce jobs.
5, What is the difference between Pig and Hive?
Differences between Pig and Hive- Depending on the purpose and type of data you can
either choose to use Hive Hadoop component or Pig Hadoop Component based on the
below differences : 1) Hive Hadoop Component is used mainly by data analysts
whereas Pig Hadoop Component is generally used by Researchers and Programmers
6, What is Pig in hadoop ?
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.
7,What is Apache Pig ?
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which
makes MapReduce programming high level, similar to that of SQL for relational database
management systems.
8,What is Pig,Hive,HBase?
PIG is used for data transformation tasks. You have a file, want to extract a useful
information from it or join two files or any other transformation then use PIG. HIVE is used to
query these files by defining a "virtual" table and running SQL like queries on those
tables. HBase is a full fledged NoSQL database .
9, What is Cassandra Client?
cassandra-client is a Node.js CQL 2 driver for Apache Cassandra 0.8 and later. CQL is a
query language for Apache Cassandra. You use it in much the same way you would use SQL
for a relational database. The Cassandra documentation can help you learn the syntax.
10,List out the types of builtin operator in HIVE?
Types of Built-in Operators in HIVE are:

 Relational Operators
 Arithmetic Operators
 Logical Operators
 Operators on Complex types
 Complex type Constructors
Unit 1

Part B
1,Explain about structure of big data.
As you read about big data, you will come across a lot of discussion on the concept of
data being structured, unstructured, semi-structured, or even multi-structured. Big data is often
described as unstructured and traditional data as structured. The lines aren’t as clean as such
labels suggest, however. Let’s explore these three types of data structure from a layman’s
perspective. Highly technical details are out of scope for this book. Most traditional data sources
are fully in the structured realm. This means traditional data sources come in a clear, predefined
format that is specified in detail. There is no variation from the defined formats on a day-to-day
or update-to-update basis. For a stock trade, the first field received might be a date in a
MM/DD/YYYY format. Next might be an account number in a 12-digit numeric format. Next
might be a stock symbol that is a three- to five-digit character field. And so on. Every piece of
information included is known ahead of time, comes in a specified format, and occurs in a
specified order. This makes it easy to work with.
Unstructured data sources are those that you have little or no control over. You are going
to get what you get. Text data, video data, and audio data all fall into this classification. A picture
has a format of individual pixels set up in rows, but how those pixels fit together to create the
picture seen by an observer is going to vary substantially in each case. There are sources of big
data that are truly unstructured such as those preceding. However, most data is at least semi-
structured. Semi-structured data has a logical flow and format to it that can be understood, but
the format is not user-friendly. Sometimes semi structured data is referred to as multi-structured
data. There can be a lot of noise or unnecessary data intermixed with the nuggets of high value in
such a feed. Reading semi-structured data to analyze it isn’t as simple as specifying a fixed file
format. To read semi-structured data, it is necessary to employ complex rules that dynamically
determine how to proceed after reading each piece of information. Web logs are a perfect
example of semi-structured data. Web logs are pretty ugly when you look at them; however, each
piece of information does, in fact, serve a purpose of some sort. Whether any given piece of a
web log serves your purposes is another question. See Figure 1.1 for anexample of a raw web
log.
WHAT STRUCTURE DOES YOUR BIG DATA HAVE?
Many sources of big data are actually semi-structured or multistructured, not
unstructured. Such data does have a logical flow to it that can be understood so that information
can be extracted from it for analysis. It just isn’t as easy to deal with as traditional structured data
sources. Taming semi-structured data is largely a matter of putting in the extra time and effort to
figure out the best way to process it.
There is logic to the information in the web log even if it isn’t entirely clear at first
glance. There are fields, there are delimiters, and there are values just like in a structured source.
However, they do not follow each other consistently or in a set way. The log text generated by a
click on a web site right now can be longer or shorter than the log text generated by a click from
a different page one minute from now. In the end, however, it’s important to understand that
semi-structured data does have an underlying logic. It is possible to develop relationships
between various pieces of it. It simply takes more effort than structured data. Analytic
professionals will be more intimidated by truly unstructured data than by semi-structured data.
They may have to wrestle with semi structured data to bend it to their will, but they can do it.
Analysts can get semi-structured data into a form that is well structured and can incorporate it
into their analytical processes. Truly unstructured data can be much harder to tame and will
remain a challenge for organizations even as they tame semi-structured data.

Algorithms Using MapReduce


MapReduce is not a solution to every problem, not even every problem that profitably
can use many compute nodes operating in parallel. As we mentioned in Section 2.1.2, the entire
distributed-file-system milieu makes sense only when files are very large and are rarely updated
in place. Thus, we would not expect to use either a DFS or an implementation of MapReduce for
managing online retail sales, even though a large on-line retailer such as Amazon.com uses
thousands of compute nodes when processing requests over theWeb. The reason is that the
principal operations on Amazon data involve responding to searches for products, recording
sales, and so on, processes that involve relatively little calculation and that change the database.2
On the other hand, Amazon might use MapReduce to perform certain analytic queries on large
amounts of data, such as finding for each user those users whose buying patterns were most
similar.
The original purpose for which the Google implementation of MapReduce was created
was to execute very large matrix-vector multiplications as areneeded in the calculation of
PageRank (See Chapter 5). We shall see that matrix-vector and matrix-matrix calculations fit
nicely into the MapReduce style of computing. Another important class of operations that can
use MapReduce effectively are the relational-algebra operations. We shall examine the
MapReduce execution of these operations as well.
2.3.1 Matrix-Vector Multiplication by MapReduce
Suppose we have an n×n matrix M, whose element in row i and column j will be denoted
mij . Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-
vector product is the vector x of length n, whose ithelement xi is given by
If n = 100, we do not want to use a DFS or Map Reduce for this calculation. But this sort of
calculation is at the heart of the ranking of Web pages that goes on at search engines, and there, n
is in the tens of billions.3 Let us first assume that n is large, but not so large that vector v cannot
fit in main memory and thus be available to every Map task. The matrix M and the vector v each
will be stored in a file of the DFS. We assume that the row-column coordinates of each matrix
element will be discoverable, either from its position in the file, or because it is stored with
explicit coordinates, as a triple (i, j,mij). We also assume the position of element vj in the vector
v will be discoverable in the analogous way.
The Map Function: The Map function is written to apply to one element of M. However, if v is
not already read into main memory at the compute node executing a Map task, then v is first
read, in its entirety, and subsequently will be available to all applications of the Map function
performed at this Map task. Each Map task will operate on a chunk of the matrix M. From each
matrix element mij it produces the key-value pair (i,mijvj). Thus, all terms of the sum that make
up the component xi of the matrix-vector product will get the same key, i.
The Reduce Function: The Reduce function simply sums all the values associated with a given
key i. The result will be a pair (i, xi)
2.3.2 If the Vector v Cannot Fit in Main Memory
However, it is possible that the vector v is so large that it will not fit in its entirety in main
memory. It is not required that v fit in main memory at a compute node, but if it does not then
there will be a very large number ofdisk accesses as we move pieces of the vector into main memory
to multiply
components by elements of the matrix. Thus, as an alternative, we can divide
the matrix into vertical stripes of equal width and divide the vector into an equal
number of horizontal stripes, of the same height. Our goal is to use enough
stripes so that the portion of the vector in one stripe can fit conveniently into
main memory at a compute node. Figure 2.4 suggests what the partition looks
like if the matrix and vector are each divided into five stripes.

You might also like