Professional Documents
Culture Documents
OBJECTIVES: 3003
To understand the competitive advantages of big data analytics
To understand the big data frameworks
To learn data analysis methods
To learn stream computing
To gain knowledge on Hadoop related tools such as HBase, Cassandra, Pig, and Hive for big
data analytics.
Unit III
Part A
1,What is classification?
Classification is a general process related to categorization, the process in which ideas
and objects are recognized, differentiated, and understood.
A classification system is an approach to accomplishing classification.
2, What is clustering?
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled
data.A loose definition of clustering could be “the process of organizing objects into groups
whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
3, What are the different types of regression medels?
Linear Regression. It is one of the most widely known modeling technique
Logistic Regression
Polynomial Regression
Stepwise Regression
Ridge Regression
Lasso Regression.
Elastic Net Regression
4, What is the difference between correlation and regression?
Correlation and linear regression are not the same. Correlation quantifies the degree to
which two variables are related. Correlation does not fit a line through the data points. You
simply are computing a correlation coefficient (r) that tells you how much one variable tends to
change when the other one does.
5, What is rule mining?
Association rule mining is a procedure which is meant to find frequent patterns,
correlations, associations, or causal structures from data sets found in various kinds of databases
such as relational databases, transactional databases, and other forms of data repositories.
6, what is predictive analytics?
Predictive analytics encompasses a variety of statistical techniques from predictive
modelling, machine learning, and data mining that analyze current and historical facts to make
predictions about future or otherwise unknown events.
7, List out the Clustering methods?
Partitioning Methods
Hierarchical Methods
Density based methods
Grid based methods
Model based clustering methods
8, What is R?
Programming with Big Data in R is a series of R packages and an environment for
statistical computing with big data by using high-performance statistical computation.
Two main implementations in R using MPI are Rmpi and pbdMPI of pbdR.
9, what are the characteristics of data analysis?
There are five data characteristics that are the building blocks of an efficient data
analytics solution:
accuracy,
completeness,
consistency,
uniqueness, and timeliness.
10,What is data analysis?
Data analysis, also known as analysis of data or data analytics, is a process of
inspecting, cleansing, transforming, and modeling data with the goal of discovering useful
information, suggesting conclusions, and supporting decision-making.
The process of evaluating data using analytical and logical reasoning to examine each
component of the data provided. ... There are a variety of specific data analysis method, some
of which include datamining, text analytics, business intelligence, and data visualizations.
Unit IV
Part A
1, What do you mean by data stream?
In connection-oriented communication, a data stream is a sequence of digitally encoded
coherent signals (packets of data or data packets) used to transmit or receive information that is
in the process of being transmitted.
8.What is regression?
• Predicts the quantity or probability of an outcome
• What is the likelihood of heart attack, given age, weight, …?
• What is the expected profit a customer will generate?
• What is the forecasted price of a stock?
• Algorithms: Logistic, Linear, Polynomial, Transform
9, What is Real Time Analytics Platform (RTAP)?
Real Time Analytics Platform (RTAP) analyzes data, correlates and predicts outcomes
on a real time basis. The platform enables enterprises to track things in real time on a
worldwide basis and helps in timely decision making. This platform provides us to build a range
of powerful analytic applications.
Unit V
Part A
1, What is NoSQL?
A NoSQL (originally referring to "non SQL" or "non relational") database provides a
mechanism for storage and retrieval of data that is modeled in means other than the tabular
relations used in relational databases. ... NoSQL databases are increasingly used in big data and
real-time web applications.
2, Why do we need NoSQL?
A relational database may require vertical and, sometimes horizontal expansion of servers,
to expand as data or processing requirements grow. An alternative, more cloud-friendly approach
is to employ NoSQL. ... NoSQL is a whole new way of thinking about a database. NoSQL is not
a relational database
3.What is HBase?
HBase is an open-source, non-relational, distributed database modeled after Google's
Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-
like capabilities for Hadoop
4, What is the difference between HBase and Hive?
Despite providing SQL functionality, Hive does not provide interactive querying yet - it
only runs batch processes on Hadoop. Apache HBase is a NoSQL key/value store which runs on
top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than
MapReduce jobs.
5, What is the difference between Pig and Hive?
Differences between Pig and Hive- Depending on the purpose and type of data you can
either choose to use Hive Hadoop component or Pig Hadoop Component based on the
below differences : 1) Hive Hadoop Component is used mainly by data analysts
whereas Pig Hadoop Component is generally used by Researchers and Programmers
6, What is Pig in hadoop ?
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.
7,What is Apache Pig ?
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which
makes MapReduce programming high level, similar to that of SQL for relational database
management systems.
8,What is Pig,Hive,HBase?
PIG is used for data transformation tasks. You have a file, want to extract a useful
information from it or join two files or any other transformation then use PIG. HIVE is used to
query these files by defining a "virtual" table and running SQL like queries on those
tables. HBase is a full fledged NoSQL database .
9, What is Cassandra Client?
cassandra-client is a Node.js CQL 2 driver for Apache Cassandra 0.8 and later. CQL is a
query language for Apache Cassandra. You use it in much the same way you would use SQL
for a relational database. The Cassandra documentation can help you learn the syntax.
10,List out the types of builtin operator in HIVE?
Types of Built-in Operators in HIVE are:
Relational Operators
Arithmetic Operators
Logical Operators
Operators on Complex types
Complex type Constructors
Unit 1
Part B
1,Explain about structure of big data.
As you read about big data, you will come across a lot of discussion on the concept of
data being structured, unstructured, semi-structured, or even multi-structured. Big data is often
described as unstructured and traditional data as structured. The lines aren’t as clean as such
labels suggest, however. Let’s explore these three types of data structure from a layman’s
perspective. Highly technical details are out of scope for this book. Most traditional data sources
are fully in the structured realm. This means traditional data sources come in a clear, predefined
format that is specified in detail. There is no variation from the defined formats on a day-to-day
or update-to-update basis. For a stock trade, the first field received might be a date in a
MM/DD/YYYY format. Next might be an account number in a 12-digit numeric format. Next
might be a stock symbol that is a three- to five-digit character field. And so on. Every piece of
information included is known ahead of time, comes in a specified format, and occurs in a
specified order. This makes it easy to work with.
Unstructured data sources are those that you have little or no control over. You are going
to get what you get. Text data, video data, and audio data all fall into this classification. A picture
has a format of individual pixels set up in rows, but how those pixels fit together to create the
picture seen by an observer is going to vary substantially in each case. There are sources of big
data that are truly unstructured such as those preceding. However, most data is at least semi-
structured. Semi-structured data has a logical flow and format to it that can be understood, but
the format is not user-friendly. Sometimes semi structured data is referred to as multi-structured
data. There can be a lot of noise or unnecessary data intermixed with the nuggets of high value in
such a feed. Reading semi-structured data to analyze it isn’t as simple as specifying a fixed file
format. To read semi-structured data, it is necessary to employ complex rules that dynamically
determine how to proceed after reading each piece of information. Web logs are a perfect
example of semi-structured data. Web logs are pretty ugly when you look at them; however, each
piece of information does, in fact, serve a purpose of some sort. Whether any given piece of a
web log serves your purposes is another question. See Figure 1.1 for anexample of a raw web
log.
WHAT STRUCTURE DOES YOUR BIG DATA HAVE?
Many sources of big data are actually semi-structured or multistructured, not
unstructured. Such data does have a logical flow to it that can be understood so that information
can be extracted from it for analysis. It just isn’t as easy to deal with as traditional structured data
sources. Taming semi-structured data is largely a matter of putting in the extra time and effort to
figure out the best way to process it.
There is logic to the information in the web log even if it isn’t entirely clear at first
glance. There are fields, there are delimiters, and there are values just like in a structured source.
However, they do not follow each other consistently or in a set way. The log text generated by a
click on a web site right now can be longer or shorter than the log text generated by a click from
a different page one minute from now. In the end, however, it’s important to understand that
semi-structured data does have an underlying logic. It is possible to develop relationships
between various pieces of it. It simply takes more effort than structured data. Analytic
professionals will be more intimidated by truly unstructured data than by semi-structured data.
They may have to wrestle with semi structured data to bend it to their will, but they can do it.
Analysts can get semi-structured data into a form that is well structured and can incorporate it
into their analytical processes. Truly unstructured data can be much harder to tame and will
remain a challenge for organizations even as they tame semi-structured data.