Professional Documents
Culture Documents
Reporting: The process of organizing data into informational summaries in order to monitor
how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.
Reporting translates raw data into information. Analysis transforms data and information
into insights. Reporting helps companies to monitor their online business and be alerted to when
data falls outside of expected ranges. Good reporting should raise questions about the business
from its end users. The goal of analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations. Through the process of performing
analysis you may raise additional questions, but the goal is to identify answers, or at least
potential answers that can be tested. In summary, reporting shows you what is happening while
analysis focuses on explaining why it is happening and what you can do about it.
2. a. Big data streaming is a process in which big data is quickly processed in order to extract real-time
insights from it. The data on which processing is done is the data in motion. Big data streaming is ideally a
speed-focused approach wherein a continuous stream of data is processed.
Big data streaming is a process in which large streams of real-time data are processed with the sole
aim of extracting insights and useful trends out of it. A continuous stream of unstructured data is sent
for analysis into memory before storing it onto disk. This happens across a cluster of servers. Speed
matters the most in big data streaming. The value of data, if not processed quickly, decreases with
time.
Real-time streaming data analysis is a single-pass analysis. Analysts cannot choose to reanalyze the
data once it is streamed.
The plan shown in Figure 2-1 isn’t a bad design, and with the right choice of tools to carry out the queuing
and analytics and to build the dashboard, you’d be in fairly good shape for this one goal. But you’d be
missing out on a much better way to design your system in order to take full advantage of the data and to
improve your overall administration, operations, and development activities.
Instead, we recommend a radical change in how a system is designed.
The idea is to use data streams throughout your overall architecture—data streaming becomes the default
way to handle data rather than a specialty. The goal is to streamline (pun not intended) your whole operation
such that data is more readily available to those who need it, when they need it, for real-time analytics and
much more, without a great deal of inconvenient administrative burden.
Key Aspects of a Universal Stream-based Architecture
The idea that you can build applications to draw real-time insights from data before it is persisted is in itself a
big change from traditional ways of handling data. Even machine learning models are being developed with
streaming algorithms that can make decisions about data in real time and learn at the same time. Fast
performance is important in these systems, so in-memory processing methods and technologies are
attracting a lot of attention.
3. Hadoop:
Hadoop is an open source distributed processing framework that manages data processing and storage for
big data applications running in clustered systems. It is at the center of a growing ecosystem of big
data technologies that are primarily used to support advanced analytics initiatives, including predictive
analytics, data mining and machine learningapplications. Hadoop can handle various forms of structured and
unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational
databases and data warehouses provide.
Hadoop Distributed File System:
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage attached
to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be
stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes. The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes
are responsible for serving read and write requests from the file system’s clients. The DataNodes also
perform block creation, deletion, and replication upon instruction from the NameNode.
Components:
Hadoop Common – The common module contains libraries and utilities which are required by
other modules of Hadoop.
Hadoop Distributed File System (HDFS) – This is the distributed file-system which stores data
on the commodity machines. This is the core of the hadoop framework. This also provides a very high
aggregate bandwidth across the cluster.
Hadoop YARN – This is the resource-management platform which is responsible for managing
computer resources over the clusters and using them for scheduling of users' applications.
Hadoop MapReduce – This is the programming model used for large scale data processing.