Professional Documents
Culture Documents
:
o Elys CHEBBI
o Emna DRIDI
o Firas CHAABENE
o Hla KHOUILDI
2015-2016
Acknowledgment
We would like to take the opportunity to thank our tutors that were
present in each minute we needed them. Our experience with them was
extremely rewarding. In addition to sharpening our knowledge, we
thoroughly enjoyed working with such a great team of professors. It was a
pleasure to watch and learn from our fellow colleagues also.
Once again they set the tone for the year by helping us run a successful
project. We appreciated their unfailing attention to detail. With all the last
minute snags, we worried that something would fall through the cracks, but
they anticipated every contingency.
We are fortunate to have them donate their time on occasions like this.
On behalf of the all the class students, we would like to express heartfelt
thanks for all they have done.
Table of Contents
3
General Introduction
N
owadays, we live in the age where information is power. In any business enterprise,
it is imperative that everyone has the critical information they need to accurately and
effectively fulfill their business obligations. To deal with frequent and quick market
changes, and to enable quick decision making, every business needs to understand Business
Intelligence and Business Analytics.
The term Business Intelligence (BI) represents tools and systems that play a key role in
the strategic planning for a business. These systems allow a company to gather, store, access
and analyze corporate data that aids in recognizing the strengths and weaknesses as well as the
threats and opportunities surrounding the business.
As part of our academic projects, we are asked to implement a decision support system
(router management). The present report is a quick presentation of our project during this
semester, it will contain a small description of the key elements of our project and it contains
four chapters:
The first chapter: General Context: presents the project context, introduces the
problematic. It also presents the needs and the offered solutions.
The second chapter: Analysis: presents the functional requirement and non-functional
requirement and finally our use case.
The second chapter: modeling: describes the architecture of a data warehouse. It is also
presenting the operations and the types of OLAP.
The third chapter: Implementation: presents the tools we are going to use and project
realizations phase: ETL, Reporting, Big Data and Datamining.
Finally, we end our report with a general conclusion that summarizes the work we have
achieved and presents our outlook.
4
Chapter 1:
General context
5
Introduction
In this chapter, we are going to present the context of our project, starting with study of
existing, then specifying the objectives of our project and finally presenting the different
solutions.
1. Project context
As part of our academic projects, we are asked to design a decision-making system that
is used to correlate log files of a network. We should store data, made the necessary relations,
extracting, transforming and loading it and we ought to store it in a data warehouse which is an
efficient system used for making reports and doing analysis.
o Modeling
o ETL
o Reporting
o Big Data
o Data mining
Our five-member group, composed of serious engineering students from the Private
Higher School of Engineering and Technology (ESPRIT) took in lead a project that consists on
implementing a decision support system (router management)
Our team had four completed months to do the work, we worked collaboratively, yet as
independently from each other as possible.
We were weekly followed by supervisors from the course teaching team, who oversaw
our project team progress, provided expert advice on the problem domains, and took part in
grading our team.
2. Problematic
Different Vendors so different Log formats.
Log files contain enormous amount of data, so it is easy to ignore the crucial
6
Identifying anomalies can be difficult probes over time
Format is not easy to read, and messages can be obscure require expertise
Looking at all events takes time and it can consume a lot of disk space.
3. Solutions
In order to grapple with all the problems seen above, we are asked to conceive a Business
Intelligence system that deals with log files correlation. Our solution guaranties:
Predictive visualization
Conclusion
In this first chapter, we presented the context of our project in order to make things clear,
and to help you more understand our project
7
Chapter 2:
Analysis
8
Introduction
All data warehouse must be able to meet the expectations of users. This may, of course,
be done without a thorough study of their needs.
This chapter's main purpose is to present and describe the approach for the detection of
needs as well as the presentation of the summary that will be made.
1. Functional Requirements
The official definition for a functional requirement specifies what the system should do:
"A requirement specifies a function that a system or component must be able to perform."
Veracity of prediction
Rapid Results
2. Non-Functional Requirements
The official definition for a non-functional requirement specifies how the system should
behave: "A non-functional requirement is a statement of how a system must behave, it is a
constraint upon the systems behavior."
Reliability
Data Integrity
Maintainability
9
3. Use case
The actors of this system are the administrator and the decision maker who are provided
with a list of features
Conclusion
In this second chapter, we presented the functional requirement and non-functional
requirement and finally we presented our use case.
10
Chapter 3:
Modeling
11
Introduction
Modeling in Business Intelligence project is the most important phase for a successful
implementation of a decision support system. In this chapter, we will define and we will present
the architecture of a data warehouse.
Data warehouse is a system used for reporting and data analysis, and is considered as a
core component of Business Intelligence environment. DWs are central repositories of
integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise.
Integrated: a data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a product,
but in a data warehouse, there will be only a single way of identifying a product.
Non-volatile: once data is in the data warehouse, it will not change. So, historical
Time-Variant: historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with
a customer.
12
1.2) Architecture of a data warehouse :
In this section we present the architecture and the different functions of a data warehouse.
The following are the functions of a data warehouse tools and utilities:
warehouse format.
A Data Warehouse contain data from various sources. While loading data from various
data sources data can be cleansed, transformed as per analysis need and then loaded in data
warehouse in a common format which then can be used for reporting & analysis purpose. Since
data from all the data sources is now available in common data format and data issues like
missing values, incorrect format is corrected while loading data in data warehouse it would be
much easier for end user to perform the required data analysis on this data using automated data
analysis tools available in the market.
13
2. Data warehouse : OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information
Relational OLAP (ROLAP): ROLAP servers are placed between relational back-
end server and client front-end tools. To store and manage warehouse data, ROLAP
stores, the storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle dense and
HOLAP servers allows to store the large data volumes of detailed information. The
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Roll-up
Drill-down
data. You can think of a dimension as how you would want to view your numerical
data. Dimensions are how we want to see our data and can come in many different
flavors.
Fact: A fact table is an object in our data warehouse that contains transactional data.
15
16
Conclusion
Throughout this chapter, we presented the architecture of a data warehouse, then we
presented the operations and the type of OLAP and finally we presented our data warehouse
design.
17
Chapter 4:
Implementation
18
Introduction
In this last chapter, we will move to the process of project implementation by shedding
light on the various tools, and technologies, as well as the different steps of our solutions.
Talend is the first provider of open source data integration software. Its main product is
Talend Open Studio. After three years of intense research and development investment the first
version of that software was released in 2006. It is an Open Source project for data integration
based on Eclipse RCP that primarily supports ETL-oriented implementations and is provided
for on-premises deployment as well as in a software-as-a-service (SaaS) delivery model. Talend
Open Studio is mainly used for integration between operational systems, as well as for ETL
(Extract, Transform, Load) for Business Intelligence and Data Warehousing, and for migration.
Talend offers a completely new vision, reflected in the way it utilizes technology, as well as in
its business model.
Talend Open Studio is the most open, innovative and powerful data integration solution
on the market today.
1.2 PostgreSQL
19
1.3 Mondrian Schema Workbench
The Mondrian Schema Workbench allows you to visually create and test Mondrian OLAP
cube
Schema editor integrated with the underlying data source for validation
Test MDX queries against schema and database
Browse underlying databases structure
Tableau visualization and analytics products aim to help business managers, analysts and
executives see the relationships between different data points, regardless of users' technical skill
levels.
Tableau provides reporting, dashboarding and scorecards, BI search, ad hoc analysis and
queries, online analytical processing, data discovery, spreadsheet integration, and other data
analytics and analysis functions. This ad hoc analysis and data visualization make up the core
of Tableau's products.
Qlik Sense gives users a smart analytics tools that can generate personalized reports and
very detailed dashboards. Its designed for Businesses that are looking for a way to fully explore
huge amounts of data and high-quality its perfect for small to huge organizations and even
professionals who work individually, Qlik Sense helps users make sense out of their data.
Facepager was made for fetching public available data from Facebook, Twitter.... All data
is stored in a SQLite database and may be exported to csv.
1.8 Hadoop
Hadoop is an apache open source project that develops software for scalable, distributed
computing.
Hadoop is a framework for distributed processing of large data sets across
clusters of computers using simple programming models
Hadoop easily deals with complexities of high volume, velocity and variety
of data
21
Hadoop detects and handles failures at the application layer
Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language called HiveQL
that interacts with the HDFS files.
Create MapReduce
Insert HiveQL (Task Tracker )
1.10 R language
22
2. Realization
2.1. ETL
The first step of our project consist of feeding the data warehouse from a database that
contains several information (email, date, ip addresses, mac addresses, port number ...)
The basic approach to feeding a data warehouse typically consists of three phases: E
(Extract), T (Transform), and L (Load):
Extract: the first part of an ETL process involves extracting the data from
the source system. In many cases this represents the most important aspect
of ETL, since extracting data correctly sets the stage for the success of
subsequent processes, in our case, we extracted data from the database
under PostgreSQL.
Transform: in the data transformation stage, a series of rules or functions
are applied to the extracted data in order to prepare it for loading into the
end target: filtering (selecting only certain columns to load), joining
together data from multiple sources (lookup, merge), transposing rows and
columns
Load: loads the data into the end target that may be a simple delimited flat
file or a data warehouse.
In this section, we will present the different steps of feeding the data warehouse using
Talend:
23
Figure 2: Feeding of dimension Ip address
As shown in the figure above, we have fed our Dim_IpAddress dimension as follows:
TUniquRow: compares entries and sorts out duplicate entries from input flow.
TMap: the tMap component is part of the Processing family of components. TMap is one
of the core components and is primarily used for mapping input data to output data that is,
mapping one Schema to another.
TPostgresqlCommit: Validates the data processed through the job into the connected
Data Base.
As well as performing mapping functions, tMap may also be used to join multiple inputs,
and to write multiple outputs.
24
Figure 4: Dim_Ip_Address
Fact table
25
Figure 6: Fact table
2.2. Analysis
We used the "Schema Workbench" tool to achieve our OLAP cube, first, we created our
fact table, and then we added our dimensions.
After creating cube, we will generate with "Schema Workbench" an XML file that we
will use in the Mondrian
26
Figure 8: XML file generated by schema workbench
We placed our XML file at the Mondrian file, then we configured the connection in
"web.xml", "datasources.xml" and "mondrian.jsp" after that , we used the MDX queries to view
the different measures.
27
2.3.Reporting
The goal of Tableau and Qlik sense offers is to provide decision makers analysis tables
and a set of indicators to have a real platform for decision support.
In this section, we will present the different interfaces with a textual description.
The interface below shows sum of access_session_time for each user according to the
email address:
28
The following interface shows sum of access input octets for each user:
The interface below shows sum of access input octets for each user:
29
Using Qlik Sense:
This interface shows the time session rate per login in seconds
This table contains address mac and ip address for each users according to their email:
30
2.4. Big Data
Big data is a term that describes the large volume of data both structured and unstructured
that inundates a business on a day-to-day basis. But its not the amount of data thats important.
Its what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.
While the term big data is relatively new, the act of gathering and storing large amounts
of information for eventual analysis is ages old. The concept gained momentum in the early
2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data
as the three Vs:
The first step consist of extracting data from Facebook using Facepager in csv format:
31
Figure 16: Extract data using Facepager
We used also, VMware workstation and cloudera, in order to make our data structured in
a table.
32
2.5. Datamining
In this part we applied Data Mining algorithms to analysis our data and then make
decision.
Using this algorithm we study the relevance of variables of our DW. In fact this PCA
count the correlation of each variable in each iteration and it neglects those have a low
correlation value.
In our case PCA allows us to discover that just 2 variables are relevant: session time (s)
and the input data in bytes
33
Semantic and Sentimental Recognition from the Web :
Facebook and Twitter pages are probably the best source of information on how
individuals use the social networking site, because all posts, likes, and comments can be
collected and analyzed.
For us we are interested on analyzing Splunk page on twitter to know what users think
about this correlation log file. In fact we found this results:
After that, we analyzed texts posted as status, comments and hashtags, we found that the
most common posts include word that deal with networking and log file correlation.
34
It is the same results found by the word cloud using R:
Conclusion
In this last chapter, we presented the tools we used and the different steps of our solutions
35
General conclusion
B
y the end of fourteen weeks of hard work and perseverance, we managed to finish our
academic project in time in which we applied what we theoretically know in order to
implement a decision support system.
During this period, we faced some technical problems due to unstable environment and
lack of experience. We documented and were inspired by various tutorials. This experience was
very beneficial for us on both the professional and personal level. We did not only apply the theoretical
knowledge we gained, but we also improved our acquaintance in many domains.
This project gave us the possibility to interact with specialist teachers and to hear their
feedback which enabled us to develop our communication skills, our ability to work in a team
and above all else, to gain real world experience that we would never acquire through classes.
36