Rapport PABI

router managemet:
:
o Elys CHEBBI
o Emna DRIDI
o Firas CHAABENE
o Hla KHOUILDI
o Med Ayoub HAMZAOUI
2015-2016
Acknowledgment
We would like to take the opportunity to thank our tutors that were
present in each minute we needed them. Our experience with them was
extremely rewarding. In addition to sharpening our knowledge, we
thoroughly enjoyed working with such a great team of professors. It was a
pleasure to watch and learn from our fellow colleagues also.
In particular, we want to thank Mr Salah Eddine MRABET and Mr

Ghazi KHODJET EL KHIL for the time they spent with us discussing our
professional development. Our weekly meetings were very helpful and
provided us with insightful constructive criticism of our work.
Once again they set the tone for the year by helping us run a successful
project. We appreciated their unfailing attention to detail. With all the last
minute snags, we worried that something would fall through the cracks, but
they anticipated every contingency.
We are fortunate to have them donate their time on occasions like this.
On behalf of the all the class students, we would like to express heartfelt
thanks for all they have done.
Table of Contents
General Introduction ..........................................................................................................................4

Introduction ....................................................................................................................................6
1. Project context ....................................................................................................................6
2. Problematic .........................................................................................................................6
3. Solutions .............................................................................................................................7
Conclusion ......................................................................................................................................7
Introduction ....................................................................................................................................9
1. Functional Requirements .................................................................................................9
2. Non-Functional Requirements .........................................................................................9
3. Use case ........................................................................................................................ 10
Conclusion .................................................................................................................................... 10
Introduction .................................................................................................................................. 12
1. Definition and architecture of a data warehouse........................................................ 12
2. Data warehouse : OLAP ............................................................................................. 14
3. Data warehouse design .............................................................................................. 15
Conclusion .................................................................................................................................... 17
Introduction .................................................................................................................................. 19
1. Tools and technologies .......................................................................................... 19
2. Realization ............................................................................................................. 23
2.1. ETL ................................................................................................................... 23
2.2. Analysis ............................................................................................................ 26
2.3. Reporting ......................................................................................................... 28
2.4. Big Data............................................................................................................ 31
2.5. Datamining....................................................................................................... 33
Conclusion .................................................................................................................................... 35
General conclusion ........................................................................................................................... 36
3
General Introduction
N
owadays, we live in the age where information is power. In any business enterprise,
it is imperative that everyone has the critical information they need to accurately and
effectively fulfill their business obligations. To deal with frequent and quick market
changes, and to enable quick decision making, every business needs to understand Business
Intelligence and Business Analytics.
The term Business Intelligence (BI) represents tools and systems that play a key role in
the strategic planning for a business. These systems allow a company to gather, store, access
and analyze corporate data that aids in recognizing the strengths and weaknesses as well as the
threats and opportunities surrounding the business.
As part of our academic projects, we are asked to implement a decision support system
(router management). The present report is a quick presentation of our project during this
semester, it will contain a small description of the key elements of our project and it contains
four chapters:
The first chapter: General Context: presents the project context, introduces the
problematic. It also presents the needs and the offered solutions.
The second chapter: Analysis: presents the functional requirement and non-functional
requirement and finally our use case.
The second chapter: modeling: describes the architecture of a data warehouse. It is also
presenting the operations and the types of OLAP.
The third chapter: Implementation: presents the tools we are going to use and project
realizations phase: ETL, Reporting, Big Data and Datamining.
Finally, we end our report with a general conclusion that summarizes the work we have
achieved and presents our outlook.
4
Chapter 1:
General context
5
Introduction
In this chapter, we are going to present the context of our project, starting with study of
existing, then specifying the objectives of our project and finally presenting the different
solutions.
1. Project context
As part of our academic projects, we are asked to design a decision-making system that
is used to correlate log files of a network. We should store data, made the necessary relations,
extracting, transforming and loading it and we ought to store it in a data warehouse which is an
efficient system used for making reports and doing analysis.
Our project consists of 5 sprints:
o Modeling
o ETL
o Reporting
o Big Data
o Data mining
Our five-member group, composed of serious engineering students from the Private
Higher School of Engineering and Technology (ESPRIT) took in lead a project that consists on
implementing a decision support system (router management)
Our team had four completed months to do the work, we worked collaboratively, yet as
independently from each other as possible.
We were weekly followed by supervisors from the course teaching team, who oversaw
our project team progress, provided expert advice on the problem domains, and took part in
grading our team.
2. Problematic
Different Vendors so different Log formats.
Log files contain enormous amount of data, so it is easy to ignore the crucial
information in large amount of alerts.
6
Identifying anomalies can be difficult probes over time
Format is not easy to read, and messages can be obscure require expertise
Managing Logs can be expensive , and log analysis is a unique skill
Looking at all events takes time and it can consume a lot of disk space.
3. Solutions
In order to grapple with all the problems seen above, we are asked to conceive a Business
Intelligence system that deals with log files correlation. Our solution guaranties:
Handling multiple formats and structures of log files
Easily human readable summary reports.
Predictive visualization
Ability to generate custom reports
Conclusion
In this first chapter, we presented the context of our project in order to make things clear,
and to help you more understand our project
7
Chapter 2:
Analysis
8
Introduction
All data warehouse must be able to meet the expectations of users. This may, of course,
be done without a thorough study of their needs.
This chapter's main purpose is to present and describe the approach for the detection of
needs as well as the presentation of the summary that will be made.
1. Functional Requirements
The official definition for a functional requirement specifies what the system should do:
"A requirement specifies a function that a system or component must be able to perform."
Typical functional requirements are:
Improved Solution Quality
Veracity of prediction
Longer Lasting Results
Rapid Results
2. Non-Functional Requirements
The official definition for a non-functional requirement specifies how the system should
behave: "A non-functional requirement is a statement of how a system must behave, it is a
constraint upon the systems behavior."
Non-functional requirements specify the systems quality characteristics or quality

attributes.
Typical non-functional requirements are:
Reliability
Data Integrity
Maintainability
9
3. Use case
The actors of this system are the administrator and the decision maker who are provided
with a list of features
Figure 1: Use case
Conclusion
In this second chapter, we presented the functional requirement and non-functional
requirement and finally we presented our use case.
10
Chapter 3:
Modeling
11
Introduction
Modeling in Business Intelligence project is the most important phase for a successful
implementation of a decision support system. In this chapter, we will define and we will present
the architecture of a data warehouse.
1. Definition and architecture of a data warehouse
1.1) Definition of a data warehouse :
Data warehouse is a system used for reporting and data analysis, and is considered as a
core component of Business Intelligence environment. DWs are central repositories of
integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile

collection of data in support of management's decision making process:
Subject-Oriented: a data warehouse can be used to analyze a particular subject area.
For example, "sales" can be a particular subject.
Integrated: a data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a product,
but in a data warehouse, there will be only a single way of identifying a product.
Non-volatile: once data is in the data warehouse, it will not change. So, historical
data in a data warehouse should never be altered.
Time-Variant: historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with
a customer.
12
1.2) Architecture of a data warehouse :
In this section we present the architecture and the different functions of a data warehouse.
The following are the functions of a data warehouse tools and utilities:
Data Extraction: Involves gathering data from multiple heterogeneous sources.
Data Cleaning: Involves finding and correcting the errors in data.
Data Transformation: Involves converting the data from legacy format to
warehouse format.
Data Loading: Involves sorting, summarizing, consolidating, checking integrity,
and building indices and partitions.
Refreshing: Involves updating from data sources to warehouse.
A Data Warehouse contain data from various sources. While loading data from various
data sources data can be cleansed, transformed as per analysis need and then loaded in data
warehouse in a common format which then can be used for reporting & analysis purpose. Since
data from all the data sources is now available in common data format and data issues like
missing values, incorrect format is corrected while loading data in data warehouse it would be
much easier for end user to perform the required data analysis on this data using automated data
analysis tools available in the market.
Figure 1: Data warehouse architecture
13
2. Data warehouse : OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information
. This section cover the types of OLAP and operations on OLAP.
2.1. Types of OLAP :
Relational OLAP (ROLAP): ROLAP servers are placed between relational back-
end server and client front-end tools. To store and manage warehouse data, ROLAP
uses relational or extended-relational DBMS.
Multidimensional OLAP (MOLAP): MOLAP uses array-based multidimensional
storage engines for multidimensional views of data. With multidimensional data
stores, the storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.
Hybrid OLAP (HOLAP): Hybrid OLAP is a combination of both ROLAP and
MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP.
HOLAP servers allows to store the large data volumes of detailed information. The
aggregations are stored separately in MOLAP store.
2.2. OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations:
Roll-up
Drill-down
Slice and dice

14
Pivot (rotate)
OLAP cubes organize data in a hierarchical arrangement, according to dimensions and

measures.
Dimension: Simply put a dimension is a category that provides a consistent view of
data. You can think of a dimension as how you would want to view your numerical
data. Dimensions are how we want to see our data and can come in many different
flavors.
Measure: Simply put a measure is a numeric value that is of interest.
Fact: A fact table is an object in our data warehouse that contains transactional data.
This fact table shares a relationship with our dimension table.
3. Data warehouse design

The following page shows our data warehouse design in details.
15
16
Conclusion
Throughout this chapter, we presented the architecture of a data warehouse, then we
presented the operations and the type of OLAP and finally we presented our data warehouse
design.
17
Chapter 4:
Implementation
18
Introduction
In this last chapter, we will move to the process of project implementation by shedding
light on the various tools, and technologies, as well as the different steps of our solutions.
1. Tools and technologies
1.1 Talend Open Studio
Talend is the first provider of open source data integration software. Its main product is
Talend Open Studio. After three years of intense research and development investment the first
version of that software was released in 2006. It is an Open Source project for data integration
based on Eclipse RCP that primarily supports ETL-oriented implementations and is provided
for on-premises deployment as well as in a software-as-a-service (SaaS) delivery model. Talend
Open Studio is mainly used for integration between operational systems, as well as for ETL
(Extract, Transform, Load) for Business Intelligence and Data Warehousing, and for migration.
Talend offers a completely new vision, reflected in the way it utilizes technology, as well as in
its business model.
Talend Open Studio is the most open, innovative and powerful data integration solution
on the market today.
1.2 PostgreSQL
PostgreSQL, is an object-relational database management system (ORDBMS) with an

emphasis on extensibility and standards-compliance. As a database server, its primary function
is to store data securely, supporting best practices, and to allow for retrieval at the request of
other software applications.
19
1.3 Mondrian Schema Workbench
The Mondrian Schema Workbench allows you to visually create and test Mondrian OLAP
cube
It provides the following functionality:
Schema editor integrated with the underlying data source for validation
Test MDX queries against schema and database
Browse underlying databases structure
1.4 Tableau 9.3
Tableau visualization and analytics products aim to help business managers, analysts and
executives see the relationships between different data points, regardless of users' technical skill
levels.
Tableau provides reporting, dashboarding and scorecards, BI search, ad hoc analysis and
queries, online analytical processing, data discovery, spreadsheet integration, and other data
analytics and analysis functions. This ad hoc analysis and data visualization make up the core
of Tableau's products.
1.5 Qlik Sense
Qlik Sense gives users a smart analytics tools that can generate personalized reports and
very detailed dashboards. Its designed for Businesses that are looking for a way to fully explore
huge amounts of data and high-quality its perfect for small to huge organizations and even
professionals who work individually, Qlik Sense helps users make sense out of their data.
1.6 Jasper Reports Server

20
Jasper Reports Server is a stand-alone and embeddable reporting server. It provides
reporting and analytics that can be embedded into a web or mobile application, JRS can offer a
lot to our reporting solution:
Flexible and robust security
Report Execution and Scheduling
Repository
Web Services API
Source Code is available
Configurable and extensible
Professional Features
1.7 Face pager
Facepager was made for fetching public available data from Facebook, Twitter.... All data
is stored in a SQLite database and may be exported to csv.
1.8 Hadoop
Hadoop is an apache open source project that develops software for scalable, distributed
computing.
Hadoop is a framework for distributed processing of large data sets across
clusters of computers using simple programming models
Hadoop easily deals with complexities of high volume, velocity and variety
of data
21
Hadoop detects and handles failures at the application layer
1.9 Apache Hive
Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language called HiveQL
that interacts with the HDFS files.
Create MapReduce
Insert HiveQL (Task Tracker )
Update HiveQL HiveQL

Delete HiveQL
HDFS
Select (Data
Node)
1.10 R language
R is a programming language and software environment for statistical computing and

graphics supported by the R Foundation for Statistical Computing. The R language is widely
used among statisticians and data miners for developing statistical software and data.
22
2. Realization
2.1. ETL
The first step of our project consist of feeding the data warehouse from a database that
contains several information (email, date, ip addresses, mac addresses, port number ...)
The basic approach to feeding a data warehouse typically consists of three phases: E
(Extract), T (Transform), and L (Load):
Extract: the first part of an ETL process involves extracting the data from
the source system. In many cases this represents the most important aspect
of ETL, since extracting data correctly sets the stage for the success of
subsequent processes, in our case, we extracted data from the database
under PostgreSQL.
Transform: in the data transformation stage, a series of rules or functions
are applied to the extracted data in order to prepare it for loading into the
end target: filtering (selecting only certain columns to load), joining
together data from multiple sources (lookup, merge), transposing rows and
columns
Load: loads the data into the end target that may be a simple delimited flat
file or a data warehouse.
In this section, we will present the different steps of feeding the data warehouse using
Talend:
Feeding of dimension Ip address
23
Figure 2: Feeding of dimension Ip address
As shown in the figure above, we have fed our Dim_IpAddress dimension as follows:
TUniquRow: compares entries and sorts out duplicate entries from input flow.
TMap: the tMap component is part of the Processing family of components. TMap is one
of the core components and is primarily used for mapping input data to output data that is,
mapping one Schema to another.
TLogRow: Displays data or results in the run console.
TPostgresqlCommit: Validates the data processed through the job into the connected
Data Base.
As well as performing mapping functions, tMap may also be used to join multiple inputs,
and to write multiple outputs.
Figure 3: TMap components
Here is an illustration of our table in the database PostgreSQL :
24
Figure 4: Dim_Ip_Address
Fact table
As shown in the figure above, we have fed our Fact_action as follows:
Figure 5: Feeding of fact table
Here is an illustration of our fact table in our data warehouse:
25
Figure 6: Fact table
2.2. Analysis
We used the "Schema Workbench" tool to achieve our OLAP cube, first, we created our
fact table, and then we added our dimensions.
The measures represent the final elements of our cube: accesssessiontime,

accessinputoctet, accessoutputoctet.
Figure 7: OLAP cube using schema workbench
After creating cube, we will generate with "Schema Workbench" an XML file that we
will use in the Mondrian
26
Figure 8: XML file generated by schema workbench
We placed our XML file at the Mondrian file, then we configured the connection in
"web.xml", "datasources.xml" and "mondrian.jsp" after that , we used the MDX queries to view
the different measures.
Figure 9: MDX query
A sample result using an MDX query: the list of all users
27
2.3.Reporting
Reporting is the necessary prerequisite of analysis; as such, it should be viewed in light

of the goal of making data understandable and ready for easy, efficient and accurate analysis.
The goal of Tableau and Qlik sense offers is to provide decision makers analysis tables
and a set of indicators to have a real platform for decision support.
Using tableau 9.3:
In this section, we will present the different interfaces with a textual description.
Figure 10: Connect to data base
The interface below shows sum of access_session_time for each user according to the
email address:
Figure 11: sum of access_session_time for each user
28
The following interface shows sum of access input octets for each user:
Figure 12: sum of access input octets for each user
The interface below shows sum of access input octets for each user:
Figure 13: sum of access output octets for each user
29
Using Qlik Sense:
This interface shows the time session rate per login in seconds
Figure 14: time session rate per login in seconds
This table contains address mac and ip address for each users according to their email:
Figure 15: address mac and ip address for each users
30
2.4. Big Data
Big data is a term that describes the large volume of data both structured and unstructured
that inundates a business on a day-to-day basis. But its not the amount of data thats important.
Its what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.
While the term big data is relatively new, the act of gathering and storing large amounts
of information for eventual analysis is ages old. The concept gained momentum in the early
2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data
as the three Vs:
Volume: Organizations collect data from a variety of sources, including

business transactions, social media and information from sensor or
machine-to-machine data. In the past, storing it wouldve been a problem
but new technologies (such as Hadoop) have eased the burden.
Velocity: Data streams in at an unprecedented speed and must be dealt with
in a timely manner. RFID tags, sensors and smart metering are driving the
need to deal with torrents of data in near-real time.
Variety: Data comes in all types of formats from structured, numeric
data in traditional databases to unstructured text documents, email, video,
audio, stock ticker data and financial transactions.
Veracity: Uncertainty due to data inconsistency & incompleteness,
ambiguities, latency, deception, model approximations
The first step consist of extracting data from Facebook using Facepager in csv format:
31
Figure 16: Extract data using Facepager
We used also, VMware workstation and cloudera, in order to make our data structured in
a table.
Figure 17: Hue interface
32
2.5. Datamining
Data mining, or knowledge discovery, is the computer-assisted process of digging

through and analyzing enormous sets of data and then extracting the meaning of the data. Data
mining tools predict behaviors and future trends, allowing businesses to make proactive,
knowledge-driven decisions. Data mining tools can answer business questions that traditionally
were too time consuming to resolve.
In this part we applied Data Mining algorithms to analysis our data and then make
decision.
Principal component Analysis :
Using this algorithm we study the relevance of variables of our DW. In fact this PCA
count the correlation of each variable in each iteration and it neglects those have a low
correlation value.
In our case PCA allows us to discover that just 2 variables are relevant: session time (s)
and the input data in bytes
Figure 18: PCA
33
Semantic and Sentimental Recognition from the Web :
Facebook and Twitter pages are probably the best source of information on how
individuals use the social networking site, because all posts, likes, and comments can be
collected and analyzed.
For us we are interested on analyzing Splunk page on twitter to know what users think
about this correlation log file. In fact we found this results:
Figure 19: sentiment analysis
After that, we analyzed texts posted as status, comments and hashtags, we found that the
most common posts include word that deal with networking and log file correlation.
Figure 20: text mining
34
It is the same results found by the word cloud using R:
Figure 21: text mining
Conclusion
In this last chapter, we presented the tools we used and the different steps of our solutions
35
General conclusion
B
y the end of fourteen weeks of hard work and perseverance, we managed to finish our
academic project in time in which we applied what we theoretically know in order to
implement a decision support system.
During this period, we faced some technical problems due to unstable environment and
lack of experience. We documented and were inspired by various tutorials. This experience was
very beneficial for us on both the professional and personal level. We did not only apply the theoretical
knowledge we gained, but we also improved our acquaintance in many domains.
This project gave us the possibility to interact with specialist teachers and to hear their
feedback which enabled us to develop our communication skills, our ability to work in a team
and above all else, to gain real world experience that we would never acquire through classes.
36

Rapport PABI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rapport PABI

Uploaded by

Copyright:

Available Formats

router managemet:

o Med Ayoub HAMZAOUI

In particular, we want to thank Mr Salah Eddine MRABET and Mr

General Introduction ..........................................................................................................................4

Our project consists of 5 sprints:

information in large amount of alerts.

Managing Logs can be expensive , and log analysis is a unique skill

Handling multiple formats and structures of log files

Easily human readable summary reports.

Ability to generate custom reports

Typical functional requirements are:

Improved Solution Quality

Longer Lasting Results

Non-functional requirements specify the systems quality characteristics or quality

Typical non-functional requirements are:

Figure 1: Use case

1. Definition and architecture of a data warehouse

1.1) Definition of a data warehouse :

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile

Subject-Oriented: a data warehouse can be used to analyze a particular subject area.

For example, "sales" can be a particular subject.

data in a data warehouse should never be altered.

Data Extraction: Involves gathering data from multiple heterogeneous sources.

Data Cleaning: Involves finding and correcting the errors in data.

Data Transformation: Involves converting the data from legacy format to

Data Loading: Involves sorting, summarizing, consolidating, checking integrity,

and building indices and partitions.

Refreshing: Involves updating from data sources to warehouse.

Figure 1: Data warehouse architecture

. This section cover the types of OLAP and operations on OLAP.

2.1. Types of OLAP :

uses relational or extended-relational DBMS.

Multidimensional OLAP (MOLAP): MOLAP uses array-based multidimensional

storage engines for multidimensional views of data. With multidimensional data

sparse data sets.

Hybrid OLAP (HOLAP): Hybrid OLAP is a combination of both ROLAP and

MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP.

aggregations are stored separately in MOLAP store.

2.2. OLAP Operations:

Here is the list of OLAP operations:

Slice and dice

OLAP cubes organize data in a hierarchical arrangement, according to dimensions and

Dimension: Simply put a dimension is a category that provides a consistent view of

Measure: Simply put a measure is a numeric value that is of interest.

This fact table shares a relationship with our dimension table.

3. Data warehouse design

1. Tools and technologies

1.1 Talend Open Studio

PostgreSQL, is an object-relational database management system (ORDBMS) with an

It provides the following functionality:

1.4 Tableau 9.3

1.5 Qlik Sense

1.6 Jasper Reports Server

1.7 Face pager

1.9 Apache Hive

Update HiveQL HiveQL

R is a programming language and software environment for statistical computing and

Feeding of dimension Ip address

TLogRow: Displays data or results in the run console.

Figure 3: TMap components

Here is an illustration of our table in the database PostgreSQL :

As shown in the figure above, we have fed our Fact_action as follows:

Figure 5: Feeding of fact table

Here is an illustration of our fact table in our data warehouse: