PGDIT Semester III Curriculum

Semester III
POSTGRADUATE DIPLOMA IN IT (PGDIT) -BIG DATA

CURRICULUM
Semester -1
Hours/week Total Marks

Paper No Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA101 Programming Java, Agile and Raptor 3 1 1 4 30 70 100
14PGDIT-BDA102 Linux fundamentals and Python 3 1 4 30 70 100

Big data – 1 (Storage and processing- Pig,
14PGDIT-BDA103 3 2 4 30 70 100
Hive).
14PGDIT-BDA104 Big data – 2 (HBase and time series). 3 2 4 30 70 100
14PGDIT-DA105 Learning Lab– 1 (Agile and Raptor) 3 1 CA=50 50

14PGDIT-DA107AL (or)
14PGDIT-DA107BL (or) Learning Lab – 2 (Big data) 3 1 CA=50 50
14PGDIT-DA107CL
12 12 1
Total 18 500
25
Semester -2
Hours/week Total Marks

Paper No. Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA201 Advanced Big Data – 1 (SPARK) 3 2 4 30 70 100
14PGDIT-BDA202 Advanced Big Data – 2 (SPARK streaming) 3 2 4 30 70 100
Advanced Big Data – 3 (SPARK – Machine

14PGDIT-BDA203 3 1 4 30 70 100
Learning)
14PGDIT-BT202A (or)
Subject (Security/Social media/Data Lake) 3 1 1 4 30 70 100
14PGDIT-BDA203 Learning Lab – 1a (applied big data) 3 1 CA-50 50
14PGDIT-BDA204L Learning Lab – 1a (applied big data) 3 1 CA-50 50

Total 12 12 1
18 500
25
1
Semester 1 Project work:
Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total

14PGDIT-BDA205L Project – Data Engineering – 1 4 13 50 + 25 100
14PGDIT-BDA206 Project – Data Engineering – 1 3 12
Final Project work (Second semester):
Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total

14PGDIT-BDA203 Project – 1a (applied big data) 4 13 50 + 25 100
14PGDIT-BDA204L Project – 1a (applied big data) 3 12
Total credits= 18+7+18+7=50 Total Hours=25+25=50 hours/week Total marks: 1200
SYLLABI
SEMESTER I
Programming Java, Agile and Raptor

XX Hours
1. Agile Programming 15 Hrs

Roles in Agile - Cross-functional Team - How an Agile Team Plans its Work? - What is a User Story? - Relationship
of User Stories and Tasks - When a Story is Done - What is Acceptance Criteria? - How the Requirements are
Defined?
Twelve Principles of Agile Manifesto - Agile – Characteristics - Iterative/incremental and Ready to Evolve - Face-
to-face Communication - Feedback Loop - User Story - Iteration – Release planning - Who is Involved? -
Prerequisites of Planning - Materials Required - Planning Data - Output
2. Raptor 15 Hrs
Program design and development process
 Problem definition
 Pseudo-code
 Flowcharting
 Code modularization
 Coding, testing, and debugging
 Sequence, selection, and iteration patterns
 Array processing
 File processing
Values and Variables
 Integer Values
2
 Variables and Assignment
 Identifiers
 Additional Integer Types
 Floating-point Types
 Constants
 Other Numeric Types
 Characters
 Enumerated Types
Expressions and Arithmetic
 Expressions
 Mixed Type Expressions
 Operator Precedence and Associativity
 Comments
 Compile-time Errors
 Run-time Errors
 Logic Errors
 Compiler Warnings
 Arithmetic Examples
 Integer Implementation
 Floating-point Implementation
 Bitwise Operators
 Algorithms
Conditional Execution
 Type bool
 Boolean Expressions
 The Simple if Statement
 Compound Statements
 The if/else Statement
 Compound Boolean Expressions
 Nested Conditionals
 Multi-way if/else Statements
Iteration
 The while Statement
 Nested Loops
3
 Abnormal Loop Termination
 The break statement
 The goto Statement
 The continue Statement
 Infinite Loops
 Iteration Examples
 Drawing a Tree
 Printing Prime Numbers
Using Functions
 Introduction to Using Functions
 Standard Math Functions
 Maximum and Minimum
 clock Function
 Character Functions
 Random Numbers
Arrays
 Static Arrays
 Pointers and Arrays
 Dynamic Arrays
 Copying an Array
 Multidimensional Arrays
 Command-line Arguments
 Vectors vs. Arrays
 Prime Generation with a Vector
Custom Objects
 Object Basics
 Instance Variables
 Member Functions
 Constructors
 Defining a New Numeric Type
 Encapsulation
Handling Exceptions
4
 Motivation
 Exception Examples
 Custom Exceptions
 Catching Multiple Exceptions
 Exception Mechanics
 Using Exceptions
3. Basic Java 10hrs
 Creating Java Projects

 Variables, Datatypes and Operators
 Primitive Data Types - The Byte, Short, Int And Long
 Primitive Data Types - Float And Double
 Primitive Data Types - Char And Boolean
 Understanding Strings
 Operators In Java And Operator Precedence
 Expressions, Statements, Code blocks, Methods and more
 Keywords And Expressions
 Statements, Whitespace and Indentation (Code Organization)
 Code Blocks And The If Then Else Control Statements
 Methods In Java
 Method Overloading
 Control Flow Statements
 The switch statement
 The for Statement
 The while and do while statements
 Euler project excercises (basic – 20)
4. Intermediate Java 10hrs
 OOP Part - Classes, Constructors and Inheritance

 Classes
 Constructors
 Inheritance
 Composition
 Encapsulation
 Polymorphism
Advanced data types
 Arrays, Java inbuilt Lists, Autoboxing and Unboxing

 Arrays
 List and ArrayList
 Autoboxing and Unboxing
 LinkedList
 Inner and Abstract Classes & Interfaces
 Java Generics
 Naming Conventions
 Packages
5
 Scope
 Access Modifiers
 The static statement
 The final statement
 Java Collections
 Binary Search
 Collections List Methods
 Comparable and Comparator
 Maps
 Immutable Classes
 Sets & HashSet
 Sorted Collections
 TreeMap and Unmodifiable Maps
 Euler project (Intermediate – 20)
5. Advanced Java 10 hrs
 Basic Input & Output including java.util

 Exceptions
 Stack Trace and Call Stack
 Catching and throwing Exceptions
 Multi Catch Exceptions
 Introduction to I/O
 Writing content - FileWriter class and Finally block
 FileReader and Closeable
 BufferedReader
 Load Big Location and Exits Files
 Buffered Writer and Challenge
 Byte Streams
 Reading Binary Data and End of File Exceptions
 Object Input Output including Serialization
 Random Access File
 Java NIO
 Separators Temp Files and File Stores
 Concurrency and Threads Introduction
 Multiple Threads
 Synchronisation
 Producer and Consumer
 Lambda Expressions
 Scope and Functional Programming
 Regular Expressions
 Debugging and Unit Testing
 Databases
 Creating Databases With JDBC in Java
 JDBC Insert, Update, Delete
 executeQuery() and using Constants
 Result Set Meta Data
 Transactions
 Inserting Records With JDBC
 Handling Updates
6
Books & References
Text books:
1. Java for Programmers, Dietel and Dietel, Prentice Hall, 2016
Reference Books:
 Thinking in Java, Bruce Eckel, Prentice Hall, 2012

 Effective Java, 2nd edition,Addison- Wesly, 2008
Linux fundamentals and Python programming

XX Hours
1 : Linux Basics
 Introduction
 Linux and the Operating System
 Graphical Environments and Interfaces
 Getting Help
 Text Editors
 Shells, bash, and the Command Line
 System Components
 System Administration
 Essential Command Line Tools
 Command and Tool Details
 Users and Groups
 Bash Scripting
 Files and Filesystems
Linux Intermediate
 Filesystem Layout
 Linux Filesystems
 Compiling, Linking and Libraries
 Java Installation and Environment
 Python and dependency installation
 Building RPM and Debian Packages
2: GIT and version control
 Introduction to GIT
 Git Installation
 Git and Revision Control Systems
 Using Git: an Example
 Git Concepts and Architecture
 Managing Files and the Index
7
 Commits
 Branches
 Diffs
 Merges
 Managing Local and Remote Repositories
 Using Patches
3 : Basic Python
 Overview of Python- Starting with Python

 Introduction to installation of Python
 Understand Jupyter notebook & Customize Settings
 Concept of Packages/Libraries - Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
 Installing & loading Packages & Name Spaces
 Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
 List and Dictionary Comprehensions
 Variable & Value Labels – Date & Time Values
 Basic Operations - Mathematical - string - date
 Reading and writing data
 Simple plotting
 Control flow & conditional statements
 Debugging & Code profiling
 How to create class and modules and how to call them?
 Packages in python for Analytics - Numpy, scify, pandas, scikitlearn, statmodels, nltk etc
Working with Data in Python
 Importing Data from various sources (Csv, txt, excel, access etc)
 Database Input (Connecting to database)
 Viewing Data objects - subsetting, methods
 Exporting Data to various formats
 Important python modules: Pandas, beautifulsoup
 Cleansing Data with Python
 Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables,
sampling, Data type conversions, renaming, formatting etc)
 Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
 Python Built-in Functions (Text, numeric, date, utility functions)
 Python User Defined Functions
 Stripping out extraneous information
 Normalizing data
 Formatting data
 Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)
4: Data Analysis in Python
 Introduction exploratory data analysis

 Descriptive statistics, Frequency Tables and summarization
 Univariate Analysis (Distribution of data & Graphical Analysis)
 Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
 Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
8
 Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats
etc)
5: Python for Big Data
 Introduction to ODBC and data base programming in Python

 Introduction to Python streaming for Hadoop
 Introduction to PySpark
 Sample programs for practice:
 Word frequency count and visualization
 Sales performance report development
 Montecarlo simulation of stock price
Books & References
 Based on Linux foundation training reference https://training.linuxfoundation.org/linux-

courses/development-training, LFD301- Introduction to Linux, Open Source Development, & GIT
 Matering Linux, Paul S.Wang, CRC Press
 Learn enough Git to be dangerous, Maichael hartl, LearnEnough.com, (https://www.learnenough.com/git-
tutorial
 Python: Journey from Novice to Expert (Module 1 only), Dusty Phillips, Fabrizio Romano, Rick van
Hattem, Packt Publishing
 Python: Data Analytics and Visualization, Ashish Kumar, Kirthi Raman, Martin Czygan, Phuong Vo.T.H,
Packt Publishing
Big data – 1 (Storage and processing- Pig, Hive)
1: Basics of big data
Unit 1: Course Introduction
 Understanding of lab setup details

 Prerequisites to run the preconfigured VMWare Virtual image, software & hardware requirements
 Login credentials
 Logging authorization.
Unit II: Introduction to Big Data
 Introduces the Big Data

 Definition
 Different types of Data
 Identifying the demand of Big Data and its use cases.
Unit III: Introduction to HPE IDOL
9
 What is HPE IDOL
 HPE IDOL use cases
 Complexity of the powerful infrastructure software by examining the technology from a high-level
perspective
 Understand the architecture of HEP IDOL
 Different components used
 Understanding of license server and its validity
 HPE IDOL Server configuration
 Different types of connectors supported and its uses based on the respective ports.
2. Configuration and administration
Unit IV: HPE IDOL Administration
 Different services and its functionalities of basic architecture

 Understanding of HP IDOL software by using the web based graphical user interface navigating the different
tabs and its uses such as how to know the status, the synchronous process, adding or creating the database,
adding the contents, initializing and indexing documents
 What is indexing?
 Different indexing options and its uses.
Unit V: HPE IDOL Configuration
 Understating different sections in the Idol server configuration file

 Start editing the sections as per the requirements.
 Configuring the license server
 Overview of the file system, configurations file, connector framework server configuration file and the
different uses of the different sections.
Unit VI: Exploring the connectors
 Understanding different types of connectors available in HEP IDOL Software

 Start working by configuring the File system connectors
 Connector Framework server
 HTTP connectors fetching and indexing manually From/To XML files.
3. Social media
Unit VII: Social Media Connector Configuration
 Different configuration files

 Start working on the live web pages/ social media connectors
 Understanding the security of the Social Media
 Creating the Apps
 App Keys
 Secret Keys
 Retrieving the data and placing in the respective database
10
Unit Work: By using users created in the respective social media websites such as Facebook, Twitter with the help
of Facebook social media connector, Twitter social media connector and will have an assignment to have a LinkedIn
social media connector.
Unit VIII: HPE IDOL Media Servers
 Understanding the media server and its configuration

 Feasible hardware for the media server configuration.
 Understanding how to ingest, analyze and encode the media
 Understanding the Media Server Architecture, its system requirements, software dependenc ies.
 Introduction to the Video Logger Software
 Optical character recognition image server
 What is speech server configuration.
Unit IX: Retrieval using IDOL Find, end user search Interface
 Understanding of search engine and its uses

 Setting the conceptual parameters
 Understand the execution of keyword, proximity and Boolean
 Conceptualizing and deploying find.
Unit X: Action Commands
 Understanding different action commands

 Action Command Syntax
 GRE Request
 Query actions
 Get content actions
 List actions
 Saving an output
Unit XI: Parametric Search
 Advanced search using the functions, and indexing the parametric data
 Parametric parameters and its uses
Unit XII: Introduction to Hadoop Big Data
 Introduction about Hadoop Big Data

 Types of data such as Structured Data, Semi Structured Data and Unstructured data and its uses.
4.Architecture
Unit XIII: Hadoop Architecture
11
 Understanding the prerequisites of hardware and software
 Understanding various configurations and services of Hadoop.
 Understand difference between the regular file system and Hadoop distributed file system.
Unit XIV: Introduction to MapReduce
 Concept of MapReduce
 Different roles of the user
 Work out with jobtracker and tasktracker
 Flow of MapReduce
 Different concepts of MapReduce.
5.Advanced concepts
Unit XV: Advanced HDFS and MapReduce
 Advanced Hadoop file system

 Hadoop related concepts like identifying the steps for decommission datanode, advanced MapReduce
concepts and various joins in MapReduce
Unit XVI: Ecosystem and Its Components
 Hadoop ecosystem Structure

 Different components of Hadoop ecosystem and the different roles
Unit XVII: Basic Hadoop Administration, troubleshooting and security
 Identification of different parameters for performance monitoring

 Performance tuning
 Configure the security parameters in Hadoop
Unit XVIII: Configuring & Integrating Hadoop using IDOL Connector
 Understanding the configuration file of IDOL Connector for Hadoop

 Different section of the configuration file
 Specific changes as per the configuration along with password encryption
 Setting up a secured communication.
Big Data – 2 (HBase and Time Series) XX Hours
1. Introduction to HBase
o The problem with distributed computing
o Installing HBase
o The role of HBase in the Hadoop ecosystem
o How is HBase different from RDBMS?
o HBase Data Model
o Introducing CRUD operations
o HBase is different from Hive
12
CRUD operations using the HBase Shell
o 1 - Creating a table for User Notifications

o 2 - Inserting a row
o 3 - Updating a row
o 4 - Retrieving a row
o 5 - Retrieving a range of rows
o 6 - Deleting a row
o 7 - Deleting a table
CRUD operations using the Java API
o 8 - Creating a table with HBaseAdmin

o 9 - Inserting a row using a Put object
o 10 - Inserting a list of Puts
o 11 - Retrieving data - Get and Result objects
o 12 - A list of Gets
o 13 - Deleting a row
o 14 - A list of Deletes
o 15 - Mix and match with batch operations
o 16 - Scanning a range of rows
o 17 - Deleting a table
2. HBase Architecture and advanced operationd
o HBase Architecture
Advanced operations - Filters and Counters
o 18 - Filter by Row id - RowFilter

o 19 - Filter by column value - SingleColumnValueFilter
o 20 - Apply multiple conditions - Filterlist
o 21 - Retrieve rows within a time range
o 22 - Atomically incrementing a value with Counters
3. MapReduce with HBase
o 23 : A MapReduce task to count Notifications by Type
o 23 continued: Implementing the MapReduce in Java
o Demo : Running a MapReduce task
4. Build a Notification Service
o 24 : Implement a Notification Hierarchy
o 25: Implement a Notifications Manager
5. Time series and OpenTSDB
o 26 : Time series data
o 27: Sources of time series
o 28:OpenTSDB architecture
o 29:Inserting the time series data
o 30:Querying TS data
o 31: Aggregation
o Dashboard for application monitoring
13
Advanced Big Data – 1 (SPARK – Machine Learning) XX Hours
1 : Introduction to Spark
 Introduction to Apache Spark

 Streaming Data Vs. In Memory Data
 Map Reduce Vs. Spark
 Modes of Spark
 Spark Installation Demo
 Overview of Spark on a cluster
 Spark Standalone Cluster
 Invoking Spark Shell
 Creating the Spark Context
 Loading a File in Shell
 Performing Some Basic Operations on Files in Spark Shell
 Caching Overview
 Distributed Persistence
 Spark Streaming Overview(Example: Streaming Word Count)
2: Data Processing with Spark
 Analyze Hive and Spark SQL Architecture

 Analyze Spark SQL
 Context in Spark SQL
 Implement a sample example for Spark SQL
 Integrating hive and Spark SQL
 Support for JSON and Parquet File Formats Implement Data Visualization in Spark
 Loading of Data
3. Data Processing with Spark using Hive
 Hive Queries through Spark

 Analyze Hive queries
 Implementing sentiment analysis with Hive
 Performance Tuning Tips in Spark
 Shared Variables: Broadcast Variables & Accumulators
4: Spark graph
 Basic graph analysis

 GraphFrames API

 GraphFrames motif finding
 Persisting graph data
 GraphFrames ETL
14
 GraphFrames Property Graph analysis
 Project – Social network analysis
Books & References
 Workbook designed & developed for the PGDBDA

 Learning Spark, Matei Zaharia, O’Reilly, 2015
Advanced Big Data – 2 (SPARK – Streaming) XX Hours
1. Introduction to Spark streaming

Architecture and Components of Spark and Spark Streaming
Batch versus real-time data processing
Architecture of Spark
Architecture of Spark Streaming
First Spark Streaming program
2. Processing Distributed Log Files in Real Time

Log files – structure and formats
Spark packaging structure and client APIs
Resilient distributed datasets and discretized streams
Data loading from distributed and varied sources
Load log files
Computing metrics and presentation
3. Applying Transformations to Streaming Data
Understanding and applying transformation functions
Basic transformations
Advanced transformations
Data pipeline for real time computing
Performance tuning
4. Persisting Log Analysis Data

Output operations in Spark Streaming
Integration with Cassandra
Integration with Advanced Spark Libraries
Querying streaming data in real time
Summary
5. Deploying in Production
Spark deployment models
High availability and fault tolerance
Monitoring streaming jobs
Summary
15
Reference books:
Learning Real-time Processing with Spark Streaming, Sumit Gupta, September 2015, Packt
books
Advanced Big Data – 3 (SPARK – Machine Learning) XX Hours
1: Introduction
Machine learning: goals, results, supervised/unsupervised - · Spark as a tool for Big Data -
Python as the language of Spark – Spark structures for machine learning – sample use cases –
life cycle of machine learning – data preparation for machine learning – binning – outlier
treatment – missing values – binning – feature selection
2: Linear methods
Linear regression – linear relations – normal distribution – heteroscedasticity - dummy variables

– correlation – regression coefficients - Use case: financial modelling
Logistic Regression – Odds – Log odds – linear relations – assumptions in logistic regression –
L1/L2 regularization · Use case: healthcare prediction
SVM (Support Vector Machines) – risk boundaries – hyper parameters – Vapnik space – SVM
classifier – SVM regression – Linear SVM – non linear SVM – single class SVM - Use case:
anomaly detection
3.Non linear and independent methods
Naive Bayes – probability – priors – conditional probability – posterior probability – discrete

variables – continuous variables - Use case: spam filtering
Decision Trees – selection of variable – impurity measures – entrophy – gini -chi square –
splitting variables - Use case: Diabetes diagnosis
Random forests – diversity metrics – voting – bagging – boosting – random trees - Use case:
Credit scoring
4: Unsupervised methods
Clustering (K-Means) – distance metrics -Euclidean – City block – power distance – similarity
metrics – dissimilarity metrics – number of clusters – elbow criterion – cluster validity –
applications of clustering - Use case: topic grouping
16
Principal Component Analysis (PCA) – dimension reduction – covariance matrix – linaer
combinations – new variables – assumptions in PCA – non linear PCA - Use case: stock analysis
5: Advanced applications of big data
Recommendation (Collaborative filtering) – Basket analysis – affinity modeling – item based

recommendation – user based recommendation – slope one recommendation – use case: Amazon
like product recommendation
17

PGDIT Semester III Curriculum

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PGDIT Semester III Curriculum

Uploaded by

Copyright:

Available Formats

Semester III

POSTGRADUATE DIPLOMA IN IT (PGDIT) -BIG DATA

Hours/week Total Marks

14PGDIT-BDA102 Linux fundamentals and Python 3 1 4 30 70 100

14PGDIT-DA105 Learning Lab– 1 (Agile and Raptor) 3 1 CA=50 50

Hours/week Total Marks

14PGDIT-BDA202 Advanced Big Data – 2 (SPARK streaming) 3 2 4 30 70 100

Advanced Big Data – 3 (SPARK – Machine

14PGDIT-BDA203 Learning Lab – 1a (applied big data) 3 1 CA-50 50

14PGDIT-BDA204L Learning Lab – 1a (applied big data) 3 1 CA-50 50

Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total

Final Project work (Second semester):

Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total

Programming Java, Agile and Raptor

1. Agile Programming 15 Hrs

3. Basic Java 10hrs

 Creating Java Projects

 OOP Part - Classes, Constructors and Inheritance

Advanced data types

 Arrays, Java inbuilt Lists, Autoboxing and Unboxing

5. Advanced Java 10 hrs

 Basic Input & Output including java.util

 Thinking in Java, Bruce Eckel, Prentice Hall, 2012

Linux fundamentals and Python programming

2: GIT and version control

 Overview of Python- Starting with Python

Working with Data in Python

4: Data Analysis in Python

 Introduction exploratory data analysis

5: Python for Big Data

 Introduction to ODBC and data base programming in Python

Books & References

 Based on Linux foundation training reference https://training.linuxfoundation.org/linux-

Big data – 1 (Storage and processing- Pig, Hive)

1: Basics of big data

Unit 1: Course Introduction

 Understanding of lab setup details

Unit II: Introduction to Big Data

 Introduces the Big Data

Unit III: Introduction to HPE IDOL

2. Configuration and administration

Unit IV: HPE IDOL Administration

 Different services and its functionalities of basic architecture

Unit V: HPE IDOL Configuration

 Understating different sections in the Idol server configuration file

Unit VI: Exploring the connectors

 Understanding different types of connectors available in HEP IDOL Software

Unit VII: Social Media Connector Configuration

 Different configuration files

Unit VIII: HPE IDOL Media Servers

 Understanding the media server and its configuration

 Understanding of search engine and its uses

Unit X: Action Commands

 Understanding different action commands

Unit XI: Parametric Search

Unit XII: Introduction to Hadoop Big Data

 Introduction about Hadoop Big Data

Unit XIII: Hadoop Architecture

Unit XIV: Introduction to MapReduce

Unit XV: Advanced HDFS and MapReduce

 Advanced Hadoop file system

Unit XVI: Ecosystem and Its Components