You are on page 1of 17

Semester III

POSTGRADUATE DIPLOMA IN IT (PGDIT) -BIG DATA


CURRICULUM

Semester -1

Hours/week Total Marks


Paper No Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA101 Programming Java, Agile and Raptor 3 1 1 4 30 70 100

14PGDIT-BDA102 Linux fundamentals and Python 3 1 4 30 70 100


Big data – 1 (Storage and processing- Pig,
14PGDIT-BDA103 3 2 4 30 70 100
Hive).
14PGDIT-BDA104 Big data – 2 (HBase and time series). 3 2 4 30 70 100

14PGDIT-DA105 Learning Lab– 1 (Agile and Raptor) 3 1 CA=50 50


14PGDIT-DA107AL (or)
14PGDIT-DA107BL (or) Learning Lab – 2 (Big data) 3 1 CA=50 50
14PGDIT-DA107CL
12 12 1
Total 18 500
25
Semester -2

Hours/week Total Marks


Paper No. Title of the Paper Credits
L P T IA UE Total
14PGDIT-BDA201 Advanced Big Data – 1 (SPARK) 3 2 4 30 70 100

14PGDIT-BDA202 Advanced Big Data – 2 (SPARK streaming) 3 2 4 30 70 100

Advanced Big Data – 3 (SPARK – Machine


14PGDIT-BDA203 3 1 4 30 70 100
Learning)
14PGDIT-BT202A (or)
Subject (Security/Social media/Data Lake) 3 1 1 4 30 70 100

14PGDIT-BDA203 Learning Lab – 1a (applied big data) 3 1 CA-50 50

14PGDIT-BDA204L Learning Lab – 1a (applied big data) 3 1 CA-50 50


Total 12 12 1
18 500
25

1
Semester 1 Project work:

Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total


14PGDIT-BDA205L Project – Data Engineering – 1 4 13 50 + 25 100
14PGDIT-BDA206 Project – Data Engineering – 1 3 12

Final Project work (Second semester):

Subject Code Title of the project Credits IA UE (Dissertation + Viva) Total


14PGDIT-BDA203 Project – 1a (applied big data) 4 13 50 + 25 100
14PGDIT-BDA204L Project – 1a (applied big data) 3 12
Total credits= 18+7+18+7=50 Total Hours=25+25=50 hours/week Total marks: 1200

SYLLABI

SEMESTER I

Programming Java, Agile and Raptor


XX Hours

1. Agile Programming 15 Hrs


Roles in Agile - Cross-functional Team - How an Agile Team Plans its Work? - What is a User Story? - Relationship
of User Stories and Tasks - When a Story is Done - What is Acceptance Criteria? - How the Requirements are
Defined?
Twelve Principles of Agile Manifesto - Agile – Characteristics - Iterative/incremental and Ready to Evolve - Face-
to-face Communication - Feedback Loop - User Story - Iteration – Release planning - Who is Involved? -
Prerequisites of Planning - Materials Required - Planning Data - Output

2. Raptor 15 Hrs
Program design and development process
 Problem definition
 Pseudo-code
 Flowcharting
 Code modularization
 Coding, testing, and debugging
 Sequence, selection, and iteration patterns
 Array processing
 File processing
Values and Variables
 Integer Values

2
 Variables and Assignment
 Identifiers
 Additional Integer Types
 Floating-point Types
 Constants
 Other Numeric Types
 Characters
 Enumerated Types
Expressions and Arithmetic
 Expressions
 Mixed Type Expressions
 Operator Precedence and Associativity
 Comments
 Compile-time Errors
 Run-time Errors
 Logic Errors
 Compiler Warnings
 Arithmetic Examples
 Integer Implementation
 Floating-point Implementation
 Bitwise Operators
 Algorithms
Conditional Execution
 Type bool
 Boolean Expressions
 The Simple if Statement
 Compound Statements
 The if/else Statement
 Compound Boolean Expressions
 Nested Conditionals
 Multi-way if/else Statements
Iteration
 The while Statement
 Nested Loops

3
 Abnormal Loop Termination
 The break statement
 The goto Statement
 The continue Statement
 Infinite Loops
 Iteration Examples
 Drawing a Tree
 Printing Prime Numbers
Using Functions
 Introduction to Using Functions
 Standard Math Functions
 Maximum and Minimum
 clock Function
 Character Functions
 Random Numbers
Arrays
 Static Arrays
 Pointers and Arrays
 Dynamic Arrays
 Copying an Array
 Multidimensional Arrays
 Command-line Arguments
 Vectors vs. Arrays
 Prime Generation with a Vector

Custom Objects
 Object Basics
 Instance Variables
 Member Functions
 Constructors
 Defining a New Numeric Type
 Encapsulation

Handling Exceptions

4
 Motivation
 Exception Examples
 Custom Exceptions
 Catching Multiple Exceptions
 Exception Mechanics
 Using Exceptions

3. Basic Java 10hrs

 Creating Java Projects


 Variables, Datatypes and Operators
 Primitive Data Types - The Byte, Short, Int And Long
 Primitive Data Types - Float And Double
 Primitive Data Types - Char And Boolean
 Understanding Strings
 Operators In Java And Operator Precedence
 Expressions, Statements, Code blocks, Methods and more
 Keywords And Expressions
 Statements, Whitespace and Indentation (Code Organization)
 Code Blocks And The If Then Else Control Statements
 Methods In Java
 Method Overloading
 Control Flow Statements
 The switch statement
 The for Statement
 The while and do while statements
 Euler project excercises (basic – 20)
4. Intermediate Java 10hrs

 OOP Part - Classes, Constructors and Inheritance


 Classes
 Constructors
 Inheritance
 Composition
 Encapsulation
 Polymorphism

Advanced data types

 Arrays, Java inbuilt Lists, Autoboxing and Unboxing


 Arrays
 List and ArrayList
 Autoboxing and Unboxing
 LinkedList
 Inner and Abstract Classes & Interfaces
 Java Generics
 Naming Conventions
 Packages

5
 Scope
 Access Modifiers
 The static statement
 The final statement
 Java Collections
 Binary Search
 Collections List Methods
 Comparable and Comparator
 Maps
 Immutable Classes
 Sets & HashSet
 Sorted Collections
 TreeMap and Unmodifiable Maps
 Euler project (Intermediate – 20)

5. Advanced Java 10 hrs

 Basic Input & Output including java.util


 Exceptions
 Stack Trace and Call Stack
 Catching and throwing Exceptions
 Multi Catch Exceptions
 Introduction to I/O
 Writing content - FileWriter class and Finally block
 FileReader and Closeable
 BufferedReader
 Load Big Location and Exits Files
 Buffered Writer and Challenge
 Byte Streams
 Reading Binary Data and End of File Exceptions
 Object Input Output including Serialization
 Random Access File
 Java NIO
 Separators Temp Files and File Stores
 Concurrency and Threads Introduction
 Multiple Threads
 Synchronisation
 Producer and Consumer
 Lambda Expressions
 Scope and Functional Programming
 Regular Expressions
 Debugging and Unit Testing
 Databases
 Creating Databases With JDBC in Java
 JDBC Insert, Update, Delete
 executeQuery() and using Constants
 Result Set Meta Data
 Transactions
 Inserting Records With JDBC
 Handling Updates

6
Books & References

Text books:
1. Java for Programmers, Dietel and Dietel, Prentice Hall, 2016

Reference Books:

 Thinking in Java, Bruce Eckel, Prentice Hall, 2012


 Effective Java, 2nd edition,Addison- Wesly, 2008

Linux fundamentals and Python programming


XX Hours

1 : Linux Basics

 Introduction
 Linux and the Operating System
 Graphical Environments and Interfaces
 Getting Help
 Text Editors
 Shells, bash, and the Command Line
 System Components
 System Administration
 Essential Command Line Tools
 Command and Tool Details
 Users and Groups
 Bash Scripting
 Files and Filesystems

Linux Intermediate

 Filesystem Layout
 Linux Filesystems
 Compiling, Linking and Libraries
 Java Installation and Environment
 Python and dependency installation
 Building RPM and Debian Packages

2: GIT and version control

 Introduction to GIT
 Git Installation
 Git and Revision Control Systems
 Using Git: an Example
 Git Concepts and Architecture
 Managing Files and the Index

7
 Commits
 Branches
 Diffs
 Merges
 Managing Local and Remote Repositories
 Using Patches

3 : Basic Python

 Overview of Python- Starting with Python


 Introduction to installation of Python
 Understand Jupyter notebook & Customize Settings
 Concept of Packages/Libraries - Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
 Installing & loading Packages & Name Spaces
 Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
 List and Dictionary Comprehensions
 Variable & Value Labels – Date & Time Values
 Basic Operations - Mathematical - string - date
 Reading and writing data
 Simple plotting
 Control flow & conditional statements
 Debugging & Code profiling
 How to create class and modules and how to call them?
 Packages in python for Analytics - Numpy, scify, pandas, scikitlearn, statmodels, nltk etc

Working with Data in Python

 Importing Data from various sources (Csv, txt, excel, access etc)
 Database Input (Connecting to database)
 Viewing Data objects - subsetting, methods
 Exporting Data to various formats
 Important python modules: Pandas, beautifulsoup
 Cleansing Data with Python
 Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables,
sampling, Data type conversions, renaming, formatting etc)
 Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
 Python Built-in Functions (Text, numeric, date, utility functions)
 Python User Defined Functions
 Stripping out extraneous information
 Normalizing data
 Formatting data
 Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)

4: Data Analysis in Python

 Introduction exploratory data analysis


 Descriptive statistics, Frequency Tables and summarization
 Univariate Analysis (Distribution of data & Graphical Analysis)
 Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
 Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)

8
 Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats
etc)

5: Python for Big Data

 Introduction to ODBC and data base programming in Python


 Introduction to Python streaming for Hadoop
 Introduction to PySpark
 Sample programs for practice:
 Word frequency count and visualization
 Sales performance report development
 Montecarlo simulation of stock price

Books & References

 Based on Linux foundation training reference https://training.linuxfoundation.org/linux-


courses/development-training, LFD301- Introduction to Linux, Open Source Development, & GIT
 Matering Linux, Paul S.Wang, CRC Press
 Learn enough Git to be dangerous, Maichael hartl, LearnEnough.com, (https://www.learnenough.com/git-
tutorial
 Python: Journey from Novice to Expert (Module 1 only), Dusty Phillips, Fabrizio Romano, Rick van
Hattem, Packt Publishing
 Python: Data Analytics and Visualization, Ashish Kumar, Kirthi Raman, Martin Czygan, Phuong Vo.T.H,
Packt Publishing

Big data – 1 (Storage and processing- Pig, Hive)

1: Basics of big data

Unit 1: Course Introduction

 Understanding of lab setup details


 Prerequisites to run the preconfigured VMWare Virtual image, software & hardware requirements
 Login credentials
 Logging authorization.

Unit II: Introduction to Big Data

 Introduces the Big Data


 Definition
 Different types of Data
 Identifying the demand of Big Data and its use cases.

Unit III: Introduction to HPE IDOL

9
 What is HPE IDOL
 HPE IDOL use cases
 Complexity of the powerful infrastructure software by examining the technology from a high-level
perspective
 Understand the architecture of HEP IDOL
 Different components used
 Understanding of license server and its validity
 HPE IDOL Server configuration
 Different types of connectors supported and its uses based on the respective ports.

2. Configuration and administration

Unit IV: HPE IDOL Administration

 Different services and its functionalities of basic architecture


 Understanding of HP IDOL software by using the web based graphical user interface navigating the different
tabs and its uses such as how to know the status, the synchronous process, adding or creating the database,
adding the contents, initializing and indexing documents
 What is indexing?
 Different indexing options and its uses.

Unit V: HPE IDOL Configuration

 Understating different sections in the Idol server configuration file


 Start editing the sections as per the requirements.
 Configuring the license server
 Overview of the file system, configurations file, connector framework server configuration file and the
different uses of the different sections.

Unit VI: Exploring the connectors

 Understanding different types of connectors available in HEP IDOL Software


 Start working by configuring the File system connectors
 Connector Framework server
 HTTP connectors fetching and indexing manually From/To XML files.

3. Social media

Unit VII: Social Media Connector Configuration

 Different configuration files


 Start working on the live web pages/ social media connectors
 Understanding the security of the Social Media
 Creating the Apps
 App Keys
 Secret Keys
 Retrieving the data and placing in the respective database

10
Unit Work: By using users created in the respective social media websites such as Facebook, Twitter with the help
of Facebook social media connector, Twitter social media connector and will have an assignment to have a LinkedIn
social media connector.

Unit VIII: HPE IDOL Media Servers

 Understanding the media server and its configuration


 Feasible hardware for the media server configuration.
 Understanding how to ingest, analyze and encode the media
 Understanding the Media Server Architecture, its system requirements, software dependenc ies.
 Introduction to the Video Logger Software
 Optical character recognition image server
 What is speech server configuration.

Unit IX: Retrieval using IDOL Find, end user search Interface

 Understanding of search engine and its uses


 Setting the conceptual parameters
 Understand the execution of keyword, proximity and Boolean
 Conceptualizing and deploying find.

Unit X: Action Commands

 Understanding different action commands


 Action Command Syntax
 GRE Request
 Query actions
 Get content actions
 List actions
 Saving an output

Unit XI: Parametric Search

 Advanced search using the functions, and indexing the parametric data
 Parametric parameters and its uses

Unit XII: Introduction to Hadoop Big Data

 Introduction about Hadoop Big Data


 Types of data such as Structured Data, Semi Structured Data and Unstructured data and its uses.

4.Architecture

Unit XIII: Hadoop Architecture

11
 Understanding the prerequisites of hardware and software
 Understanding various configurations and services of Hadoop.
 Understand difference between the regular file system and Hadoop distributed file system.

Unit XIV: Introduction to MapReduce

 Concept of MapReduce
 Different roles of the user
 Work out with jobtracker and tasktracker
 Flow of MapReduce
 Different concepts of MapReduce.

5.Advanced concepts

Unit XV: Advanced HDFS and MapReduce

 Advanced Hadoop file system


 Hadoop related concepts like identifying the steps for decommission datanode, advanced MapReduce
concepts and various joins in MapReduce

Unit XVI: Ecosystem and Its Components

 Hadoop ecosystem Structure


 Different components of Hadoop ecosystem and the different roles

Unit XVII: Basic Hadoop Administration, troubleshooting and security

 Identification of different parameters for performance monitoring


 Performance tuning
 Configure the security parameters in Hadoop

Unit XVIII: Configuring & Integrating Hadoop using IDOL Connector

 Understanding the configuration file of IDOL Connector for Hadoop


 Different section of the configuration file
 Specific changes as per the configuration along with password encryption
 Setting up a secured communication.

Big Data – 2 (HBase and Time Series) XX Hours

1. Introduction to HBase
o The problem with distributed computing
o Installing HBase
o The role of HBase in the Hadoop ecosystem
o How is HBase different from RDBMS?
o HBase Data Model
o Introducing CRUD operations
o HBase is different from Hive

12
CRUD operations using the HBase Shell

o 1 - Creating a table for User Notifications


o 2 - Inserting a row
o 3 - Updating a row
o 4 - Retrieving a row
o 5 - Retrieving a range of rows
o 6 - Deleting a row
o 7 - Deleting a table

CRUD operations using the Java API

o 8 - Creating a table with HBaseAdmin


o 9 - Inserting a row using a Put object
o 10 - Inserting a list of Puts
o 11 - Retrieving data - Get and Result objects
o 12 - A list of Gets
o 13 - Deleting a row
o 14 - A list of Deletes
o 15 - Mix and match with batch operations
o 16 - Scanning a range of rows
o 17 - Deleting a table
2. HBase Architecture and advanced operationd
o HBase Architecture

Advanced operations - Filters and Counters

o 18 - Filter by Row id - RowFilter


o 19 - Filter by column value - SingleColumnValueFilter
o 20 - Apply multiple conditions - Filterlist
o 21 - Retrieve rows within a time range
o 22 - Atomically incrementing a value with Counters
3. MapReduce with HBase
o 23 : A MapReduce task to count Notifications by Type
o 23 continued: Implementing the MapReduce in Java
o Demo : Running a MapReduce task
4. Build a Notification Service
o 24 : Implement a Notification Hierarchy
o 25: Implement a Notifications Manager
5. Time series and OpenTSDB
o 26 : Time series data
o 27: Sources of time series
o 28:OpenTSDB architecture
o 29:Inserting the time series data
o 30:Querying TS data
o 31: Aggregation
o Dashboard for application monitoring

13
Advanced Big Data – 1 (SPARK – Machine Learning) XX Hours

1 : Introduction to Spark

 Introduction to Apache Spark


 Streaming Data Vs. In Memory Data
 Map Reduce Vs. Spark
 Modes of Spark
 Spark Installation Demo
 Overview of Spark on a cluster
 Spark Standalone Cluster
 Invoking Spark Shell
 Creating the Spark Context
 Loading a File in Shell
 Performing Some Basic Operations on Files in Spark Shell
 Caching Overview
 Distributed Persistence
 Spark Streaming Overview(Example: Streaming Word Count)

2: Data Processing with Spark

 Analyze Hive and Spark SQL Architecture


 Analyze Spark SQL
 Context in Spark SQL
 Implement a sample example for Spark SQL
 Integrating hive and Spark SQL
 Support for JSON and Parquet File Formats Implement Data Visualization in Spark
 Loading of Data

3. Data Processing with Spark using Hive

 Hive Queries through Spark


 Analyze Hive queries
 Implementing sentiment analysis with Hive
 Performance Tuning Tips in Spark
 Shared Variables: Broadcast Variables & Accumulators

4: Spark graph

 Basic graph analysis


 GraphFrames API

 GraphFrames motif finding
 Persisting graph data
 GraphFrames ETL

14
 GraphFrames Property Graph analysis
 Project – Social network analysis

Books & References

 Workbook designed & developed for the PGDBDA


 Learning Spark, Matei Zaharia, O’Reilly, 2015

Advanced Big Data – 2 (SPARK – Streaming) XX Hours

1. Introduction to Spark streaming


Architecture and Components of Spark and Spark Streaming
Batch versus real-time data processing
Architecture of Spark
Architecture of Spark Streaming
First Spark Streaming program

2. Processing Distributed Log Files in Real Time


Log files – structure and formats
Spark packaging structure and client APIs
Resilient distributed datasets and discretized streams
Data loading from distributed and varied sources
Load log files
Computing metrics and presentation
3. Applying Transformations to Streaming Data
Understanding and applying transformation functions
Basic transformations
Advanced transformations
Data pipeline for real time computing
Performance tuning

4. Persisting Log Analysis Data


Output operations in Spark Streaming
Integration with Cassandra
Integration with Advanced Spark Libraries
Querying streaming data in real time
Summary

5. Deploying in Production
Spark deployment models
High availability and fault tolerance
Monitoring streaming jobs
Summary

15
Reference books:

Learning Real-time Processing with Spark Streaming, Sumit Gupta, September 2015, Packt
books

Advanced Big Data – 3 (SPARK – Machine Learning) XX Hours

1: Introduction

Machine learning: goals, results, supervised/unsupervised - · Spark as a tool for Big Data -
Python as the language of Spark – Spark structures for machine learning – sample use cases –
life cycle of machine learning – data preparation for machine learning – binning – outlier
treatment – missing values – binning – feature selection

2: Linear methods

Linear regression – linear relations – normal distribution – heteroscedasticity - dummy variables


– correlation – regression coefficients - Use case: financial modelling

Logistic Regression – Odds – Log odds – linear relations – assumptions in logistic regression –
L1/L2 regularization · Use case: healthcare prediction

SVM (Support Vector Machines) – risk boundaries – hyper parameters – Vapnik space – SVM
classifier – SVM regression – Linear SVM – non linear SVM – single class SVM - Use case:
anomaly detection

3.Non linear and independent methods

Naive Bayes – probability – priors – conditional probability – posterior probability – discrete


variables – continuous variables - Use case: spam filtering

Decision Trees – selection of variable – impurity measures – entrophy – gini -chi square –
splitting variables - Use case: Diabetes diagnosis

Random forests – diversity metrics – voting – bagging – boosting – random trees - Use case:
Credit scoring

4: Unsupervised methods

Clustering (K-Means) – distance metrics -Euclidean – City block – power distance – similarity
metrics – dissimilarity metrics – number of clusters – elbow criterion – cluster validity –
applications of clustering - Use case: topic grouping

16
Principal Component Analysis (PCA) – dimension reduction – covariance matrix – linaer
combinations – new variables – assumptions in PCA – non linear PCA - Use case: stock analysis

5: Advanced applications of big data

Recommendation (Collaborative filtering) – Basket analysis – affinity modeling – item based


recommendation – user based recommendation – slope one recommendation – use case: Amazon
like product recommendation

17

You might also like