You are on page 1of 7

FAST-NU, Islamabad

Instructor: Mr. Naveed Iqbal TA: Muhammad Kamran

DWH, Fall 2010 - Project 2

Project Title: MOLAP system for Cotton to improve the Cotton yield in Pakistan

Agriculture is the backbone of Pakistan’s economy and cotton is one of the main crops
sowed in Pakistan. In this project you will use the DWH concepts to develop a decision
support system for cotton production in Pakistan. Provided you do this project yourself,
you will INSHA ALLAH feel the advantages of DWH for decision support system while
working with large amount of data coming from different source systems in different
formats, and hence you can enjoy a very highly paid job of a DWH specialist. And also
this project will have a very large impact on your grade in DWH. So before starting this
project kindly read this document carefully to understand what you are supposed to do.

Submission Deadline: Monday, November 15, 2010


Read These Entire Pages, Before Starting the Project
You have been provided with:
1. Data obtained from the source systems.
2. Business Questions to answer given at the end of this document.

Scope:
Many organizations analyze their business-critical data using Online Analytical
Processing (OLAP) technology. OLAP-based data analysis provides a unique and
efficient way to query multidimensional datasets and drill down the data to find patterns
and ultimately improve the business. A cube contains a set of attributes called
dimensions, which roughly correspond to database fields, except that they also contain a
hierarchical collection of data/facts/information. E.g. time dimension may be divided into
levels of years, quarters, months, and weeks. A cube also contains a collection of
measures, which are the actual data values and are typically numeric. For example, a
retail cube will allow you to view unit sales (measure) according to store location
(dimension), and time of year (dimension).

First step involves data cleansing and transformation. All the cleansing/transformation
work can be done in any RDBMS. You do have the option to use any programs, but the
one using SQL, of course, have a higher edge for project marks. Please do not
underestimate this project and start early as the data given to you has many problems
and need lot of cleansing work. You need to identify the anomalies in data and use your
creativity and innovation to eliminate them. A brief overview of the methodology you
may like to follow is given below. You may fill in the gaps yourselves. Of course, we
will also be there to guide you as well. First of all, load all the data arriving from the
source into a staging database using loading strategy. The schema for the staging tables
will be similar to the source data. Only add the Tehsil (an administrative unit of
District) name to the schema. The format for the source data in all cases is as follows:

1. District Name
2. Mouza Name
3. Farmer Name, Father Name
4. Area
5. Variety of Crop
6. Sowing Date
7. Visit Date
8. Pest Population1
9. Pest Population2
10. Pest Population3
11. Pest Population4
12. Pest Population5
13. Pest Population6
14. Pest Population7
15. Pest Population8
16. Pest Population9
17. Pest Population10
18. Pest Population11
19. Pest Population12
20. Pesticide Used
21. Pesticide Spray Date
22. Pesticide Dosage
23. CLCV (Disease)
24. Plant Height

Data Profiling and Analytics


Once you have the data in the staging area inside RDBMS, perform data profiling for all
the fields. By data profiling we mean the following statistics:
1. No. of unique values (for each column)
2. No. of nulls (for each column)
3. Invalid values (for each column, you have to use your knowledge and understanding.)
4. Total no. of farmers
5. Total no. of Pest vs. Predators
6. Total no of people who has taken more than 1 variety in a Season.
8. Average no of farmers using a particular Pesticide.
9. Average no of farmers in each Mouza/Town.

Data Cleansing
After the profiling, you should have identified the anomalies and data cleansing issues.
Here is an example:
1. Separate first and last names both for the farmer and father. Standardize the first and
last names. (Hint: Find all the unique names in data and create a lookup table with two
columns i.e. correct_name, variation etc. Use the lookup table to update the name fields
with standardized names. Never hardcode names in your SQL.)

In some cases, you will find the format as


“farmerName” s/o “fathername” s/o
In the above case you have to remove the last “s/o” as that is a typo.

2. In some cases where you have to find out the Plant height, or pesticide usage etc. you
may be missing the units. Identify them. If you are able to remedy them, congrats, you
started thinking as DWH professional developer should be thinking.

3. Some dates are not valid.


If you identify and report any additional interesting exceptions/anomalies, you will get
EXTRA CREDIT for them.

Please remember that whatever scripts/SQL you code to do the cleansing or


querying, is to be mentioned clearly in the project report.

More Tips
1. Try to use everything you know about data quality management. The more you
identify the data quality defects and correct them, the more points you get.

2. Be careful with the anomalies of date values.

3. Do not forget to use the INSERT/SELECT and CREATE TABLE AS SQL constructs.
They will be of great help during this phase (ETL phase that is).

3. Also consider the use of derived tables and columns. They will help you simplify and
cut down your SQL.

6. Use SQL Assistant for running the cleansing/transformation SQL. Programs allowed
in Java or C# only.

After ETL you will have to transfer the data into DWH and generate the aggregates for
pest scouting data. Aggregates can be generated with the help of any program written by
you that may use nested queries in nested loops to generate aggregates that are then
written in text files or you can use SQL Server Analysis Services to illustrate this process,
you can use third party tools or Cube clause.

You have to generate Pivot tables that show the aggregated data. You can use any utility
like Microsoft Office Web Components (OWC), RadarCube (powerful API designed to
create true OLAP applications), Dundas Charting/OLAP tool etc. You have to provide
functionality for roll up and drill down e.g. we can check aggregates on the basis of year
as well as on the basis of month. You have to create graphs on aggregates.

· Graphs can be generated with the help of Java Script.


· Or you can use any third party tools like Dundas charting tool etc.
· Graphs must be dynamic, as we drill down or roll up the graphs also change with the
data.

After this task you have to finally submit the project with all the components integrated
and proper GUI interface. Also you have to submit a report which includes:

• Class diagram
• Variable, type and description.
• Function, type and description.
• Class description.
• Detailed Design
o DWH DB Design – Star Schema having fact tables, dimension tables,
and helper tables if required.
o You may use Dimensional Modeling or De-Normalization.
• High Level System Design.
• Report of using the tool on data entered with findings, if any.

For your assistance, a sample OLAP Tool is shown below. You can build according to
you own style/demand.
Extra Credit:
Extra credit will be provided to any group that will do any innovative task other than
specified above e.g. you can provide functionality for drill-down or roll-up by clicking on
the graph or compression of cube. You have an open world, go beyond the specified
limits and come-up with such a unique solution(s) that we feel proud to arrange a
special demonstration of your project for faculty and industry professionals and
then just imagine the level of respect, honor and pride you will achieve. A straight
A+ grade will be a penny sort of award. PLEASE DON’T UNDERESTIMATE
YOURSELF AS WELL AS THESE WORDS. WE ARE SURE YOU WILL MAKE
US PROUD TO BE THE INSTRUCTORS OF THIS COURSE. CHEER UP,
GOOD LUCK AND MAY ALLAH HELP YOU TO CROSS THE LIMITS.

Notes:
• Plan your work on daily basis so that you do not miss deadline, as this project
needs some brainstorming and smart work.
• Deadline is hard as simply No credit for late submissions.
• The groups for the project will be 3-4 persons per group.
• Application should be developed in C# (preferably).
• Code should be commented like this

For Classes

/// <summary>
/// Summary description for Class
/// </summary>
For Functions

/// <summary>
/// Function to select data from the Database.
/// </summary>
/// <param>string Abc to perform xyz task</param>
/// <returns>integer n</returns>
/// Developed by Your name on DD/MM/YYYY
/// </summary>

For Variables etc.

/// <summary>
/// Global/Local/Public/Private Variable to store …….
/// </summary>

Deliverables:
• Properly commented Code
• Normalized and de-normalized databases
• Project report alongwith a hard copy of the project report. Please be specific.

Submission Guidelines:

• Submit your work at margala course folder in a zip file named like
DWH10_Project2_Roll#1_Roll#2_ Roll#3_Roll#4.zip which corresponds to
Project 2 submitted by Mentioned Roll Numbers.
• Also mail your work (code and report only) at
mohammad.kamranpk@yahoo.com.
• The subject line of the mail should be like DWH10-Project2_Roll#_Roll#...
• Any zip file containing virus or corrupted will not be graded in any case.
• You can Cc the submission to yourself to see if the file is delivered fine or
otherwise.
• Properly follow these guidelines as they have marks
Also join the yahoo group DWH10. For any query mail on the DWH group at
DWH10@yahoogroups.com
____________________________ Good Luck _________________________________

Business Questions to Answer


Following is the list of business questions we want our DWH to answer. The DWH
should be tuned according to these business questions. By tuning we mean indexing
choices and other factors that can affect performance. Obviously, one would not like to
improve the performance of the queries that are used infrequently at the cost of those
queries that are used frequently.
1. Which group of pesticide is effective against certain group of pests?
2. What is the effect of predators on pest population?
3. What is the effect of pesticides on population of pests and predators?
4. Which pests have been dominant in the last X years?
5. Which pesticides are commonly used in a specific area?
6. What are the major varieties being sowed in different agro-ecological zones?
7. What is the effect on pest population as regards to sowing date?
8. What is the ratio of increase in the pesticide usage in the last few years?

Note: Please don’t be specific to the above questions only. During project demos,
you may be asked other questions to check the scalability and performance of your
solution. Be innovative.

You might also like