Professional Documents
Culture Documents
INTRODUCTION
This chapter provides the overview of this research project and discussed about
research background, problem statement, objectives of the research, research scope
and significance of the research.
E-filing provides access to large database that consist list of electronic files.
According to Olson, Edwards and Monty (2003), e-filing is a highly secure
and reliable method for sending, receiving and managing legal documents.
This is because, it takes time to find needed files manually and e-filing
provides secured access to identify needed files easily without searching
manually at huge shelf. Olson et al. (2003) also stated that state courts,
federal courts and law firms across the country are using e-filing more and
more to improve access to documents, maximize resources and streamline
filing and service activities. It is much easier to know status of the needed
files and identified location of the files before going through to the real files.
1
Within the e-filing web-based system, staffs easily gather information about
status of the files and identify suitable files that meet their requirement. The
system is developed using data mining technique specifically clustering
technique. According to Phyu (2009), data mining involves the use of
sophisticated data analysis tools to discover previously unknown, valid
patterns and relationships in large database set. This is because data mining
not even consists of more than collection and managing data, but also
includes analysis and prediction. Garofalakis, Rastogi, Seshadri and Shim
(1999) stated that there are three popular data mining techniques which are
association rules, classification and clustering. This research identified
suitable searching method using data mining techniques either association,
classification or clustering techniques in order to develop a prototype of e-
filing web-based system.
All this steps will create barriers in order to give best respond for each action.
By developing this system, staff can find the file that satisfies their needs so
that it will create interactive environment for them.
2
1.3 Aim
The aim for this research project is to provide a suitable searching method
using data mining techniques for e-filing web-based system.
To achieve the aim of the project above, the objective can be divided into
four. The objectives are:
The significance of this development is that this system can be used by staff
in Majlis Daerah Kerian. E-filing will act as an information center for staff to
gather information about status of the files. Besides that, it also provides staff
with interactive environment in making their choice in determining the
suitable files that meets their requirement.
3
1.6 Scope of Study
1.7 Limitation
The important task carried out in this study is to gather information from
staffs in Majlis Daerah Kerian who are involved in filing management. It is
conducted through the interview that requires arranging schedules and need
the right interviewee to gain the proper and effective interview sessions.
Conducting the interview time is the main constraint. This is because, the
researcher have to reschedule the interview when the interviewee canceled
the interview session. It is difficult for the researcher to gather all of the
information and possibility of missing some important information. Interview
session was conducted at Majlis Daerah Kerian, Parit Buntar, Perak.
Another limitation is that there are three different data mining techniques, but
researcher must select the best data mining technique that suite with the
objective. Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.
Next, there are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher need to
study the tools based on their function and usability with the selected
techniques. Furthermore, the tool used in this research is new to the
researcher so that requires time to familiarize with the tool.
4
Experience of the researcher is another limitation factor of the research. This
is the first research for the researcher. However, researcher can learn and
have the proper guide based on the research plan and instruction from the
supervisor and examiner.
1.8 Outcomes/Deliverables
The outcome from the research project is a suitable searching method using
data mining technique for e-filing web-based system.
This research project has both a theoretical and practical part. The theoretical
part will describes the concepts and literature review of the e-filing and data
mining techniques. The practical part consists of an analysis of data gathered
from the interview session and secondary data from literature review.
Chapter 2 is about the literature review on the e-filing and data mining
techniques. These literatures will act as a reference for this research
project.
Chapter 3 describes the research approach and methodology used in
this research project. The choice of method, how data is gathered and
the strategy used to perform an analysis of the data are explained.
Chapter 4 discusses the construction of the system’s prototype.
Chapter 5 discusses the findings and the analysis from the interview
sessions and secondary data.
Chapter 6 provides suggestion of conclusion and recommendations
for further research.
5
1.10 Summary
This chapter explains the background of the problem and its proposed
solution together with a brief explanation of the solution. The important
aspects of the projects such as research background, objectives of the project,
scope of the project and significance of the project are included in this
chapter. The methodology diagram as shown in Figure 3.1 in Chapter 3 and
other contents of this chapter will be used in the following chapter as the
basis for direction.
The next chapter discusses the literature review for the research project.
6
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
This chapter describes in detail the related literatures to support the research
project. Literature review also clarifies the relationship between the study and
previous work conducted on the topic. This chapter covers overview of e-
filing and data mining, brief explanation for each technique in data mining
and steps in selecting data mining tools.
2.2 E-Filing
According to Olson et al. (2003), there are reason why rules are
important for electronic filing :
7
To define the electronic filing system : Electronic filing
and services can mean anything. So, the exact information
regarding type of files must clearly defined in order to
provide guidance for where and how to access the files.
Short title
Clear definitions of files
Give authority
Determine authorized users
Give effective date
Signature to identify responsible user
8
2.3 What is Data Mining?
Tang, Steinbach and Kumar (2006) stated that data mining is the
process of automatically discovering useful information in large
database repositories. Data mining techniques are deployed to scour
large database in order to find novel and useful patterns that might
otherwise remain unknown.
There are also many other terms founded in some articles and journals
that carry a similar or slightly different meaning, such as knowledge
meaning from databases, knowledge extraction, data archeology, data
dredging or data analysis.
9
Figure 2.1 : The Process of knowledge discovery in database.
Tang et al. (2006) stated that the input data can be stored in a variety
of formats (flat files, spread-sheets, or relational tables) and may
reside in a centralized data repository or be distributed across multiple
sites. The purpose of preprocessing is to transform the raw input data
into an appropriate format for subsequent analysis. The steps involved
in data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and
selecting records and features that are relevant to the data mining task
at hand. Because of the many ways data can be collected and stored,
data preprocessing is perhaps the most laborious and time-consuming
step in the overall knowledge discovery process.
“Closing the loop” is the phrase often used to refer to the process of
integrating data mining results into decision support systems. For
example, in business applications, the insights offered by data mining
results can be integrated with campaign management tools so that
effective marketing promotions can be conducted and tested. Such
integration requires a postprocessing step that ensures that only valid
and useful results are incorporated into the decision support system.
Statistical measures or hypothesis testing methods can also be applied
during postprocessing to eliminate spurious data mining results.
10
According to Shyu, Chen and Haruechaiyasak (2005), data mining or
knowledge discovery in databases has emerged recently as an active
research area for extracting implicit, previously unknown, and
potentially useful information from large databases mining techniques
into the IR context, specifically as the information filtering tools for
the recommender system framework.
Data Collection: This initial step involves the collection of data sets
for executing the data mining algorithms. Three data components are
considered: (a) textual content (i.e., index terms or keywords), (b) link
structure (embedded hyperlinks within Web pages), and (c) user log
records.
11
mining algorithms. This step includes the data reduction and selection
techniques to improve the efficiency of the data mining algorithms.
12
Chen et al. (1996) also provide the list of challenges that will face
during development of data mining techniques which is :
13
knowledge. This also encourage a systematic study of
measuring the quality of the discovered knowledge,
including interestingness and reliability, by construction of
statistical, analytical and simulative models and tools.
14
g. Protection of privacy and data security.
Protecting data security and guarding against the invasion
of privacy are important when data viewed from many
different angles and at different abstraction levels. The
measurement of security can avoid disclosure of sensitive
information.
Chen et. al (1996) stated the kinds of techniques that can be utilized
during classification which is :
15
specific the area that system will perform. For example, a
system is a relational data miner if it discovers knowledge
from relational data, or an object-oriented one if it mines
knowledge from object-oriented databases. In general, a
data miner can be classified according to its mining of
knowledge from the following different kinds of databases:
relational databases, transaction databases, object oriented
databases, deductive databases, spatial databases, temporal
databases, multimedia databases, heterogeneous databases,
active databases, legacy databases, and the Internet
information-base.
16
2.4.3 Association Rules
Chen et. al (1996) stated the problem of mining association rules that
composed into the following two steps :
17
Figure 2.3 represents the general architecture of Mining Association
Rule (MAR) model. MAR model consists of two main modules, pre-
processing and processing module. The first module, pre-processing is
used to transform data, identify and remove inconsistent data from
databases. Next, processing is executed to generate rules and evaluate
the generated rules.
2.4.4 Classification
Chen et al. (1996) also stated the objective of the classification which
is :
a. Analyze the training data.
b. Develop an accurate description or a model for each class
using the features available in the data.
18
Figure 2.4 : Hierarchical Classification Process
(Khodra & Widyantoro, 2007)
19
2.4.5 Clustering
Qiu, Davis and Ikem (2004) stated that clustering techniques are
heuristic in nature. Almost all techniques have a number of arbitrary
parameters that can be “adjusted” to improve results.
20
finds all clusters at once. This is in contrast to traditional
hierarchical schemes, which bisect a cluster to get two clusters
or merge two clusters to get one.
21
Figure 2.5 depicts a typical sequencing of the first three of these steps,
including a feedback path where the grouping process output could
affect subsequent feature extraction and similarity computations.
Pattern representation refers to the number of classes, the number of
available patterns, and the number, type, and scale of the features
available to the clustering algorithm. Some of this information may
not be controllable by the practitioner. Feature selection is the process
of identifying the most effective subset of the original features to use
in clustering. Feature extraction is the use of one or more
transformations of the input features to produce new salient features.
Either or both of these techniques can be used to obtain an appropriate
set of features to use in clustering. Pattern proximity is usually
measured by a distance function defined on pairs of patterns. A
variety of distance measures are in use in the various communities. A
simple distance measure like Euclidean distance can often be used to
reflect dissimilarity between two patterns, whereas other similarity
measures can be used to characterize the conceptual similarity
between patterns. The grouping step can be performed in a number of
ways. The output clustering (or clusterings) can be hard (a partition of
the data into groups) or fuzzy (where each pattern has a variable
degree of membership in each of the output clusters).
22
association and clustering. These techniques deliver the same objective of
data mining, but different in terms of their function and suitability for the
system.
Cluster divides data into groups (clusters) that are meaningful, useful, or both.
If meaningful groups are the goal, then the clusters should capture the natural
structure of the data. The concept of clustering has been around for a long
time. It has several applications, particularly in the context of information
23
retrieval and in organizing web resources. The main purpose of clustering is
to locate information and in the present day context, to locate most relevant
electronic resources. In database management, data clustering is a technique
in which, the information that is logically similar is physically stored
together. In order to increase the efficiency of search and the retrieval in
database management, the number of disk accesses is to be minimized. In
clustering, since the objects of similar properties are placed in one class of
objects, a single access to the disk can retrieve the entire class. If the
clustering takes place in some abstract algorithmic space, we may group a
population into subsets with similar characteristic, and then reduce the
problem space by acting on only a representative from each subset.
Clustering is ultimately a process of reducing a mountain of data to
manageable piles. For examples, analyze the large amounts of genetic
information that are now available, group the search result into a small
number of clusters, identify different types of depression and to segment
customers into a small number of groups for additional analysis and
marketing activities. (Ravichandra, 2003)
24
Table 2.1 : Differences of Classification, Association and Clustering
techniques
DM Techniques
Classification Association Clustering
Differences
Data Association rules Clustering as
classification is provide a useful the process of
the process mechanism for grouping
which finds the discovering physical or
common correlations abstract objects
properties among items into classes of
among a set of belonging to similar objects.
objects in a customer (Chen et al.,
Definition
database and transactions in a 1996)
classifies them market basket
into different database
classes, (Garofalakis et
according to a al., 1999)
classification
model. (Chen et
al., 1996)
Classification, Association is Cluster divides
which is the task useful for data into groups
of assigning discovering (clusters) that
objects to one of interesting are meaningful,
several relationships useful, or both.
predefined hidden in large (Ravichandra,
Concept categories, is a
data sets. (Tang 2003)
pervasive
et al., 2006)
problem that
encompasses
many diverse
applications.
(Tang et al., 2006)
25
DM Techniques
Classification Association Clustering
Differences
Classification is It will discover It helps data
useful in the the patterns from miner to
Web context to a large construct
build taxonomies transaction data meaningful
26
DM Techniques
Classification Association Clustering
Differences
Detecting spam Huge amounts of Analyze the
email messages customer large amounts
based upon the purchase data are of genetic
message header collected daily at information that
and content, the checkout are now
categorizing counters of available, group
cells as grocery stores. the search result
malignant or Retailers are into a small
benign based interested in number of
upon the results analyzing the clusters,
of MRI scans data to learn identify
Examples and classifying about the different types
galaxies based purchasing of depression
upon their behavior of their and to segment
shapes. (Tang et customers. (Tang customers into
al., 2006) et al., 2006) a small number
of groups for
additional
analysis and
marketing
activities.
(Ravichandra,
2003)
27
Although Classification, Association and Clustering have similarity in terms
of information retrieval, but there are differences regarding how the
information retrieved, analyzed and delivered. Classification assigning
objects to several predefined categories in order to develop a model for each
data using the features available in the data. Association is useful to discover
correlations among data in order to identify interesting relationships hidden in
large data sets especially for market analysis. However, clustering groups the
physical or abstract objects into list of similar objects to provide simplified
list of data. In other words, it divides data into groups that have similarity,
meaningful and useful.
28
2.6 Selecting Data Mining Tools
Nowadays, numbers of data mining tools are increases and it has become
more challenges in order to select effective tools. The data mining tool
market has become more crowded in recent years, with more than 50
commercial data mining tools as stated at the KDNuggets website
(http://www.kdnuggets.com). KDnuggets.com is the Data Mining
Community’s Top Resource since 1997 for data mining and analytics news,
tools, jobs, courses, data and more.
a. Performance
As per Table 2.2 is the ability to handle a variety of data sources
in an efficient manner. From a computational perspective,
hardware configuration has a major impact on tool performance.
Besides, some data algorithms are more efficient than others.
However, this category focuses on the qualitative aspects of a
tool’s ability to easily handle data under a variety of hardware
configuration. The criteria that should consider in this task are
platform variety, software architecture, heterogeneous data access,
data size, efficiency, interoperability and robustness.
29
Table 2.2 : Computational Performance Criteria (Collier et al., 1999)
Criteria Description
Platform Variety Does the software run on a wide-variety of computer platforms? More
importantly, does it run on typical business user platforms?
Software Does the software use client-server architecture or a stand-alone
Architecture architecture? Does the user have a choice of architectures?
Heterogeneous How well does the software interface with a variety of data sources
Data (RDBMS, ODBC, CORBA, etc)? Does it require any auxiliary software to
Access do so? Is the interface seamless?
Data Size How well does the software scale to large data sets? Is performance linear or
exponential?
Efficiency Does the software produce results in a reasonable amount of time relative to
the data size, the limitations of the algorithm, and other variables?
Interoperability Does the tool interface with other KDD support tools easily? If so, does it
use a standard architecture such as CORBA or some other proprietary API?
Robustness Does the tool run consistently without crashing? If the tool cannot handle a
data mining analysis, does it fail early or when the analysis appears to be
nearly complete? Does the tool require monitoring and intervention or can it
be left to run on its own?
b. Functionality
There are variety of capabilities, techniques, and methodologies
for data mining (Table 2.3). In order to know well the tool adapt to
different data mining problem, software functionality will help to
solve it. The criteria in functionality aspect are algorithm variety,
prescribed methodology, model validation, data type flexibility,
algorithm modifiability, data sampling, reporting, model exporting,
user interface, learning curve, user types, data visualization, error
reporting, action history and domain variety.
30
c. Usability
Different level and types of user will cause usability (Table 2.4).
One problem with easy-to-use mining tools is their potential
misuse. The criteria should consider are data cleansing, value
substitution, data filtering, binning, deriving attributes,
randomization, record deletion, handling blanks, metadata
manipulation and result feedback.
Bialynicka (2008) stated that there are data mining tools that suite with
clustering which are :
Scatter
Grouper
Carrot²
Vivisimo
31
Scatter is designed for browsing that support online clustering based on two
novel clustering algorithms which are buckshot and fractionation. Buckshot
fast for online clustering and fractionation is accurate for offline initial
clustering of the entire set. (Bialynicka, 2008)
Grouper is suitable for online purposes that operate on query result snippets.
It will cluster together documents with large common subphrases.
(Bialynicka, 2008)
However, for this research project, researcher used free tools that available
for learning purposes which is Carrot².
32
2.7 Summary
This chapter provides overview of e-filing and data mining techniques based
on the literature review from several journals. Rules in e-filing, overview of
data mining and challenges in data mining are discussed. Researcher also
reviews three basic data mining techniques which are classification,
association and clustering. After that, researcher come out with comparison
between them and selects the suitable data mining techniques for searching
method in e-filing web-based system (Refer Table 2.1). Based on the
comparison in Table 2.1, researcher found that clustering is the suitable
searching method for e-filing web-based system. Besides, for this research
project, researcher used free tools that available for learning purposes which
is Carrot² (open source search results clustering engine) after review several
journals regarding data mining tools.
The next chapter discusses the research approach and the methodology for the
research project.
33
CHAPTER 3
3.1 Introduction
This chapter describes the methodology and approaches that were used in the
research from problem identification until development of the system. To
achieve the objective of this project, the right approach must be applied for
best conclusions. This research used five major steps to start developing
prototype of e-filing web-based system using data mining techniques. It
consists of problem identification and planning, requirement gathering,
requirement analysis, design model and develop prototype. The overview of
this methodology can be shown below in Figure 3.1.
34
3.2 Problem Identification and Planning
This phase will identify the goal, scope, budget, schedule, technology and
system development process, methods and tools to ensure that everything are
in right place. However, it depends to what researcher wants to plan
according to the stakeholder requirement.
Before start to plan the project’s planning, the researcher should know the
current situation and problem that the old system have. An understanding of
potential problems is the main process to make the development
successful. After the researcher identifies the problems, scope of the project
is defined. The goal must be determined and the objectives of the project
must solved on the problems that have been identified. After analyzing all
the problems and identifying what task need to be done, a measurable
and achievable project plan is schedule using a Microsoft Project tool.
For this research, Microsoft Project is used to produce Gantt Chart (Refer
Appendix A- Project Planning) as a guideline for researcher in order to finish
the project. Besides, this phase involves list of steps which is :
35
3.3 Requirement Gathering
36
The main advantages of interviews are that the answer of the
interviewees is more spontaneous without an extended reflection. This
can be done by using a top down approach where the interviewer
starts with a general question and progress to specific question about
task. Interviews should plan in advance by defining a set of interview
questions to be asked. This does not only assist in ensuring
consistency between interviews conducted with different interviewees
but also help to focus on the purpose of the interview session.
The secondary data for this research is about data collection through
many resources such as articles, journals, books and other related
academic publication information about e-filing and data mining. It is
important to gain deeper understanding to e-filing and data mining.
This is the next stage after all data has been collected from the requirement
gathering phase. The primary data collected is needed to be analyzed to
define the system requirement for developing e-filing web-based system. The
collected data need to be studied and analyzed properly in order to have
accurate, reliable and relevant information during the development. This
entire requirement helped researcher to identify the use case that produce
system functions and finally researcher come out with Software Requirement
Specification (SRS) documentation.
37
Besides, secondary data that collected during requirement gathering phase is
useful to identify suitable searching method using data mining techniques.
Researcher made comparison between three popular techniques (association,
classification and clustering) in data mining in order to identify suitable
searching method from selected data mining techniques. Researcher finally
comes out with suitable searching method using data mining techniques. The
tool used during this phase is Rational Rose.
The model will be designed and determine before proceeding with the actual
construction of the database and system. System interface, classes, objects
and their relation will be designed using Rational Rose. The entire related
diagram to this research that includes class diagram, use case, sequence
diagram will be designed based on the result from the requirement analysis
phase.
After all the objects and classes are illustrated clearly with its attributes and
methods, a development of database was conducted. This activity is
accomplished by using MySQL database. At the end of this activity, a
detailed design (database model) is produced. The deliverable of this phase
has been documented in Software Design Document (SDD).
Develop prototype is related with building the application of the system using
the appropriate development technologies. In this phase, researcher will
develop the prototype of e-filing web-based system using data mining
techniques. The Apache is use as a web server, MySQL database as a
database server, and PHP programming language as the platform of the
development. In order to write programming code, Dreamweaver is used as a
38
workspace and Carrot² as a data mining tool. At the end of this phase, e-filing
prototype system using data mining technique will be produced.
3.7 Summary
The research methodology describes the research strategy that is used in this
research project. For this research project, a plan of action is laid out that
shows how the problem will be investigated, what information will be
collected using which method and how this information will be analyzed to
come to the conclusion. It consists of problem identification and planning,
requirement gathering, requirement analysis, design model and develop
prototype.
The methodology stated above was followed to develop the e-filing web-
based system in order to achieve the project’s objectives as well as to
fulfill requirements specified by the user. With understandable and
achievable methodology, the project is carried out in a proper manner that
consequently completed effectively.
The next chapter discusses the construction for the research project.
39
CHAPTER 4
PROTOTYPE CONSTRUCTION
4.1 Introduction
Specified below is the list of software tools that are selected during the
development process. These include operating system and other applications
that are compulsory for the system to be developed and deployed.
40
4.2.2 Software Tools Installation
Referring to Table 4.1, the installation of the three basic tools related
which is Apache, MySQL Server version 5, Rational Rose Enterprise
Edition, Adobe Photoshop CS3, Macromedia Dreamweaver MX 2004
are explain further as the following.
a. Apache
41
b. MySQL Version 5
42
d. Adobe Photoshop CS3
f. Carrot²
43
collections of documents, e.g. search results, into thematic
categories.
44
4.4.1 Requirement Analysis Phase
a. System Design
45
b. Detailed Design
a. Coding
46
Figure 4.1 : Coding index.php
c. Interface
Figure 4.2 shows the main page of the system. This page
appear after the authorize user (staff) enter into the system.
This page shows the list of menu for staff to handle the
system.
(Refer Appendix F – Description of Interface System)
47
Figure 4.2 : The main page interface of e-filing
4.5 Summary
The next chapter discusses the result and findings for the research project.
48
CHAPTER 5
5.1 Introduction
This chapter will explain how the collected data is organized, analyzed and
finalized to be used in the development phase of the research. The result of
the research that has been conducted will be explained in depth in this
chapter. It includes the findings and result gathered from the interviews and
discussions.
49
In developing a Software Requirements Specification (SRS) of good quality,
it is quite important to correctly elicit requirements from stakeholders. The
interview session has been conducted with Encik Gobibaskaran A/L
Govindaraju, the Head of Information Technology at Majlis Daerah Kerian
and Puan Shalina Mat Piah, the Administrative Assistant at Majlis Daerah
Kerian. The interview questions are categorized into two categories. The first
category focused more on the current problems faced by staffs in Majlis
Daerah Kerian. All the necessary data from the current problems has been
collected through this category. The second category is focusing on the
functional requirement for the system to be developed. The sample interview
question can be found in Appendix C.
Interviewee :
Puan Shalina Mat Piah, Administrative Assistant,
Majlis Daerah Kerian.
The results gained from the first category of the interview questions
are presented in the Table 5.1 below.
Table 5.1 : The problems that have been identified from the interviews.
Problem Researcher Interviewee
PQ.1 Is the current manual No
system easier and
comfortable to you?
PQ.2 Please describe the Involve many step :
current system in Searching suitable
regarding the manual number of file that
managing and required by using
searching files. log book.
Determine file name
by using file
50
number.
Check needed file
on many big shelf
that required long
time.
Surveying on each
staff’s table or other
department in Majlis
Daerah Kerian if the
file not have on the
shelf.
PQ.3 Is it easy to identify No
the suitable files
manually according to
your requirement?
PQ.4 Why you think it is not Difficult to search the
easy to identify the suitable files.
suitable files Difficult to know status
manually? of the files.
Required long time.
There are thousands of
files on the shelf.
Sometimes, there are
interchanges of files
between departments.
PQ.5 In your opinion, is it Yes, of course
important for MDK to
have web-based
system that will act as
information center for
staff to gather
information about the
status of the files?
51
5.2.2 Functional Requirements
Interviewee :
Encik Gobibaskaran A/L Govindaraju, Head of IT Department,
Majlis Daerah Kerian.
Apart from that, the second category of the interview is focusing more
on the functional requirement of the system. The requirements and
suggestions gathered from the interviews are represented in the Table
5.2 below.
52
information and
maintain files.
RQ.4 What is your Use the open source
suggestion about language that suite with
the language to any platform such as
develop the PHP programming.
system?
RQ.5 What is your MySQL database
suggestion about
the database to
develop the
system?
Based on the Table 5.2, several processes for the system are
identified. This requirement is all about system functionality for e-
filing web-based system. This requirement is collected and analyzed
to produce the new system.
53
5.3 Use Case Diagram
View Files
Admin
Delete Files
Validate User
Delete Staff
Referring to Figure 5.2 above, it shows the use case diagram for e-filing web-
based system. This use case illustrated the functionality for the administrator,
manager and staff. First, the admin, manager and staff must login into the
system. They must registered first before can use the system. Upon they have
login into the system, admin can maintain user account, view files and delete
files. Manager can maintain user information, maintain files information,
maintain customer information and delete staff. Staff can maintain user
information, maintain files information and maintain customer information.
54
The description about the use cases is described in Table 5.3.
55
5.4 Class Diagram
<<entity>>
advisor
<<boundary>>
<<PK>> advisor_no
advisor_form <<control>>
advisor_ic
advisor_no advisor_control advisor_name
advisor_ic advisor_hp
advisor_name set_advisor_detail() advisor_email
advisor_hp set_advisor_update() dept_name
advisor_email
add_advisor()
update_advisor()
display_advisor()
1
validate 1
<<entity>>
file <<boundary>>
1
<<entity>> file_form
<<PK>> file_id
login <<control>>
file_name file_id
<<control>> <<PK>> user_name file_control
<<boundary>> manage file_status file_name
login_control user_password file_status
login_form file_remark
user_id open_date search_files() file_remark
user_name user_level
set_user_update() update_date set_file_detail() open_date
user_password user_dept
remove_user() staff_no set_file_update() update_date
validate_user() dept_name remove_file() staff_no
update_user() dept_name
delete_user() add_files()
display_user() 0..*
update_files()
0..* delete_files()
1
display_files()
<<entity>>
staff 0..n
<<boundary>> validate
staff_form <<PK>> staff_no
staff_ic manage
staff_no 1 staff_name
staff_ic
staff_add1
staff_name <<control>>
staff_add2 1
staff_add1 staff_control
staff_city
staff_add2 have
staff_postcode
staff_city search_staff() staff_state
staff_postcode set_staff_detail() staff_hp
staff_state set_staff_update() staff_email
staff_hp removeStaff() dept_name
staff_email
advisor_no 1
dept_name manage 1
advisor_no
add_staff() <<entity>>
update_staff() customer
delete_staff() <<PK>> cust_id <<boundary>>
display_staff() file_id customer_form
cust_ic file_id
cust_name <<control>> cust_ic
0..* cust_add1 customer_control cust_name
cust_add2 cust_add1
cust_city search_cust() cust_add2
cust_postcode set_cust_detail() cust_city
cust_state set_cust_update() cust_postcode
cust_phone remove_cust() cust_state
staff_no cust_phone
staff_no
add_cust()
update_cust()
delete_cust()
display_cust()
56
5.5 Clustering as the Suitable Searching Method
5.5.1 Introduction
For this research project, it is important for the researcher to select the
suitable searching method using data mining techniques. Researcher
decided to review three main data mining techniques which are
classification, association and clustering. These techniques deliver the
same objective of data mining, but different in terms of their function
and suitability for the system.
57
Aliakbary, Khayyamian and Abolhassani (2008) stated that clustering
search results helps the user to overview returned results and to focus
on the desired clusters. Most of search result clustering methods use
title, URL and snippets returned by a search engine as the source of
information for creating the clusters.
58
Figure 5.4 : Google’s One Dimensional Result List
Figure 5.4 below shows the good search result list with clustering
technique. By using “clustering search result” keywords same as
Figure 5.3 above, it gives about 194 list of result only which is more
accurate, simple and easy to choose.
59
Figure 5.5 below shows the search result list with clustering technique
that available in the World Wide Web (http://search.carrot2.org).
60
5.5.4 Clustering Search Result from e-filing web-based system
Figure 5.7 below shows the search result list with clustering technique
that available in e-filing web-based system.
Figure 5.7 : Good clustering result list from e-filing web-based system
Figure 5.8 below shows the data mining tool provided by Carrot²
which is the open source framework for building search clustering
engines. The necessary codes were added in the system to cluster
search results.
61
Figure 5.8 : Data Mining Tool by Carrot²
5.6 Summary
The next chapter discusses the conclusion and recommendations for the
research project.
62
CHAPTER 6
6.1 Introduction
This chapter will conclude what has been done by the researcher from
defining the objectives until obtaining the findings through developing
the prototype of e-filing web-based system using data mining techniques.
This chapter also concludes the report for this project and provides limitations
of the software and recommendations for those who wish to pursue the
research on the development of the e-filing web-based system.
6.2 Conclusions
The first objective of the research project is to identify requirements that will
be needed for e-filing from Majlis Daerah Kerian. This objective has been
achieved through requirement gathering by conducting interview session with
staffs in Majlis Daerah Kerian in order to know the current problems and
functional requirements for e-filing web-based system. The deliverable for
this objective has been documented and can be referred in the Appendix D:
Software Requirement Specification (SRS).
63
The second objective of the research project is to identify the searching
method based on data mining techniques. For this phase, researcher reviewed
many resources such as article, journal, books and other related academic
publication information about e-filing and Data Mining in order to gain
deeper understanding to e-filing and Data Mining. This secondary data is
useful to identify suitable searching method using data mining techniques.
Researcher make comparison between three popular data mining techniques
(association, classification and clustering) in order to identify suitable
techniques for searching method in e-filing web-based system. This objective
has been achieved when researcher found that clustering is the suitable
searching method for e-filing web-based system.
After the second objective has been achieved, the research proceeds with the
third objective of designing e-filing web-based system. This objective has
been achieved through the design stage, which is system design and
detailed design. In system design, the development of e-filing web-based
system highlight the importance of interface design with the human
computer interface characteristics through proper choosing of colors,
buttons, and fonts. Despite, overall system structure is produced to illustrate
how the overall system works. In detailed design, it addressed the design of
classes and the detail working of this project system. The detail design
described the attributes, operations, and classes. The third objective
deliverables been documented and can be referred in the Appendix E:
Software Design Document (SDD).
64
By developing e-filing web-based system for Majlis Daerah Kerian, it is
expected that it will providing staff interactive environment in making their
choice in determining the suitable files that meets their requirements. Besides,
it also expects that it will help staff to identify their needed files more
accurate and faster as a result of using suitable searching method using
selected data mining technique. This system also expected to become
information center for staff in Majlis Daerah Kerian to gather information
about status of the files.
Although all the objectives have been achieved, the e-filing web-based
system using data mining technique is far from complete and has its own
limitations. There are still lots of improvement that can be considered to
enhance this project. The limitations and recommendation for this project are
discussed below.
6.3 Limitations
a. The interview session for gathering the information about the current
problems and functional requirements was conducted only with Head
of Information Technology and Administrative Assistant of Majlis
Daerah Kerian. Interview with two person only, provide less
information about the requirements.
65
c. There are a lot of journal regarding data mining techniques, but
researcher faces difficulties to understand each journal because not
familiar with this knowledge.
d. There are three different data mining techniques, but researcher must
select the better data mining techniques that suite with the objective.
Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.
e. There are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher
need to study the tools based on their function and usability with the
selected data mining techniques. Furthermore, the tool used in this
research is new to the researcher so that requires time to familiarize
with the tool.
6.4 Recommendations
b. Suggest that this system can be used by others local government, not
only Majlis Daerah Kerian.
66
c. Suggest that project can be online through the Internet so that it
can be access by everyone at anytime and anywhere. It is because, this
project has limited access by using Local Area Network (LAN) only.
67
REFERENCES
Abbott, D.W., Matkovsky, I.P., & Elder, J.F. (1998). An Evaluation of High-end
Data Mining Tools for Fraud Detection. IEEE Transaction on Knowledge and
Data Engineering, 2836.
Aliakbary, S., Khayyamian, M., & Abolhassani, H. (2008). Using Social Annotations
for Search Result Clustering. Retrieved February 10, 2010, from http://
www.springerlink.com/index/v770wm385n256p68.pdf
Apache. (2002). Retrieved February 14, 2010, from The Apache Software
Foundation: http://apache.org/
Bennett, S., McRobb, S., & Farmer, R. (2006). Object-Oriented Systems Analysis
and Design Using UML Third Edition. McGraw-Hill Education(UK)
Limited.
Bialynicka, I. (2008). Clustering Web Search Results. Retrieved March 2, 2010, from
http://medialab.di.unipi.it/web/Search+QA/Seminar/Clustering.ppt
Chen, M., Han, J., & Yu, S.Y. (1996). Data Mining : An Overview from a Database
Perspective. IEEE Transaction on Knowledge and Data Engineering, 8, 6.
Collier, K., Carey, B., Sautter, D., & Marjaniemi, C. (1999). A Methodology for
Evaluating and Selecting Data Mining Software. IEEE Transaction on
Knowledge and Data Engineering, 2-4.
68
Defit, S., & Md Sap, M. N. (2009). Mining Association Rule from Large Databases.
Retrieved October 10, 2009, from http://fsksm.utm.edu.my
Garofalakis, M. N., Rastogi, R., Seshadri, S., & Shim, K. (1999). Data Mining and
the Web : Past, Present and Future. Retrieved July 17, 2009, from
http://www.softnet.tuc.gr/~minos/Papers/widm99.pdf
IBM Corporation. (2006). IBM Rational Rose. Retrieved March 1, 2010, from
http://ftp.software.ibm.com/software/rational/web/datasheets/rose_ds.pdf
Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data Clustering: A Review. ACM
Computing.
69
Olson, T., Edwards, M., & Monty, H.A. (2003). A Guide to Model Rules for
Electronic Filing and Service. Retrieved July 15, 2009, from
http://www.ncsconline.org/WC/Publications/External_ElFileModelRulesLexi
sPub.pdf
Qiu, M., Davis, S., & Ikem, F. (2004). Evaluation of Clustering Techniques in Data
Mining Tools. Retrieved January 5, 2010, from
http://www.iacis.org/iis/2004_iis/PDFfiles/QiuDavisIkem.pdf
Shyu, M. L., Chen, S. C., & Haruechaiyasak, C. (2005). Retrieved February 12,
2010, from http://www.hlt.nectec.or.th/Publications/Conferences/A%20
Data%20Mining%20Framework%20for%20Building%20A%20Web-
Page%20Recommender%20System.pdf
Tang, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining.
Boston : Pearson Education.
Zhang, H., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search
Engine Results. Retrieved February 6, 2010, from http://www.ieee.org
70
APPENDICES
71
APPENDIX A
PROJECT PLANNING
A
72
APPENDIX B
PROGRESS SLIDE PRESENTATION
73
B
APPENDIX C
INTERVIEW QUESTION
74
C
APPENDIX D
SOFTWARE
REQUIREMENT SPECIFICATION
(SRS)
75D
APPENDIX E
SOFTWARE DESIGN DOCUMENT
(SDD)
76E
APPENDIX F
DESCRIPTION OF SYSTEM
INTERFACE
77
F
APPENDIX G
IN-PROGRESS ASSESSMENT
78
G
UNIVERSITI TEKNOLOGI MARA
BSc. (Hons)
INFORMATION SYSTEM ENGINEERING
MAY 2010
79
Universiti Teknologi MARA
MAY 2010
80
DECLARATION
This declaration is to certify that this thesis and all of its submitted contents are
original in its stature, excluding those in which have been acknowledged specifically
in the references. The contents of this thesis are of my own endeavor and any ideas
or quotations from the work of other people; published or otherwise are fully
acknowledged in accordance with the standard referring practices of the discipline.
81
APPROVAL
By
This thesis is prepared under the direction of thesis coordinators, Assoc. Prof. Wan
Nor Amalina Wan Hariri and Assoc. Prof. Rashidah Md. Rawi, Information System
Engineering Program, and it has been approved by the thesis supervisor, Puan
Norisan Abd Karim. It was submitted to the Faculty of Computer and Mathematical
Sciences and was accepted in partial fulfillment of the requirement for the degree of
Bachelor of Science.
Approved by:
__________________________
Madam Norisan Abd Karim
Thesis Supervisor
Date: 24th May 2010
82
DEDICATION
83
ACKNOWLEDGEMENT
Firstly, I would like to pay my gratitude to Allah S.W.T for giving me strength to be
able to complete this project. Without His blessing and permission, this project could
not have been completed.
Special thanks to Mr. Gobibaskaran and Puan Shalina for giving the opportunity to
perform the interview session that helped me in gathering the requirements for this
project.
84
i
TABLE OF CONTENTS
TITLE PAGE
ACKNOWLEDGEMENT i
TABLE OF CONTENT ii
LIST OF TABLES vi
LIST OF FIGURES vii
ABSTRACT viii
CHAPTER 1
INTRODUCTION
1.1 Research Background 1
1.2 Problem Statement 2
1.3 Aim 3
1.4 Objective of the Research 3
1.5 Significance of Research 3
1.6 Scope of Study 4
1.7 Limitation 4
1.8 Outcomes/Deliverables 5
1.9 Layout of Dissertation 5
1.10 Summary 6
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction 7
2.2 E-Filing
2.2.1 Introduction to E-Filing 7
2.2.2 Purposes of the Rules in E-Filing 7
2.2.3 Proposed Model Rules for E-Filing 8
2.3 What is Data Mining
2.3.1 Definition of Data Mining 9
ii85
2.3.2 Data Mining & Knowledge Discovery 9
2.3.3 Challenges of Data Mining 12
2.4 Data Mining Techniques
2.4.1 Overview of Data Mining Techniques 15
2.4.2 Classifying Data Mining Techniques 15
2.4.3 Association Rules 17
2.4.4 Classification 18
2.4.5 Clustering 20
2.5 Selecting Data Mining Techniques 22
2.6 Selecting Data Mining Tools 29
2.7 Summary 33
CHAPTER 3
RESEARCH APPROACH AND METHODOLOGY
3.1 Introduction 34
3.2 Problem Identification and Planning 35
3.3 Requirement Gathering 36
3.3.1 Primary Data 36
3.3.2 Secondary Data 37
3.4 Requirement Analysis 37
3.5 Design Model 38
3.6 Develop Prototype 38
3.7 Summary 39
CHAPTER 4
PROTOTYPE CONSTRUCTION
4.1 Introduction 40
4.2 Software Requirement 40
4.2.1 Software Tools 40
4.2.2 Software Tools Installation 41
4.3 Hardware Requirements 44
4.4 Development Phase 44
86
iii
4.4.1 Requirement Analysis Phase 45
4.4.2 Design Phase 45
4.4.3 Development Phase 46
4.5 Summary 48
CHAPTER 5
RESULT AND FINDINGS
5.1 Introduction 49
5.2 Interview Results 49
5.2.1 Current Problems 50
5.2.2 Functional Requirements 52
5.3 Use Case Diagram 54
5.4 Class Diagram 56
5.5 Clustering as the Suitable Searching Method 57
5.5.1 Introduction 57
5.5.2 Why Clustering Search Result 57
5.5.3 Examples of Clustering Search Result 58
5.5.4 Clustering Search Result from e-filing
web-based system 61
5.6 Summary 62
CHAPTER 6
CONCLUSION AND RECOMMENDATIONS
6.1 Introduction 63
6.2 Conclusions 63
6.3 Limitations 65
6.4 Recommendations 66
REFERENCES 68
87
iv
APPENDICES 71
88
v
LIST OF TABLES
vi89
LIST OF FIGURES
90
vii
ABSTRACT
91
viii