You are on page 1of 10

3.

1 Architectural Design
3.1.1 Problem Specification
The search of web document is very time consuming task. As the number of web
documents are increasing significantly every day. Hence, searching for a web documents if done
linearly, makes it very time consuming. This aims to reduce the detail and diversity of data and
the resulting information overload by grouping similar documents together on the database. The
web pages are categorized in order to help make search better and efficient.
Classification of web pages, is grouping the web documents based on their classes by
pre processing the web documents and index them on their category and, also to reduce the
detail and diversity of data and the resulting information, overload by grouping similar
documents together on the database.
3.1.2 ER Diagram
has
Words
wid
Document
doc ID
URL
Matrix Value
wid docid Value
has
User
password
user id
gives
email
Class
Class ID
Class name
Uses
doc ID
3.1.3 Data Flow Diagram
Context level or Zero level DFD
This is 0
th
level Data Flow Diagram show the abstract view of the project .In this the user
will give the URL to a web page as an input and based on the textual content of the web page, the
web page is classified into classes.
Level 1 DFD
Web page
grouping
Web page
classification
System
URL
User input of
URL
Download the
web page
Parse textual
content
Store the occurrence
details of words in
database
Analyze the types
of words in the
web page
Classify the webpage
3.1.4 Structure Charts
Web Page
Classification
get user login
credentials
get URL of the
web page
register user if
not registered
read username
and password
validate user
credentials
read new username
email and passwords
check for username availability
and duplicate entry
register new user
validate web page
availability
parse each word in
the web page
store occurrence
details of each word
in database
analyze the words
and classify the web
page
3.1.5 Data definition / Dictionary:
Table: User
o The user table is used to store the user's details who can enter the URLs to
classify the web pages.
o This table is used when user logs into the system.
o The table contains fields user id, password and email
Field Type Constraint Description
User id Varchar Primary key Unique identifier
Email Varchar Not null Used for password recovery
Password Varchar Not null Used for user's validation
Table: Words
o This table is used to store all the dictionary words into the database.
o The words are used to compare and analyze the words that are present in the
web page entered in the URL.
o This table acts as row of the value matrix which contains the occurrences of the
words in each of the documents.
o The table contains only one field - word which itself is unique.
Field Type Constraint Description
word Varchar Primary key Unique identifier
Table: Document
o This table is used to store the web page documents as columns of the matrix.
o The table contains URL's of each web page entered by the user and an unique id
for each of the documents to refer to the value matrix.
o The table contains two fields document id and the document URL.
Field Type Constraint Description
docId Int Primary key Unique identifier
URL Varchar Not null File's remote or local
location
Table: Matrix Value
o This table represents value matrix of words versus the documents to keep count of
each of the words
o This table is used to store the occurrence of words in each of the document and
the frequency of each word in the particular document.
o The matrix entries are updated each time a new word is occurred or the same
word is occurred multiple times.
o The table contains words as rows and document ids as columns and values to
count the occurrence.
Field Type Constraint Description
docId Int Foreign Key Unique identifier for
documents
Word Varchar Foreign Key Unique identifier for words
Value Int Null Counter value to keep count
of occurrences of the words
in each documents.
Table: Class
o This table is used to stored the type of class which the web pages belongs to
o The table has three values, class name, class id and document ID.
Field Type Constraint Description
Class ID Int Primary Key Unique identifier for class
Class Name Varchar Not Null Name of the class
Doc ID Int Foreign Key Reference to the document.
3.1.6 Module Specification
3.1.6.1 User
Input:
o The user is a module required to verify the user with login credentials
before using the system.
o In this module the user will provide his details and login to the system.
o If the user don't have an account he can register to the system.
o This collects the user's details like Name, email-id, and password.
Processing:
o The registration details are used to create a new user account for the user.
o The login details are used to allow the user to use the system.
Output:
o After processing the registration details, an user account is created in the
database.
o Processing the login details, the user is allowed to use the system with a
new session.
3.1.6.1 Parsing
Input:
o User inputs an URL to the system to specify the physical location of the
web page.
o The file specified with URL may be present either remotely or locally.
Processing:
o The URL given by the user is verified for availability. Further processing
is done only if the file is present.
o The file located in the server is fetched and the textual content is
processed word by word.
o each word of the file is stored in the database which in turn is used as a
column of value matrix.
o URL of the web page is represented with an unique identifier and is used
as row of value matrix.
Output:
o The value matrix is updated with every new document, and a new word
in the dictionary updating frequency of the occurrence of word in the
document.
3.1.6.1 Classification:
Input:
o The frequency count in value matrix for each word in the particular
document is used for further classification of documents.
Processing:
o The type of words in the document are analyzed and then the document is
classified accordingly.
o The classification details are stored in the database with class type and ID
of the document.
Output:
o The proper classification of the web page is performed.
3.2 Detailed design
3.2.1 Design Decisions
In the development of this project , the Procedure-Oriented Approach has been used to assess
the performances of various applications. As the procedural approached is followed for the step
by step execution of projects involving a series of steps to be performed to assess the
performances of a software product. Hence the procedural approach is a better model their
problem domains than similar systems produced by traditional techniques.
3.2.2 Logic Design:
PDL for User
If the customer is not registered/login
Redirect to the registration page
Collect the customer details
Check all the data are in correct form
If not correct
Generate error message while submitting data from the
user
End-if
Store data into the customer_leads table into database
If customer is not register
Register the customer, generate user-id and password
End-if
Else if the user is registered
read the user credentials
verify the user credentials
if user details are not valid
Throw an error message
else
allow the user to login
End If
PDL for Parsing
Read the URL from the user to the web page
If URL is not valid
Throw an error message
Else If
Read the web page
Process the textual content word by word
Create a value matrix in the database and store words of document
as rows and documents represented by their unique ID as columns of
the matrix
Update the frequency of occurrence of words in vale matrix.
End If.
PDL for Classification
Read frequency of each word in each of the document
check for the type of the word
Perform classification of the document by analyzing the type of
words
store classification details into the database updating the class ID
and class name using document ID.

You might also like