You are on page 1of 82

UNIT I

INTRODUCTION TO DBMS - Syllabus


File Systems Organization - Sequential, Pointer, Indexed, Direct - Purpose
of Database System- Database System Terminologies-Database
characteristics- Data models Types of data models Components of
DBMS- Relational Algebra. LOGICAL DATABASE DESIGN: Relational DBMS Codd's Rule - Entity-Relationship model - Extended ER Normalization
Functional Dependencies, Anomaly- 1NF to 5NF- Domain Key Normal Form
Denormalization.

Databases
Definition: "a collection of related data"

represents some aspect of the real world (universe of discourse) generally


relevant to an enterprise/company/organization

logically coherent

organized to reflect relationships among the data

persistent

mirrors the state of the company/organization/enterprise, an asset in its own


right

usually a specific purpose and for a set of users when built -- but a good
design should allow for uses that are unanticipated.

Data: Any raw information that needs to be stored in the database for persistence is
termed as data.

Database Management System


A DBMS is a system of programs
Includes facilities to:

1. Define and modify the database structure


2. Construct the database on a storage medium
3. Manipulate the database: queries and updates
4. Maintain integrity with and security over the database
Meta-data (data about the database) is used to represent #1 and #2; the database administrator
supplies the meta-data
SQL is the most common language for #3.

Operations on the database are referred to as transactions.

The database administrator along with the DBMS itself covers #4.

Database Approaches / History


Flat files

separate files, each with a tabular organization

historically used punched cards and/or tapes

magnetic disks

today CSV files (comma separated values; text files; spreadsheet files); data
mining input;

Hierarchical (tree organization of data )

earliest approach of integrated data, IBM

today return to this approach with XML (eXtensible Markup Language, text
files)

Network (linked lists, directed graphs)

Efficient storage and retrieval

Complex design and navigation

Developed by CODASYL (Committe on Data Systems and Languages) which


brought us COBOL

today the network approach is found in object-oriented databases (OODB)

Relational database (primary approach today)

Tables (relations) of rows (tuples) and columns (attributes)

Tables and attributes are named

Relationships between tables are established by common values

Mathematically based on set theory

SQL is the workhorse query language and is often adapted for other
paradigms

Object oriented (OODB)

Embedded in Java or C++ (extension of OO)

Unifies object heap space in memory and secondary storage

Return to network approach

Modeling Design of databases


Attempt to model semantics of the database

entity-relationship (ER) modeling

extended ER (EER)

Unified Modeling Languae (UML)

Characteristics of modern database systems


Main Characteristics of database approach:
1. Self-Description: A database system includes in addition to the data stored that is of
relevance to the organization a complete definition/description of the database's
structure and constraints. This meta-data (i.e., data about data) is stored in the so-called
system catalog, which contains a description of the structure of each file, the type and
storage format of each field, and the various constraints on the data (i.e., conditions that
the data must satisfy).
The system catalog is used not only by users (e.g., who need to know the names of tables
and attributes, and sometimes data type information and other things), but also by the
DBMS software, which certainly needs to "know" how the data is structured/organized in

order to interpret it in a manner consistent with that structure. Recall that a DBMS is
general purpose, as opposed to being a specific database application. Hence, the structure
of the data cannot be "hard-coded" in its programs (such as is the case in typical file
processing approaches), but rather must be treated as a "parameter" in some sense.
2. Insulation between Programs and Data; Data Abstraction:
Program-Data Independence: In traditional file processing, the structure of the data
files accessed by an application is "hard-coded" in its source code. (E.g., Consider a file
descriptor in a COBOL program: it gives a detailed description of the layout of the
records in a file by describing, for each field, how many bytes it occupies.)
If, for some reason, we decide to change the structure of the data (e.g., by adding the first
two digits to the YEAR field, in order to make the program Y2K compliant!), every
application in which a description of that file's structure is hard-coded must be changed!
In contrast, DBMS access programs, in most cases, do not require such changes, because
the structure of the data is described (in the system catalog) separately from the programs
that access it and those programs consult the catalog in order to ascertain the structure of
the data (i.e., providing a means by which to determine boundaries between records and
between fields within records) so that they interpret that data properly.
In other words, the DBMS provides a conceptual or logical view of the data to
application programs, so that the underlying implementation may be changed without the
programs being modified. (This is referred to as program-data independence.)
Also, which access paths (e.g., indexes) exist are listed in the catalog, helping the DBMS
to determine the most efficient way to search for items in response to a query.
Note: In fairness to COBOL, it should be pointed out that it has a COPY feature that allows
different application programs to make use of the same file descriptor stored in a
"library". This provides some degree of program-data independence, but not nearly as
much as a good DBMS does. End of note.
Example by which to illustrate this concept: Suppose that you are given the task of
developing a program that displays the contents of a particular data file. Specifically,
each record should be displayed as follows:
Record #i:
value of first field
value of second field
...
...
value of last field

To keep things very simple, suppose that the file in question has fixed-length records of
57 bytes with six fixed-length fields of lengths 12, 4, 17, 2, 15, and 7 bytes, respectively,
all of which are ASCII strings. Developing such a program would not be difficult.
However, the obvious solution would be tailored specifically for a file having the
particular structure described here and would be of no use for a file with a different
structure.
Now suppose that the problem is generalized to say that the program you are to develop
must be able to display any file having fixed-length records with fixed-length fields that
are ASCII strings. Impossible, you say? Well, yes, unless the program has the ability to
access a description of the file's record layout (i.e., lengths of its records and the fields
therein), in which case the problem is not hard at all. This illustrates the power of
metadata, i.e., data describing other data.
3. Multiple Views of Data: Different users (e.g., in different departments of an
organization) have different "views" or perspectives on the database. For example, from
the point of view of a Bursar's Office employee, student data does not include anything
about which courses were taken or which grades were earned. (This is an example of a
subset view.)
As another example, a Registrar's Office employee might think that GPA is a field of data
in each student's record. In reality, the underlying database might calculate that value
each time it is needed. This is called virtual (or derived) data.
A view designed for an academic advisor might give the appearance that the data is
structured to point out the prerequisites of each course.
A good DBMS has facilities for defining multiple views. This is not only convenient for
users, but also addresses security issues of data access. (E.g., The Registrar's Office view
should not provide any means to access financial data.)
4. Data Sharing and Multi-user Transaction Processing: As you learned about (or will)
in the OS course, the simultaneous access of computer resources by multiple
users/processes is a major source of complexity. The same is true for multi-user DBMS's.
Arising from this is the need for concurrency control, which is supposed to ensure that
several users trying to update the same data do so in a "controlled" manner so that the
results of the updates are as though they were done in some sequential order (rather than
interleaved, which could result in data being incorrect).
This gives rise to the concept of a transaction, which is a process that makes one or more
accesses to a database and which must have the appearance of executing in isolation from
all other transactions (even ones that access the same data at the "same time") and of
being atomic (in the sense that, if the system crashes in the middle of its execution, the
database contents must be as though it did not execute at all).

Applications such as airline reservation systems are known as online transaction


processing applications.

Capabilities/Advantages of DBMS's
1. Controlling Redundancy: Data redundancy (such as tends to occur in the
"file processing" approach) leads to wasted storage space, duplication of
effort (when multiple copies of a datum need to be updated), and a higher
liklihood of the introduction of inconsistency.

On the other hand, redundancy can be used to improve performance of queries. Indexes,
for example, are entirely redundant, but help the DBMS in processing queries more
quickly.
Another example of using redundancy to improve performance is to store an "extra" field
in order to avoid the need to access other tables (as when doing a JOIN, for example).
See Figure 1.6 (page 18): the StudentName and CourseNumber fields need not be there.
A DBMS should provide the capability to automatically enforce the rule that no
inconsistencies are introduced when data is updated. (Figure 1.6 again, in which
Student_name does not match Student_number.)
2. Restricting Unauthorized Access: A DBMS should provide a security and
authorization subsystem, which is used for specifying restrictions on user
accounts. Common kinds of restrictions are to allow read-only access (no
updating), or access only to a subset of the data (e.g., recall the Bursar's and
Registrar's office examples from above).
3. Providing Persistent Storage for Program Objects: Object-oriented
database systems make it easier for complex runtime objects (e.g., lists,
trees) to be saved in secondary storage so as to survive beyond program
termination and to be retrievable at a later time.
4. Providing Storage Structures for Efficient Query Processing: The
DBMS maintains indexes (typically in the form of trees and/or hash tables)
that are utilized to improve the execution time of queries and updates. (The
choice of which indexes to create and maintain is part of physical database
design and tuning and is the responsibility of the DBA.

The query processing and optimization module is responsible for choosing an efficient
query execution plan for each query submitted to the system.

5. Providing Backup and Recovery: The subsystem having this responsibility


ensures that recovery is possible in the case of a system crash during
execution of one or more transactions.
6. Providing Multiple User Interfaces: For example, query languages for
casual users, programming language interfaces for application programmers,
forms and/or command codes for parametric users, menu-driven interfaces
for stand-alone users.
7. Representing Complex Relationships Among Data: A DBMS should have
the capability to represent such relationships and to retrieve related data
quickly.
8. Enforcing Integrity Constraints: Most database applications are such that
the semantics (i.e., meaning) of the data require that it satisfy certain
restrictions in order to make sense. Perhaps the most fundamental constraint
on a data item is its data type, which specifies the universe of values from
which its value may be drawn. (E.g., a Grade field could be defined to be of
type Grade_Type, which, say, we have defined as including precisely the
values in the set { "A", "A-", "B+", ..., "F" }.

Another kind of constraint is referential integrity, which says that if the database includes
an entity that refers to another one, the latter entity must exist in the database. For
example, if (R56547, CIL102) is a tuple in the Enrolled_In relation, indicating that a
student with ID R56547 is taking a course with ID CIL102, there must be tuples in the
Student and Course relations, respectively, that describe a student and a course with
those ID's.
9. Permitting Inferencing and Actions Via Rules: In a deductive database
system, one may specify declarative rules that allow the database to infer
new data! E.g., Figure out which students are on academic probation. Such
capabilities would take the place of application programs that would be used
to ascertain such information otherwise.
Active database systems go one step further by allowing "active rules" that
can be used to initiate actions automatically.

Database Users and their responsibilites


Database Administrators (DBA)

oversee design

manage resources and other users

authorization/security control to database

coordinating and monitoring its use

acquiring software resources and hardware resources as needed

the DBA is also accountable for problems such as breach of security or poor
system response time

Database Designers

specifies structure of data that will be stored in database

Identifying the data to be stored


Systems analysts -- specifies system using input from customer; provides complete description
of functionality from customers and users point of view

Applications programmers -- implements application programs (transactions) that access data


and support enterprise rules
Project managers
System administrator -- maintains transaction processing system: monitors interconnection of
HW and SW modules, deals with failures and congestion
End Users: These are persons who access the database for querying, updating, and report
generation. They are main reason for database's existence!

Casual end users: use database occasionally, needing different information each time;
use query language to specify their requests; typically middle- or high-level managers.

Naive/Parametric end users: Typically the biggest group of users; frequently


query/update the database using standard canned transactions that have been carefully
programmed and tested in advance. Examples:
o bank tellers check account balances, post withdrawals/deposits
o reservation clerks for airlines, hotels, etc., check availability of seats/rooms and
make reservations.

o shipping clerks (e.g., at UPS) who use buttons, bar code scanners, etc., to update
status of in-transit packages.

Sophisticated end users: engineers, scientists, business analysts who implement their
own applications to meet their complex needs.

Stand-alone users: Use "personal" databases, possibly employing a special-purpose


(e.g., financial) software package.

Workers Behind the Scene

DBMS system designers/implementors: provide the DBMS software that


is at the foundation of all this!

tool developers: design and implement software tools facilitating database


system design, performance monitoring, creation of graphical user interfaces,
prototyping, etc.

operators and maintenance personnel: responsible for the day-to-day


operation of the system.

Three Level Database Architecture


Data and Related Structures
Data are actually stored as bits, or numbers and strings, but it is difficult to work with data at this
level. It is necessary to view data at different levels of abstraction.
Schema:

Description of data at some level. Each level has its own schema.

We will be concerned with three forms of schemas:

physical,

conceptual, and

external.

Physical Data Level


The physical schema describes details of how data is stored: files, indices, etc. on the random
access disk system. It also typically describes the record layout of files and type of files (hash, btree, flat).
Early applications worked at this level - explicitly dealt with details. E.g., minimizing physical
distances between related data and organizing the data structures within the file (blocked records,
linked lists of blocks, etc.)
Problem:

Routines are hardcoded to deal with physical representation.

Changes to data structures are difficult to make.

Application code becomes complex since it must deal with details.

Rapid implementation of new features very difficult.

Conceptual Data Level


Also referred to as the Logical level. Hides details of the physical level.

In the relational model, the conceptual schema presents data as a set of


tables.

The DBMS maps data access between the conceptual to physical schemas automatically.

Physical schema can be changed without changing application:

DBMS must change mapping from conceptual to physical.

Referred to as physical data independence.

External Data Level


In the relational model, the external schema also presents data as a set of relations. An external
schema specifies a view of the data in terms of the conceptual level. It is tailored to the needs of
a particular category of users. Portions of stored data should not be seen by some users and
begins to implement a level of security and simplifies the view for these users
Examples:

Students should not see faculty salaries.

Faculty should not see billing or payment data.

Information that can be derived from stored data might be viewed as if it were stored.

GPA not stored, calculated when needed.

Applications are written in terms of an external schema. The external view is computed when
accessed. It is not stored. Different external schemas can be provided to different categories of
users. Translation from external level to conceptual level is done automatically by DBMS at run
time. The conceptual schema can be changed without changing application:

Mapping from external to conceptual must be changed.

Referred to as conceptual data independence.

Data Independence
Logical data independence

Immunity of external models to changes in the logical model

Occurs at user interface level

Physical data independence

Immunity of logical model to changes in internal model

Occurs at logical interface level

Database Models
A database model is a theory or specification describing how a database is
structured and used. Several such models have been suggested.
The common models include

Network Model - Any links supporting quick access.

Hierarchical Model - Links but no cycles (hierarchy).

Relational Model - Data Independence.

Object Oriented Model - Entity Abstraction.

Network Model
The popularity of the network data model coincided with the popularity of the
hierarchical data model. Some data were more naturally modeled with more than
one parent per child. So, the network model permitted the modeling of many-tomany relationships in data. In 1971, the Conference on Data Systems Languages
(CODASYL) formally defined the network model. The basic data modeling construct
in the network model is the set construct. A set consists of an owner record type, a
set name, and a member record type. A member record type can have that role in
more than one set, hence the multiparent concept is supported. An owner record
type can also be a member or owner in another set. The data model is a simple
network, and link and intersection record types (called junction records by IDMS)
may exist, as well as sets between them . Thus, the complete network of
relationships is represented by several pairwise sets; in each set some (one) record
type is owner (at the tail of the network arrow) and one or more record types are
members (at the head of the relationship arrow). Usually, a set defines a 1:M
relationship, although 1:1 is permitted. The CODASYL network model is based on
mathematical set theory.

Hierarchical Model
The hierarchical data model organizes data in a tree structure. There is a hierarchy
of parent and child data segments. This structure implies that a record can have
repeating information, generally in the child data segments. Data in a series of
records, which have a set of field values attached to it. It collects all the instances of
a specific record together as a record type. These record types are the equivalent of
tables in the relational model, and with the individual records being the equivalent
of rows. To create links between these record types, the hierarchical model uses
Parent Child Relationships. These are a 1:N mapping between record types. This is
done by using trees, like set theory used in the relational model, "borrowed" from
maths. For example, an organization might store information about an employee,
such as name, employee number, department, salary. The organization might also
store information about an employee's children, such as name and date of birth.
The employee and children data forms a hierarchy, where the employee data
represents the parent segment and the children data represents the child segment.
If an employee has three children, then there would be three child segments
associated with one employee segment. In a hierarchical database the parent-child
relationship is one to many. This restricts a child segment to having only one parent
segment. Hierarchical DBMSs were popular from the late 1960s, with the
introduction of IBM's Information Management System (IMS) DBMS, through the
1970s.

Relational Model
(RDBMS - relational database management system) A database based on the
relational model developed by E.F. Codd. A relational database allows the definition
of data structures, storage and retrieval operations and integrity constraints. In such
a database the data and relations between them are organised in tables. A table is a
collection of records and each record in a table contains the same fields.
Properties of Relational Tables:
# Values Are Atomic
# Each Row is Unique
# Column Values Are of the Same Kind
# The Sequence of Columns is Insignificant
# The Sequence of Rows is Insignificant
# Each Column Has a Unique Name
Certain fields may be designated as keys, which means that searches for specific
values of that field will use indexing to speed them up. Where fields in two different
tables take values from the same set, a join operation can be performed to select
related records in the two tables by matching values in those fields. Often, but not
always, the fields will have the same name in both tables. For example, an "orders"
table might contain (customer-ID, product-code) pairs and a "products" table might
contain (product-code, price) pairs so to calculate a given customer's bill you would
sum the prices of all products ordered by that customer by joining on the productcode fields of the two tables. This can be extended to joining multiple tables on
multiple fields. Because these relationships are only specified at retreival time,
relational databases are classed as dynamic database management system. The
RELATIONAL database model is based on the Relational Algebra.
Object-Oriented Model

Uses the E-R modeling as a basis but extended to include encapsulation, inheritance

Objects have both state and behavior

State is defined by attributes

Behavior is defined by methods (functions or procedures)

Designer defines classes with attributes, methods, and relationships


Class constructor method creates object instances

Each object has a unique object ID

Classes related by class hierarchies

Database objects have persistence

Both conceptual-level and logical-level model

The Entity-Relationship Model

Database Design
Goal of design is to generate a formal specification of the database schema
Methodology:
1. Use E-R model to get a high-level graphical view of essential components of
enterprise and how they are related
2. Then convert E-R diagram to SQL DDL, or whatever database model you are
using

E-R Model is not SQL based. It's not limited to any particular DBMS. It is a conceptual and
semantic model captures meanings rather than an actual implementation
The E-R Model: The enterprise is viewed as set of

Entities

Relationships among entities

Symbols used in E-R Diagram

Entity rectangle

Attribute oval

Relationship diamond

Link line

Entities and
Attributes
Entity:
enterprise
other objects. (not shown in the ER diagram--is an instance)

an object that is involved in the


and that be distinguished from

Can be person, place, event, object, concept in the real world

Can be physical object or abstraction

Ex: "John", "CSE305"

Entity Type: set of similar objects or a category of entities; they are well defined

A rectangle represents an entity set

Ex: students, courses

We often just say "entity" and mean "entity type"

Attribute: describes one aspect of an entity type; usually [and best when] single valued and
indivisible (atomic)

Represented by oval on E-R diagram

Ex: name, maximum enrollment

May be multi-valued use double oval on E-R diagram

May be composite attribute has further structure; also use oval for
composite attribute, with ovals for components connected to it by lines

May be derived a virtual attribute, one that is computable from existing


data in the database, use dashed oval. This helps reduce redundancy

Entity Types
An entity type is named and is described by set of attributes

Student: Id, Name, Address, Hobbies

Domain: possible values of an attribute.

Note that the value for an attribute can be a set or list of values, sometimes
called "multi-valued" attributes

This is in contrast to the pure relational model which requires atomic values

E.g., (111111, John, 123 Main St, (stamps, coins))

Key: subset of attributes that uniquely identifies an entity (candidate key)


Entity Schema:

The meta-information of entity type name, attributes (and associated domain), key constraints
Entity Types tend to correspond to nouns; attributes are also nouns albeit descriptions of the
parts of entities
May have null values for some entity attribute instances no mapping to domain for those
instances

Keys
Superkey: an attribute or set of attributes that uniquely identifies an entity--there can be many of
these
Composite key: a key requiring more than one attribute
Candidate key: a superkey such that no proper subset of its attributes is also a superkey
(minimal superkey has no unnecessary attributes)
Primary key: the candidate key chosen to be used for identifying entities and accessing records.
Unless otherwise noted "key" means "primary key"
Alternate key: a candidate key not used for primary key
Secondary key: attribute or set of attributes commonly used for accessing records, but not
necessarily unique

Foreign key: term used in relational databases (but not in the E-R model) for an attribute that
is the primary key of another table and is used to establish a relationship with that table where it
appears as an attribute also.
So a foreign key value occurs in the table and again in the other table. This conflicts with the
idea that a value is stored only once; the idea that a fact is stored once is not undermined

Rectangle -- Entity
Ellipses -- Attribute (underlined attributes are [part of] the primary key)
Double ellipses -- multi-valued attribute
Dashed ellipses-- derived attribute, e.g. age is derivable from birthdate and current date.
[Drawing notes: keep all attributes above the entity. Lines have no arrows. Use straight lines
only]

Graphical Representation in E-R diagram

Relationships
Relationship: connects two or more entities into an association/relationship

"John" majors in "Computer Science"

Relationship Type: set of similar relationships

Student (entity type) is related to Department (entity type) by MajorsIn


(relationship type).

Relationship Types may also have attributes in the E-R model. When they are mapped to the
relational model, the attributes become part of the relation. Represented by a diamond on E-R
diagram.
Relationship types can have descriptive attributes like entity sets
Relationships tend to be verbs or verb phrases; attributes of relationships are again nouns

ttributes and Roles


An attribute of a relationship type adds additional information to the relationship

e.g., "John" majors in "CS" since 2000

John and CS are related

2000 describes the relationship - it's the value of the since attribute of
MajorsIn relationship type

The role of a relationship type names one of the related entities. The name of the entity is usually
the role name.
e.g., "John" is value of Student role, "CS" value of Department role of MajorsIn
relationship type
(John, CS, 2000) describes a relationship
Problem: relationships can relate elements of same entity type
e.g., ReportsTo relationship type relates two elements of Employee entity type:

Bob reports to Mary since 2000

We do not have distinct names for the roles. It is not clear who reports to whom.

Solution: the role name of relationship type need not be same as name of entity type from which
participants are drawn

ReportsTo has roles Subordinate and Supervisor and attribute Since

Values of Subordinate and Supervisor both drawn from entity type Employee

Optional to name role of each entity-relationship, but helpful in cases of

Recursive relationship entity set relates to itself

Multiple relationships between same entity sets

Roles are edges labeled with role names (omitted if role name = name of entity set). Most
attributes have been omitted.

Degree of relationship
The number of roles in the relationship

Binary links two entity sets; set of ordered


pairs (most common)
Ternary links three entity sets; ordered
triples (rare). If a relationship exists among the
three entities, all three must be present
N-ary links n entity sets; ordered n-tuples
(very rare). If a relationship exists among the
entities, then all must be present. Cannot
represesnt subsets.
Note: ternary relationships may sometimes be replaced by two binary relationships. Semantic
equivalence between ternary relationships and two binary ones are not necessarily true.

Cardinality of Relationships
Cardinality is the number of entity instances to which another entity set can map under the
relationship. This does not reflect a requirement that an entity has to participate in a relationship.
Participation is another concept.
One-to-one: X-Y is 1:1 when each entity in X is associated with at most one entity in Y, and
each entity in Y is associated with at most one entity in X.
One-to-many: X-Y is 1:M when each entity in X can be associated with many entities in Y, but
each entity in Y is associated with at most one entity in X.

Many-to-many: X:Y is M:M if each entity in X can be associated with many entities in Y, and
each entity in Y is associated with many entities in X ("many" =>one or more and sometimes
zero)

Relationship
Participation
Constraints
Total participation

Every member of entity


set must participate in
the relationship

Represented by double
line from entity rectangle to relationship diamond

E.g., A Class entity cannot exist unless related to a Faculty member entity in
this example, not necessarily at Juniata.

In a relational model we will use the references clause.

Key constraint

If every entity participates in exactly one relationship, both a total


participation and a key constraint hold

E.g., if a class is taught by only one faculty member.

Partial participation

Not every entity instance must participate

Represented by single line from entity rectangle to relationship diamond

E.g., A Textbook entity can exist without being related to a Class or vice
versa.

Existence Dependency and Weak


Entities
Existence dependency: Entity Y is existence
dependent on entity X is each instance of Y must
have a corresponding instance of X
In that case, Y must have total participation in its
relationship with X
If Y does not have its own candidate key, Y is called
a weak entity, and X is strong entity
Weak entity may have a partial key, called a discriminator, that distinguishes instances of the
weak entity that are related to the same strong entity
Use double rectangle for weak entity, with double diamond for relationship connecting it to its
associated strong entity
Note: not all existence dependent entities are weak the lack of a key is essential to definition

Schema of a Relationship Type


Contains the following features:
Role names, Ri, and their corresponding entity sets. Roles must be single valued (the number of
roles is called its degree)
Attribute names, Aj, and their corresponding domains. Attributes in the E-R model may be set or
multi-valued.
Key: Minimum set of roles and attributes that uniquely identify a relationship
Relationship: <e1, en; a1, ak>

ei is an entity, a value from Ris entity set

aj is a set of attribute values with elements from domain of A j

Example ER diagram

Mapping the ER Model to Relational DBs

Database Design
Goal of design is to generate a formal specification of the database schema
Methodology:
1. Use E-R model to get a high-level graphical view of essential components of
enterprise and how they are related
2. Then convert E-R diagram to SQL Data Definition Language (DDL), or
whatever database model you are using

E-R Model is not SQL based.


The E-R Model: The database represented is viewed as a graphical drawing of

Entities and attributes

Relationships among those entities

--not tables!

Relational Model: The database is viewed as a

Tables

and their attributes (keys)

--we could include constraints but will not at this stage.

Representation of Entity Type in Relational Model


Mapping #1: Each entity type always corresponds to a relation

---> Person(....)
Mapping #2: The attributes of a relation contains at least the simple attributes of an entity
type

Attributes are single valued

There may be additional attributes (foreign keys)

Persons(SSN, FirstName, LastName, Address, Birthdate)


Problem: Recall that the entity type can have multi-valued attributes.
Possible solution: Use several rows to represent a single entity

(111111, John, 123 Main St, stamps)

(111111, John, 123 Main St, coins)

Problems with this solution:

Redundancy of the other attributes (never good)

Key of entity type no longer can be key of relation

so, the resulting relation must be further transformed--> Normalization is the process we will
study to help deal with this and would result in:
Persons(SSN, FirstName, LastName, Address, Birthdate)
Hobbies(SSN, Hobby)

Relationship mapping
Relationship: connects two or more entities into an association/relationship

John majors in Computer Science

Relationship Type: set of similar relationships

Student (entity type) related to Department (entity type) by MajorsIn


(relationship type).

Distinction

relation (relational model) - set of tuples

relationship (E-R Model) describes relationship between entities of an


enterprise

Entity types and most relationship types in the E-R model are mapped to relations (relational
model)

Mapping #3: 1-1 and 1-many relationships between separate


entitites need not be mapped to a relation; the primary key
attributes of the "1" relation become foreign key attributes of the
"many" relation

If no "Since" attribute, the relations could be (with some appropriate attribute renaming and
additions)
Students(StudId, Name, Dept)
Departments(Dept, Chair)
Relationship Types may also have attributes in the E-R model.

Mapping #4: Any attributes of the 1-1 or 1-many relationship may be


attached to the "many" relation.

Students(StudId, Name, Dept, Since)


Departments(Dept, Chair)

Mapping #5: many-many relationships are always mapped to a


separate relation

Textbooks(ISBN, Title, Author, Copyright, Edition, Price)


Class(ClassNo, Name, Room, Days, Time)
TextUses(ISBN, ClassNo)

Mapping #6: The attributes of many-many relationships become part


of the relationship type relation, as well as the primary key
attributes of the related entity types

TextUses(ISBN, ClassNo, Optional)

Projects(ProjId, Name, TotalCost, StartDate)


Parts(UPC, PartName, Weight, WSPrice)
Suppliers(SupId, Name, Address)
Sold(ProjId, UPC, SupId, Date, Price)
Relationships tend to be verbs; attributes of relationships are nouns or adverbs

Roles
Problem: recursive relationships can relate elements of same entity type
e.g., the ReportsTo relationship type relates two elements of the Employee entity type:

Bob reports to Mary since 2000

We do not always have distinct names for the roles


It is not clear who reports to whom
Solution: the role name of relationship type need not be same as name of entity type from which
participants are drawn

ReportsTo has roles Subordinate and Supervisor and attribute Since

Values of Subordinate and Supervisor both drawn from entity type


Employee

Mapping #7: If the cardinality is 1-many or 1-1 of a recursive relationship, then a second
attribute of the same domain as the key may be added to the entity relation to establish the
relationship. Attributes of the relationship can also be added to the entity relation, but may
be a good reason to create a separate relation with the attributes and keys of the entities.

Employees(EmpID, Name, Address, Salary, SupervisorID)


Persons(PID, Name, Address, SpouseID, Mdate)

Mapping #8: for many-many recursive relationships, you create a relation including the
attributes of the relation but with the primary keys of the entity included twice, one for
each role.
Assume multiple marriages are now recorded, thus many-to-many
MarriedTo(HusbandID, WifeID, MarDate, DivDate)

Examples

S2000Courses (CrsCode, SectNo, Enroll)


Professor (Id, DeptId, Name)
Teaching (CrsCode, SecNo, Id, RoomNo)

Real SQL code


CREATE TABLE WorksIn (
Since DATE,
-- attribute
Status CHAR (10), -- attribute

ProfId INTEGER,
-- role (key of Professor)
DeptId CHAR (4), -- role (key of Department)
PRIMARY KEY (ProfId), -- since a professor works in at most one department
FOREIGN KEY (ProfId) REFERENCES Professor (Id),
FOREIGN KEY (DeptId) REFERENCES Department

CREATE TABLE Sold (


Price INTEGER,
-- attribute
Date DATE,
-- attribute
ProjId INTEGER,
-- role
SupplierId INTEGER, -- role
PartNumber INTEGER, -- role
PRIMARY KEY (ProjId, SupplierId, PartNumber, Date),
FOREIGN KEY (ProjId) REFERENCES Project (Id),
FOREIGN KEY (SupplierId) REFERENCES Supplier (Id),
FOREIGN KEY (PartNumber) REFERENCES Part (Number)
)

The Relational Data Model


History of Relational Model

1970 paper by E.F. Codd A Relational Model of Data for Large Shared Data
Banks proposed relational model

System R, prototype developed at IBM Research Lab at San Jose, California


late 1970s

Peterlee Test Vehicle, IBM UK Scientific Lab

INGRES, University of California at Berkeley, in Unix

System R results used in developing DB2 from IBM and also Oracle

Early microcomputer based DBMSs were relational - dBase, R;base, Paradox

Microsofts Access, now most popular microcomputer-based DBMS, is relational


Oracle, DB2, Informix, Sybase, Microsofts SQL Server, MySQL, PostgreSQL- most popular
enterprise DBMSs, all relational

Advantages of Relational Model

Based on mathematical notion of relation


o

Can use power of mathematical abstraction

Can develop body of results using theorem and proof method of


mathematics results then apply to many different applications

Can use expressive, exact mathematical notation

Theory provides tools for improving design

Basic structure is simple, easy to understand

Separates logical from physical level

Data operations easy to express, using a few powerful commands

Operations do not require user to know storage structures used

Data Structures
Relations are represented abstractly as tables

Tables are related to one another

Table holds information about objects or entities

Rows (tuples) correspond to individual entities

Each tuple is distinct no duplicate tuples

Order of tuples is immaterial

Cardinality of relation = number of tuples

Columns correspond to attributes

Each column has a distinct name, the name of the attribute it represents

Order of attributes not important

Each cell contains at most one value

A column contains values from one domain

Domains consist of atomic values

Arity = number of attributes, sometimes called the degree of the relation

Example: Relations

Student table tells facts about students

Faculty table shows facts about faculty

Class table shows facts about classes, including what faculty member
teaches each

Enroll table relates students to classes

Student

stuId

lastName

firstName

major

credits

S1001

Smith

Tom

History

90

S1002

Chin

Ann

Math

36

S1005

Lee

Perry

History

S1010

Burns

Edward

Art

S1013

McCarthy

Owen

Math

S1015

Jones

Mary

Math

42

S1020

Rivera

Jane

CSC

15

63

Class

classNumber

facId

schedule

room

ART103A

F101

MWF9

H221

CSC201A

F105

TuThF10

M110

CSC203A

F105

MThF12

M110

HST205A

F115

MWF11

H221

MTH101B

F110

MTuTh9

H225

MTH103C

F110

MWF11

H225

Faculty

facId

name

department

rank

F101

Adams

Art

Professor

F105

Tanaka

CSC

Instructor

F110

Byrne

Math

Assistant

F115

Smith

History

Associate

F221

Smith

CSC

Professor

Enroll

stuId

classNumber

grade

S1001

ART103A

S1001

HST205A

S1002

ART103A

S1002

CSC201A

S1002

MTH103C

S1010

ART103A

S1010

MTH103C

S1020

CSC201A

S1020

MTH101B

Mathematical Relations
For two sets D1 and D2, the Cartesian product, D1 X D2 , is the set of all ordered pairs in which
the first element is from D1 and the second is from D2. The domains for the two sets are abitrary.
A relation ,then, is any subset of the Cartesian product
One can form a Cartesian product of 3 sets; a relation is any subset of the ordered triples so
formed.
This can extend to n sets, using n-tuples
Database Relations

A relation schema, named R, is a set of attributes A1, A2,,An with their corresponding domains
D1, D2,Dn
A relation r on relation schema R is a set of mappings from the attributes to their domains,
or to say r is a set of n-tuples (A1:d1, A2:d2, , An:dn) such that d1 D1, d2D2 , , dnDn
In a table to represent the relation, list the Ai's as column headings, and let the (d1, d2, dn)
become the n-tuples, the rows of the table

Relation Schema
A schema defines the following
Relation name
Attribute names and domains
Integrity constraints
e.g.,:

The values of a particular attribute in all tuples are unique

The values of a particular attribute in all tuples are greater than 0

Default values

Relational Database

Finite set of relations

Each relation consists of a schema definition and an instance of the relation

Database schema = set of relation schemas (and other things)

Database instance = set of (corresponding) relation instances

Example

Student (StuId: INT, LastName: STRING, FirstName: STRING, major: STRING,


credits DEC)

Faculty (FacId: STRING, Name: STRING, Dept: DEPTS, Rank RANKS)

Class (FacId: STRING, Schedule: STRING, Room: STRING, ClassNum:


COURSES)

Enroll (ClassNum: COURSES, StudId: DEC, Grade: GRADES)

Department(DeptId: DEPTS, Name: STRING)

TableName (attr1:type, attr2:type, ... ) is a simplified non-SQL description of the table.

Relation Keys
Relations never have duplicate tuples, so you can always tell tuples apart; implies there is always
a key (which may be a composite of all attributes, in worst case)
Superkey: set of attributes that uniquely identifies tuples
Candidate key: superkey such that no proper subset of itself is also a superkey (i.e. it has no
unnecessary attributes)

Primary key: candidate key chosen for unique identification of tuples


Cannot verify a key by looking at an instance; need to consider semantic information to ensure
uniqueness
A foreign key is an attribute or combination of attributes that is the primary key of some relation
(called its home relation). Usually the home relation is some other relation but there can be cases
of self-referencing (recursuve relationship)

Key Constraint
Values in a column (or columns) of a relation are unique: at most one row in a relation instance
can contain a particular value(s)
Key - set of attributes satisfying key constraint

e.g., Id in Student,

e.g., (StudId, CrsCode, Semester) in Transcript

Minimality - no subset of a key is a key. When you determine a key, this rule should be applied.

(StudId, CrsCode) is not a key of Transcript

Superkey - set of attributes containing key

(Id, Name) is a superkey of Student, but as a key, it's not minimal

Every relation has a key. The goal is to determine the "best" key, but a relation can have several
keys:

primary key (Id in Student) (cannot be null) -- only one is designated per
relation

candidate key ((Name, Address) in Student) is a potential key and


sometimes used as information to the DBMS to set up an index for efficient
lookup.

Foreign Key Constraint


Also known as Referential integrity => Item named in one relation must correspond to tuple(s)
in another that describes the item
Examples:

Transcript (CrsCode) references Course(CrsCode )

Professor(DeptId) references Department(DeptId)

We say "a1 is a foreign key of R1 referring to a2 in R2" meaining that "if v is the value of a1, then
there is a unique tuple in R2 in which a2 has the same value v

This is a special case of referential integrity: a 2 must be a candidate key of R2


(CrsCode is a key of Course), e.g., not necessarily the primary key (often is,
however)

If no row exists in R2 then we have a violation of referential integrity

Not all rows of R2 need to be referenced.: relationship is not symmetric (some


course might not be taught)

Value of a foreign key might not be specified (DeptId column of some


professor might be null)

Example

Note the foreign key might consist of several columns:

(CrsCode, Semester) of Transcript references (CrsCode, Sem) of Teaching

In general, when R1(a1, an) references R2(b1, bn):

There exists a 1 - 1 relationship between a 1,an and b1,bn

ai and bi must have the same base domains (although not necessarily the
same names)

b1,bn is a candidate key of R2

Types of Integrity
Data Integrity
Data Integrity validates the data before getting stored in the columns of the table.
SQL Server supports four type of data integrity:
Entity Integrity

Entity Integrity can be enforced through indexes, UNIQUE constraints and PRIMARY KEY
constraints.
Domain Integrity

Domain integrity validates data for a column of the table.


It can be enforced using:

Foreign key constraints,

Check constraints,

Default definitions

NOT NULL.

Referential Integrity

FOREIGN KEY and CHECK constraints are used to enforce Referential Integrity.
User-Defined Integrity

It enables you to create business logic which is not possible to develop using system constraints.
You can use stored procedure, trigger and functions to create user-defined integrity.

EF Codd Rules
A relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model as introduced by E. F. Codd. Most popular
commercial and open source databases currently in use are based on the relational model.
A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables
and the relationship among the data is also stored in the form of tables.
E.F. Codd, the famous mathematician has introduced 12 rules for the relational model for
databases commonly known as Codd's rules. The rules mainly define what is required for a
DBMS for it to be considered relational, i.e., an RDBMS. There is also one more rule i.e. Rule00
which specifies the relational model should use the relational way to manage the database. The
rules and their description are as follows:Rule 0: Foundation Rule
A relational database management system should be capable of using its relational facilities
(exclusively) to manage the database.
Rule 1: Information Rule
All information in the database is to be represented in one and only one way. This is achieved by
values in column positions within rows of tables.
Rule 2: Guaranteed Access Rule
All data must be accessible with no ambiguity, that is, Each and every datum (atomic value) is
guaranteed to be logically accessible by resorting to a combination of table name, primary key
value and column name.
Rule 3: Systematic treatment of null values
Null values (distinct from empty character string or a string of blank characters and distinct from
zero or any other number) are supported in the fully relational DBMS for representing missing
information in a systematic way, independent of data type.

Rule 4: Dynamic On-line Catalog Based on the Relational Model


The database description is represented at the logical level in the same way as ordinary data, so
authorized users can apply the same relational language to its interrogation as they apply to
regular data. The authorized users can access the database structure by using common language
i.e. SQL.
Rule 5: Comprehensive Data Sublanguage Rule
A relational system may support several languages and various modes of terminal use. However,
there must be at least one language whose statements are expressible, per some well-defined
syntax, as character strings and whose ability to support all of the following is comprehensible:
a. data definition
b. view definition
c. data manipulation (interactive and by program)
d. integrity constraints
e. authorization
f. Transaction boundaries (begin, commit, and rollback).

Rule 6: View Updating Rule


All views that are theoretically updateable are also updateable by the system.
Rule 7: High-level Insert, Update, and Delete
The system is able to insert, update and delete operations fully. It can also perform the operations
on multiple rows simultaneously.
Rule 8: Physical Data Independence
Application programs and terminal activities remain logically unimpaired whenever any changes
are made in either storage representation or access methods.
Rule 9: Logical Data Independence
Application programs and terminal activities remain logically unimpaired when information
preserving changes of any kind that theoretically permit unimpairment are made to the base
tables.

Rule 10: Integrity Independence


Integrity constraints specific to a particular relational database must be definable in the relational
data sublanguage and storable in the catalog, not in the application programs.
Rule 11: Distribution Independence
The data manipulation sublanguage of a relational DBMS must enable application programs and
terminal activities to remain logically unimpaired whether and whenever data are physically
centralized or distributed.
Rule 12: Nonsubversion Rule
If a relational system has or supports a low-level (single-record-at-a-time) language, that lowlevel language cannot be used to subvert or bypass the integrity rules or constraints expressed in
the higher-level (multiple-records-at-a-time) relational language.
On the basis of the above rules there is no fully relational DBMS available today

Functional Dependencies

Objectives of Normalization
Develop a good description of the data, its relationships and constraints
Produce a stable set of relations that

Is a faithful model of the enterprise

Is highly flexible

Reduces redundancy-saves space and reduces inconsistency in data

Is free of update, insertion and deletion anomalies

Normal Forms

First normal form -1NF

Second normal form-2NF

Third normal form-3NF

Boyce-Codd normal form-BCNF

Fourth normal form-4NF

Fifth normal form-5NF

Domain/Key normal form-DKNF

Each is contained within the previous form each has stricter rules than the previous form

Limitations of E-R Designs


E-R modeling provides a set of guidelines, but does not result in a unique database schema.
Nor does it provide a way of evaluating alternative schemas.
Normalization theory provides a mechanism for analyzing and refining the schema produced by
an E-R design, or any other design.

Redundancy
Dependencies between attributes within a relation cause redundancy
Ex. All addresses in the same town have the same zip code
SSN
1234

Name
Joe

Town

Zip

Huntingdon 16652

2345

Mary

Huntingdon 16652

3456

Tom

Huntingdon 16652

5948

Harry

Alexandria 16603

There's clearly redundant information stored here.


Consistency and integrity are harder to maintain even in this simple example, e.g., ensuring the
fact that the zip code always refers the same city and the city is spelled consistently.
Note we don't have a zip code to city fact stored unless there is a person from that zipcode

Redundancy and Other Problems


Set-valued or multi-valued attributes in the E-R diagram result in multiple rows in corresponding
table
Example: Person (SSN, Name, Address, Hobbies)

A person entity with multiple hobbies yields multiple rows in table Person

Hence, the association between Name and Address for the same person is
stored redundantly

SSN is key of entity set, but (SSN, Hobbies) is key of corresponding relation
below

The relation Person cant describe people without hobbies


but more important is the replication of what would be the key value

SSN
1111
1111
2222

Anomalies
An anomaly is an inconsistent, incomplete, or contradictory state of the database

Insertion anomaly user is unable to insert a new record of data when it


should be possible to do so because not all other information is available.

Deletion anomaly when a record is deleted, other information that is tied


to it is also deleted

Update anomaly a record is updated, but other appearances of the same


items are not updated

Redundancy leads to the following anomalies:


Update anomaly: A change in Address must be made in several places. Updating one fact may
require updating multiple tuples.
Deletion anomaly: Deleting one fact may delete other information. Suppose a person gives up
all hobbies. Do we:
Set Hobby attribute to null? No, since Hobby is part of key
Delete the entire row? No, since we lose other information in the row
Insertion anomaly: To record one fact may require more information than is available. Hobby
value must be supplied for any inserted row since Hobby is part of key

Decomposition
Solution: use two relations to store Person information
Person1 (SSN, Name, Address)
Hobbies (SSN, Hobby)
The decomposition is more general: people with hobbies can now be described
No update anomalies:

Name and address stored once


A hobby can be separately supplied or deleted
Decomposition is the process of breaking a relation into two or more relations to eliminate the
redundancies and corresponding anomalies.

Normalization Theory
The result of E-R analysis needs further refinement.
Appropriate decomposition can solve problems. What is appropriate?
The underlying theory is referred to as normalization theory and is based on functional
dependencies (and other kinds, like multivalued dependencies)

Informal Guidelines for Relation Design


Want to keep the semantics of the relation attributes clear. The information in a tuple should
represent exactly one fact or an entity. The hidden or buried entities are what we want to
discover and eliminate.

Design a relation schema so that it is easy to explain its meaning.

Do not combine attributes from multiple entity types and relationship types
into a single relation. Use a view if you want to present a simpler layout to
the end user.

A relation schema should correspond to on entity type or relationship type.

Minimize redundant information in tuples, thus reducing update anomalies

If anomalies are present, try to decompose the relation into two or more to
represent the separate facts, or document the anomalies well for
management in the applications programs.

Minimize the use of null values. Nulls have multiple interpretations:

The attribute does not apply to this tuple

The attribute value is unknown

The attribute value is absent

The attribute value might represent an actual value

If nulls are likely (non-applicable) then consider decomposition of the relation into two or more
relations that hold only the non-null valued tuples.

Do not permit the creation of spurious tuples

Too much decomposition of relations into smaller ones may also lose information or generate
erroneous information

Be sure that relations can be logically joined using natural join and the result
doesn't generate relationships that don't exist

Functional Dependencies
FD's are constraints on well-formed relations and represent a formalism on the
infrastructure of relation.

Definition: A functional dependency (FD) on a relation schema R is a constraint X Y, where


X and Y are subsets of attributes of R.
Definition: an FD is a relationship between an attribute "Y" and a determinant (1 or more other
attributes) "X" such that for a given value of a determinant the value of the attribute is uniquely
defined.

X is a determinant

X determines Y

Y is functionally dependent on X

XY

X Y is trivial if Y X

Definition: An FD X Y is satisfied in an instance r of R if for every pair of tuples, t and s: if t


and s agree on all attributes in X then they must agree on all attributes in Y
A key constraint is a special kind of functional dependency: all attributes of relation occur on the
right-hand side of the FD:

SSN SSN, Name, Address

Example Functional Dependencies


Let R be
NewStudent(stuId, lastName, major, credits, status, socSecNo)
FDs in R include

{stuId}{lastName}, but not the reverse

{stuId} {lastName, major, credits, status, socSecNo, stuId}

{socSecNo} {stuId, lastName, major, credits, status, socSecNo}

{credits}{status}, but not {status}{credits}

ZipCodeAddressCity

16652 is Huntingdons ZIP

ArtistNameBirthYear

Picasso was born in 1881

AutobrandManufacturer, Engine type

Pontiac is built by General Motors with gasoline engine

Author, TitlePublDate

Shakespeares Hamlet was published in 1600

Trivial Functional Dependency

The FD XY is trivial if set {Y} is a subset of set {X}


Examples: If A and B are attributes of R,

{A}{A}

{A,B} {A}

{A,B} {B}

{A,B} {A,B}

are all trivial FDs and will not contribute to the evaluation of normalization.

FD Axioms
Understanding: Functional Dependencies are recognized by analysis of the real world; no
automation or algorithm. Finding or recognizing them are the database designer's task.
FD manipulations:

Soundness -- no incorrect FD's are generated

Completeness -- all FD's can be generated

Axiom Name

Axiom

Example

Reflexivity

if a is set of attributes, b a,
then a b

SSN,Name SSN

Augmentation

if a b holds and c is a set of


attributes, then cacb

SSN Name then


SSN,Phone Name, Phone

Transitivity

if a b holds and bc holds,


then a c holds

SSN Zip and Zip City then SSN


City

Union or
Additivity *

if a b and a c holds then a SSNName and SSNZip then


bc holds
SSNName,Zip

Decomposition if a bc holds then a b and a


or Projectivity* c holds

SSNName,Zip then SSNName and


SSNZip

Address Project and Project,Date


Pseudotransitivi if a b and cb d hold then ac
Amount then Address,Date
ty*
d holds
Amount

(NOTE)

ab c does NOT imply a b and


bc

*Armstrong's Axioms (basic axioms)

Closure
Find all FD's for attributes a in a relation R
a+ denotes the set of attributes that are functionally determined by a
IF attribute(s) a IS/ARE A SUPERKEY OF R THEN a+ SHOULD BE THE WHOLE
RELATION R. This is our goal. Any attributes in a relation not part of the closure
indicates a problem with the design.
Algorithm for Closure

result := a; //start with superkey a


WHILE (more changes to result) DO
FOREACH ( FD b c in R) DO
IF b result
THEN result := result c

Normalization
Normalization
Process to revise relational schemas to hold desirable properties
1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Properties of bad design

Repetition of information

Inability to represent certain information

Loss of information

Decomposition
Replace an "unnormalized" relation by a set of normalized relations
If R is a relation scheme then
{R1, R2, ..., Rn} is a decomposition
if R = R1 R2 ... Rn
Desirable properties of decomposition

Lossless join decomposition


Dependency preservation

all FDs are represented in the resulting relations

Minimum repetition of information

Lossless Join Decomposition


If r is a relation on scheme R
and ri is a relation on Ri then
r is a subset of the natural join of the ri's
A lossless join decomposition is one that the ri's when joined produce r.
I.e. no spurious tuples can be generated, nor any are lost.
Consider the following relation

enroll (stId, crsNo, dateEnrolled, roomNo, instructor)


Suppose we decompose the above relation into two relations enrol11 and enrol12 as follows

enroll1 (stId, csNo, dateEnrolled)

enroll2 (dateEnrolled, roomNo, instructor)

There are many problems with this decomposition but we focus on one aspect at the moment. Let
an instance of the relation enrol be
stId

crsNo

dateEnrolled

roomNo

instructor

830057
830057
820159
825678
826789

CP302
CP303
CP302
CP304
CP305

1FEB2004
1FEB2004
10JAN2004
1FEB2004
15JAN2004

MP006
MP006
MP006
CE122
EA123

Gupta
Jones
Gupta
Wilson
Smith

and let the decomposed relations enroll1 and enroll2 be:


stId

crsNo

dateEnrolled

830057
830057
820159
825678
826789

CP302
CP303
CP302
CP304
CP305

1FEB2004
1FEB2004
10JAN2004
1FEB2004
15JAN2004

dateEnrolled

roomNo

instructor

1FEB2004
1FEB2004
10JAN2004
1FEB2004
15JAN2004

MP006
MP006
MP006
CE122
EA123

Gupta
Jones
Gupta
Wilson
Smith

All the information that was in the relation enroll appears to be still available in enroll1 and
enroll2 but this is not so. Suppose, we wanted to retrieve the student numbers of all students
taking a course from Wilson, we would need to join enroll1 and enroll2. The join would have 11
tuples as follows:

stId

crsNo

dateEnrolled

roomNo

instructor

830057
830057
830057
830057
830057
830057

CP302
CP302
CP303
CP303
CP302
CP303

1FEB2004
1FEB2004
1FEB2004
1FEB2004
1FEB2004
1FEB2004

MP006
MP006
MP006
MP006
CE122
CE122

Gupta
Jones
Gupta
Jones
Wilson
Wilson

(add further tuples ...)


The join contains tuples that were not in the original!

Null Values in Tuples


Relations should be designed such that their tuples will have as few NULL values as possible
Attributes that are NULL frequently could be placed in separate relations (with the
primary key)
Reasons for nulls:

attribute not applicable or invalid

attribute value unknown (may exist)

value known to exist, but unavailable

Spurious Tuples
Bad designs for a relational database may result in erroneous results for certain JOIN
operations
The "lossless join" property is used to guarantee meaningful results for join operations
The relations should be designed to satisfy the lossless join condition. No spurious tuples
should be generated by doing a natural-join of any relations.
There are two important properties of decompositions:
(a) non-additive or losslessness of the corresponding join

(b) preservation of the functional dependencies.


Note that property (a) is extremely important and cannot be sacrificed. Property (b) is less
stringent and may be sacrificed.

First Normal Form (1NF) : Disallows composite attributes,


multivalued attributes, and nested relations; attributes
whose values for an individual tuple are non-atomic

flat file

no repeating fields, sets or lists--atomic or single-valued values

no missing values

by definition of relations, all relations are 1NF

Here we can see that there is a non atomic value {Bellaire, Sugarland,
Houston}

Here we can see that there is a non atomic value {Bellaire, Sugarland, Houston}

Here we can see that there is a non atomic value {Bellaire, Sugarland, Houston}. As
in fig. c they can be written as separate tuples
(or)
the schema can be converted into 2 relations as below
{Dnumber, Dname, Dmgrssn}
{Dnumber,Dlocation}
(Or)
can be converted into into one relation with 3 columns for each location value. But
the below table will result in null values for other depts
{Dnumber, Dname, Dmgrssn,Dlocation1,Dlocation2,Dlocation3}

The 2nd option is termed as best because it reduces redundancy and does not
introduce null values

Counter-Example for 1NF

See Figure 5.4(a) NewStu Table (Assume students can have double majors)
Stuid

lastName

major

credits

status

socSecNo

S1001

Smith

History

90

Senior

100429500

S1003

Jones

Math

95

Senior

010124567

S1006

Lee

CSC

15

Freshman

088520876

Math
S1010

Burns

Art
English

63

Junior

099320985

S1060

Jones

CSC

25

Freshman

064624738

NewStu(StuId, lastName, major, credits, status, socSecNo) Assume students can have more
than one major

1NF Decomposition
The major attribute is not single-valued for each tuple
Ensuring 1NF

Best solution: For each multi-valued attribute, create a new table, in which you place the key of
the original table and the multi-valued attribute. Keep the original table, with its key
NewStu2(stuId, lastName, credits,status, socSecNo)
Majors(stuId, major)
stuId

lastName

credits

status

socSecNo

S1001

Smith

90

Senior

100429500

S1003

Jones

95

Senior

010124567

S1006

Lee

15

Freshman

088520876

S1010

Burns

Junior

099320985

S1060

Jones

25

Freshman

064624738

stuId

major

S1001

History

S1003

Math

S1006

CSC

S1006

Math

S1010

Art

S1010

English

S1060

CSC

Another method for 1NF


If the number of repeats is limited, make additional columns for multiple values
Student(stuId, lastName, major1, major2, credits, status, socSecNo)
stuId

lastName

major1

S1001

Smith

S1003

major2

credits

status

socSecNo

History

90

Senior

100429500

Jones

Math

95

Senior

010124567

S1006

Lee

CSC

Math

15

Freshman

088520876

S1010

Burns

Art

English

63

Junior

099320985

S1060

Jones

CSC

25

Freshman

064624738

What is Full Functional Dependency


In relation R, a set of attributes B is fully functionally dependent on a set of attributes A if B is
functionally dependent on A but not functionally dependent on any proper subset of A
This means every attribute in A is needed to functionally determine B
Partial Functional Dependency Example

NewClass(courseNo, stuId, stuLastName, facId, schedule, room, grade)

FDs:
{courseNo,stuId} {lastName}
{courseNo,stuId} {facId}
{courseNo,stuId} {schedule}
{courseNo,stuId} {room}
{courseNo,stuId} {grade}
courseNo facId //**partial FD
courseNo schedule //**partial FD
courseNo room //** partial FD
stuId lastName //** partial FD
plus trivial FDs that are partial

Second Normal Form (2NF)


A relation is in second normal form (2NF) if it is in first normal
form and all the non-key attributes are fully functionally
dependent on the key.

1NF

each non-key attribute is functionally dependent on a candidate key

if the key is composite, no DF exists between non-key and a subkey (just part
of the key)

If key has only one attribute, and R is 1NF, R is automatically 2NF


2NF is a scientific accident; has little practical or theoretical value

2NF Decomposition
Converting to 2NF

Identify each partial FD

Remove the attributes that depend on each of the determinants so identified

Place these determinants in separate relations along with their dependent


attributes

In original relation keep the composite key and any attributes that are fully
functionally dependent on all of it

Even if the composite key has no dependent attributes, keep that relation to
connect logically the others

Example

The EMP_PROJ relation in Figure 15.3(b) is in1NF but is not in 2NF. The nonprime attribute Ename
violates 2NF because of FD2, as do the nonprime attributes Pname and Plocaton because of FD3 . The
functional dependencies FD2 and FD3 make Ename , Pname , and Plocation partially dependent on the primary
key { Ssn , Pnumber } of EMP_PROJ , thus violating the 2NF test

Another 2NF Example

NewClass(courseNo, stuId, stuLastName, facId, schedule, room, grade )


FDs grouped by determinant:
{courseNo} {courseNo,facId, schedule, room}
{stuId} {stuId, lastName}
{courseNo,stuId} {courseNo, stuId, facId, schedule, room, lastName, grade}
Create tables grouped by determinants:
Course(courseNo,facId, schedule, room)
Stu(stuId, lastName)
Keep relation with original composite key, with attributes FD on it, if any
NewStu2( courseNo, stuId, grade)

What is Transitive Dependency?


If A, B, and C are attributes of relation R, such that A B, and B C, then C is transitively
dependent on A
Example:
NewStudent (stuId, lastName, major, credits, status)
FD:
creditsstatus
By transitivity:
stuIdcredits and creditsstatus implies stuIdstatus
Transitive dependencies cause update, insertion, deletion anomalies.

Third Normal Form (3NF) : A relation schema R is in third


normal form (3NF) if it is in 2NF and no non-prime attribute A
in R is transitively dependent on the primary key
Also A relation is in third normal form (3NF) if whenever a non-trivial functional dependency
XA exists, then either X is a superkey or A is a member of some candidate key
To be 3NF, relation must be 2NF and have no transitive dependencies
No non-key attribute determines another non-key attribute. Here key includes candidate key
3NF Decomposition

Remove the dependent attribute, status, from the relation

Create a new table with the dependent attribute and its determinant, credits

Keep the determinant in the original table

Example

The relation schema EMP_DEPT in Figure 15.3(a) is in 2NF, since no partial depen-dencies on a key exist.
However, EMP_DEPT is not in 3NF because of the transitive dependency of Dmgr_ssn (and also Dname) on
Ssn via Dnumber. We can normalize EMP_DEPT by decomposing it into 2 3NF relations

Another example

PRESIDENTS(Pres,Spouse, Party,
Founded)

pres spouse

pres party

party founded

PRESIDENTS (Pres, Spouse, Party )


PARTIES(Party, Founded)

NewStudent (stuId, lastName, major, credits,


status)

creditsstatus

NewStu2 (stuId, lastName, major, credits)


Stats (credits, status)

Boyce/Codd Normal Form (BCNF)


A relation is in Boyce/Codd Normal Form (BCNF) if whenever a non-trivial functional
dependency XA exists, then X is a superkey
Stricter than 3NF, which allows A to be part of a candidate key
If there is just one single candidate key, the forms are equivalent

3NF

for all FD's each determinant is a candidate key

3NF relations are BCNF if there is only one candidate key and the key is not
composite

Generally can reach 3NF or BCNF immediately

Example

Suppose that we have thousands of lots in the relation but the lots are from only two coun-ties: DeKalb and
Fulton. Suppose also that lot sizes in DeKalb County are only 0.5,0.6, 0.7, 0.8, 0.9, and 1.0 acres, whereas lot
sizes in Fulton County are restricted to 1.1, 1.2,...,1.9, and 2.0 acres. In such a situation we would have the
additional functional dependency FD5: Area County_name. then now Area becomes a super key of R and thus
2 new relations LOTS1AX and LOTS1AY are formed

Another Example

NewFac (facName, dept, office, rank, dateHired)


FDs:
office dept
facName,dept office, rank, dateHired
facName,office dept, rank, dateHired

NewFac is not BCNF because office is not a superkey

BCNF Decomposition Attempt


To make it BCNF, remove the dependent attributes to a new relation, with the determinant as the
key
Project into
Fac1 (office, dept)
Fac2 (facName, office, rank, dateHired)
Note we have lost a functional dependency in Fac2 no longer able to see that {facName, dept}
is a determinant, since they are in different relations
BCNF may not be dependency preserving and might have to settle for 3NF

Properties of Decompositions
Starting with a universal relation that contains all the attributes, we can decompose into relations
by projection
A decomposition of a relation R is a set of relations {R1,R2,...,Rn} such that each Ri is a subset of
R and the union of all of the Ri is R.
Desirable properties of decompositions:

Attribute preservation - every attribute is in some relation

Dependency preservation - see previous example

Lossless decomposition - discussed later

Dependency Preservation
If R is decomposed into {R1,R2,...,Rn} so that for each functional dependency XY all the
attributes in X Y appear in the same relation, Ri, then all FDs are preserved

Allows DBMS to check each FD constraint by checking just one table for each

Attribute Preservation Condition


All attributes must be preserved through the process of normalization.
Start with universal relation schema R
R = {A1,A2,...,An}, the set of attributes
D is a decomposition of R such that
D = {R1,R2,...,Rm}
and R = U Ri

Lossless Join Condition


A decomposition should not have spurious tuples generated when a natural join operation is
applied to the relations in the resulting decomposition
A decomposition (R1,,Rn) of a schema, R, is lossless if every valid instance, r, of R can be
reconstructed from its components through a natural join.
Each ri = Ri(r)

Lossless Join Decomposition Algorithm

1. set D := {R}
2. WHILE there exists a Q in D that is not in BCNF DO

Find an FD XY in Q that violates BCNF


and replace Q in D by (Q-Y) and (X Y)

MVD and Normalization Examples


Multivalued Dependencies
A multivalued dependency (MVD) X >> Y specified on relation schema
R, where X and Y are both subsets of R, specifies the following constraint on
any relation state r of R: If two tuples t1 and t2 exist in r such that t1[X] = t2[X],
then two tuples t3 and t4 should also exist in r with the following properties,
where we use Z to denote (R 2 (X Y)):

t3[X] = t4[X] = t1[X] = t2[X].

t3[Y] = t1[Y] and t4[Y] = t2[Y].


t3[Z] = t2[Z] and t4[Z] = t1[Z].
An MVD X >> Y in R is called a trivial MVD if (a) Y is a subset of X, or (b) X
Y = R.
Inference Rules for Functional and Multivalued Dependencies:

IR1 (reflexive rule for FDs): If X Y, then X > Y.


IR2 (augmentation rule for FDs): {X > Y} XZ > YZ.
IR3 (transitive rule for FDs): {X > Y, Y >Z}
X > Z.
IR4 (complementation rule for MVDs): {X >> Y} X >>
(R (X Y))}.
IR5 (augmentation rule for MVDs): If X >> Y and W Z
then WX >> YZ.
IR6 (transitive rule for MVDs): {X >> Y, Y >> Z} X >> (Z 2 Y).
IR7 (replication rule for FD to MVD): {X > Y} X >> Y.
IR8 (coalescence rule for FDs and MVDs): If X >> Y and there exists
W with the properties that (a) W Y is empty, (b) W > Z, and (c) Y Z,
then X > Z.

MVD Example
Course ->> Instructor
Course ->> Text
Course(Y)

Instructor(X)

Text(R-XY)

Intro

Kruse

Intro to CS

Intro

Wright

Intro to CS

CS1

Thomas

Intro to Java

CS1

Thomas

CS Theory Survey

CS2

Rhodes

Java Data Structures

CS2

Rhodes

Unix

CS2

Kruse

Java Data Structures

CS2

Kruse

Unix

4NF : A relation schema R is in 4NF with respect to a set of


dependencies F (that includes functional dependencies and
multivalued dependencies) if, for every nontrivial multivalued
dependency X >> Y in F+, X is a superkey for R.
A relation R is in 4NF if for all MVD in D+ of the form A->>B at least one of the following hold

A ->> B is a trivial MVD

A is a superkey

Example:
Decomposing a relation state of EMP that is not in 4NF:

(a) EMP relation with additional tuples.


(b) Two corresponding 4NF relations EMP_PROJECTS and EMP_DEPENDENTS.

Dependency Preservation
If f is an FD in F, but f is not in F1 F2, there are two possibilities:

f (F1F2)+

If the constraints in F1 and F2 are maintained, f will be maintained


automatically.

f not in (F1F2)+

f can be checked only by first taking the join of r 1 and r2. This is costly.

Example

Schema (R, F) where

R = {SSN, Name, Address, Hobby}

F = {SSN Name, Address}

can be decomposed into

R1 = {SSN, Name, Address} F1 = {SSN Name, Address}

and R2 = {SSN, Hobby} F2 = { }

Since F = F1 F2 the decomposition is dependency preserving

Example
Schema: (ABC; F) with F = {A B, B C, C B}
Decomposition:

(AC, F1), F1 = {A C} Note: A C not in F, but is in F+

(BC, F2), F2 = {B C, C B}

A B not in (F1 F2), but A B (F1 F2)+.

So F+ = (F1 F2)+ and thus the decompositions is still dependency preserving

JOIN DEPENDENCY
A join dependency (JD), denoted by JD(R1, R2, ..., Rn), specified on relation schema R,
specifies a constraint on the states r of R.
The constraint states that every legal state r of R should have a non-additive join
decomposition into R1, R2, ..., Rn; that is, for every such r we have

* (pR1(r), pR2(r), ..., pRn(r)) = r

Note: an MVD is a special case of a JD where n = 2.

A join dependency JD(R1, R2, ..., Rn), specified on relation schema R, is a trivial JD if
one of the relation schemas Ri in JD(R1, R2, ..., Rn) is equal to R.

5NF : A relation schema R is in fifth normal form (5NF) (or


Project-Join Normal Form (PJNF)) with respect to a set F of
functional, multivalued, and join dependencies if, for every
nontrivial join dependency JD(R1, R2, ..., Rn) in F+ (that is,
implied by F), every Ri is a superkey of R.

In the above the supply relation is decomposed into 3 relations which are now in 5NF. If we
apply a join on any two relations it will produce spurious tuples. But applying a join to all 3
together will not do so. Which means there is a join dependency between the 3 relations

Minimal Cover
A minimal cover of a set of dependencies, T, is a set of dependencies, U, such that:

U is equivalent to T (T+ = U+)

All FDs in U have the form X A where A is a single attribute

It is not possible to make U smaller (while preserving equivalence) by


Deleting an FD
Deleting an attribute from an FD (either from LHS or RHS)

FDs and attributes that can be deleted in this way are called redundant

Computing Minimal Cover


Example: T = {ABH CK, A D, C E, BGH F, F AD, E F, BH E}
step 1: Make RHS of each FD into a single attribute

Algorithm: Use the decomposition inference rule for FDs

Example: F AD replaced by F A, F D ; ABH CK by ABH C, ABH K

step 2: Eliminate redundant attributes from LHS.

Algorithm: If FD XB A T (where B is a single attribute) and X A is


entailed by T, then B was unnecessary

Example: Can an attribute be deleted from ABH C ?


Compute AB+T, AH+T, BH+T.
Since C (BH)+T , BH C is entailed by T and A is redundant in ABH C.

step 3: Delete redundant FDs from T

Algorithm: If T - {f} entails f, then f is redundant

If f is X A then check if A X+T-{f}

Example: BGH F is entailed by E F, BH E, so it is redundant

Note: Steps 2 and 3 cannot be reversed!! See the textbook for a counterexample

Domain-Key Normal Form (DKNF):


Definition:

A relation schema is said to be in DKNF if all constraints and dependencies that


should hold on the valid relation states can be enforced simply by enforcing the
domain constraints and key constraints on the relation.
The idea is to specify (theoretically, at least) the ultimate normal form that takes into
account all possible types of dependencies and constraints. .
For a relation in DKNF, it becomes very straightforward to enforce all database
constraints by simply checking that each attribute value in a tuple is of the appropriate
domain and that every key constraint is enforced.
The practical utility of DKNF is limited

Normalization Drawbacks
By limiting redundancy, normalization helps maintain consistency and saves space
But performance of querying can suffer because related information that was stored in a single
relation is now distributed among several
Example: A join is required to get the names and grades of all students taking CS305 in S2002.

Denormalization
Tradeoff: Judiciously introduce redundancy to improve performance of certain queries
Example: Add attribute Name to Transcript
SELECT T.Name, T.Grad
FROM Transcript T
WHERE T.CrsCode = CS305 AND T.Semester = S2002

Join is avoided

If queries are asked more frequently than Transcript is modified, added


redundancy might improve average performance

But, Transcript is no longer in BCNF since key is (StudId, CrsCode, Semester)


and StudId Name

Normalization Problems
1. Consider the following relation:
CAR_SALE(Car#, Date_sold, Salesman#, Commision%, Discount_amt
Assume that a car may be sold by multiple salesmen and hence {CAR#, SALESMAN#} is the
primary key. Additional dependencies are:
Date_sold ->Discount_amt
and
Salesman# ->commission%
Based on the given primary key, is this relation in 1NF, 2NF, or 3NF? Why or why not? How would
you successively normalize it completely?
Answer:
Given the relation schema
Car_Sale(Car#, Salesman#, Date_sold, Commission%, Discount_amt)
with the functional dependencies
Date_sold Discount_amt
Salesman# Commission%
Car# Date_sold
This relation satisfies 1NF but not 2NF (Car# Date_sold and Car#
Discount_amt
so these two attributes are not FFD on the primary key) and not 3NF.
To normalize,
2NF:
Car_Sale1(Car#, Date_sold, Discount_amt)
Car_Sale2(Car#, Salesman#)

Car_Sale3(Salesman#,Commission%)
3NF:
Car_Sale1-1(Car#, Date_sold)
Car_Sale1-2(Date_sold, Discount_amt)
Car_Sale2(Car#, Salesman#)
Car_Sale3(Salesman#,Commission%)
2. Consider the following relation for published books:
BOOK (Book_title, Authorname, Book_type, Listprice, Author_affil, Publisher)
Author_affil referes to the affiliation of the author. Suppose the following dependencies exist:
Book_title -> Publisher, Book_type
Book_type -> Listprice
Author_name -> Author-affil
(a) What normal form is the relation in? Explain your answer.
(b) Apply normalization until you cannot decompose the relations further. State the reasons behind
each decomposition.
Answer:
Given the relation
Book(Book_title, Authorname, Book_type, Listprice, Author_affil, Publisher)
and the FDs
Book_title Publisher, Book_type
Book_type Listprice
Authorname Author_affil
(a)The key for this relation is Book_title,Authorname. This relation is in 1NF and not in
2NF as no attributes are FFD on the key. It is also not in 3NF.
(b) 2NF decomposition:

Book0(Book_title, Authorname)
Book1(Book_title, Publisher, Book_type, Listprice)
Book2(Authorname, Author_affil)
This decomposition eliminates the partial dependencies.
3NF decomposition:
Book0(Book_title, Authorname)
Book1-1(Book_title, Publisher, Book_type)
Book1-2(Book_type, Listprice)
Book2(Authorname, Author_affil)
This decomposition eliminates the transitive dependency of Listprice
3. Consider the following relation:
R (Doctor#, Patient#, Date, Diagnosis, Treat_code, Charge)
In this relation, a tuple describes a visit of a patient to a doctor along with a treatment code and daily
charge. Assume that diagnosis is determined (uniquely) for each patient by a doctor. Assume that
each treatment code has a fixed charge (regardless of patient). Is this relation in 2NF? Justify your
answer and decompose if necessary. Then argue whether further normalization to 3NF is
necessary, and if so, perform it.
Answer:
From the questions text, we can infer the following functional dependencies:
{Doctor#, Patient#, Date}{Diagnosis, Treat_code, Charge}
{Treat_code}{Charge}
Because there are no partial dependencies, the given relation is in 2NF already. This however is
not 3NF because the Charge is a nonkey attribute that is determined by another nonkey attribute,
Treat_code. We must decompose further:
R (Doctor#, Patient#, Date, Diagnosis, Treat_code)
R1 (Treat_code, Charge)
We could further infer that the treatment for a given diagnosis is functionally dependant, but we
should be sure to allow the doctor to have some flexibility when prescribing cures.

References:
1. Database Management Systems By Dr. V.K. Jain
2. Elmasri, R., & Navathe, S. (1994). Fundamentals of Database Systems.
3. Codd, E. (1985). "Is Your DBMS Really Relational?" and "Does Your DBMS Run By
the Rules?" ComputerWorld, October 14 and October 21. Elmasri, R., & Navathe, S.
(1994). Fundamentals of Database Systems. 2nd ed. Redwood City, CA: The
Benjamin/Cummings Publishing Co. pp. 283 285.
4. http://jcsites.juniata.edu/faculty/rhodes/dbms/relnmodel.htm

You might also like