Professional Documents
Culture Documents
UNIT- I
INTRODUCTION TO DATABASE M ANAGEMENT SYSTEM
INTRODUCTION :
The typical file processing system is supported by a conventional operating system. The system
stores permanent records in various files, and it needs different application programs to extract
records from, and add records to, the appropriate files.A file processing system has a number of
major disadvantages.
Mr. Y SUBBA RAYUDU M. Tech
Page 1
DBMS
Data Redundancy & Inconsistency
In file processing, every user group maintains its own files for handling its data processing
applications.
Example:
Consider the UNIVERSITY database. Here, two groups of users might be the course
registration personnel and the accounting office. The accounting office also keeps data on
registration and related billing information, whereas the registration office keeps track of
student courses and grades. Storing the same data multiple times is called data redundancy.
This redundancy leads to several problems.
Need to perform a single logical update multiple times.
Storage space is wasted.
Files that represent the same data may become inconsistent.
Data inconsistency is the various copies of the same data may no larger Agree.
Example:
One user group may enter a student's birth date erroneously as JAN-19-1984, whereas the other
user groups may enter the correct value of JAN-29-1984.
Difficulty in accessing data File processing environments do not allow needed data to
be retrieved in a convenient and efficient manner.
Example:
Suppose that one of the bank officers needs to find out the names of all customers who live
within a particular area. The bank officer has now two choices: either obtain the list of all
customers and extract the needed information manually or ask a system programmer to write
the necessary application program. Both alternatives are obviously unsatisfactory. Suppose that
such a program is written, and that, several days later, the same officer needs to trim that list to
View of Data
DBMS is a collection of interrelated data and a set of programs that allow users to access and
modify the interrelated data. The major purpose of DBMS using DBMS is providing an
abstract view of the data is the major purpose of DBMS. Data must be retrieved efficiently
from the systems in order the system to be usable.
Data abstraction
Data abstraction is amazingly useful because it allows humans to understand and build
complex systems like databases.
A good place to start understanding the definition of data abstraction is to think about the way
the word 'abstract' is used when we talk about a long document. The abstract is the shortened,
simplified form. We often read it to get an overview before reading the entire paper. (Actually
we often read it INSTEAD of reading the paper, but that's another issue.)
The three formal abstraction layers we usually use are:
Mr. Y SUBBA RAYUDU M. Tech
Page 2
DBMS
User model: How the user describes the database
Logical model: More formal, more detail often rendered as an entity relationship
(ER) model
Physical model: More geeky detail added indexing, data types etc.
Data abstraction is simply a way of turning a complex problem into a manageable one.
DATABASE SCHEMA
Database schema skeleton structure of and it represents the logical view of entire database. It
tells about how the data is organized and how relation among them is associated. It formulates
all database constraints that would be put on data in relations, which resides in database. A
database schema defines its entities and the relationship among them. Database schema is a
descriptive detail of the database, which can be depicted by means of schema diagrams. All
these activities are done by database designer to help programmers in order to give some ease
of understanding all aspect of database.
Database schema can be divided broadly in two categories:
Physical Database Schema
This schema pertains to the actual storage of data and its form of storage like files, indices etc.
It defines the how data will be stored in secondary storage etc.
Logical Database Schema
This defines all logical constraints that need to be applied on data stored. It defines tables,
views and integrity constraints etc.
DATABASE INSTANCE
It is important that we distinguish these two terms individually. Database schema is the
skeleton of database. It is designed when database doesn't exist at all and very hard to do any
changes once the database is operational. Database schema does not contain any data or
information
Database instances, is a state of operational database with data at any given time. This is a
snapshot of database. Database instances tend to change with time. DBMS ensures that its
every instance (state) must be a valid state by keeping up to all validation, constraints and
condition that database designers has imposed or it is expected from DBMS itself.
DATA MODELS
Data model tells how the logical structure of a database is modeled. Data Models are
fundamental entities to introduce abstraction in DBMS. Data models define how data is
connected to each other and how it will be processed and stored inside the system.
The very first data model could be flat data-models where all the data used to be kept in same
plane. Because earlier data models were not so scientific they were prone to introduce lots of
duplication and update anomalies.
Mr. Y SUBBA RAYUDU M. Tech
Page 3
DBMS
Entity-Relationship Model
Entity-Relationship model is based on the notion of real world entities and relationship among
them. While formulating real-world scenario into database model, ER Model creates entity set,
relationship set, general attributes and constraints.
ER Model is best used for the conceptual design of database. ER Model is based on:
Entities and their attributes
Relationships among entities
These concepts are explained below.
Entity: An entity in ER Model is real world entity, which has some properties called attributes.
Every attribute is defined by its set of values, called domain. For example, in a school
database, a student is considered as an entity. Student has various attributes like name, age and
class etc.
Relationship:The logical association among entities is called relationship. Relationships are
mapped with entities in various ways. Mapping cardinalities define the number of association
between two entities.
Mapping cardinalities:
One to one
One to many
Many to one
Many to many
RELATIONAL M ODEL
The most popular data model in DBMS is Relational Model. It is more scientific model then
others. This model is based on first-order predicate logic and defines table as an n-ary relation.
Page 4
DBMS
DATABASE LANGUAGES
Data Definition Language (DDL)
DDL statements are used to define the database structure or schema.
Page 5
DBMS
TRANSACTION MANAGEMENT
ACID Properties
A transaction may contain several low level tasks and further a transaction is very small unit of
any program. A transaction in a database system must maintain some properties in order to
ensure the accuracy of its completeness and data integrity. These properties are refer to as
ACID properties and are mentioned below:
Atomicity: Though a transaction involves several low level operations but this property
states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in database where the
transaction is left partially completed. States should be defined either before the
execution of the transaction or after the execution/abortion/failure of the transaction.
Consistency: This property states that after the transaction is finished, its database must
remain in a consistent state. There must not be any possibility that some data is
incorrectly affected by the execution of transaction. If the database was in a consistent
state before the execution of the transaction, it must remain in consistent state after the
execution of the transaction.
Durability: This property states that in any case all updates made on the database will
persist even if the system fails and restarts. If a transaction writes or updates some data
in database and commits that data will always be there in the database. If the
transaction commits but data is not written on the disk and the system fails, that data
will be updated once the system comes up.
Page 6
DBMS
Isolation: In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions
will be carried out and executed as if it is the only transaction in the system. No
transaction will affect the existence of any other transaction.
Serializability
When more than one transaction is executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with
some other transaction.
Serial Schedule: A schedule in which transactions are aligned in such a way that one
transaction is executed first. When the first transaction completes its cycle then next
transaction is executed. Transactions are ordered one after other. This type of schedule
is called serial schedule as transactions are executed in a serial manner.
Result Equivalence:If two schedules produce same results after execution, are said to
be result equivalent. They may yield same result for some value and may yield different
results for anothervalues. That's why this equivalence is not generally considered
significant.
Page 7
DBMS
View equivalent schedules are view serializable and conflict equivalent schedules are conflict
serializable. All conflict serializable schedules are view serializable too.
States of Transactions
Active: In this state the transaction is being executed. This is the initial state of every
transaction.
Failed: If any check made by database recovery system fails, the transaction is said to
be in failed state, from where it can no longer proceed further.
Aborted: If any of checks fails and transaction reached in Failed state, the recovery
manager rolls back all its write operation on the database to make database in the state
Page 8
DBMS
where it was prior to start of execution of transaction. Transactions in this state are
called aborted. Database recovery module can select one of the two operations after a
transaction aborts:
o
o
Primary Storage
The memory storage, which is directly accessible by the CPU, comes under this category.
CPU's internal memory (registers), fast memory (cache) and main memory (RAM) are directly
accessible to CPU as they all are placed on the motherboard or CPU chipset. This storage is
typically very small, ultra fast and volatile. This storage needs continuous power supply in
order to maintain its state, i.e. in case of power failure all data are lost.
Secondary Storage
The need to store data for longer amount of time and to retain it even after the power supply is
interrupted gave birth to secondary data storage. All memory devices, which are not part of
CPU chipset or motherboard comes under this category. Broadly, magnetic disks, all optical
disks (DVD, CD etc.), flash drives and magnetic tapes are not directly accessible by the CPU.
Page 9
DBMS
Hard disk drives, which contain the operating system and generally not removed from the
computers are, considered secondary storage and all other are called tertiary storage.
Tertiary Storage
Third level in memory hierarchy is called tertiary storage. This is used to store huge amount of
data. Because this storage is external to the computer system, it is the slowest in speed. These
storage devices are mostly used to backup the entire system. Optical disk and magnetic tapes
are widely used storage devices as tertiary storage.
DATA QUERYING
Queries are the primary mechanism for retrieving information from a database and consist of
questions presented to the database in a predefined format. Many database management
systems use the Structured Query Language (SQL) standard query format.
Choosing parameters from a menu: In this method, thedatabase system presents a list of
parameters from which you can choose. This is perhaps the easiest way to pose a query
because the menus guide you, but it is also the least flexible.
Query by example (QBE): In this method, the systempresents a blank record and lets
you specify the fields and values that define the query.
Query language: Many database systems require you to make requests for information
in the form of a stylized query that must be written in a special query language. This is
the most complex method because it forces you to learn a specialized language, but it is
also the most powerful.
DATABASE ARCHITECTURE
The design of a Database Management System highly depends on its architecture. It can be
centralized or decentralized or hierarchical. DBMS architecture can be seen as single tier or
multi-tier. n-tier architecture divides the whole system into related but independent n modules,
which can be independently modified, altered, changed or replaced.
Page 10
DBMS
In 1-tier architecture, DBMS is the only entity where user directly sits on DBMS and uses it.
Any changes done here will directly be done on DBMS itself. It does not provide handy tools
for end users and preferably database designer and programmers use single tier architecture.
If the architecture of DBMS is 2-tier then must have some application, which uses the DBMS.
Programmers use 2-tier architecture where they access DBMS by means of application. Here
application tier is entirely independent of database in term of operation, design and
programming.
3-tier architecture
Most widely used architecture is 3-tier architecture. 3-tier architecture separates it tier from
each other on basis of users. It is described as follows:
Database (Data) Tier: At this tier, only database resides. Database along with its query
processing languages sits in layer-3 of 3-tier architecture. It also contains all relations and their
constraints.
Application (Middle) Tier: At this tier the application server and program, which access
database, resides. For a user this application tier works as abstracted view of database. Users
are unaware of any existence of database beyond application. For database-tier, application tier
Mr. Y SUBBA RAYUDU M. Tech
Page 11
DBMS
is the user of it. Database tier is not aware of any other user beyond application tier. This tier
works as mediator between the two.
User (Presentation) Tier: An end user sits on this tier. From users aspect this tier is
everything. He/she doesn't know about any existence or form of database beyond this layer. At
this layer multiple views of database can be provided by the application. All views are
generated by applications, which reside in application tier.
Multiple tier database architecture is highly modifiable as almost all its components are
independent and can be changed independently.
DATA BASE USERS
DBMS is used by various users for various purposes. Some may involve in retrieving data and
some may involve in backing it up. Some of them are described as follows:
Administrators: A bunch of users maintain the DBMS and are responsible for administrating
the database. They are responsible to look after its usage and by whom it should be used. They
create users access and apply limitation to maintain isolation and force security. Administrators
also look after DBMS resources like system license, software application and tools required
and other hardware related maintenance.
Designer: This is the group of people who actually works on designing part of database. The
actual database is started with requirement analysis followed by a good designing process.
They people keep a close watch on what data should be kept and in what format. They identify
and design the whole set of entities, relations, constraints and views.
End Users: This group contains the persons who actually take advantage of database system.
End users can be just viewers who pay attention to the logs or market rates or end users can be
as sophisticated as business analysts who take the most of it.
Database Administrator [DBA]
Centralized control of the database is exerted by a person or group of persons under the
supervision of a high level administrator. This person or group is referred to as the database
administrator (DBA). They are the users who are most familiar with the database and are
responsible for creating, modifying, and maintaining its three levels. Database Administrator is
responsible to manage the DBMSs use and ensure that the database is functioning properly.
DBA administers the three levels of database and consultation with the overall user community,
sets up the definition of the global view of the various users and applications and is responsible
the definition and implementation of the internal level, including the storage structure and
access methods to be used for the optimum performance of the DBMS. DBA is responsible for
Page 12
DBMS
granting permission to the users of the database and stores the profile of each user in the
database.
History of Database System
Although various rudimentary DBMSs had been in use prior to IBM Corp.'s release of
Information Management System (IMS) in 1966, IMS was the first commercially available
DBMS. IMS was considered a hierarchical database, in which standardized data records were
organized within other standardized data records, creating a hierarchy of information about a
single entry. In the late 1960s, firms like Honeywell Corp. and General Electric Corp.
developed DBMSs based on a network data model, but the next major database management
breakthrough came in 1970 when a research scientist at IBM first outlined his theory for
relational databases. Six years later, IBM completed a prototype for a relational DBMS.
In 1977, computer programmers Larry Ellison and Robert Miner co-founded Oracle Systems
Corp. Their combined experience designing specialized database programs for governmental
organizations landed the partners a $50,000 contract from the Central Intelligence Agency
(CIA) to develop a customized database program. While working on the CIA project, Ellison
and Miner became interested in IBM's efforts to develop a relational database, which involved
Structured Query Language (SQL). Recognizing that SQL would allow computer users to
retrieve data from a variety of sources and sensing that SQL would become a database industry
standard, Ellison and Miner began working on developing a program similar to the relational
DBMS being developed by IBM. In 1978, Oracle released its own relational DBMS, the
world's first relational database management system (RDBMS) using SQL. Oracle began
shipping its RDBMS the following year, nearly two years before IBM shipped its first version
of DB2, which would become a leading RDBMS competing with the database management
applications of industry giants like Microsoft Corp. and Oracle. Relational databases eventually
outpaced all other database types, mainly because they allowed for highly complex queries and
could support various tools which enhanced their usefulness.
In 1983, Oracle developed the first portable RDBMS, which allowed firms to run their DBMS
on various machines including mainframes, workstations, and personal computers. Soon
thereafter, the firm also launched a distributed DBMS, based on SQL-Star software, which
granted users the same kind of access to data stored on a network they would have if the data
were housed in a single computer. By the end of the decade, Oracle had grown into the world's
leading enterprise DBMS provider with more than $100 million in sales.
It wasn't long before DBMSs were developed for use on individual PCs. In 1993, Microsoft
Corp. created an application called Access. The program competed with FileMaker Inc.'s
FileMaker Pro, a database application initially designed for Macintosh machines.
INTRODUCTION
TO
DATABASE DESIGN
Page 13
DBMS
Logical Database Design: Convert the conceptual model to a schema in the chosen
data model of the DBMS. For a relational database, this means converting the
conceptual to a relational schema (logical schema).
Schema Refinement: Look for potential problems in the original choice of schema and
try to redesign.
Physical Database Design: Direct the DBMS into choice of underlying data layout
(e.g., indexes and clustering) in hopes of optimizing the performance.
Applications and Security Design: How will the underlying database interact with
surrounding applications.
Entity: An entity is a real-world object or concept which is distinguishable from other objects.
It may be something tangible, such as a particular student or building. It may also be somewhat
more conceptual, such as CS A-341, or an email address.
Attributes: These are used to describe a particular entity (e.g. name, SS#, height).
Domain: Each attribute comes from a specified domain (e.g., name may be a 20 character
string; SS# is a nine-digit integer)
Entity set: a collection of similar entities (i.e., those which are distinguished using the same set
of attributes. As an example, I may be an entity, whereas Faculty might be an entity set to
which I belong. Note that entity sets need not be disjoint. I may also be a member of Staff or
of Softball Players.
Key: a minimal set of attributes for an entity set, such that each entity in the set can be
uniquely identified. In some cases, there may be a single attribute (such as SS#) which serves
as a key, but in some models you might need multiple attributes as a key ("Bob from
Accounting"). There may be several possible candidate keys. We will generally designate one
such key as the primary key.
ER diagrams:
Page 14
DBMS
It is often helpful to visualize an ER model via a diagram. There are many variant conventions
for such diagrams; we will adapt the one used in the text.
Diagram conventions
BEYOND ER DESIGN
Objectives
ER Model
Entity relationship model defines the conceptual view of database. It works around real world
entity and association among them. At view level, ER model is considered well for designing
databases.
Entity
A real-world thing either animate or inanimate that can be easily identifiable and
distinguishable. For example, in a school database, student, teachers, class and course offered
can be considered as entities. All entities have some attributes or properties that give them their
identity.
An entity set is a collection of similar types of entities. Entity set may contain entities with
attribute sharing similar values. For example, Students set may contain all the student of a
Page 15
DBMS
school; likewise Teachers set may contain all the teachers of school from all faculties. Entities
sets need not to be disjoint.
Attributes
Entities are represented by means of their properties, called attributes. All attributes have
values. For example, a student entity may have name, class, age as attributes.
There exists a domain or range of values that can be assigned to attributes. For example, a
student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be
negative, etc.
Types of Attributes
Simple attribute
Simple attributes are atomic values, which cannot be divided further. For example, student's
phone-number is an atomic value of 10 digits.
Composite attribute
Composite attributes are made of more than one simple attribute. For example, a student's
complete name may have first_name and last_name.
Derived attribute
Derived attributes are attributes, which do not exist physical in the database, but there values
are derived from other attributes presented in the database. For example, average_salary in a
department should be saved in database instead it can be derived. For another example, age can
be derived from data_of_birth.
Single-valued attribute
Single valued attributes contain on single value. For example: Social_Security_Number.
Multi-value attribute
Multi-value attribute may contain more than one values. For example, a person can have more
than one phone numbers, email_addresses etc.
These attribute types can come together in a way like:
o
o
o
o
Key is an attribute or collection of attributes that uniquely identifies an entity among entity set.
Example: roll_number of a student makes her/him identifiable among students.
Mr. Y SUBBA RAYUDU M. Tech
Page 16
DBMS
o Super Key: Set of attributes (one or more) that collectively identifies an entity in an
entity set.
o Candidate Key: Minimal super key is called candidate key that is, supers keys for
which no proper subset are a superkey. An entity set may have more than one candidate
key.
o Primary Key: This is one of the candidate key chosen by the database designer to
uniquely identify the entity set.
Relationship
The association among entities is called relationship. For example, employee entity has relation
works_at with department. Another example is for student who enrolls in some course. Here,
Works_at and Enrolls are called relationship.
Relationship Set
Relationship of similar type is called relationship set. Like entities, a relationship too can have
attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in an relationship defines the degree of the relationship.
o Binary = degree 2
o Ternary = degree 3
o n-ary = degree
Mapping Cardinalities
Cardinality defines the number of entities in one entity set which can be associated to the
number of entities of other set via relationship set.
o
One-to-one: one entity from entity set A can be associated with at most one entity of
entity set B and vice versa.
One-to-many: One entity from entity set A can be associated with more than one
entities of entity set B but from entity set B one entity can be associated with at most
one entity.
Page 17
DBMS
Many-to-one: More than one entities from entity set A can be associated with at most
one entity of entity set B but one entity from entity set B can be associated with more
than one entity from entity set A.
Many-to-many: one entity from A can be associated with more than one entity from B
and vice versa.
Page 18
DBMS
Page 19
DBMS
Page 20
DBMS
If both entity sets of a relationship set have key constraints, we would call this a "one-to-one"
relationship set. In general, note that key constraints can apply to relationships between more
than two entities, as in the following example.
Participation Constraints
Mr. Y SUBBA RAYUDU M. Tech
Page 21
DBMS
Recall that a key constraint requires that each entity of a set be required to participate in at
most one relationship. Dual to this, we may ask whether each entity of a set be required to
participate in at least one relationship.
If this is required, we call this a total participation constraint; otherwise the participation
is partial. In our ER diagrams, we will represent a total participation constraint by using
a thick line.
Weak Entities
There are times you might wish to define an entity set even though its attributes do not
formally contain a key (recall the definition for a key).
Usually, this is the case only because the information represented in such an entity set is only
interesting when combined through an identifying relationship set with another entity set we
call theidentifying owner.
We will call such a set a weak entity set, and insist on the following:
The weak entity set must exhibit a key constraint with respect to the identifying
relationship set.
The weak entity set must have total participation in the identifying relationship set.
Together, this assures us that we can uniquely identify each entity from the weak set by
considering the primary key of its identifying owner together with a partial key from the weak
entity.
In our ER diagrams, we will represent a weak entity set by outlining the entity and the
identifying relationship set with dark lines. The required key constraint and total participation
are diagrammed with our existing conventions. We underline the partial key with a dotted line.
Page 22
DBMS
Class Hierarchies
As with object-oriented programming, it is often convenient to classify an entity sets as a
subclass of another. In this case, the child entity set inherits the attributes of the parent entity
set. We will denote this scenario using an "ISA" triangle, as in the following ER diagram:
Page 23
DBMS
model called aggregation. We identifying an existing relationship set by enclosing it in a larger
dashed box, and then we will allow it to participate in another relationship set.
A motivating example follows:
Page 24
DBMS
the Works_In relationship can be made ternary (associating an employee, a department and an
interval). What are the pros and cons?
If the duration is described through descriptive attributes, only a single such duration can be
modeled. That is, we could not express an employment history involving someone who left the
department yet later returned.
Should a concept be modeled as an entity or a relationship?
Consider a situation in which a manager controls several departments. Let's presume that a
company budgets a certain amount (budget) for each department. Yet it also wants managers to
have access to some discretionary budget (dbudget). There are two corporate models. A
discretionary budget may be created for each individual department; alternatively, there may be
a discretionary budget for each manager, to be used as she desires.
Which scenario is represented by the following ER diagram? If you want the alternate
interpretation, how would you adjust the model?
Page 25
DBMS
Every policy must be owned by some employee.
Dependents is a weak entity set, and each dependent entity is uniquely identified by
taking pname in conjunction with the policyid of a policy entity (which, intuitively, covers the
given dependent).
The best way to model this is to switch away from the ternary relationship set, and instead use
two distinct binary relationship sets.
If we did not need the until or since attributes. In this case, we could model the identical setting
using the following ternary relationship:
Page 26
DBMS
Let's compare these two models. What if we wanted to add an additional constraint to
each, that each sponsorship (of a project by a department) be monitored by at most one
employee. Can you add this constraint to either of the above models.
RELATION DATA MODEL
Relational data model is the primary data model, which is used widely around the world for
data storage and processing. This model is simple and have all the properties and capabilities
required to process data with storage efficiency.
Concepts
Tables: In relation data model, relations are saved in the format of Tables. This format stores
the relation among entities. A table has rows and columns, where rows represent records and
columns represents the attributes.
Tuple: A single row of a table, which contains a single record for that relation is called a tuple.
Relation instance: A finite set of tuples in the relational database system represents relation
instance. Relation instances do not have duplicate tuples.
Relation schema: This describes the relation name (table name), attributes and their names.
Relation key: Each row has one or more attributes which can identify the row in the relation
(table) uniquely, is called the relation key.
Attribute domain: Every attribute has some pre-defined value scope, known as attribute
domain.
Page 27
DBMS
Relational Model Constraints
Domain Constraints:A relation schema specifies the domain of each field in the
relation instance. These domain constraints in the schema specify the condition that
each instance of the relation has to satisfy: The values that appear in a column must be
drawn from the domain associated with that column. Thus, the domain of a field is
essentially the type of that field.
Key Constraints
A Key Constraint is a statement that a certain minimal subset of the fields of a relation is a
unique identifier for a tuple.
Super Key:An attribute, or set of attributes, that uniquely identifies a tuple within a
relation.However, a super key may contain additional attributes that are not necessary
for a unique identification.
Example: The customer_id of the relation customer is sufficient to distinguish one tuple
from other. Thus,customer_id is a super key. Similarly, the combination
of customer_id and customer_name is a super key for the relation customer. Here
the customer_name is not a super key, because several people may have the same
name. We are often interested in super keys for which no proper subset is a super key.
Such minimal super keys are called candidate keys.
Candidate Key:A super key such that no proper subset is a super key within the
relation.There are two parts of the candidate key definition:
o Two distinct tuples in a legal instance cannot have identical values in all the
fields of a key
o No subset of the set of fields in a candidate key is a unique identifier for a
tuple.A relation may have several candidate keys.
Example: The combination of customer_name and customer_street is sufficient to
distinguish the members of the customer relation. Then both, {customer_id} and
{customer_name,
customer_street}
are
candidate
keys.
Although customer_id and customer_name together can distinguish customer tuples,
their combination does not form a candidate key, since the customer_id alone is a
candidate key.
Page 28
DBMS
Primary Key:The candidate key that is selected to identify tuples uniquely within the
relation. Out of all the available candidate keys, a database designer can identify
a primary key. The candidate keys that are not selected as the primary key are called
as alternate keys.
Features of the primary key:
o Primary key will not allow duplicate values.
o Primary key will not allow null values.
o Only one primary key is allowed per table.
Example: For the student relation, we can choose student_id as the primary key.
Foreign Key:Foreign keys represent the relationships between tables. A foreign key is
a column (or a group of columns) whose values are derived from the primary key of
some other table.The table in which foreign key is defined is called a Foreign
table or Details table. The table that defines the primary key and is referenced by
the foreign key is called the Primary table or Master table.
Features of foreign key:
o Records cannot be inserted into a detail table if corresponding records in the
master table do not exist.
o Records of the master table cannot be deleted or updated if corresponding
records in the detail table actually exist.
General Constraints
Domain, primary key, and foreign key constraints are considered to be a fundamental part of
the relational data model. Sometimes, however, it is necessary to specify more general
constraints.
Example: we may require that student ages be within a certain range of values. Giving such an
IC, the DBMS rejects inserts and updates that violate the constraint.
Current database systems support such general constraints in the form of table
constraints andassertions. Table constraints are associated with a single table and checked
whenever that table is modified. In contrast, assertions involve several tables and are checked
whenever any of these tables is modified.
Example: for table constraint, which ensures always the salary of an employee, is above 1000:
CREATE
TABLE
employee
(eid integer, ename varchar2(20), salary real,
CHECK(salary>1000));
Example: for assertion, which enforce a constraint that the number of boats plus the number of
sailors should be less than 100.
Page 29
DBMS
CREATE ASSERTION smallClub CHECK ((SELECT COUNT (S.sid) FROM Sailors S) +
(SELECT COUNT (B.bid) FROM Boats B) < 100);
Referential/Enforcing Integrity Constraints
This integrity constraints works on the concept of Foreign Key. A key attribute of a relation can
be referred in other relation, where it is called foreign key.
Referential integrity constraint states that if a relation refers to an key attribute of a different or
same relation, that key element must exists.
Querying Relational Data:
Page 30
DBMS
A logical data model will normally be derived from and or linked back to objects in a
conceptual data model.
INTRODUCTION TO VIEWS
A view is virtual table in the database defined by a query. A view does not exist in the database
as a stored set of data values.To reduces redundant data to the minimum possible, oracle allows
the create of an object called a view.
The reasons for creating view sale:
When data security is required.
When data redundancy is to be kept to the minimum while maintaining datasecurity.
There are 3 types of views
Horizontal view restricts a users access to selected rows of a table.
Vertical view restricts a users access to select columns of a table.
A joined view draws its data from two or three different tables and presents the query
results as a single virtual table. Once the view is defined, one can use a single table query
against the view for the requests that would otherwise each require a two or three table join.
Advantages of views
Security: security is provided to the data base to the user to a specific no. of rows of a
table.
Query simplicity: by using joined views data can be accessed from different tables.
Data integrity: if data is accessed and entered through a view, the DBMS can
automatically check the data to ensure that it meets specified integrity constraints.
Disadvantages of views
Page 31
DBMS
Performance: The DBMS the query against the view into queries against the underlying
source table. If a table is defined by a multi table query, then even a simple query against a
view becomes a complicated join, and it may take a long time to complete. This is
reference to insert, delete and update operations
Update restrictions: when a user tries to update rows of a view, the DBMS must translate
the request into an update into an update on rows of the underlying source table. This is
possible for simple views, but more complicated views cannot be updated.
Removing constraints
ALTER TABLE enables you to remove column or table constraints. For example, to remove
the unique constraint you just created, use
ALTER TABLE SALESMAN
DROP CONSTRAINT uk_salesmancode;
UNIT - II
Mr. Y SUBBA RAYUDU M. Tech
Page 32
DBMS
RELATIONAL ALGEBRA
Relational algebra is a procedural query language, which takes instances of relations as input
and yields instances of relations as output. It uses operators to perform queries. An operator can
be either unary or binary. They accept relations as their input and yields relations as their
output. Relational algebra is performed recursively on a relation and intermediate results are
also considered relations.
Fundamental operations of Relational algebra:
Select
Project
Union
Set different
Cartesian product
Rename
These are defined briefly as follows:
Select Operation ()
Selects tuples that satisfy the given predicate from a relation.
Notation p(r)
Where p stands for selection predicate and r stands for relation. p is prepositional logic
formulae which may use connectors like and, or and not. These terms may use relational
operators like: =, , , < , >, .
Examples:
subject="database"(Books)
Output : Selects tuples from books where subject is 'database'.
subject="database" and price="450"(Books)
Output : Selects tuples from books where subject is 'database' and 'price' is 450.
subject="database" and price < "450" or year > "2010"(Books)
Output : Selects tuples from books where subject is 'database' and 'price' is 450 or the
publication year is greater than 2010, that is published after 2010.
Project Operation ()
Page 33
DBMS
Examples:
subject, author (Books)
Selects and projects columns named as subject and author from relation Books.
Union Operation ()
Union operation performs binary union between two given relations and is defined as:
r s = { t | t r or t s}
Notion: r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold:
r, s must have same number of attributes.
Attribute domains must be compatible.
Duplicate tuples are automatically eliminated.
Examples:
author (Books) author (Articles)
Output : Projects the name of author who has either written a book or an article or
both.
Set Difference ( )
The result of set difference operation is tuples which present in one relation but are not in the
second relation.
Notation: r s
Finds all tuples that are present in r but not s.
Example:
author (Books) author (Articles)
Output: Results the name of authors who has written books but not articles.
Cartesian Product ()
Combines information of two different relations into one.
Notation: r s
Where r and s are relations and there output will be defined as:
r s = { q t | q r and t s}
author = 'tutorialspoint'(Books Articles)
Output : yields a relation as result which shows all books and articles written by
tutorialspoint.
Rename operation ( )
Mr. Y SUBBA RAYUDU M. Tech
Page 34
DBMS
Results of relational algebra are also relations but without any name. The rename operation
allows us to rename the output relation. rename operation is denoted with small greek letter
rho
Notation: x (E)
Where the result of expression E is saved with name of x.
Additional operations are:
Set intersection
Assignment
Natural join
JOIN Operator
JOIN is used to combine related tuples from two relations:
In its simplest form the JOIN operator is just the cross product of the two relations.
As the join becomes more complex, tuples are removed within the cross product to
make the result of the join more meaningful.
JOIN allows you to evaluate a join condition between the attributes of the relations on
which the join is undertaken.
The notation used is
R JOINjoin condition S
Example
Natural Join
Invariably the JOIN involves an equality test, and thus is often described as an equi-join. Such
joins result in two attributes in the resulting relation having exactly the same value. A `natural
join' will remove the duplicate attribute(s).
In most systems a natural join will require that the attributes have the same name to
identify the attribute(s) to be used in the join. This may require a renaming mechanism.
Mr. Y SUBBA RAYUDU M. Tech
Page 35
DBMS
If you do use natural joins make sure that the relations do not have two attributes with
the same name by accident.
Outer Joins
Notice that much of the data is lost when applying a join to two relations. In some cases this
lost data might hold useful information. An outer join retains the information that would have
been lost from the tables, replacing missing data with nulls.
There are three forms of the outer join, depending on which data is to be kept.
LEFT OUTER JOIN - keep data from the left-hand table
RIGHT OUTER JOIN - keep data from the right-hand table
FULL OUTER JOIN - keep data from both tables
Page 36
DBMS
Division
As the name of this operation implies, it involves dividing one relation by another. Division is
in principle a partitioning operation. Thus, 6 2 can be paraphrased as partitioning a single
group of 6 into a number of groups of 2 - in this case, 3 groups of 2. The basic terminology
used in arithmetic will be used here as well. Thus in an expression like x y, x is the dividend
and y the divisor. Division does not always yield whole groups of the divisor, eg. 7 2 gives 3
groups of 2 and a remainder group of 1. Relational division too can leave remainders but, much
like integer division, we ignore remainders and focus only on constructing whole groups of the
divisor.
Relational Calculus
In contrast with Relational Algebra, Relational Calculus is non-procedural query language, that
is, it tells what to do but never explains the way, how to do it.
Relational calculus exists in two forms:
Tuple relational calculus (TRC) : Filtering variable ranges over tuples
Notation: { T | Condition }
Returns all tuples T that satisfies condition.
Examples:
{ T.name | Author(T) AND T.article = 'database' }
Output: returns tuples with 'name' from Author who has written article on 'database'.
TRC can be quantified also. We can use Existential ( )and Universal Quantifiers (
).
{ R| T Authors(T.article='database' AND R.name=T.name)}
Output : the query will yield the same result as the previous one.
Domain relational calculus (DRC) : In DRC the filtering variable uses domain of attributes
instead of entire tuple values (as done in TRC, mentioned above).
Notation:{ a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}
where a1, a2 are attributes and P stands for formulae built by inner attributes.
Page 37
DBMS
Examples:
{< article, page, subject > | TutorialsPoint subject = 'database'}
Output: Yields Article, Page and Subject from relation TutorialsPoint where Subject
is database.
Just like TRC, DRC also can be written using existential and universal quantifiers. DRC also
involves relational operators.Expression power of Tuple relation calculus and Domain relation
calculus is equivalent to Relational Algebra.
RDBMS
RDBMS stands for Relational Database Management System.
RDBMS is the basis for SQL, and for all modern database systems such as MS SQL
Server, IBM DB2, Oracle, MySQL, and Microsoft Access.
The data in RDBMS is stored in database objects called tables.
A table is a collection of related data entries and it consists of columns and rows.
Mr. Y SUBBA RAYUDU M. Tech
Page 38
DBMS
Table
The data in RDBMS is stored in database objects called tables. The table is a collection of
related data entries and it consists of columns and rows.
Remember, a table is the most common and simplest form of data storage in a relational
database. Following is the example of a CUSTOMERS table:
ID
1
2
3
4
5
6
7
NAME
Ramesh
Khilan
Kaushik
Chaitali
Hardik
Komal
Muffy
AGE
32
25
23
25
27
22
24
ADDRESS
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
Indore
SALARY
2000
1500
2000
6500
8500
4500
10000
Field
Every table is broken up into smaller entities called fields. The fields in the
CUSTOMERS table consist of ID, NAME, AGE, ADDRESS and SALARY.
A field is a column in a table that is designed to maintain specific information about
every record in the table.
Record or Row
A record, also called a row of data, is each individual entry that exists in a table. For example
there are 7 records in the above CUSTOMERS table. Following is a single row of data or
record in the CUSTOMERS table:
1
Ramesh
32
Ahmedabad
2000
Page 39
DBMS
Bhopal
MP
Indore
NULL value
A NULL value in a table is a value in a field that appears to be blank, which means a field with
a NULL value is a field with no value.
It is very important to understand that a NULL value is different than a zero value or a field
that contains spaces. A field with a NULL value is one that has been left blank during record
creation.
SQL Constraints
Constraints are the rules enforced on data columns on table. These are used to limit the type of
data that can go into a table. This ensures the accuracy and reliability of the data in the
database.
Constraints could be column level or table level. Column level constraints are applied only to
one column whereas table level constraints are applied to the whole table.
Following are commonly used constraints available in SQL:
NOT NULL: Ensures that a column cannot have NULL value.
DEFAULT: Provides a default value for a column when none is specified.
UNIQUE: Ensures that all values in a column are different.
PRIMARY Key: Uniquely identified each rows/records in a database table.
FOREIGN Key: Uniquely identified a rows/records in any another database table.
CHECK Constraint: The CHECK constraint ensures that all values in a column satisfy
certain conditions.
INDEX: Use to create and retrieve data from the database very quickly.
Data Integrity
The following categories of the data integrity exist with each RDBMS:
Entity: There are no duplicate rows in a table.
Domain: Enforces valid entries for a given column by restricting the type, the format,
or the range of values.
Referential: Rows cannot be deleted, which are used by other records.
User-Defined: Enforces some specific business rules that do not fall into entity, domain
or referential integrity.
SQL History
In 1971, IBM researchers created a simple non-procedural language called Structured English
Query Language. or SEQUEL. This was based on Dr. Edgar F. (Ted) Codd's design of a
relational model for data storage where he described a universal programming language for
accessing databases.
Mr. Y SUBBA RAYUDU M. Tech
Page 40
DBMS
In the late 80's ANSI and ISO (these are two organizations dealing with standards for a wide
variety of things) came out with a standardized version called Structured Query Language or
SQL. SQL is prounced as 'Sequel'. There have been several versions of SQL and the latest one
is SQL-99. Though SQL-92 is the current universally adopted standard.
SQL is the language used to query all databases. It's simple to learn and appears to do very
little but is the heart of a successful database application. Understanding SQL and using it
efficiently is highly imperative in designing an efficient database application. The better your
understanding of SQL the more versatile you'll be in getting information out of databases.A
SQL SELECT statement can be broken down into numerous elements, each beginning with a
keyword. Although it is not necessary, common convention is to write these keywords in all
capital letters. In this article, we will focus on the most fundamental and common elements of a
SELECT statement, namely
SELECT
FROM
WHERE
ORDER BY
The SELECT ... FROM Clause
The most basic SELECT statement has only 2 parts:
What columns you want to return
What table(s) those columns come from.
Examples of Basic SQL Queries:
If we want to retrieve all of the information about all of the customers in the Employees table,
we could use the asterisk (*) as a shortcut for all of the columns, and our query looks like
SELECT * FROM Employees
If we want only specific columns (as is usually the case), we can/should explicitly specify them
in a comma-separated list, as in
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
Explicitly specifying the desired fields also allows us to control the order in which the fields
are returned, so that if we wanted the last name to appear before the first name, we could write
SELECT EmployeeID, LastName, FirstName, HireDate, City FROM Employees
Page 41
DBMS
conditions that must be met by the selected data. This will limit the number of rows that answer
the query and are fetched. In many cases, this is where most of the "action" of a query takes
place.
Examples
We can continue with our previous query, and limit it to only those employees living in
London:
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London'
If you wanted to get the opposite, the employees who do not live in London, you would write
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City <> 'London'
It is not necessary to test for equality; you can also use the standard equality/inequality
operators that you would expect. For example, to get a list of employees who were hired on or
after a given date, you would write
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE HireDate >= '1-july-1993'
Of course, we can write more complex conditions. The obvious way to do this is by having
multiple conditions in the WHERE clause. If we want to know which employees were hired
between two given dates, we could write
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM
Employees
WHERE
(HireDate >= '1-june-1992') AND (HireDate <= '15-december-1993')
Note that SQL also has a special BETWEEN operator that checks to see if a value is between
two values (including equality on both ends). This allows us to rewrite the previous query as
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM
Employees
WHERE HireDate BETWEEN '1-june-1992' AND '15-december-1993'
We could also use the NOT operator, to fetch those rows that are not between the specified
dates:
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM
Employees
WHERE HireDate NOT BETWEEN '1-june-1992' AND '15-december-1993'
Let us finish this section on the WHERE clause by looking at two additional, slightly more
sophisticated, comparison operators.
What if we want to check if a column value is equal to more than one value? If it is only 2
values, then it is easy enough to test for each of those values, combining them with the OR
operator and writing something like
Mr. Y SUBBA RAYUDU M. Tech
Page 42
DBMS
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London' OR City = 'Seattle'
However, if there are three, four, or more values that we want to compare against, the above
approach quickly becomes messy. In such cases, we can use the IN operator to test against a set
of values. If we wanted to see if the City was either Seattle, Tacoma, or Redmond, we would
write
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City IN ('Seattle', 'Tacoma', 'Redmond')
As with the BETWEEN operator, here too we can reverse the results obtained and query for
those rows where City is not in the specified list:
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City NOT IN ('Seattle', 'Tacoma', 'Redmond')
Finally, the LIKE operator allows us to perform basic pattern-matching using wildcard
characters. For Microsoft SQL Server, the wildcard characters are defined as follows:
Wildcard
_ (underscore)
Description
matches any single character
[]
matches any single character within the specified range (e.g. [a-f])
or set (e.g. [abcdef]).
[^]
matches any single character not within the specified range (e.g.
[^a-f]) or set (e.g. [^abcdef]).
Here too, we can opt to use the NOT operator: to find all of the employees whose first name
does not start with 'M' or 'A', we would write
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE (FirstName NOT LIKE 'M%') AND (FirstName NOT LIKE 'A%')
Page 43
DBMS
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY City
If we want the sort order for a column to be descending, we can include the DESC keyword
after the column name.
The ORDER BY clause is not limited to a single column. You can include a comma-delimited
list of columns to sort bythe rows will all be sorted by the first column specified and then by
the next column specified. If we add the Country field to the SELECT clause and want to sort
by Country and City, we would write:
SELECT EmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country, City DESC
Note that to make it interesting, we have specified the sort order for the City column to be
descending (from highest to lowest value). The sort order for the Country column is still
ascending. We could be more explicit about this by writing
SELECT EmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country ASC, City DESC
It is important to note that a column does not need to be included in the list of selected
(returned) columns in order to be used in the ORDER BY clause. If we don't need to see/use
the Country values, but are only interested in them as the primary sorting field we could write
the query as
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY Country ASC, City DESC
INTRODUCTION
TO
NESTED QUERIES
Nested Queries
A Subquery or Inner query or Nested query is a query within another SQL query and embedded
within the WHERE clause.A subquery is used to return data that will be used in the main query
as a condition to further restrict the data to be retrieved.
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along
with the operators like =, <, >, >=, <=, IN, BETWEEN etc.
There are a few rules that subqueries must follow:
Subqueries must be enclosed within parentheses.
A subquery can have only one column in the SELECT clause, unless multiple columns
are in the main query for the subquery to compare its selected columns.
Mr. Y SUBBA RAYUDU M. Tech
Page 44
DBMS
An ORDER BY cannot be used in a subquery, although the main query can use an
ORDER BY. The GROUP BY can be used to perform the same function as the ORDER
BY in a subquery.Subqueries that return more than one row can only be used with
multiple value operators, such as the IN operator.
The SELECT list cannot include any references to values that evaluate to a BLOB,
ARRAY, CLOB, or NCLOB.
Page 45
DBMS
WHERE id IN ( SELECT stud
FROM assign
WHERE id = 1);
ANY and SOME
The right-hand side of this form of ANY is a parenthesized sub query, which must return
exactly one column. The left-hand expression is evaluated and compared to each row of the sub
query result using the given operator, which must yield a Boolean result. The result of ANY is
TRUE
if
any
true
result
is
obtained.
SOME is a synonym for ANY. IN is equivalent to = ANY.
ALL
The right-hand side of this form of ALL is a parenthesized sub query, which must return
exactly one column. The left-hand expression is evaluated and compared to each row of the sub
query result using the given operator, which must yield a Boolean result. The result of ALL is
TRUE if all rows yield TRUE (including the special case where the sub query returns no
rows). NOT IN is equivalent to <> ALL.
Row-wise comparison
The left-hand side is a list of scalar expressions. The right-hand side can be either a list of
scalar expressions of the same length, or a parenthesized sub query, which must return exactly
as many columns as there are expressions on the left-hand side. Furthermore, the sub query
cannot return more than one row. (If it returns zero rows, the result is taken to be NULL.) The
left-hand side is evaluated and compared row-wise to the single sub query result row, or to the
right-hand expression list. Presently, only = and <> operators are allowed in row-wise
comparisons. The result is TRUE if the two rows are equal or unequal, respectively.
CORRELATED NESTED QUERIES
SQL Correlated Subqueries are used to select data from a table referenced in the outer query.
The subquery is known as a correlated because the subquery is related to the outer query. In
this type of queries, a table alias (also called a correlation name) must be used to specify which
table reference is to be used.
The alias is the pet name of a table which is brought about by putting directly after the table
name in the FROM clause. This is suitable when anybody wants to obtain information from
two separate tables.
SELECT a.ord_num,a.ord_amount,a.cust_code,a.agent_code
FROM orders a
WHERE a.agent_code=( SELECT b.agent_code
FROM agents b
WHERE b.agent_name='Alex');
OUTPUT:
Mr. Y SUBBA RAYUDU M. Tech
Page 46
DBMS
Using EXISTS
SELECT employee_id, manager_id, first_name, last_name
FROM employees a
WHERE EXISTS (SELECT employee_id
FROM employees b
WHERE b.manager_id = a.employee_id)
SET-COMPARISION OPERATOR
SQL Operators
There are two type of Operators, namely Comparison Operators and Logical Operators. These
operators are used mainly in the WHERE clause, HAVING clause to filter the data to be
selected.
Mr. Y SUBBA RAYUDU M. Tech
Page 47
DBMS
Comparison Operators:Comparison operators are used to compare the column data with
specific values in a condition.Comparison Operators are also used along with the SELECT
statement to filter data based on specific conditions.
Comparison Operators
=
<>, !=
<
>
>=
<=
Description
equal to
is not equal to
less than
greater than
greater than or equal to
less than or equal to
Logical Operators:There are three Logical Operators namely AND, OR and NOT.
SQL Comparison Keywords
There are other comparison keywords available in sql which are used to enhance the search
capabilities of a sql query. They are "IN", "BETWEEN...AND", "IS NULL", "LIKE".
Comparision Operators Description
LIKE
column value is similar to specified character(s).
IN
column value is equal to any one of a specified set of values.
BETWEEN...AND column value is between two values, including the end values
specified in the range.
IS NULL
column value does not exist.
SQL LIKE Operator
The LIKE operator is used to list all rows in a table whose column values match a specified
pattern. It is useful when you want to search rows to match a specific pattern, or when you do
not know the entire value. For this purpose we use a wildcard character '%'.
To select all the students whose name begins with 'S'
SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE 'S%';
The above select statement searches for all the rows where the first letter of the column
first_name is 'S' and rest of the letters in the name can be any character.
There is another wildcard character you can use with LIKE operator. It is the underscore
character, ' _ ' . In a search string, the underscore signifies a single character.
To display all the names with 'a' second character,
SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE '_a%';
Mr. Y SUBBA RAYUDU M. Tech
Page 48
DBMS
NOTE:Each underscore act as a placeholder for only one character. So you can use more than
one underscore. Eg: ' __i% '-this has two underscores towards the left, 'S__j%' - this has two
underscores between character 'S' and 'i'.
SQL BETWEEN ... AND Operator
The operator BETWEEN and AND, are used to compare data for a range of values.
To find the names of the students between age 10 to 15 years, the query would be like,
SELECT first_name, last_name, age
FROM student_details
WHERE age BETWEEN 10 AND 15;
SQL IN Operator
The IN operator is used when you want to compare a column with more than one value. It is
similar to an OR condition.
If you want to find the names of students who are studying either Maths or Science, the query
would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject IN ('Maths', 'Science');
You can include more subjects in the list like ('maths','science','history')
NOTE:The data used to compare is case sensitive.
SQL IS NULL Operator
A column value is NULL if it does not exist. The IS NULL operator is used to display all the
rows for columns that do not have a value.
If you want to find the names of students who do not participate in any games, the query would
be as given below
SELECT first_name, last_name
FROM student_details
WHERE games IS NULL
There would be no output as we have every student participate in a game in the table
student_details, else the names of the students who do not participate in any games would be
displayed.
AGGREGATE OPERATORS
The SQL Aggregate Functions are functions that provide mathematical operations. If you need
to add, count or perform basic statistics, these functions will be of great help.
Mr. Y SUBBA RAYUDU M. Tech
Page 49
DBMS
The functions include:
count()
- counts a number of rows
sum()
- compute sum
avg()
- compute average
min()
- compute minimum
max()
- compute maximum
Use of SQL Aggregate Functions
SQL Aggregate Functions are used as follows. If a grouping of values is needed also include
the GROUP BY clause.Use a column name or expression as the parameter to the Aggregate
Function. The parameter, '*', represents all rows.
SELECT <column_name1>, <column_name2><aggregate_function(s)>
FROM <table_name>
GROUP BY <column_name1>, <column_name2>
Example
The following example Aggregate Functions are applied to the employee_count of the branch
table. The region_nbr is the level of grouping.Here are the contents of the table:
Table: BRANCH
branch_nbr
108
110
212
404
415
branch_name
New York
Boston
Chicago
San Diego
San Jose
region_nbr
100
100
200
400
400
employee_count
10
6
5
6
3
B,
sum(employee_count)
B
2
1
2
C
16
5
9
D
6
5
3
Page 50
E
10
5
6
F
8
5
4
C,
DBMS
NULL VALUES
The SQL NULL is the term used to represent a missing value. A NULL value in a table is a
value in a field that appears to be blank.A field with a NULL value is a field with no value. It is
very important to understand that a NULL value is different than a zero value or a field that
contains spaces.
Syntax:
The basic syntax of NULL while creating a table:
CREATE TABLE CUSTOMERS
( ID
INT
NOT NULL,
NAME
VARCHAR (20) NOT NULL,
AGE
INT
NOT NULL,
ADDRESS
CHAR (25),
SALARY
DECIMAL (18, 2),
PRIMARY KEY (ID));
Here, NOT NULL signifies that column should always accept an explicit value of the given
data type. There are two columns where we did not use NOT NULL, which means these
columns could be NULL.
A field with a NULL value is one that has been left blank during record creation.
Example:
The NULL value can cause problems when selecting data, however, because when comparing
an unknown value to any other value, the result is always unknown and not included in the
final results.
You must use the IS NULL or IS NOT NULL operators in order to check for a NULL value.
Consider the following table, CUSTOMERS having the following records:
ID
1
2
3
4
5
6
7
NAME
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
Muffy
AGE
32
25
23
25
27
22
24
ADDRESS
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
Indore
Page 51
SALARY
2000.00
1500.00
2000.00
6500.00
8500.00
DBMS
FROM CUSTOMERS
WHERE SALARY IS NOT NULL;
This would produce the following result:
ID
1
2
3
4
5
NAME
Ramesh
Khilan
kaushik
Chaitali
Hardik
AGE
32
25
23
25
27
ADDRESS
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
SALARY
2000.00
1500.00
2000.00
6500.00
8500.00
NAME
Komal
Muffy
AGE
22
24
ADDRESS
MP
Indore
SALARY
There are three Logical Operators namely, AND, OR, and NOT. These operators compare two
conditions at a time to determine whether a row can be selected for the output. When retrieving
data using a SELECT statement, you can use logical operators in the WHERE clause, which
allows you to combine more than one condition.
Logical Operators
OR
AND
NOT
Description
For the row to be selected at
least one of the conditions
must be true.
For a row to be selected all the
specified conditions must be
true.
For a row to be selected the
specified condition must be
false.
Page 52
DBMS
If you want to select rows that satisfy at least one of the given conditions, you can use the
logical operator, OR.
Example: if you want to find the names of students who are studying either Maths or Science,
the query would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject = 'Maths' OR subject = 'Science'
firs
t_n
am
e
las
t_
na
me
----------
----------
An
ajal
i
Bh
ag
wa
t
She
kar
Go
wd
a
Ra
hul
Sh
ar
ma
Ste
phe
n
Fle
mi
ng
s
u
b
je
ct
-----M
at
h
s
M
at
h
s
S
ci
e
n
c
e
S
ci
e
n
c
e
The following table describes how logical "OR" operator selects a row.
Column1
Satisfied?
YES
Mr. Y SUBBA RAYUDU M. Tech
Column2
Satisfied?
YES
Page 53
Row
Selected
YES
DBMS
YES
NO
NO
NO
YES
NO
YES
YES
NO
Column2
Satisfied?
YES
NO
YES
NO
Row
Selected
YES
NO
NO
NO
Page 54
DBMS
If you want to find rows that do not satisfy a condition, you can use the logical operator, NOT.
NOT results in the reverse of a condition. That is, if a condition is satisfied, then the row is not
returned.
Example: If you want to find out the names of the students who do not play football, the query
would be like:
SELECT first_name, last_name, games
FROM student_details
WHERE NOT games = 'Football'
OUTER JOINS
All joins mentioned above, that is Theta Join, Equi Join and Natural Join are called inner-joins.
An inner-join process includes only tuples with matching attributes, rest are discarded in
resulting relation. There exists methods by which all tuples of any relation are included in the
resulting relation.
There are three kinds of outer joins:
Left outer join ( R S )
All tuples of Left relation, R, are included in the resulting relation and if there exists tuples in
R without any matching tuple in S then the S-attributes of resulting relation are made NULL.
Left
A
100
101
102
B
Database
Mechanics
Electronics
A
100
102
104
B
Alex
Maya
Mira
Right
C
100
--102
D
Alex
--Maya
Page 55
DBMS
All tuples of the Right relation, S, are included in the resulting relation and if there exists tuples
in S without any matching tuple in R then the R-attributes of resulting relation are made
NULL.
Right outer join output
A
100
102
---
B
Database
Electronics
---
C
100
102
104
D
Alex
Maya
Mira
B
Database
Mechanics
Electronics
---
C
100
--102
104
D
Alex
--Maya
Mira
Employee
Name
Henry
Tina
John
Sophie
Age
Gender
Location
Salary
54
36
24
29
Male
Female
Male
Female
New York
Moscow
London
London
100000
80000
40000
60000
Page 56
DBMS
An integrity constraint defines a business rule for a table column. When enabled, the rule will
be enforced by oracle (and so will always be true.) To create an integrity constraint all existing
table data must satisfy the constraint.
Default values are also subject to integrity constraint checking (defaults are included as part of
an INSERT statement before the statement is parsed.)
If the results of an INSERT or UPDATE statement violate an integrity constraint, the statement
will be rolled back.
Integrity constraints are stored as part of the table definition, (in the data dictionary.)
If multiple applications access the same table they will all adhere to the same rule.
The following integrity constraints are supported by Oracle:
NOT NULL
UNIQUE
CHECK constraints for complex integrity rules
PRIMARY KEY
FOREIGN KEY integrity constraints - referential integrity actions: On Update On
Delete Delete CASCADE Delete SET NULL
Constraint States
The current status of an integrity constraint can be changed to any of the following 4 options
using the CREATE TABLE or ALTER TABLE statement.
Page 57
DBMS
DISABLE VALIDATE disables the constraint, drops the index on the constraint, and
disallows any modification of the constrained columns.
For a UNIQUE constraint, this enables you to load data from a nonpartitioned table into a
partitioned table using the ALTER
TRIGGERS
A database trigger is a procedure written in PL/SQL, Java, or C that will run implicitly when
data is modified or when some user or system actions occur.
Triggers can be used in many ways e.g. to enforce complex integrity constraints or to audit data
modifications. Triggers should not be used to enforce business rules or referential integrity
rules that could be implemented with simple constraints.
Triggers are implicitly fired by Oracle when a triggering event occurs, no matter which user is
connected or which application is being used.
A row trigger is fired once for each row affected by an UPDATE statement.
A statement trigger is fired once, regardless of the number of rows in the table.
BEFORE triggers execute the trigger action before the triggering statement is executed. This
type of trigger is commonly used if the trigger will derive specific column values or if the
trigger action will determine whether the triggering statement should be allowed to complete.
Appropriate use of a BEFORE trigger can eliminate unnecessary processing of the triggering
statement.
AFTER triggers execute the trigger action after the triggering statement is executed.
For any given table you can have multiple triggers of the same type for the same statement.
E.g. multiple AFTER UPDATE triggers on the same table
Page 58
DBMS
UNIT - III
INTRODUCTION TO SCHEMA REFINEMENT
Conceptual database design gives us a set of relation schemas and integrity constraints (ICs)
that can be regarded as a good starting point for the final database design.This initial design
must be refined by taking the lCs into account more fully than is possible with just the ER
model constructs and also by considering performance criteria and typical workloads.
Introduction to Schema Refinement:
We now present an overview of the problems that schema refinement is intended to address
and a refinement approach based on decompositions.
Redundant storage of information is the root cause of these problems.
Although decomposition can eliminate redundancy, it can lead to problems of its own and
should be used with caution.
Problems caused by Redundancy:
Redundant Storage
Update Anomalies
Insertion Anomalies
Deletion Anomalies
Hourly_Emps (SSN, Name, Lot,Rating, Hourly_wages, Hours_worked)
SSN
123
456
326
434
612
Name
Rajesh
Ajay
Arun
Kamal
Nitin
Lot
48
22
35
35
35
Rating
8
8
5
5
8
Hourly_wages
10
10
7
7
10
Hours_worked
40
30
30
32
40
Decompositions
The Problems arising from redundancy can be solved by replacing a relation with collection of
smaller relations.
A Decomposition of a relation schema R consists of replacing the relation schema by two (or
more) relation schemas that each contain a subset of attributes of R and together include all
attributes of R.
Hourly_Emps2 (SSN, Name, Lot, Rating, Hours_worked)
Mr. Y SUBBA RAYUDU M. Tech
Page 59
DBMS
Wages( Rating, Hourly_wages)
Problems related to Decomposition
Unless we are careful decomposing a relation schema can create some problems than it solves.
We need to ask two questions repeatedly
Is there reason to decompose a relation?
To answer this question, several normal forms have been proposed for relations. If a
relation schema is in one of these normal forms, we know that certain kinds of
problems cannot arise.
B
b1
b1
b2
b1
C
c1
c1
c2
c3
D
d1
d2
d1
d1
AB C
<a1, b1, c2, d1>
Closure of a Set of FDs
We say that an FDs is implied by a given set F of FDs if f holds on every relation instance that
satisfies all dependencies in F; that is, f holds whenever all FDs in F hold.
The set of all FDs implied by a given set F of FDs is called the closure of F, denoted by F + .
The three rules called Armstrongs Axioms, can be applied repeatedly to infer all FDs implied
by a set F of FDs.
Armstrongs Axioms
Mr. Y SUBBA RAYUDU M. Tech
Page 60
DBMS
Here X, Y & Z denote sets of attributes of relation R:
Reflexivity : If X Y, then X Y.
Augmentation : If X Y, then XZ YZ for any Z.
Transitivity : If X Y and Y Z, then X Z
Union : If X Y & X Z, then XYZ
Decomposition : If XYZ, then X Y & X Z
Contracts ( contractid, supplierid, projectid, deptid, partid, qty, value)
This can be denoted as CSJDPQV.
The meaning of tuple is that the contract with contractid C is an agreement that supplier S will
supply Q items of part P to project J associated with department D, the value V of this contract
is equal to value.
The ICs are known to hold are
1. The contract id C is a key : C CSJDPQV
2. A project purchases a given part using a single contract: JP C
3. A department purchases at most one part from supplier: SD P
Some additional FDs hold in the closure of the set of given FDs
From JP C, C CSJDPQV & transitivity JP CSJDPQV
From SD P & augmentation SDJ JP
From SDJ JP & JP CSJDPQV & transitivity SDJ CSJDPQV
From C CSJDPQV using decomposition C C, C S, C J etc.
And we may have number of FDs from reflexivity.
Attribute Closure
If we just want to check whether a given dependency, say, X Y, is in the closure of a set F of
FDs, we can do so efficiently without computing F+ .
We first cornpute the Attribute closure X + with respect to F, is the set of attributes A such that
X A can be inferred using the Armstrong Axioms. We can find attribute closure using this
algorithm.
Closure = X
Repeat until there is no change:
{
If there is an FD V W in F such that
V C closure,
then set closure = closure U W
}
Superkey A superkey of a relation schema R={A1, A2, An} is a set of attributes S R with
property that no two tuples t1 & t2 in any legal relation state r of R will have t1[S]=t2[S].
Page 61
DBMS
Prime Attribute An attribute of relation schema R is called a prime attribute of R if it is a
member of some candidate key of R.
Couple of additional rules (that follow from axioms):
Union If XY and XZ, then XY Z
e.g., if sidacode and sidcity, then sidacode,city
Decomposition - If XY Z, then XY and XZ
e.g., if sidacode,city then sidacode, and sidcity
Examples: Derive union rule from axioms (Reflexivity, Augmentation, and Transitivity) Drive
Decomposition rule from Reflex and Trans. Corollary: Given any set of FDs F, can convert F
into an equivalent set of FDs F, s.t. every FD in F is of the form XA, where X is a set of
attributes and A is a single attribute.
Normalization
If a database design is not perfect it may contain anomalies, which are like a bad dream for
database itself. Managing a database with anomalies is next to impossible.
Update anomalies: if data items are scattered and are not linked to each other properly,
then there may be instances when we try to update one data item that has copies of it
scattered at several places, few instances of it get updated properly while few are left
with there old values. This leaves database in an inconsistent state.
Deletion anomalies: we tried to delete a record, but parts of it left undeleted because of
unawareness, the data is also saved somewhere else.
Insert anomalies: we tried to insert data in a record that does not exist at all.
Normalization is a method to remove all these anomalies and bring database to consistent state
and free from any kinds of anomalies.
This is defined in the definition of relations (tables) itself. This rule defines that all the
attributes in a relation must have atomic domains. Values in atomic domain are indivisible
units.
Page 62
DBMS
We re-arrange the relation (table) as below, to convert it to First Normal Form
Each attribute must contain only single value from its pre-defined domain.
Before we learn about second normal form, we need to understand the following:
Prime attribute: an attribute, which is part of prime-key, is prime attribute.
Non-prime attribute: an attribute, which is not a part of prime-key, is said to be a nonprime attribute.
Second normal form says, that every non-prime attribute should be fully functionally
dependent on prime key attribute. That is, if X A holds, then there should not be any proper
subset Y of X, for that Y A also holds.
We see here in Student_Project relation that the prime key attributes are Stu_ID and Proj_ID.
According to the rule, non-key attributes, i.e. Stu_Name and Proj_Name must be dependent
upon both and not on any of the prime key attribute individually. But we find that Stu_Name
can be identified by Stu_ID and Proj_Name can be identified by Proj_ID independently. This is
called partial dependency, which is not allowed in Second Normal Form.
Page 63
DBMS
We broke the relation in two as depicted in the above picture. So there exists no partial
dependency.
For a relation to be in Third Normal Form, it must be in Second Normal form and the
following must satisfy:
No non-prime attribute is transitively dependent on prime key attribute
For any non-trivial functional dependency, X A, then either
X is a superkey or,
A is prime attribute.
We find that in above depicted Student_detail relation, Stu_ID is key and only prime key
attribute. We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip is a
superkey nor City is a prime attribute. Additionally, Stu_ID Zip City, so there
exists transitive dependency.
We broke the relation as above depicted two relations to bring it into 3NF.
Page 64
DBMS
Boyce-Codd Normal Form:
BCNF is an extension of Third Normal Form in strict way. BCNF states that
For any non-trivial functional dependency, X A, then X must be a super-key.
In the above depicted picture, Stu_ID is super-key in Student_Detail relation and Zip is superkey in ZipCodes relation. So, Stu_ID Stu_Name, ZipAndZip CityConfirms, that both
relations are in BCNF.
Lossless-Join Decomposition:
Let R be a relation schema and let F be a set of FDs over R. A decomposition of R into two
schemas with attribute sets X and Y is said to be a lossless-join decomposition with respect to
F if, for every instance r of R that satisfies the dependencies in F, x (r)
y (r) = r. In
other words, we can recover the original relation from the decomposed relations.
From the definition it is easy to see that r is always a subset of natural join of decomposed
relations. If we take projections of a relation and recombine them using natural join, we
typically obtain some tuples that were not in the original relation.
Example:
By replacing the instance r shown in figure with the instances SP (r) and PD (r), we lose some
information.
S
P
D
s1
p1
d1
s2
p2
d2
s3
p1
d3
Instance r
S
s1
s2
s3
P
p1
p2
p1
P
p1
p2
p1
SP (r)
S
s1
s2
s3
s1
s3
SP (r)
Theorem:
P
p1
p2
p1
p1
p1
PD (r)
D
d1
d2
d3
PD (r)
D
d1
d2
d3
d3
d1
Fig: Instances illustrating Lossy Decompositions
Page 65
DBMS
Let R be a relation and F be a set of FDs that hold over R. The decomposition of R into
relations with attribute sets R1 and R2 is lossless if and only if F+ contains either the FD
R1 R2 R1 (or R1R2) or the FD R1 R2 R2 (or R2R1).
Consider the Hourly_Emps relation. It has attributes SNLRWH, and the FD RW
causes a violation of 3NF. We dealt this violation by decomposing the relation
into SNLRH and RW. Since R is common to both decomposed relations and RW
holds, this decomposition is lossless-join.
Dependency-Preserving Decomposition:
Consider the Contracts relation with attributes CSJDPQV. The given FDs are CCSJDPQV,
JPC, and SDP. Because SD is not a key, the dependency SDP causes a violation of
BCNF.
We can decompose Contracts into relations with schemas CSJDQV and SDP to address this
violation. The decomposition is lossless-join. But, there is one problem. If we want to enforce
an integrity constraint JPC, it requires an expensive join of the two relations. We say that
this decomposition is not dependency-preserving.
Let R be a relation schema that is decomposed into two schemas with attributes sets X and Y,
and let F be a set of FDs over R. The projection of F on X is the set of FDs in the closure
F+ that involve only attributes in X. We denote the projection of F on attributes X as FX . Note
that a dependency UV in F+ is in FX only if all the attributes in U and V are in X.
The decomposition of relation schema R with FDs F into schemas with attribute sets X and Y
is dependency-preserving if (FX U FY)+ = F+.
Example:
Consider the relation R with attributes ABC is decomposed into relations with attributes AB
and BC. The set of FDs over R includes AB, BC, and CA.
The closure of F contains all dependencies in F plus AC, BA, and CB. Consequently
FAB contains AB and BA, and FBC contains BC and CB. Therefore, FAB U FBC contains
AB, BC, BA and CB. The closure of FAB and FBC now includes CA (which follows
from CB and BA). Thus the decomposition preserves the dependency CA.
Design process
1. Determine the purpose of the database - This helps prepare for the remaining steps.
2. Find and organize the information required - Gather all of the types of information
to record in the database, such as product name and order number.
Mr. Y SUBBA RAYUDU M. Tech
Page 66
DBMS
3. Divide the information into tables - Divide information items into major entities or
subjects, such as Products or Orders. Each subject then becomes a table.
4. Turn information items into columns - Decide what information needs to be stored in
each table. Each item becomes a field, and is displayed as a column in the table. For
example, an Employees table might include fields such as Last Name and Hire Date.
5. Specify primary keys - Choose each tables primary key. The primary key is a column,
or a set of columns, that is used to uniquely identify each row. An example might be
Product ID or Order ID.
6. Set up the table relationships - Look at each table and decide how the data in one
table is related to the data in other tables. Add fields to tables or create new tables to
clarify the relationships, as necessary.
7. Refine the design - Analyze the design for errors. Create tables and add a few records
of sample data. Check if results come from the tables as expected. Make adjustments to
the design, as needed.
8. Apply the normalization rules - Apply the data normalization rules to see if tables are
structured correctly. Make adjustments to the tables
Multivalued dependency
In database theory, a multivalued dependency is a full constraint between two sets of attributes
in a relation.
In contrast to the functional dependency, the multivalued dependency requires that certain
tuples be present in a relation. Therefore, a multivalued dependency is a special case of tuplegenerating dependency. The multivalued dependency plays a role in the 4NF database
normalization.
A multivalued dependency is a special case of a join dependency, with only two sets of values
involved, i.e. it is a 2-ary join dependency.
Formal definition
The formal definition is given as follows.
Let be a relational schema and let
dependency
(which can be read as
multidetermines
pairs of tuples
that
in
and
and
) holds on
such that
Page 67
and
, for all
in
such
DBMS
In more simple words the above condition can be expressed as follows: if we denote
by
collectively equal to
and
and
exist in , the
Example
Consider this example of a relation of university courses, the books recommended for the
course, and the lecturers who will be teaching the course:
University courses
Course
Book
Lecturer
AHA
Silberschatz
John D
AHA
Nederpelt
William M
AHA
Silberschatz
William M
AHA
Nederpelt
John D
AHA
Silberschatz
Christian G
AHA
Nederpelt
Christian G
OSO
Silberschatz
John D
OSO
Silberschatz
William M
Because the lecturers attached to the course and the books attached to the course are
independent of each other, this database design has a multivalued dependency; if we were to
add a new book to the AHA course, we would have to add one record for each of the lecturers
on that course, and vice versa.
Put formally, there are two multivalued dependencies in this relation: {course}
{book} and
equivalently {course}
{lecturer}.
Databases with multivalued dependencies thus exhibit redundancy. In database
normalization, fourth normal form requires that either every multivalued dependency X
Y is
trivial or for every nontrivial multivalued dependency X
Y, X is a superkey.
Properties
If
, Then
If
and
, Then
If
and
, then
If
, then
If
and
, then
Page 68
DBMS
The above rules are sound and complete.
A decomposition of R into (X, Y) and (X, R Y) is a lossless-join decomposition if and
only if X
Y holds in R.
Every FD is an MVD because if X
Y, then swapping Y's between tuples that agree
on X doesn't create new tuples.
Splitting Doesnt Hold. Like FDs, we cannot generally split the left side of an
MVD.But unlike FDs, we cannot split the right side either, sometimes you have to
leave several attributes on the right side.
Closure of a set of MVDs is the set of all MVDs that can be inferred using the
following rules (Armstrong's axioms):
Complementation: If X
Y, then X
R - XY
Augmentation: If X
Y and Z W, then XW
YZ
Transitivity: If X
Y and Y Z, then X
Z-Y
Replication: If X
Y, then X
Y
Coalescence: If X
Z
Y and
W s.t. W
Y= ,W
Z, and Z
Y, then X
Full Constraint
A constraint which expresses something about all attributes in a database. (In contrast to
an embedded constraint.) That a multivalued dependency is a full constraintfollows from its
definition,as where it says something about the attributes
tuple-generating dependency
A dependency which explicitly requires certain tuples to be present in the relation.
trivial multivalued dependency 1
A multivalued dependency which involves all the attributes of a relation i.e.
.A
trivial multivalued dependency implies, for tuples and , tuples and which are equal
to and .
trivial multivalued dependency 2
A multivalued dependency for which
MVD Example
Course ->> Instructor
Course ->> Text
Course(Y)
Intro
Intro
Instructor(X)
Kruse
Wright
Text(R-XY)
Intro to CS
Intro to CS
Page 69
DBMS
CS1
CS1
CS2
CS2
CS2
CS2
Thomas
Thomas
Rhodes
Rhodes
Kruse
Kruse
Intro to Java
CS Theory Survey
Java Data Structures
Unix
Java Data Structures
Unix
Page 70
DBMS
A table is in fifth normal form (5NF) or Project-Join Normal Form (PJNF) if it is in 4NF and it
cannot have a lossless decomposition into any number of smaller tables.
Properties of 5NF:
Anomalies can occur in relations in 4NF if the primary key has three or more fields.
5NF is based on the concept of join dependence - if a relation cannot be decomposed
any further then it is in 5NF.
Pair wise cyclical dependency means that:
o You always need to know two values (pair wise).
o For any one you must know the other two (cyclical).
Example to understand 5NF
Take the following table structure as an example of a buying table.This is used to track buyers,
what they buy, and from whom they buy. Take the following sample data:
buyer
vendor
item
Sally
Mary
Sally
Mary
Sally
Liz Claiborne
Liz Claiborne
Jordach
Jordach
Jordach
Blouses
Blouses
Jeans
Jeans
Sneakers
Problem:- The problem with the above table structure is that if Claiborne starts to sell Jeans
then how many records must you create to record this fact? The problem is there are pair wise
cyclical dependencies in the primary key. That is, in order to determine the item you must
know the buyer and vendor, and to determine the vendor you must know the buyer and the
item, and finally to know the buyer you must know the vendor and the item.
Solution:- The solution is to break this one table into three tables; Buyer-Vendor, Buyer-Item,
and Vendor-Item. So following tables are in the 5NF.
Buyer-Vendor
buyer
Mr. Y SUBBA RAYUDU M. Tech
vendor
Page 71
DBMS
Sally
Liz Claiborne
Mary
Liz Claiborne
Sally
Jordach
Mary
Jordach
Buyer-Item
buyer
item
Sally
Blouses
Mary
Blouses
Sally
Jeans
Mary
Jeans
Sally
Sneakers
Vendor-Item
vendor
item
Page 72
DBMS
Liz Claiborne
Blouses
Jordach
Jeans
Jordach
Sneakers
Note: There is also one more normal form i.e. 6 NF. A table is in sixth normal form (6NF) or
Domain-Key normal form (DKNF) if it is in 5NF and if all constraints and dependencies that
should hold on the relation can be enforced simply by enforcing the domain constraints and the
key constraints specified on the relation.
Inclusion Dependencies:
Inclusion dependencies support an essential semantics of the standard relational data
model. An inclusion dependency is defined as the existence of attributes (the left term) in a
table R whose values must be a subset of the values of the corresponding attributes (the right
term) in another table S .When the right term conforms a unique column or a primary key (K)
for the table S, the inclusion dependency is key-based (also named referential integrity
restriction, rir). In this case, the left term is a foreign key (FK) in R and the restriction is stated
as R[FK] <<S[K]. On the contrary, if the right term does not constitute the key of S, the
inclusion dependency is non-key-based (simply, an inclusion dependency, id). Ids are
expressed as R[X] S[Z], being R[X] and S[Z] the left and right terms respectively. Both, rirs
and ids, are often called referential constraints.
UNIT-4
TRANSACTION MANAGEMENT
Transactions:
Transaction Concept
Mr. Y SUBBA RAYUDU M. Tech
Page 73
DBMS
Transaction State
Implementation of Atomicity and Durability
Concurrent Executions
Serializability
Recoverability
Implementation of Isolation
Transaction Definition in SQL
Testing for Serializability.
Transaction Concept
A transactionis a unit of program execution that accesses and possibly updates various
data items.
A transaction must see a consistent database.
During transaction execution the database may be inconsistent.
When the transaction is committed, the database must be consistent.
Two main issues to deal with:
Failures of various kinds, such as hardware failures and system crashes
Concurrent execution of multiple transactions
ACID Properties
Atomicity: Either all operations of the transaction are properly reflected in the database
or none are.
Mr. Y SUBBA RAYUDU M. Tech
Page 74
DBMS
Consistency: Execution of a transaction in isolation preserves the consistency of the
database.
Isolation: Although multiple transactions may execute concurrently, each transaction
must be unaware of other concurrently executing transactions. Intermediate transaction
results must be hidden from other concurrently executed transactions.
That is, for every pair of transactions Tiand Tj, it appears to Tithat either Tj,
finished execution before Ti started, or Tj started execution after Ti finished.
Durability: After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures.
Example of Fund Transfer
Transaction to transfer $50 from account A to account B:
1.
read(A)
2.
A :=A 50
3.
write(A)
4.
read(B)
5.
B :=B + 50
6.
write(B)
sum
will
be
less
than
it
should
be).
Can be ensured trivially by running transactions serially, that is one after the
Mr. Y SUBBA RAYUDU M. Tech
Page 75
DBMS
other.However, executing multiple transactions concurrently has significant benefits, as
we will see.
Transaction States
Active, the initial state; the transaction stays in this state while it is executing
Partially committed, after the final statement has been executed.
Failed, after the discovery that normal execution can no longer proceed.
Aborted, after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted:
restart the transaction only if no internal logical error
kill the transaction
Committed, after successful completion.
Page 76
DBMS
Concurrent Executions
Multiple transactions are allowed to run concurrently in the system. Advantages are:
increased processor and disk utilization, leading to better transaction
throughput: one transaction can be using the CPU while another is reading from
or writing to the disk
reduced average response time for transactions: short transactions need not
wait behind long ones.
Concurrency control schemes mechanisms to achieve isolation, i.e., to control the
interaction among the concurrent transactions in order to prevent them from destroying
the consistency of the database
Mr. Y SUBBA RAYUDU M. Tech
Page 77
DBMS
Schedules
Schedules sequences that indicate the chronological order in which instructions of
concurrent transactions are executed
a schedule for a set of transactions must consist of all instructions of those
transactions
must preserve the order in which the instructions appear in each individual
transaction.
Example Schedules
Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The
following is a serial schedule in which T1 is followed by T2.
Schedule 1
Let T1 and T2 be the transactions defined previously. The following schedule is not a
serial schedule, but it is equivalent to Schedule 1.
Schedule 2
Page 78
DBMS
In both above Schedules , the sum A + B is preserved.
The following concurrent schedule does not preserve the value of the the sum A + B.
Serializability
Basic Assumption Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency.
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Different forms of schedule equivalence give rise to the notions of:
1. conflictserializability
2. viewserializability
We ignore operations other than read and write instructions, and we assume that
transactions may perform arbitrary computations on data in local buffers in between
reads and writes. Our simplified schedules consist of only read and write instructions.
Conflict Serializability
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there
exists some item Q accessed by both li and lj, and at least one of these instructions
wrote Q.
Page 79
DBMS
1.
li
read(Q),
2.
li
read(Q),
3.
li
write(Q),
lj
read(Q).
lj
lj
=
=
li
and
write(Q).
read(Q).
ljdont
conflict.
They
conflict.
They
conflict
T3
T4
read(Q)
write(Q)
write(Q)
We are unable to swap instructions in the above schedule to obtain either the serial
schedule <T3, T4>, or the serial schedule <T4, T3>.
Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2
follows T1, by series of swaps of non-conflicting instructions. Therefore Schedule 3 is
conflict serializable.
Page 80
DBMS
Schedule 3
View Serializability
Let S and S be two schedules with the same set of transactions. S and S are view
equivalentif the following three conditions are met:
1 .For each data item Q, if transaction Tireads the initial value of Q in schedule S, then
transaction Ti must, in schedule S, also read the initial value of Q.
2.For each data item Q if transaction Tiexecutes read(Q) in schedule S, and that value
was produced by transaction Tj(if any), then transaction Ti must in schedule S also read
the value of Q that was produced by transaction Tj .
3. For each data item Q, the transaction (if any) that performs the final write(Q)
operation in schedule S must perform the finalwrite(Q) operation in schedule S.
As can be seen, view equivalence is also based purely on readsand writes alone.
A schedule S is view serializable it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Schedule 9 a schedule which is view-serializable but not conflict serializable.
Schedule 9
Every view serializable schedule that is not conflict serializable has blind writes.
Page 81
DBMS
Schedule 8 given below produces same outcome as the serial schedule <T1,T5>, yet is
not conflict equivalent or view equivalent to it.
Schedule
8
Determining such equivalence requires analysis of operations other than read and write.
Recoverability
Recoverableschedule if a transaction Tj reads a data items previously written by a
transaction Ti , the commit operation of Ti appears before the commit operation of Tj.
The following schedule is not recoverable if T9commits immediately after the read
Schedule 11
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence database must ensure that schedules are recoverable.
Cascading rollback a single transaction failure leads to a series of transaction
rollbacks. Consider the following schedule where none of the transactions has yet
committed (so the schedule is recoverable)
Page 82
DBMS
Implementation of Isolation
Schedules must be conflict or view serializable, and recoverable, for the sake of
database consistency, and preferably cascadeless.
A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency..
Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
Page 83
DBMS
Commit work commits current transaction and begins a new one.
Rollback work causes current transaction to abort.
Levels of consistency specified by SQL-92:
Serializable default
Repeatable read
Read committed
Read uncommitted
Example 1
Page 84
DBMS
T2
T3
T4
T5
read(X)
read(Y)
read(Z)
read(V)
read(W)
read(W)
read(Y)
write(Y)
write(Z)
read(U)
read(Y)
write(Y)
read(Z)
write(Z)
read(U)
write(U)
Page 85
DBMS
Cycle-detection algorithms exist which take order n2 time, where n is the number of
vertices in the graph. (Better algorithms take order n + e where e is the number of edges.)
If precedence graph is acyclic, the serializability order can be obtained by a topological
sorting of the graph. This is a linear order consistent with the partial order of the graph.
For
example,
serializability
order
for
Schedule
would
be
T5T1T3T2T4 .
Thus
existence
of
an
efficient
algorithm
is
unlikely.
However practical algorithms that just check some sufficient conditions for view
serializability can still be used.
Page 86
DBMS
Page 87
DBMS
Example of a transaction performing locking:
T2: lock-S(A);
read(A);
unlock(A);
lock-S(B);
read(B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability if A and B get updated
in-between the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. Locking protocols restrict the set of possible schedules.
Neither T3 nor T4 can make progress executing lock-S(B) causes T4 to wait for T3 to
release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its
lock on A.
Such a situation is called a deadlock.
To handle a deadlock one of T3 or T4 must be rolled back
and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary
evil.
Starvation is also possible if concurrency control manager is badly designed. For
example:
Mr. Y SUBBA RAYUDU M. Tech
Page 88
DBMS
A transaction may be waiting for an X-lock on an item, while a sequence of
other transactions request and are granted an S-lock on the same item.
The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
Page 89
DBMS
Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj
that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.
Lock Conversions
Two-phase locking with lock conversions:
First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the
various locking instructions.
Page 90
DBMS
ifTi has a lock-X on D
then
write(D)
else
begin
if necessary wait until no other trans. has any lock on D,
ifTi has a lock-S on D
then
upgrade lock on D to lock-X
else
grantTi a lock-X on D
write(D)
end;
All locks are released after commit or abort
Implementation of Locking
A Lock manager can be implemented as a separate process to which transactions send
lock and unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a
message asking the transaction to roll back, in case of a deadlock)
The requesting transaction waits until its request is answered
The lock manager maintains a datastructure called a lock table to record granted locks
and pending requests
The lock table is usually implemented as an in-memory hash table indexed on the name
of the data item being locked
Lock Table
Black rectangles indicate granted locks, white ones indicate waiting requests
Mr. Y SUBBA RAYUDU M. Tech
Page 91
DBMS
Lock table also records the type of lock granted or requested
New request is added to the end of the queue of requests for the data item, and granted
if it is compatible with all earlier locks
Unlock requests result in the request being deleted, and later requests are checked to see
if they can now be granted
If transaction aborts, all waiting or granted requests of the transaction are deleted
lock manager may keep a list of locks held by each transaction, to implement
this efficiently
Graph-Based Protocols
Graph-based protocols are an alternative to two-phase locking
Impose a partial ordering on the set D = {d1, d2 ,..., dh} of all data items.
If didj then any transaction accessing both di and dj must access di before
accessing dj.
Implies that the set D may now be viewed as a directed acyclic graph, called a
database graph.
The tree-protocol is a simple kind of graph protocol.
Tree Protocol
Only exclusive locks are allowed.
The first lock by Ti may be on any data item. Subsequently, a data Q can be locked by
Ti only if the parent of Q is currently locked by Ti.
Data items may be unlocked at any time.
The tree protocol ensures conflict serializability as well as freedom from deadlock.
Unlocking may occur earlier in the tree-locking protocol than in the two-phase locking
protocol.
shorter waiting times, and increase in concurrency
Mr. Y SUBBA RAYUDU M. Tech
Page 92
DBMS
protocol is deadlock-free, no rollbacks are required
the abort of a transaction can still lead to cascading rollbacks.
However, in the tree-locking protocol, a transaction may have to lock data items that it
does not access.
increased locking overhead, and additional waiting time
potential decrease in concurrency
Schedules not possible under two-phase locking are possible under tree protocol, and
vice versa.
Timestamp-Based Protocols:
Each transaction is issued a timestamp when it enters the system. If an old transaction
Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that
TS(Ti) <TS(Tj).
The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
In order to assure such behavior, the protocol maintains for each data Q two timestamp
values:
W-timestamp(Q) is the largest time-stamp of any transaction that executed
write(Q) successfully.
R-timestamp(Q) is the largest time-stamp of any transaction that executed
read(Q) successfully.
The timestamp ordering protocol ensures that any conflicting
operations are executed in timestamp order.
Suppose a transaction Ti issues a read(Q)
1. If TS(Ti) W-timestamp(Q), then Ti needs to read a value of Q
that was already overwritten. Hence, the read operation is
Mr. Y SUBBA RAYUDU M. Tech
Page 93
DBMS
rejected, and Ti is rolled back.
2. If TS(Ti)W-timestamp(Q), then the read operation is
executed, and R-timestamp(Q) is set to the maximum of R- timestamp(Q) and TS(Ti).
Suppose that transaction Ti issues write(Q).
If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced. Hence,
the write operation is rejected, and Ti is rolled back.
If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.
Hence, this write operation is rejected, and Ti is rolled back.
Otherwise, the write operation is executed, and W-timestamp(Q) is set to TS(Ti).
Example Use of the Protocol: A partial schedule for several data items for transactions
with timestamps
1, 2, 3, 4, 5
T1
T2
read(Y)
T3
T4
T5
read(X)
read(Y)
write(Y)
write(Z)
read(Z)
read(X)
read(X)
abort
write(Z)
abort
write(Y)
write(Z)
transaction
transaction
with smaller
Page 94
with larger
DBMS
Thus, there will be no cycles in the precedence graph
Timestamp protocol ensures freedom from deadlock as no transaction ever waits.
But the schedule may not be cascade-free, and may not even be recoverable
write
Validation-Based Protocol
Mr. Y SUBBA RAYUDU M. Tech
Page 95
DBMS
Execution of transaction Tiis done in three phases.
1. Read and execution phase: Transaction Ti writes only to
temporary local variables
2. Validation phase: Transaction Ti performs a ``validation test''
to determine if local variables can be written without violating
serializability.
3. Write phase: If Ti is validated, the updates are applied to the
database; otherwise, Ti is rolled back.
The three phases of concurrently executing transactions can be
Page 96
DBMS
Justification: Either first condition is satisfied, and there is no overlapped execution, or
second condition is satisfied and
1. the writes of Tjdo not affect reads of Ti since they occur after Ti
has finished its reads.
2. the writes of Ti do not affect reads of Tj since Tjdoes not read
any item written by Ti.
Schedule Produced by Validation
Example of schedule produced using validation
T1
T2
read(B)
read(B)
B:- B-50
read(A)
A:- A+50
read(A)
(validate)
display (A+B)
Multiple Granularity
(validate)
write (B)
write (A)
Allow data items to be of various sizes and define a hierarchy of data granularities,
where the small granularities are nested within larger ones
Can be represented graphically as a tree (but don't confuse with tree-locking protocol)
When a transaction locks a node in the tree explicitly, it implicitly locks all the node's
descendents in the same mode.
Granularity of locking (level in tree where locking is done):
fine granularity (lower in tree): high concurrency, high locking overhead
coarse granularity (higher in tree): low locking overhead, low concurrency
Page 97
DBMS
IX
S IX
IS
IX
S
S Mr.
IX Y SUBBA RAYUDU M. Tech
X
Page 98
DBMS
Page 99
DBMS
Multiple Granularity Locking Scheme
Transaction Ti can lock a node Q, using the following rules:
1. The lock compatibility matrix must be observed.
2. The root of the tree must be locked first, and may be locked in
any mode.
3. A node Q can be locked by Ti in S or IS mode only if the parent
ofQ is currently locked by Ti in either IX or IS
mode.
4. A node Q can be locked by Ti in X, SIX, or IX mode only if the
parent of Q is currently locked by Ti in either IX
or SIX mode.
5. Ti can lock a node only if it has not previously unlocked any node
(that is, Tiis two-phase).
6. Tican unlock a node Q only if none of the children of Q are
currently locked by Ti.
Observe
that
locks
are
acquired
Recovery System
Failure Classification
Storage Structure
Recovery and Atomicity
Log-Based Recovery
Shadow Paging
Recovery With Concurrent Transactions
Buffer Management
Failure with Loss of Nonvolatile Storage
Advanced Recovery Techniques
Page 100
in
root-to-leaf
order,
DBMS
ARIES Recovery Algorithm
Remote Backup Systems
Failure Classification
Transaction failure :
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to
an error condition (e.g., deadlock)
System crash: a power failure or other hardware or software failure causes the system
to crash.
Fail-stop assumption: non-volatile storage contents are assumed to not be
corrupted by system crash
Database systems have numerous integrity checks to prevent corruption
of disk data
Disk failure: a head crash or similar disk failure destroys all or part of disk storage
Destruction is assumed to be detectable: disk drives use checksums to detect
failures
Recovery Algorithms
Recovery algorithms are techniques to ensure database consistency and transaction
atomicity and durability despite failures
Recovery algorithms have two parts
Actions taken during normal transaction processing to ensure enough information
exists to recover from failures
Actions taken after a failure to recover the database contents to a state that ensures
atomicity, consistency and durability
Storage Structure
Volatile storage:
Mr. Y SUBBA RAYUDU M. Tech
Page 101
DBMS
does not survive system crashes
examples: main memory, cache memory
Nonvolatile storage:
survives system crashes
examples:
disk,
tape,
flash
memory,
Page 102
DBMS
Copies of a block may differ due to failure during output operation. To recover from
failure:
First find inconsistent blocks:
Expensive solution: Compare the two copies of every disk block.
Better solution:
copies can be at remote sites to protect against disasters such as fire or flooding.
Failure during data transfer can still result in inconsistent copies: Block transfer can
result in
Successful completion
Partial failure: destination block has incorrect information
Total failure: destination block was never updated
Protecting storage media from failure during data transfer (one solution):
Execute output operation as follows (assuming two copies of each block):
Write the information onto the first physical block.
When the first write successfully completes, write the same information
onto the second physical block.
The output is completed only after the second write successfully
completes.
Copies of a block may differ due to failure during output operation. To recover from
failure:
First find inconsistent blocks:
Expensive solution: Compare the two copies of every disk block.
Better solution:
Page 103
DBMS
Record in-progress disk writes on non-volatile storage (Non-volatile
RAM or special area of disk).
Use this information during recovery to find blocks that may be
inconsistent, and only compare copies of these.
Used in hardware RAID systems
If either copy of an inconsistent block is detected to have an error (bad
checksum), overwrite it by the other copy. If both have no error, but are
different, overwrite the second block by the first block.
Data Access
Physical blocks are those blocks residing on the disk.
Buffer blocks are the blocks residing temporarily in main memory.
Block movements between disk and main memory are initiated through the following
two operations:
input(B) transfers the physical block B to main memory.
output(B) transfers the buffer block B to the disk, and replaces the appropriate
physical block there.
Each transaction Tihas its private work-area in which local copies of all data items
accessed and updated by it are kept.
Ti's local copy of a data item X is called xi.
We assume, for simplicity, that each data item fits in, and is stored inside, a single
block.
Transaction transfers data items between system buffer blocks and its private work-area
using the following operations :
read(X) assigns the value of data item X to the local variable xi.
write(X) assigns the value of local variable xito data item {X} in the buffer
block.
Page 104
DBMS
both these commands may necessitate the issue of an input(BX) instruction
before the assignment, if the block BX in which X resides is not already in
memory.
Transactions
Perform read(X) while accessing X for the first time;
All subsequent accesses are to the local copy.
After last access, transaction executes write(X).
output(BX) need not immediately follow write(X). System can perform the output
operation when it deems fit.
Page 105
DBMS
log-based recovery, and
shadow-paging
We assume (initially) that transactions run serially, that is, one after the other.
Log-Based Recovery
A log is kept on stable storage.
The log is a sequence of log records, and maintains a record of update activities
on the database.
When
transaction
Tistarts,
it
registers
itself
by
writing
Page 106
DBMS
A write(X) operation results in a log record <Ti, X, V>being written, where V is the
new value for X
Note: old value is not needed for this scheme
The write is not performed on X at this time, but is deferred.
When Tipartially commits, <Ticommit> is written to the log
Finally, the log records are read and used to actually execute the previously deferred
writes.
During recovery after a crash, a transaction needs to be redone if and only if both
<Tistart> and<Ti commit> are there in the log.
Redoing a transaction Ti(redoTi) sets the value of all data items updated by the
transaction to the new values.
Crashes can occur while
the transaction is executing the original updates, or
while recovery action is being taken
example transactions T0and T1(T0executes before T1):
T0: read (A)
T1:read (C)
A: - A - 50
C:-C- 100
Write (A)
write (C)
read(B)
B:- B + 50
write(B)
Below we show the log as it appears at three instances of time.
Page 107
DBMS
Write
Output
<T0start>
<T0, A, 1000, 950>
Mr. Y SUBBA RAYUDU M. Tech
Page 108
DBMS
To, B, 2000, 2050
A = 950
B = 2050
<T0commit>
<T1start>
<T1, C, 700, 600>
C = 600
BB, BC
<T1commit>
BA
Note:BXdenotes block containing X.
Recovery procedure has two operations instead of one:
undo(Ti) restores the value of all data items updated by Ti to their old values,
going backwards from the last log record for Ti
redo(Ti) sets the value of all data items updated by Tito the new values, going
forward from the first log record for Ti
Both operations must be idempotent
That is, even if the operation is executed multiple times the effect is the same as
if it is executed once
Needed since operations may get re-executed during recovery
When recovering after failure:
Transaction Tineeds to be undone if the log contains the record
<Tistart>, but does not contain the record <Ticommit>.
Transaction Tineeds to be redone if the log contains both the record <Tistart>
and the record <Ti commit>.
Undo operations are performed first, then redo operations.
Immediate DB Modification Recovery Example
Mr. Y SUBBA RAYUDU M. Tech
Page 109
DBMS
Below we show the log as it appears at three instances of time.
Page 110
DBMS
3. Need only consider the part of log following above start record. Earlier part of
log can be ignored during recovery, and can be erased whenever desired.
4. For all transactions (starting from Ti or later) with no <Ticommit>, execute
undo(Ti). (Done only in case of immediate modification.)
5. Scanning forward in the log, for all transactions starting
Example of Checkpoints
Page 111
DBMS
Shadow Paging
Shadow paging is an alternative to log-based recovery; this scheme is useful if
transactions execute serially
Idea: maintain two page tables during the lifetime of a transaction the current page
table, and the shadow page table
Store the shadow page table in nonvolatile storage, such that state of the database
prior to transaction execution may be recovered.
1. Shadow page table is never modified during execution
To start with, both the page tables are identical. Only current page table is used for
data item accesses during execution of the transaction.
Whenever any page is about to be written for the first time
1. A copy of this page is made onto an unused page.
2. The current page table is then made to point to the copy
3. The update is performed on the copy
Sample Page Table
Page 112
DBMS
To commit a transaction :
1. Flush all modified pages in main memory to disk
2. Output current page table to disk
3. Make the current page table the new shadow page table, as follows:
keep a pointer to the shadow page table at a fixed (known) location on disk.
to make the current page table the new shadow page table, simply update the
pointer to point to current page table on disk
Once pointer to shadow page table has been written, transaction is committed.
No recovery is needed after a crash new transactions can start right away, using the
shadow page table.
Pages not pointed to from current/shadow page table should be freed (garbage
collected).
Advantages of shadow-paging over log-based schemes
no overhead of writing log records
recovery is trivial
Disadvantages :
Copying the entire page table is very expensive
Mr. Y SUBBA RAYUDU M. Tech
Page 113
DBMS
Can be reduced by using a page table structured like a B+-tree
No need to copy entire tree, only need to copy paths in the tree
that lead to updated leaf nodes
Commit overhead is high even with above extension
Need to flush every updated page, and page table
Data gets fragmented (related pages get separated on disk)
After every transaction completion, the database pages containing old versions
of modified data need to be garbage collected
Hard to extend algorithm to allow transactions to run concurrently
Easier to extend log based schemes
Page 114
DBMS
since several transactions may be active when a checkpoint is performed.
Checkpoints are performed as before, except that the checkpoint log record is now of
the
form
<
checkpointL>
is
found.
log
backwards
from
most
recent
record,
stopping
when
Page 115
DBMS
During the scan, perform redo for each log record that belongs to a
transaction on redo-list
Example of Recovery
Go over the steps of the recovery algorithm on the following log:
<T0start>
<T0, A, 0, 10>
<T0commit>
<T1start>
<T1, B, 0, 10>
<T2start>
<T2, C, 0, 10>
<T2, C, 10, 20>
<checkpoint {T1, T2}>
<T3start>
<T3, A, 10, 20>
<T3, D, 0, 10>
<T3commit>
Log Record Buffering
Log record buffering: log records are buffered in main memory, instead of of being
output directly to stable storage.
Log records are output to stable storage when a block of log records in the
buffer is full, or a log force operation is executed.
Log force is performed to commit a transaction by forcing all its log records (including
the commit record) to stable storage.
Several log records can thus be output using a single output operation, reducing the I/O
cost.
The rules below must be followed if log records are buffered:
Log records are output to stable storage in the order in which they are created.
Page 116
DBMS
Transaction Ti enters the commit state only when the log record
<Ticommit> has been output to stable storage.
Before a block of data in main memory is output to the database, all log records
pertaining to data in that block must have been output to stable storage.
This rule is called the write-ahead logging or WAL rule
Strictly speaking WAL only requires undo information to be
output
Database Buffering
Database maintains an in-memory buffer of data blocks
When a new block is needed, if buffer is full an existing block needs to be
removed from buffer
If the block chosen for removal has been updated, it must be output to disk
As a result of the write-ahead logging rule, if a block with uncommitted updates is
output to disk, log records with undo information for the updates are output to the log
on stable storage first.
No updates should be in progress on a block when it is output to disk. Can be ensured
as follows.
Before writing a data item, transaction acquires exclusive lock on block
containing the data item
Lock can be released once the write is completed.
Such locks held for short duration are called latches.
Before a block is output to disk, the system acquires an exclusive latch on the
block
Ensures no update can be in progress on the block
Buffer Management
Database buffer can be implemented either
Mr. Y SUBBA RAYUDU M. Tech
Page 117
DBMS
in an area of real main-memory reserved for the database, or
in virtual memory
Implementing buffer in reserved main-memory has drawbacks:
Memory is partitioned before-hand between database buffer and applications,
limiting flexibility.
Needs may change, and although operating system knows best how memory
should be divided up at any time, it cannot change the partitioning of memory.
Database buffers are generally implemented in virtual memory in spite of some
drawbacks:
When operating system needs to evict a page that has been modified, to make
space for another page, the page is written to swap space on disk.
When database decides to write buffer page to disk, buffer page may be in swap
space, and may have to be read from swap space on disk and output to the
database on disk, resulting in extra I/O!
Known as dual paging problem.
Ideally when swapping out a database buffer page, operating system should pass
control to database, which in turn outputs page to database instead of to swap
space (making sure to output log records first)
Dual paging can thus be avoided, but common operating systems do not
support such functionality.
Page 118
DBMS
Output all log records currently residing in main memory onto stable
storage.
Output all buffer blocks onto the disk.
Copy the contents of the database to stable storage.
Output a record <dump> to log on stable storage.
To recover from disk failure
restore database from most recent dump.
Consult the log and redo all transactions that committed after the dump
Can
be
extended
to
allow
transactions
to
be
active
during
dump;
Page 119
DBMS
Operation logging is done as follows:
When operation starts, log <Ti, Oj, operation-begin>. HereOj is a unique
identifier of the operation instance.
While operation is executing, normal log records with physical redo and
physical undo information are logged.
When operation completes, <Ti, Oj, operation-end, U> is logged, where U
contains information needed to perform a logical undo information.
If crash/rollback occurs before operation completes:
the operation-end log record is not found, and
the physical undo information is used to undo operation.
If crash/rollback occurs after the operation completes:
the operation-end log record is found, and in this case
logical undo is performed using U; the physical undo information for the
operation is ignored.
Redo of operation (after crash) still uses physical redo information.
Rollback of transaction Tiis done as follows:
Page 120
DBMS
<Ti, Oj, operation-abort>.
Skip all preceding log records for Ti until the record <Ti, Ojoperationbegin> is found
3 .If a redo-only record is found ignore it
4. If a <Ti, Oj,operation-abort> record is found:
skip
all
preceding
log
records
for
Ti
until
the
record
undo-list contains transactions that are incomplete, that is, have neither
Page 121
DBMS
When <Ti start> is found for a transaction Ti in undo-list, write a <Tiabort> log
record.
Stop scan when <Tistart> records have been found for all Ti in undo-list
This undoes the effects of incomplete transactions (those with neither commit nor
abort log records). Recovery is now complete.
Checkpointing is done as follows:
1. Output all log records in memory to stable storage
2. Output to disk all modified buffer blocks
3. Output to log on stable storage a < checkpoint L> record.
Transactions are not allowed to perform any actions while checkpointing is in progress.
Fuzzy checkpointing allows transactions to progress while the most time consuming
parts of checkpointing are in progress
Performed as described on next slide
Page 122
DBMS
Fuzzy checkpointing is done as follows:
Temporarily stop all updates by transactions
Write a <checkpointL> log record and force log to stable storage
Note list M of modified buffer blocks
Now permit transactions to proceed with their actions
Output to disk all modified buffer blocks in list M
blocks should not be updated while being output
Follow WAL: all log records pertaining to a block must be output before
the block is output
Store a pointer to the checkpoint record in a fixed position last_checkpoint on
disk
When recovering using a fuzzy checkpoint, start scan from the checkpoint record
pointed to by last_checkpoint
Log records before last_checkpoint have their updates reflected in database on
disk, and need not be redone.
Incomplete checkpoints, where system had crashed while performing
checkpoint, are handled safely
Page 123
DBMS
Stores LSNs in pages to identify what updates have already been
applied to a database page
Physiological redo
Dirty page table to avoid unnecessary redos during recovery
Fuzzy checkpointing that only records information about dirty pages, and does
not require dirty pages to be written out at checkpoint time
More coming up on each of the above
ARIES Optimizations
Physiological redo
Affected page is physically identified, action within page can be
logical
Used to reduce logging overheads
Page 124
DBMS
ARIES Data Structures
Log sequence number (LSN) identifies each log record
Must be sequentially increasing
Typically an offset from beginning of log file to allow fast access
Easily extended to handle multiple log files
Each page contains a PageLSN which is the LSN of the last log record whose effects
are reflected on the page
To update a page:
X-latch the pag, and write the log record
Update the page
Record the LSN of the log record in PageLSN
Unlock page
Page flush to disk S-latches page
Thus page state on disk is operation consistent
o
Page 125
DBMS
Have a field UndoNextLSN to note next (earlier) record to be undone
Records in between would have already been undone
Required to avoid repeated undo of already undone actions
DirtyPageTable
List of pages in the buffer that have been updated
Contains, for each such page
PageLSN of the page
RecLSNis an LSN such that log records before this LSN have
already been applied to the page version on disk
o
Page 126
DBMS
Which pages were dirty (disk version not up to date) at time of crash
RedoLSN: LSN from which redo should start
Redo pass:
Repeats history, redoing all actions from RedoLSN
RecLSN and PageLSNs are used to avoid redoing actions already
reflected on page
Undo pass:
Rolls back all incomplete transactions
Transactions whose abort was complete earlier are not undone
Key idea: no need to undo these transactions: earlier undo
actions were logged, and are redone as required
ARIES Recovery: Analysis
Analysis pass
Starts from last complete checkpoint log record
Reads in DirtyPageTable from log record
Sets RedoLSN = min of RecLSNs of all pages in DirtyPageTable
In case no pages are dirty, RedoLSN = checkpoint records LSN
Sets undo-list = list of transactions in checkpoint log record
Reads LSN of last log record for each transaction in undo-list from checkpoint
log record
Scans forward from checkpoint
If any log record found for transaction not in undo-list, adds transaction to
undo-list
Whenever an update log record is found
Mr. Y SUBBA RAYUDU M. Tech
Page 127
DBMS
If page is not in DirtyPageTable, it is added with RecLSN set to LSN of
the update log record
If transaction end log record found, delete transaction from undo-list
Keeps track of last log record for each transaction in undo-list
May be needed for later undo
At end of analysis pass:
RedoLSN determines where to start redo pass
RecLSN for each page in DirtyPageTable used to minimize redo work
All transactions in undo-list need to be rolled back
ARIES Redo Pass
Redo Pass: Repeats history by replaying every action not already reflected in the page on disk,
as follows:
Scans forward from RedoLSN. Whenever an update log record is found:
1. If the page is not in DirtyPageTable or the LSN of the log record is less than the
RecLSN of the page in DirtyPageTable, then skip the log record
2. Otherwise fetch the page from disk. If the PageLSN of the page fetched from
disk is less than the LSN of the log record, redo the log record
NOTE: if either test is negative the effects of the log record have already appeared on the page.
First test avoids even fetching the page from disk!
ARIES Undo Actions
When an undo is performed for an update log record
Generate a CLR containing the undo action performed (actions performed
during undo are logged physicaly or physiologically).
CLR for record n noted as n in figure below
Set UndoNextLSN of the CLR to the PrevLSN value of the update log record
Arrows indicate UndoNextLSN value
Mr. Y SUBBA RAYUDU M. Tech
Page 128
DBMS
ARIES supports partial rollback
Used e.g. to handle deadlocks by rolling back just enough to release reqd. locks
Figure indicates forward actions after partial rollbacks
records 3 and 4 initially, later 5 and 6, then full rollback
Page 129
DBMS
Recovery Independence
Pages can be recovered independently of others
E.g. if some disk pages fail they can be recovered from a backup while
other pages are being used
Savepoints:
Transactions can record savepoints and roll back to a savepoint
Useful for complex transactions
Also used to rollback just enough to release locks on deadlock
Fine-grained locking:
Index concurrency algorithms that permit tuple level locking on indices can be
used
These require logical undo, rather than physical undo, as in advanced
recovery algorithm
Recovery optimizations: For example:
Dirty page table can be used to prefetch pages during redo
Out of order redo is possible:
redo can be postponed on a page being fetched from disk, and
performed when page is fetched.
Meanwhile other log records can continue to be processed
Page 130
DBMS
Detection of failure: Backup site must detect when primary site has failed
to distinguish primary site failure from link failure maintain several
communication links between the primary and the remote backup.
Transfer of control:
To take over control backup site first perform recovery using its copy of the
database and all the long records it has received from the primary.
Thus, completed transactions are redone and incomplete transactions are
rolled back.
When the backup site takes over processing it becomes the new primary
To transfer control back to old primary when it recovers, old primary must
receive redo logs from the old backup and apply all updates locally.
Time to recover: To reduce delay in takeover, backup site periodically proceses the
redo log records (in effect, performing recovery from previous database state), performs
a checkpoint, and can then delete earlier parts of the log.
Hot-Spare configuration permits very fast takeover:
Backup continually processes redo log record as they arrive, applying the
updates locally.
When failure of the primary is detected the backup rolls back incomplete
transactions, and is ready to process new transactions.
Alternative to remote backup: distributed database with replicated data
Remote backup is faster and cheaper, but less tolerant to failure
Mr. Y SUBBA RAYUDU M. Tech
Page 131
DBMS
Ensure durability of updates by delaying transaction commit until update is logged at
backup; avoid this delay by permitting lower degrees of durability.
One-safe: commit as soon as transactions commit log record is written at primary
Problem: updates may not arrive at backup before it takes over.
Two-very-safe: commit when transactions commit log record is written at primary and
backup
Reduces availability since transactions cannot commit if either site fails.
Two-safe: proceed as in two-very-safe if both primary and backup are active. If only
the primary is active, the transaction commits as soon as is commit log record is written
at the primary.
Better availability than two-very-safe; avoids problem of lost transactions in
one-safe.
Page 132