You are on page 1of 151

MSc.

Information Technology

Database Management System


Semester I

Amity University

Database Management System is primary ingredients of modern computing systems. Although database concepts, technology and architectures have been developed and consolidated in the last three decades, many aspects are subject to technological evolution and revolution. Thus, developing a study material on this classical and yet continuously evolving field is a great challenge. Key features This study material provides a widespread treatment of databases, dealing with the complete syllabus for both an introductory course and an advanced course on databases. It offers a balanced view of concepts, languages and architectures, with concrete reference to current technology and to commercial database management systems (DBMS). It originates from the authors experience in teaching, both the UG and PG classes for theory and application. The study material is composed of seven chapters. Chapter 1 and 2 are designed to expose students to the fundamental principles of database management and RDBMS concepts. It gives an idea of how to design a database and develop its schema.Discussion of design techniques starts with the introduction of the elements of the E-R (Entity-Relationship) model and proceeds through a well-defined, staged process through conceptual design to the logical design, which produces a relational schema. Chapter 3 and 4 are devoted to advanced concepts, including Normalization, Functional Dependency and use of structure query language required for mastering database technology. Chapter 5 describes the fundamental and advance concept of procedural query language commonly known as PL SQL. It improves the power of structure query language. PL/SQL technology is like an engine that executes PL/SQL blocks and subprograms. This engine can be started in Oracle server or in application development tools such as Oracle Forms, Oracle Reports etc.

Chapter 6 and 7 is focusing on many advance concepts of Database systems including the concept of Transaction Management, Concurrency Control Technology and Backup and Recovery methods of database system.

Updated Syllabus
Course Contents:
Model I: Introduction to DBMS Introduction to DBMS, Architecture of DBMS, Components of DBMS, Traditional data Models (Network, Hierarchical and Relational), Database Users, Database Languages, Schemas and Instances, Data Independence Module II: Data Modeling Entity sets attributes and keys, Relationships (ER), Database modeling using entity, Weak and Strong entity types, Enhanced entity-relationship (EER), Entity Relationship Diagram Design of an E-R Database schema, Object modeling, Specialization and generalization Module III: Relational Database Model Basic Definitions, Properties of Relational Model, Keys, Constraints, Integrity rules, Relational Algebra, Relational Calculus. Module IV: Relational Database Design Functional Dependencies, Normalization, Normal forms (1st, 2nd, 3rd,BCNF), Lossless decomposition, Join dependencies, 4th & 5th Normal form. Module V: Query Language SQL Components (DDL, DML, DCL), SQL Constructs (Selectfromwhere. group by. having. order by), Nested tables, Views, correlated query, Objects in Oracle. Module VI: PL/SQL Introduction, Basic block, Structure of PL/SQL program, Control Statements, Exception handling, Cursor Concept, Procedure, functions and triggers. Module VII: Database Security and Authorization

Basic security issues, Discretionary access control, Mandatory access control, Statistical database security. Module VIII: Transaction Management and Concurrency Control Techniques Transaction concept, ACID properties, Schedules and recoverability, Serial and Non-serial schedules, Serializability, Concurrency Techniques: Locking Protocols, Timestamping Protocol, Multiversion Technique, Deadlock Concept - detection and resolution. Module IX: Backup and Recovery Database recovery techniques based on immediate and deferred update, ARIES recovery algorithm, Shadow pages and Write-ahead Logging Text & References:

Text:
Fundamental of Database Systems, Elmasri & Navathe, Pearson Education, Asia Data Base Management System, Leon & Leon, Vikas Publications Database System Concepts, Korth & Sudarshan, TMH

References:
Introduction to Database Systems, Bipin C Desai, Galgotia Oracle 9i The Complete Reference, Oracle Press

Index: Chapter
Introduction to dbms and data modeling Relational database model Functional dependency and normalization Structure query language Procedural query language

Page No. 6 34
49

64 78
106 138

Transaction management & concurrency conyrol technique Database recovey, backup & security

Chapter-1
INTRODUCTION TO DBMS AND DATA MODELING
1. Introductory Concepts
Data: - Data is Collection of facts, upon which a conclusion is based. (Information or knowledge has value, data has cost). Data can be represented in terms of numbers, characters, pictures, sounds and figures Data item: - Smallest named unit of data that has meaning in the real world (examples: last name, Locality, STD_Code ) Database: - Interrelated collection of data that serves the needs of multiple users within one or more organizations, i.e. interrelated collections of records of potentially many types. Database administrator (DBA):- A person or group of person responsible for the effective use of database technology in an organization or enterprise. DBA is said to be custodian or owner of Database. Database Management System: - DBMS is a logical collection of software programs which facilitates large, structured sets of data to be stored, modified, extracted and manipulated in different ways. Database Management System (DBMS) also provides security features that protect against unauthorized users trying to gain access to confidential information and prevent data loss in case of a system crash. Depending on the specific users requirement, users are allowed access to either all, or specific database subschema, through the use of passwords. DBMS is also responsible for the databases integrity, ensuring that no two users are able to update the same record at the same time, as well as preventing duplicate entries, such as two employees being given the same employee number. The following are examples of database applications: 1. Computerized library systems. 2. Automated teller machines. 3. Airline reservation systems.

4. Inventory Management systems.

There are innumerable numbers of Database Management System (DBMS) Software available in the market. Some of the most popular ones include Oracle, IBMs DB2, Microsoft Access, Microsoft SQL Server, MySQL. MySQL is, one of the most popular database management systems used by online entrepreneurs is one example of an object-oriented DBMS. Microsoft Access (another popular DBMS) on the other hand is not a fully object oriented system, even though it does exhibit certain aspects of it.

Example: A database may contain detailed student information, certain users may only be allowed access to student names , addresses and Phone number, while others user may be able to view payment detail of students or marks detail of student. Access and change logs can be programmed to add even more security to a database, recording the date, time and details of any user making any alteration to the database.

Furthermore, the Database Management Systems employ the use of a query language and report writers to interrogate the database and analyze its data. Queries allow users to search, sort, and analyze specific data by granting users efficient access to the required information. Example: one would use a query command to make the system retrieve data regarding all courses of a particular department. The most common query language used to access database systems is the Structured Query Language (SQL).

2. Objectives of Database Management:


Data availabilitymakes an integrated collection of data available to a wide variety of users * At reasonable costperformance in query update, eliminate or control data redundancy * In meaningful formatdata definition language, data dictionary * Easy accessquery language (4GL, SQL, forms, windows, menus); Data integrityinsure correctness and validity * Primary Key Constraint / Foreign Key Constraints / Check Constraints. * Concurrency control and multi-user updates

* Audit trail. Privacy (the goal) and security (the means) * Schema/ Sub-schema, * Passwords Management controlDBA: lifecycle control, training, maintenance Data independence (a relative term) -- Avoids reprogramming of applications, allows easier conversion and reorganization of data. Physical data independence: Application program is unaffected by changes in the storage

structure or physical method of data accessing. Logical data independence: Application program unaffected by changes in the logical schema

3. Database Models: Database information normally consists of subjects, such as customers,


employees or suppliers; as well as activities such as orders, payments or purchases. This information must be organized into related record types through a process known as database design. The DBMS that is chosen must be able to manage different relationships, which is where database models come in.

3.1 Hierarchical databases organize data under the premise of a basic parent/child relationship. Each parent can have many children, but each child can only have one parent. In hierarchical databases, attributes of specific records are listed under an entity type and entity types are connected to each other through one-to-many relationships, also known as 1:N mapping. Originally, hierarchical relationships were most commonly used in mainframe systems, but with the advent of increasingly complex relationship systems, they have now become too restrictive and are thus rarely used in modern databases. If any of the one-to-many relationships are compromised, for e.g. an employee having more than one manager, the database structure switches from hierarchical to a network. 3.2 Network model: In the network model of a database it is possible for a record to have multiple parents, making the system more flexible compared to the strict single-parent model of the hierarchical database. The model is made to accommodate many to many relationships, which allows for a more realistic representation of the relationships between entities. Even though the network database model enjoyed popularity for a short while, it never really lifted of

the ground in terms of staging a revolution. It is now rarely used because of the availability of more competitive models that boast the higher flexibility demanded in todays ever advancing age. 3.3 Relational databases (RDBMS) are completely unique when compared to the aforementioned models as the design of the records is organized around a set of tables (with unique identifiers) to represent both the data and their relationships. The fields to be used for matching are often indexed in order to speed up the process and the data can be retrieved and manipulated in a number of ways without the need to reorganize the original database tables. Working under the assumption that file systems (which often use the hierarchical or network models) are not considered databases, the relational database model is the most commonly used system today. While the concepts behind hierarchical and network database models are older than that of the relational model, the latter was in fact the first one to be formally defined.

After the relational DBMS soared to popularity, the most recent development in DMBS technology came in the form of the object-oriented database model, which offers more flexibility than the hierarchical, network and relational models put together. Under this model, data exists in the form of objects, which include both the data and the datas behavior. Certain modern information systems contain such convoluted combinations of information that traditional data models (including the RDBMS) remain too restrictive to adequately model this complex data. The object-oriented model also exhibits better cohesion and coupling than prior models, resulting in a database which is not only more flexible and more manageable but also the most able when it comes to modeling real-life processes. However, due to the immaturity of this model, certain problems are bound to arise, some major ones being the lack of an SQL equivalent as well as lack of standardization. Furthermore, the most common use of the object oriented model is to have an object point to the child or parent OID (object I.D.) to be retrieved; leaving many programmers with the impression that the object oriented model is simply a reincarnation of the network model at best. That is, however, an attempt at the over-simplification of an innovative technology.

4. Components of a DBMS
Components of a Data Base Management System (DBMS) is well illustrated by the diagram shown bellow.

4.1. Database Engine: Database Engine is the foundation for storing, processing, and securing data. The Database Engine provides controlled access and rapid transaction processing to meet the requirements of the most demanding data consuming applications within your enterprise. Use the Database Engine to create relational databases for online transaction processing or online analytical processing data. This includes creating tables for storing data, and database objects such as indexes, views, and stored procedures for viewing, managing, and securing data. You can use SQL Server Management Studio to manage the database objects, and SQL Server Profiler for capturing server events.

4.2. Data dictionary: A data dictionary is a reserved space within a database which is used to store information about the database itself. A data dictionary is a set of table and views which can only be read and never altered. Most data dictionaries contain different information about the data used in the enterprise. In terms of the database representation of the data, the data table defines all schema objects including views, tables, clusters, indexes, sequences, synonyms, procedures,

packages, functions, triggers and many more. This will ensure that all these things follow one standard defined in the dictionary. The data dictionary also defines how much space has been allocated for and / or currently in used by all the schema objects. A data dictionary is used when finding information about users, objects, schema and storage structures. Every time a data definition language (DDL) statement is issued, the data dictionary becomes modified. A data dictionary may contain information such as: Database design information Stored SQL procedures User permissions User statistics Database process information Database growth statistics Database performance statistics 4.3. Query Processor: A relational database consists of many parts, but at its heart are two major components: the storage engine and the query processor. The storage engine writes data to and reads data from the disk. It manages records, controls concurrency, and maintains log files. The query processor accepts SQL syntax, selects a plan for executing the syntax, and then executes the chosen plan. The user or program interacts with the query processor, and the query processor in turn interacts with the storage engine. The query processor isolates the user from the details of execution: The user specifies the result, and the query processor determines how this result is obtained. The query processor components include DDL interpreter DML compiler Query evaluation engine 4.4. Report writer: Also called a report generator, a program, usually part of a database management system that extracts information from one or more files and presents the information in a specified format. Most report writers allow you to select records that meet certain conditions and to display selected fields in rows and columns. You can also format data into pie charts, bar

charts, and other diagrams. Once you have created a format for a report, you can save the format specifications in a file and continue reusing it for new data.

5. Database Languages
5.1 Data Definition Language (DDL): Data Definition Language (DDL). It is use to define the structure of a Database. The database structure definition (Schema) typically includes the following: Defining all data element, Defining data element field and records, Defining the name, field length, and field type for each data type, Defining control for field that can have only selective values. Typical DDL operations (with their respective keywords in the structured query language SQL): Creation of tables and definition of attributes (CREATE TABLE ...) Change of tables by adding or deleting attributes (ALTER TABLE ) Deletion of whole table including content (DROP TABLE ) 5.2 Data Manipulation Language (DML): Data Manipulation Language (DML) Once the structure is defined the database is ready for entry and manipulation of data. Data Manipulation Language (DML) includes the command to enter and manipulate the Data, with these commands the user can Add new records, navigate through the existing records, view contents of various fields, modify the data, delete the existing record, sort the record in desired sequence. Typical DML operations (with their respective keywords in the structured query language SQL): Add data (INSERT) Change data (UPDATE) Delete data (DELETE) Query data (SELECT) 5.3 Data Control Language (DCL): Data control commands in SQL control access privileges and security issues of a database system or parts of it. These commands are closely related to the

DBMS (Database Management System) and can therefore vary in different SQL implementations. Some typical commands are:
GRANT

- give user access privileges to a database withdraws access privileges given with the GRANT or taken with the DENY

REVOKE

command Since these commands depend on the actual database management system (DBMS), we will not cover DCL in this module.

6. Database USER
6.1 Database Administrator (DBA): The DBA is a person or a group of persons who is responsible for the management of the database. The DBA is responsible for authorizing access to the database by grant and revoke permissions to the users, for coordinating and monitoring its use, managing backups and repairing damage due to hardware and/or software failures and for acquiring hardware and software resources as needed. In case of small organization the role of DBA is performed by a single person and in case of large organizations there is a group of DBA's who share responsibilities. 6.2 Database Designers: They are responsible for identifying the data to be stored in the database and for choosing appropriate structure to represent and store the data. It is the responsibility of database designers to communicate with all prospective of the database users in order to understand their requirements so that they can create a design that meets their requirements. 6.3 End Users: End Users are the people who interact with the database through applications or utilities. The various categories of end users are: Casual End Users - These Users occasionally access the database but may need different information each time. They use sophisticated database Query language to specify their requests. For example: High level Managers who access the data weekly or biweekly.

Native End Users - These users frequently query and update the database using standard types of Queries. The operations that can be performed by this class of users are very limited and effect precise portion of the database. For example: - Reservation clerks for airlines/hotels check availability for given request and make reservations. Also, persons using Automated Teller Machines (ATM's) fall under this category as he has access to limited portion of the database. Standalone end Users/On-line End Users - Those end Users who interact with the database directly via on-line terminal or indirectly through Menu or graphics based Interfaces. For example: - User of a text package, library management software that store variety of library data such as issue and return of books for fine purposes. 6.4 Application Programmers Application Programmers are responsible for writing application programs that use the database. These programs could be written in General Purpose Programming languages such as Visual Basic, Developer, C, FORTRAN, COBOL etc. to manipulate the database. These application programs operate on the data to perform various operations such as retaining information, creating new.

7. ADVANTAGES OF DBMS
The DBMS (Database Management System) is preferred over the conventional file processing system due to the following advantages:

Controlling Data Redundancy - In the conventional file processing system, every user group maintains its own files for handling its data files. This may lead to

Duplication of same data in different files. Wastage of storage space. Errors may be generated due to updating of the same data in different files. Time in entering data again and again is wasted. Computer Resources are needlessly used.

It is very difficult to combine information. The entire above mentioned problem was eliminated in Database Management System. Elimination of Inconsistency - In the file processing system information is duplicated throughout the system. So changes made in one file may be necessary be carried over to another file. This may lead to inconsistent data. So we need to remove this duplication of data in multiple file to eliminate inconsistency. For example: - Let us consider an example of student's result system. Suppose that in STUDENT file it is indicated that Roll no= 10 has opted for 'Computer' course but in RESULT file it is indicated that 'Roll No. =l0' has opted for 'Accounts' course. Thus, in this case the two entries for particular student don't agree with each other. Thus, database is said to be in an inconsistent state. Science to eliminate this conflicting information we need to centralize the database. On centralizing the data base the duplication will be controlled and hence inconsistency will be removed. Data inconsistency are often encountered in every day life Consider an another example, we have all come across situations when a new address is communicated to an organization that we deal it (Eg - Telecom, Gas Company, Bank). We find that some of the communications from that organization are received at a new address while other continued to be mailed to the old address. So combining all the data in database would involve reduction in redundancy as well as inconsistency so it is likely to reduce the costs for collection storage and updating of Data. Better service to the users - A DBMS is often used to provide better services to the users. In conventional system, availability of information is often poor, since it normally difficult to obtain information that the existing systems were not designed for. Once several conventional systems are combined to form one centralized database, the availability of information and its update ness is likely to improve since the data can now be shared and DBMS makes it easy to respond to anticipated information requests. Centralizing the data in the database also means that user can obtain new and combined information easily that would have been impossible to obtain otherwise. Also use of DBMS should allow users that don't know programming to interact with the data more easily, unlike file processing system where the programmer may need to write new programs to meet every new demand.

Flexibility of the System is improved - Since changes are often necessary to the contents of the data stored in any system, these changes are made more easily in a centralized database than in a conventional system. Applications programs need not to be changed on changing the data in the database. This will also maintain the consistency and integrity of data into the database.

Integrity can be improved - Since data of the organization using database approach is centralized and would be used by a number of users at a time. It is essential to enforce integrityconstraints. In the conventional systems because the data is duplicated in multiple files so updating or changes may sometimes lead to entry of incorrect data in some files where it exists. For example: - The example of result system that we have already discussed. Since multiple files are to maintained, as sometimes you may enter a value for course which may not exist. Suppose course can have values (Computer, Accounts, Economics, and Arts) but we enter a value 'Hindi' for it, so this may lead to an inconsistent data, so lack of Integrity. Even if we centralized the database it may still contain incorrect data. For example: Salary of full time employ may be entered as Rs. 500 rather than Rs. 5000. A student may be shown to have borrowed books but has no enrollment. A list of employee numbers for a given department may include a number of non existent employees. These problems can be avoided by defining the validation procedures whenever any update operation is attempted.

Standards can be enforced - Since all access to the database must be through DBMS, so standards are easier to enforce. Standards may relate to the naming of data, format of data, structure of the data etc. Standardizing stored data formats is usually desirable for the purpose of data interchange or migration between systems.

Security can be improved - In conventional systems; applications are developed in an adhoc or temporary manner. Often different system of an organization would access different components of the operational data, in such an environment enforcing security can be quiet difficult. Setting up of a database makes it easier to enforce security restrictions since data is now centralized. It is easier to control who has access to what parts of the database. Different checks can be

established for each type of access (retrieve, modify, delete etc.) to each piece of information in the database. Consider an Example of banking in which the employee at different levels may be given access to different types of data in the database. A clerk may be given the authority to know only the names of all the customers who have a loan in bank but not the details of each loan the customer may have. It can be accomplished by giving the privileges to each employee. Organization's requirement can be identified - Organizations have sections and departments and each of these units often consider the work of their unit as the most important and therefore consider their need as the most important. Once a database has been setup with centralized control, it will be necessary to identify organization's requirement and to balance the needs of the different units. So it may become necessary to ignore some requests for information if they conflict with higher priority need of the organization. It is the responsibility of the DBA (Database Administrator) to structure the database system to provide the overall service that is best for an organization.

For example: - A DBA must choose best file Structure and access method to give fast response for the high critical applications as compared to less critical applications.

Overall cost of developing and maintaining systems is lower - It is much easier to respond to unanticipated requests when data is centralized in a database than when it is stored in a conventional file system. Although the initial cost of setting up of a database can be large, one normal expects the overall cost of setting up of a database, developing and maintaining application programs to be far lower than for similar service using conventional systems, Since the productivity of programmers can be higher in using non-procedural languages that have been developed with DBMS than using procedural languages.

Data Model must be developed - Perhaps the most important advantage of setting up of database system is the requirement that an overall data model for an organization be build. In conventional systems, it is more likely that files will be designed as per need of particular applications demand. The overall view is often not considered. Building an overall view of an organization's data is usual cost effective in the long terms.

Provides backup and Recovery - Centralizing a database provides the schemes such as recovery and backups from the failures including disk crash, power failures, software errors which may help the database to recover from the inconsistent state to the state that existed prior to the occurrence of the failure, though methods are very complex.

8. Three-Schemes Architecture
The objective of Three-Schemes Architecture is to separate the user application program and the physical database. The Three schema architecture is an effective tool with which the user can visualize the schema levels in a database system. The three levels ANSI architecture has an important place in database technology development because it clearly separates the users external level, the systems conceptual level, and the internal storage level for designing a database. In three-schemas architecture schemas can be defined at three different levels. 8.1 External Scheme: An external scheme describes the specific users view of data. and the specific methods and constraints connected with this information.. Each external schema describes the part of the part of the database that a particular user group is interested in and hides the rest of the database from that database from that database. 8.2 Internal Scheme: The Internal scheme mainly describes the physical storage structure of the database. Internal scheme describes the data from a view very close to the computer or system in general. It completes the logical scheme with data technical aspects like storage methods or help functions for more efficiency.

8.3 Conceptual Schema: It describes the structure of the whole database for the entire user community. The conceptual schema hides the details of physical storage structure and concentrates on describing entities, data types, relationships and constraints. This

implementation of conceptual schema is based on conceptual schema design in a high level data model.

9. Data Independence:
With knowledge about the three-scheme architecture the term data independence can be explained as followed: Each higher level of the data architecture is immune to changes of the next lower level of the architecture. Data independence is normally thought of in terms of two levels or types. Logical data independence makes it possible to change the structure of the data independently without modifying the application programs that make use of the data. There is no need to rewrite current applications as part of the process of adding to or removing data from then system. The second type or level of data independence is known as physical data independence. This approach has to do with altering the organization or storage procedures related to the data, rather than modifying the data itself. Accomplishing this shift in file organization or the indexing strategy used for the data does not require any modification to the external structure of the applications, meaning that users of the applications are not likely to notice any difference at all in the function of their programs. Database Instance: The term instance is typically used to describe a complete database environment, including the RDBMS software, table structure, stored procedures and other

functionality.

It

is

most

commonly

used

when

administrators

describe

multiple

instances of the same database. Also Known As: environment Examples: An organization with an employees database might have three different instances: production (used to contain live data), pre-production (used to test new functionality prior to release into production) and development (used by database developers to create new functionality). Relational Schema: A relation schema can be thought of as the basic information describing a table or relation. This includes a set of column names, the data types associated with each column, and the name associated with the entire table.

10. Entity - Relationship Model


The Entity - Relationship Model (E-R Model) is a high-level conceptual data model developed by Chen in 1976 to facilitate database design. Conceptual Modeling is an important phase in designing a flourishing database. A conceptual data model is a set of concepts that describe the structure of a database and associated retrieval and update transactions on the database. A high level model is chosen so that all the technical aspects are also covered. The E-R data model grew out of the exercise of using commercially available DBMS to model the database. The E-R model is the generalization of the earlier available commercial models like the Hierarchical and the Network Model. It also allows the representation of the various constraints as well as their relationships. So to sum up, the Entity-Relationship (E-R) Model is based on a view of a real world that consists of set of objects called entities and relationships among entity sets which are basically a group of similar objects. The relationships between entity sets is represented by a named E-R relationship and is of 1:1, 1: N or M: N type which tells the mapping from one entity set to another. The E-R model is shown diagrammatically using Entity-Relationship (E-R) diagrams which represent the elements of the conceptual model that show the meanings and the

relationships between those elements independent of any particular DBMS and implementation details.

10.1 What are Entity Relationship Diagrams?


Entity Relationship Diagrams (ERD) illustrates the logical structure of databases.

An ER Diagram

10.2 Entity Relationship Diagram Notations

Entity
An entity is an real world objects (living or non living) or concept about which you want to store information..

Weak Entity
A weak entity is an entity that must defined by a foreign key relationship with another entity as it cannot be uniquely identified by its own attributes alone.

Key attribute
A key attribute is the unique, distinguishing characteristic of the entity, which can uniquely identify the instances of entity set.. For example, an employee's social security number might be the employee's key attribute.

Multi valued attribute


A multi valued attribute can have more than one value. For example, an employee entity can have multiple skill values.

Derived attribute
A derived attribute is based on another attribute. For example, an employee's monthly salary is based on the employee's annual salary.

Relationships
Relationships illustrate how two entities share information in the database structure. First, connect the two entities, then drop the relationship notation on the line.

Cardinality
Cardinality specifies how many instances of an entity relate to one instance of another entity. ordinality is also closely linked to cardinality. While cardinality specifies the occurrences of a relationship, ordinality describes the relationship as either mandatory or optional. In other words, cardinality specifies the maximum number of relationships and ordinality specifies the absolute minimum number of relationships.

Recursive relationship
In some cases, entities can be self-linked. For example, employees can supervise other employees.

10.3 How to design an Effective ER Diagrams


1) Make sure that each entity only appears once per diagram. 2) Name every entity, relationship, and attribute on your diagram. 3) Examine relationships between entities closely. Are they necessary? Are there any relationships missing? Eliminate any redundant relationships. Don't connect relationships to each other. 4) Use colors to highlight important portions of your diagram.

Using colors can help you highlight important features in your diagram 5) Create a polished diagram by adding shadows and color. You can choose from a number of ready-made styles in the Edit menu under Colors and Shadows, or you can create your own.

10.4 Features of the E-R Model:


1. The E-R diagram used for representing E-R Model can be easily converted into Relations (tables) in Relational Model. 2. The E-R Model is used for the purpose of good database design by the database developer so to use that data model in various DBMS. 3. It is helpful as a problem decomposition tool as it shows the entities and the relationship between those entities. 4. It is inherently an iterative process. On later modifications, the entities can be inserted into this model.

5. It is very simple and easy to understand by various types of users and designers because specific standards are used for their representation.

11. Enhanced Entity Relationship (EER) Diagrams


It Contain all the essential modeling concepts of an ER Diagram Adds extra concepts: o Specialization/generalization o Subclass/super class o Categories o Attribute inheritance Extended ER diagrams use some object-oriented concepts such as inheritance. EER is used to model concepts more accurately than the ER diagram.

Sub classes and Super classes In some cases, and entity type has numerous sub-groupings of its entities that are meaningful, and need to be explicitly represented, because of their importance.

For example, members of entity Employee can be grouped further into Secretary, Engineer, Manager, Technician, Salaried_Employee. The set listed is a subset of the entities that belong to the Employee entity, which means that every entity that belongs to one of the sub sets is also an Employee. Each of these sub-groupings is called a subclass, and the Employee entity is called the superclass.

An entity cannot only be a member of a subclass; it must also be a member of the superclass. An entity can be included as a member of a number of sub classes, for example, a Secretary may also be a salaried employee, however not every member of the super class must be a member of a sub class.

Type Inheritance The type of an entity is defined by the attributes it possesses, and the relationship types it participates in.

Because an entity in a subclass represents the same entity from the super class, it should possess all the values for its attributes, as well as the attributes as a member of the super class.

This means that an entity that is a member of a subclass inherits all the attributes of the entity as a member of the super class; as well, an entity inherits all the relationships in which the super class participates.

Employee

Work For

Department

Secretary

Engineer

Technician

Specialization The process of defining a set of subclasses of a super class. Specialization is the top-down refinement into (super) classes and subclasses The set of sub classes is based on some distinguishing characteristic of the super class.

For example, the set of sub classes for Employee, Secretary, Engineer, Technician, differentiates among employee based on job type. There may be several specializations of an entity type based on different distinguishing characteristics. Another example is the specialization, Salaried_Employee and Hourly_Employee, which distinguish employees based on their method of pay.

Notation for Specialization To represent a specialization, the subclasses that define a specialization are attached by lines to a circle that represents the specialization, and is connected to the super class. The subset symbol (half-circle) is shown on each line connecting a subclass to a super class, indicates the direction of the super class/subclass relationship. Attributes that only apply to the sub class are attached to the rectangle representing the subclass. They are called specific attributes. A sub class can also participate in specific relationship types. See Example.
Employee Department

Work For

Secretary

Engineer

Technician

Belongs To

Professional Organization

Reasons for Specialization Certain attributes may apply to some but not all entities of a super class. A subclass is defined in order to group the entities to which the attributes apply. The second reason for using subclasses is that some relationship types may be participated in only by entities that are members of the subclass.

Summary of Specialization Allows for: Defining set of subclasses of entity type Create additional specific attributes for each sub class Create additional specific relationship types between each sub class and other entity types or other subclasses. Generalization The reverse of specialization is generalization. Several classes with common features are generalized into a super class. For example, the entity types Car and Truck share common attributes License_PlateNo, VehicleID and Price, therefore they can be generalized into the super class Vehicle. Constraints on Specialization and Generalization Several specializations can be defined on an entity type. Entities may belong to subclasses in each of the specializations. The specialization may also consist of a single subclass, such as the manager specialization; in this case we dont use the circle notation. Types of Specializations Predicate-defined or Condition-defined specialization Occurs in cases where we can determine exactly the entities of each sub class by placing a condition of the value of an attribute in the super class. An example is where the Employee entity has an attribute, Job Type. We can specify the condition of membership in the Secretary subclass by the condition, JobType=Secretary

Example:

The condition is called the defining predicate of the sub class. The condition is a constraint specifying exactly those entities of the Employee entity type whose attribute value for Job Type is Secretary belong to the subclass. Predicate defined subclasses are displayed by writing the predicate condition next to the line that connects the subclass to the specialization circle.

Attribute-defined specialization If all subclasses in a specialization have their membership condition on the same attribute of the super class, the specialization is called an attribute-defined specialization, and the attribute is called the defining attribute. Attribute-defined specializations are displayed by placing the defining attribute name next to the arc from the circle to the super class. User-defined specialization When we do not have a condition for determining membership in a subclass the subclass is called user-defined. Membership to a subclass is determined by the database users when they add an entity to the subclass. Dis-jointness / Overlap Constraint Specifies that the subclass of the specialization must be disjoint, which means that an entity can be a member of, at most, one subclass of the specialization. The d in the specialization circle stands for disjoint. If the subclasses are not constrained to be disjoint, they overlap. Overlap means that an entity can be a member of more than one subclass of the specialization. Overlap constraint is shown by placing an o in the specialization circle.

Completeness Constraint The completeness constraint may be either total or partial.

A total specialization constraint specifies that every entity in the super class must be a member of at least one subclass of the specialization. Total specialization is shown by using a double line to connect the super class to the circle. A single line is used to display a partial specialization, meaning that an entity does not have to belong to any of the subclasses. Disjointness vs. Completeness

Disjoint constraints and completeness constraints are independent. The following possible constraints on specializations are possible:

Disjoint, total

Department

Academic

Administrative

Employee Disjoint, partial

Secretary

Analyst

Engineer

Overlapping, total

Part

o
Manufactured Puchased

Overlapping, partial

Movie

Children

Comedy

Drama

Chapter-I
INTRODUCTION TO DBMS AND DATA MODELING.
End Chapter quizzes:

Q.1. Entity is represented by the symbol


(a) (b) (c) (d) Circle Ellipse Rectangle Square

Q2. A relationship is

(a) (b) (c) (d)

an item in an application a meaningful dependency between entities a collection of related entities related data

Q3. Overall logical structure of a database can be expressed graphically by


a). ER diagram b). Records c). Relations d). Hierarchy.

Q4. In three schemas architecture a specific view of data given to a particular user is defined at
a) Internal Level b) External Level c) Conceptual Level d) Physical Level Q5. By data redundancy in a file based system we mean that

(a) Unnecessary data is stored (b) Same data is duplicated in many files (c) Data is unavailable (d) Files have redundant data Q6. Entities are identified from the word statement of a problem by (a) picking words which are adjectives (b) picking words which are nouns (c) picking words which are verbs (d) picking words which are pronouns Q7. Data independence allows (a) sharing the same database by several applications (b) extensive modification of applications (c) no data sharing between applications (d) elimination of several application programs

Q8. Access right to a database is controlled by (a) top management (a) system designer (b) system analyst (c) database administrator Q9. Data integrity in a file based system may be lost because (a) the same variable may have different values in different files (b) files are duplicated (c) unnecessary data is stored in files (d) redundant data is stored in files Q10. Characteristics of an entity set is known as: a) Attributes b) Cardinality c) Relationship d) Many to Many Relation Q11. Vehicle identification number, color, weight, and horsepower best exemplify: a.) entities. b.) entity types. c.) data markers. d.) attributes. Q12. If each employee can have more than one skill, then skill is referred to as a: a.) gerund. b.) multivalued attribute. c.) nonexclusive attribute. d.) repeating attribute Q13. The data structure used in the hierarchical model is a) Tree b) Graph c) Table d) None of these. Q14. By data security in DBMS we mean (a) preventing access to data (b) allowing access to data only to authorized users (c) preventing changing data (d) introducing integrity constraints

Chapter-2
RELATIONAL DATABASE MODEL
2. Introductory Concepts
Relational Database Management System A Relational Database Management System (RDBMS) provides a complete and integrated move towards information management. A relational model provides the basis for a relational

database. A relational model has three aspects:

Structures Operations Integrity rules

Structures consist of a collection of objects or relations that store data. An example of relation is a table. You can store information in a table and use the table to retrieve and modify data. Operations are used to manipulate data and structures in a database. When using operations. You must stick to a predefined set of integrity rules. Integrity rules are laws that govern the operations allowed on data in a database. This ensures data accuracy and consistency. Relational database components include: Table Row Column Field Primary key Foreign key

Figure Relational database components A Table is a basic storage structure of an RDBMS and consists of columns and rows. A table represents an entity. For example, the S_DEPT table stores information about the departments of an organization. A Row is a combination of column values in a table and is identified by a primary key. Rows are also known as records. For example, a row in the table S_DEPT contains information about one department.

A Column is a collection of one type of data in a table. Columns represent the attributes of an object. Each column has a column name and contains values that are bound by the same type and size. For example, a column in the table S_DEPT specifies the names of the departments in the organization.

A Field is an intersection of a row and a column. A field contains one data value. If there is no data in the field, the field is said to contain a NULL value.

Figure Table, Row, Column & Field A Primary key is a column or a combination of columns that is used to uniquely identify each row in a table. For example, the column containing department numbers in the S_DEPT table is created as a primary key and therefore every department number is different. A primary key must contain a value. It cannot contain a NULL value.

A Foreign key is a column or set of columns that refers to a primary key in the same table or another table. You use foreign keys to establish principle connections between, or within, tables. A foreign key must either match a primary key or else be NULL. Rows are connected logically when required. The logical connections are based upon conditions that define a relationship between corresponding values, typically between a primary key and a matching foreign key. This relational method of linking provides great flexibility as it is independent of physical links between records.

Figure Primary & Foreign key

RDBMS Properties An RDBMS is easily accessible. You execute commands in the Structured Query Language (SQL) to manipulate data. SQL is the international Standards Organization (ISO) standard language for interacting with a RDBMS.

An RDBMS provides full data independence. The organization of the data is independent of the applications that use it. You do not need to specify the access routes to tables or know how data is physically arranged in a database.

A relational database is a collection of individual, named objects. The basic unit of data storage in a relational database is called a table. A table consists of rows and columns used to store values. For access purpose, the order of rows and columns is insignificant. You can control the access order as required.

Figure SQL & Database When querying the database, you use conditional operations such as joins and restrictions. A join combines data from separate database rows. A restriction limits the specific rows returned by a query.

Figure Conditional operations

An RDBMS enables data sharing between users. At the same time, you can ensure consistency of data across multiple tables by using integrity constraints. An RDBMS uses various types of data integrity constraints. These types include entity, column, referential and user-defined constraints.

The constraint, entity, ensures uniqueness of rows, and the constraint column ensures consistency of the type of data within a column. The other type, referential, ensures validity of foreign keys, and user-defined constraints are used to enforce specific business rules. An RDBMS minimizes the redundancy of data. This means that similar data is not

3. Codd's 12 rules
Codd's 12 rules are a set of twelve rules proposed by E. F. Codd, a pioneer of the relational model for databases, designed to define what is required from a database management system in order for it to be considered relational, i.e., an RDBMS. Codd produced these rules as part of a personal campaign to prevent his vision of the relational database being diluted.

Rule 1: The information rule: All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables. Rule 2: The guaranteed access rule: All data must be accessible with no ambiguity. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row. Rule 3: Systematic treatment of null values: The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way. Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data. Rule 5: The comprehensive data sublanguage rule: The system must support at least one relational language that
o o o

Has a linear syntax Can be used both interactively and within application programs, Supports data definition operations (including view definitions), data manipulation (update as well as retrieval), security and integrity constraints,

operations

and transaction management operations (begin, commit, and rollback). Rule 6: The view updating rule: All views that are theoretically updatable must be updatable by the system. Rule 7: High-level insert, update, and delete: The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table. Rule 8: Physical data independence: Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure. Rule 9: Logical data independence: Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

Rule 10: Integrity independence: Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications. Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully:
o o

when a distributed version of the DBMS is first introduced; and when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule: If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

3. Data Integrity and Integrity Rules


Data Integrity is very important concepts in database operations in particular and Data Warehousing and Business Intelligence in general. Because Data Integrity ensured that only data of high quality, correct, consistent is accessible to its user. The database designer is responsible for incorporating elements to promote the accuracy and reliability of stored data within the database. There are many different techniques that can be used to encourage data integrity, with some of these dependants on what database technology is being used. Here we are discussing two most common integrity rule. Integrity rule 1: Entity integrity It says that no component of a primary key may be null. All entities must be distinguishable. That is, they must have a unique identification of some kind. Primary keys perform unique identification function in a relational database. An identifier that was wholly null would be a contradiction in terms. It would be like there was some entity that did not have any unique identification. That is, it was not distinguishable from other entities. If two entities are not distinguishable from each other, then by definition there are not two entities but only one. Integrity rule 2: Referential integrity The referential integrity constraint is specified between two relations and is used to maintain the consistency among tuples of the two relations. Suppose we wish to ensure that value that appears in one relation for a given set of attributes also appears for a certain set of attributes in another. This is referential integrity. The referential integrity constraint states that, a tuple in one relation that refers to another relation must refer to the existing tuple in that relation. This means that the referential integrity is

a constraint specified on more than one relation. This ensures that the consistency is maintained across the relations. Table A DeptID F-1001 S-2012 H-0001 Table B EmpNo 1001 1002 1003 DeptName Financial Software HR DeptManager Nathan Martin Jason

DeptID F-1001 S-2012 H-0001

EmpName Tommy Will Jonathan

4. Relational algebra
Relational algebra is a procedural query language, which consists of a set of operations that take one or two relations as input and produce a new relation as their result. The fundamental operations that will be discussed in this section are: select, project, union, and set difference. Besides the fundamental operations, the following additional operations will be discussed: setintersection. Each operation will be applied to tables of a sample database. Each table is otherwise known as a relation and each row within the table is referred to as a tuple. The sample database consists of tables in which one might see in a bank. The sample database consists of the following 6 relations:

Account
branch-name Downtown Mianus Perryridge Round Hill Brighton Redwood Brighton account-number A-101 A-215 A-102 A-305 A-201 A-222 A-217 balance 500 700 400 350 900 700 750

Branch
branch-name Downtown Redwood Perryridge Mianus Round Hill Pownal North Town Brighton branch-city Brooklyn Palo Alto Horseneck Horseneck Horseneck Bennington Rye Brooklyn assets 9000000 2100000 1700000 400000 8000000 300000 3700000 7100000

Customer
customer-name Jones Smith Hayes Curry Lindsay Turner Williams Adams Johnson Glenn Brooks Green customer-street Main North Main North Park Putnam Nassau Spring Alma Sand Hill Senator Walnut customer-city Harrison Rye Harrison Rye Pittsfield Stamford Princeton Pittsfield Palo Alto Woodside Brooklyn Stamford

Depositor
customer-name Johnson Smith Hayes Turner Johnson Jones Lindsay account-number A-101 A-215 A-102 A-305 A-201 A-217 A-222

Loan
branch-name Downtown Redwood Perryridge Downtown Mianus Round Hill Perryridge loan-number L-17 L-23 L-15 L-14 L-93 L-11 L-16 amount 1000 2000 1500 1500 500 900 1300

Borrower
customer-name Jones Smith Hayes Jackson Curry Smith Williams Adams loan-number L-17 L-23 L-15 L-14 L-93 L-11 L-17 L-16

The Select operation is a unary operation, which means it operates on one relation. Its function is to select tuples that satisfy a given predicate. To denote selection, the lowercase Greek letter sigma ( ) is used. The predicate appears as a subscript to parentheses following the . For example, to select those tuples of the loan relation where the branch is "Perryridge," we write:
branch-home = "Perryridge"

. The argument relation is given in

(loan)

The results of the query are the following: branch-name Perryridge Perryridge loan-number L-15 L-16 amount 1500 1300

Comparisons like =, , <, >, can also be used in the selection predicate. An example query using a comparison is to find all tuples in which the amount lent is more than $1200 would be written: amount > 1200 (loan) The project operation is a unary operation that returns its argument relation with certain attributes left out. Since a relation is a set, any duplicate rows are eliminated. Projection is denoted by the Greek letter pi ( ). The attributes that wish to be appear in the result are listed as a subscript to . The argument relation follows in parentheses. For example, the query to list all loan numbers and the amount of the loan is written as: Loan-number, amount (loan) The result of the query is the following:
loan-number L-17 L-23 L-15 amount 1000 2000 1500

L-14 L-93 L-11 L-16

1500 500 900 1300

Another more complicated example query is to find those customers who live in Harrison is written as: Customer-name ( customer-city = "Harrison" (customer The union operation yields the results that appear in either or both of two relations. It is a binary operation denoted by the symbol .

An example query would be to find the name of all bank customers who have either an account or a loan or both. To find this result we will need the information in the depositor relation and in the borrower relation. To find the names of all customers with a loan in the bank we would write: Customer-name (borrower) and to find the names of all customers with an account in the bank, we would write: Customer-name (depositor) Then by using the union operation on these two queries we have the query we need to obtain the wanted results. The final query is written as: Customer-name (borrower) The result of the query is the following: customer-name (depositor)

customer-name Johnson Smith Hayes Turner Jones Lindsay Jackson Curry Williams Adams The set intersection operation is denoted by the symbol . It is not a fundamental operation, however it is a more convenient way to write r - (r - s).

An example query of the operation to find all customers who have both a loan and and account can be written as: Customer-name (borrower) The results of the query are the following: customer-name Hayes Jones Smith customer-name (depositor)

Set Difference Operation Set difference is denoted by the minus sign ( ). It finds tuples that are in one relation, but not in another. Thus but not in . results in a relation containing tuples that are in

Cartesian Product Operation The Cartesian product of two relations is denoted by a cross ( written

),

The result of and .

is a new relation with a tuple for each possible pairing of tuples from

Chapter-2
RELATIONAL DATABASE MODEL
End Chapter quizzes:
Q1. Which of the following are characteristics of an RDBMS?
a) Data are organized in a series of two-dimensional tables each of which contains records for one entity. b) Queries are possible on individual or groups of tables. c) It cannot use SQL. d) Tables are linked by common data known as keys.

Q2. The keys that can have NULL values are


a). Primary Key b). Unique Key c). Foreign Key d). Both b and c Q3 . GRANT and REVOKE are (a) DDL statements (b) DML statements (c) DCL statements (d) None of these. Q4.

Rows of a relation are called (a) tuples (b) a relation row (c) a data structure (d) an entity Q5. Primary Key column in the Table
(a) (b) (c) (d)
Q6.

Cant accept NULL values Cant accept duplicate values Cant be more than one All of the above

A table can have how many primary key

A). any number B). 1 C). 255 D). None of the above

Q7. Projection operation is:


a) b) c) d) Unary operation Ternary operation binary operation None of the above A). Primary Key B). Unique Key C). Foreign Key

Q8. The keys that can have NULL values are

D). Both b and c Q9. Referential integrity constraint is specified between two relations

a) True b) False Q10 Union operation in relational algebra is performed on a) Single Relation b) Two relation c) Both a and b d) None Q11. As per Codds rule NULL value is same as a) blank space b) Zero c) Character string d) None of the above. Q12 Relational Algebra is a non procedural query language a) True b) False

Chapter: 3
FUNCTIONAL DEPENDENCY AND NORMALIZATION

1. Functional Dependency
Consider a relation R that has two attributes A and B. The attribute B of the relation is functionally dependent on the attribute A if and only if for each value of A no more than one value of B is associated. In other words, the value of attribute A uniquely determines the value of B and if there were several tuples that had the same value of A then all these tuples will have an identical value of attribute B. That is, if t1 and t2 are two tuples in the relation R and t1(A) = t2(A) then we must have t1(B) = t2(B).

A and B need not be single attributes. They could be any subsets of the attributes of a relation R (possibly single attributes). We may then write R.A -> R.B If B is functionally dependent on A (or A functionally determines B). Note that functional dependency does not imply a one-to-one relationship between A and B although a one-to-one relationship may exist between A and B. A simple example of the above functional dependency is when A is a primary key of an entity (e.g. student number) and A is some single-valued property or attribute of the entity (e.g. date of birth). A -> B then must always hold. Functional dependencies also arise in relationships. Let C be the primary key of an entity and D be the primary key of another entity. Let the two entities have a relationship. If the relationship is one-to-one, we must have C -> D and D -> C. If the relationship is many-to-one, we would have C -> D but not D -> C. For many-to-many relationships, no functional dependencies hold. For example, if C is student number and D is subject number, there is no functional dependency between them. If however, we were storing marks and grades in the database as well, we would have

(student_number, subject_number) -> marks and we might have marks -> grades The second functional dependency above assumes that the grades are dependent only on the marks. This may sometime not be true since the instructor may decide to take other considerations into account in assigning grades, for example, the class average mark. For example, in the student database that we have discussed earlier, we have the following functional dependencies: sno -> sname sno -> address cno -> cname cno -> instructor instructor -> office These functional dependencies imply that there can be only one name for each sno, only one address for each student and only one subject name for each cno. It is of course possible that several students may have the same name and several students may live at the same address. If we consider cno -> instructor, the dependency implies that no subject can have more than one instructor (perhaps this is not a very realistic assumption). Functional dependencies therefore place constraints on what information the database may store. In the above example, one may be wondering if the following FDs hold sname -> sno cname -> cno Certainly there is nothing in the instance of the example database presented above that contradicts the above functional dependencies. However, whether above FDs hold or not would depend on whether the university or college whose database we are considering allows duplicate student names and subject names. If it was the enterprise policy to have unique subject names than cname -> cno holds. If duplicate student names are possible, and one would think there always is the possibility of two students having exactly the same name, then sname -> sno does not hold.

Functional dependencies arise from the nature of the real world that the database models. Often A and B are facts about an entity where A might be some identifier for the entity and B some characteristic. Functional dependencies cannot be automatically determined by studying one or more instances of a database. They can be determined only by a careful study of the real world and a clear understanding of what each attribute means.

We have noted above that the definition of functional dependency does not require that A and B be single attributes. In fact, A and B may be collections of attributes. For example
(sno, cno) -> (mark, date)

When dealing with a collection of attributes, the concept of full functional dependence is an important one. Let A and B be distinct collections of attributes from a relation R end let R.A -> R.B. B is then fully functionally dependent on A if B is not functionally dependent on any subset of A. The above example of students and subjects would show full functional dependence if mark and date are not functionally dependent on either student number ( sno) or subject number ( cno) alone. The implies that we are assuming that a student may have more than one subjects and a subject would be taken by many different students. Furthermore, it has been assumed that there is at most one enrolment of each student in the same subject. The above example illustrates full functional dependence. However the following dependence (sno, cno) -> instructor is not full functional dependence because cno -> instructor holds.

As noted earlier, the concept of functional dependency is related to the concept of candidate key of a relation since a candidate key of a relation is an identifier which uniquely identifies a tuple and therefore determines the values of all other attributes in the relation. Therefore any subset X of the attributes of a relation R that satisfies the property that all remaining attributes of the relation are functionally dependent on it (that is, on X), then X is candidate key as long as no attribute can be removed from X and still satisfy the property of functional dependence. In the example above, the attributes (sno, cno) form a candidate key (and the only one) since they functionally determine all the remaining attributes. Functional dependence is an important concept and a large body of formal theory has been developed about it. We discuss the concept of closure that helps us derive all functional dependencies that are implied by a given set of dependencies. Once a complete set of functional dependencies has been obtained, we will study how these may be used to build normalised relations. Rules about Functional Dependencies Let F be set of FDs specified on R

Must be able to reason about FDs in F Schema designer usually explicitly states only FDs which are obvious Without knowing exactly what all tuples are, must be able to deduce other/all FDs that hold on R Essential when we discuss design of good relational schemas Design of Relational Database Schemas Problems such as redundancy that occur when we try to cram too much into a single relation are called anomalies. The principal kinds of anomalies that we encounter are: _ Redundancy. Information may be repeated unnecessarily in several tuples. _ Update Anomalies. We may change information in one tuples but leave the same information unchanged in another. _ Deletion Anomalies. If a set of values becomes empty, we may lose other information as side effect.

2 Normalization
Designing a database, usually a data model is translated into relational schema. The important question is whether there is a design methodology or is the process arbitrary. A simple answer to this question is affirmative. There are certain properties that a good database design must possess as dictated by Codds rules. There are many different ways of designing good database. One of such methodologies is the method involving Normalization. Normalization theory is built around the concept of normal forms. Normalization reduces redundancy. Redundancy is unnecessary repetition of data. It can cause problems with storage and retrieval of data. During the process of normalization, dependencies can be identified, which can cause problems during deletion and updation. Normalization theory is based on the fundamental notion of Dependency. Normalization helps in simplifying the structure of schema and tables. For example the normal forms; we will take an example of a database of the following logical design: Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#} Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#} Relation SP { S#, SUPPLYCITY, P#, PARTQTY}, Primary Key{S#, P#}
5F

Foreign Key{S#} Reference S Foreign Key{P#} Reference P


S# S1 S1 S1 S1 S1 S1 S2 S2 S3 S4 S4 S4 SUPPLYCITY Bombay Bombay Bombay Bombay Bombay Bombay Mumbai Mumbai Mumbai Madras Madras Madras P# P1 P2 P3 P4 P5 P6 P1 P2 P2 P2 P4 P5 PARTQTY 3000 2000 4000 2000 1000 1000 3000 4000 2000 2000 3000 4000

Let us examine the table above to find any design discrepancy. A quick glance reveals that some of the data are being repeated. That is data redundancy, which is of course an undesirable. The fact that a particular supplier is located in a city has been repeated many times. This redundancy causes many other related problems. For instance, after an update a supplier may be displayed to be from Madras in one entry while from Mumbai in another. This further gives rise to many other problems. Therefore, for the above reasons, the tables need to be refined. This process of refinement of a given schema into another schema or a set of schema possessing qualities of a good database is known as Normalization. Database experts have defined a series of Normal forms each conforming to some specified design Decomposition. Decomposition is the process of splitting a relation into two or more relations. This is nothing but projection process. Decompositions may or may not loose information. As you would learn shortly, that normalization process involves breaking a given relation into one or more relations and also that these decompositions should be reversible as well, so that no information is lost in the process. Thus, we will be interested more with the decompositions that incur no loss of information rather than the ones in which information is lost.

Lossless decomposition: The decomposition, which results into relations without loosing any information, is known as lossless decomposition or nonloss decomposition. The decomposition that results in loss of information is known as lossy decomposition.

Consider the relation S{S#, shown below. S S# S3 S5

SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as

SUPPLYSTATUS 100 100

SUPPLYCITY Madras Mumbai

Let us decompose this table into two as shown below:


(1) SX S# S3 S5 SUPPLYSTATUS 100 100 SY S# SUPPLYCITY Madras Mumbai

S3 S5

(2)

SX

S#

SUPPLYSTATUS 100 100

SY

SUPPLYSTATUS 100 100

SUPPLYCITY Madras Mumbai

S3 S5

Let us examine these decompositions. In decomposition (1) no information is lost. We can still say that S3s status is 100 and location is Madras and also that supplier S5 has 100 as its status and location Mumbai. This decomposition is therefore lossless. In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the location of suppliers cannot be determined by these two tables. The information regarding the location of the suppliers has been lost in this case. This is a lossy decomposition. Certainly, lossless decomposition is more desirable because otherwise the decomposition will be irreversible. The decomposition process is in fact projection, where some attributes are selected from a table. A natural question arises here as to why the first decomposition is lossless while the second one is lossy? How should a given relation must be decomposed so that the resulting projections are nonlossy? Answer to these questions lies in functional dependencies and may be given by the following theorem. Heaths theorem: Let R {A, B, C} be a relation, where A, B and C are sets of attributes. If R B} and {A, C}.

Let us apply this theorem on the decompositions described above. We observe that relation S satisfies two irreducible sets of FDs

Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#, SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#, SUPPLYSTATUS} and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see

lost. An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let F be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in F+: R1 R2 2.1 First Normal Form A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every tuple contains exactly one value for each attribute. Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most desirable form of a relation. Let us take a relation (modified to illustrate the point in discussion) as Rel1 {S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY} Primary Key{S#, P#}
FD {SUPPLYCITY SUPPLYSTATUS}

Note

that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a

suppliers status is determined by the location of that supplier e.g. all suppliers from Madras must have status of 100. The primary key of the relation Rel1 is {S#, P#}.

Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us insert some sample tuples into this relation
REL1 S# S1 S1 S1 S1 S1 S1 S2 S2 S3 S4 S4 S4 SUPPLYSTATUS SUPPLYCITY 200 Madras 200 Madras 200 Madras 200 Madras 200 Madras 200 Madras 100 Mumbai 100 Mumbai 100 Mumbai 200 Madras 200 Madras 200 Madras P# P1 P2 P3 P4 P5 P6 P1 P2 P2 P2 P4 P5 PARTQTY 3000 2000 4000 2000 1000 1000

3000 4000 2000


2000 3000 4000

The redundancies in the above relation causes many problems usually known as update anomalies, that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to supplierINSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation because he has not supplied any part so far. DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of a supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier supplied a particular part but also the fact that the supplier is located in a particular city. In our case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is located at Mumbai. This is definitely undesirable. The problem here is there are too many informations attached to each tuple, therefore deletion forces loosing too many informations. UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure that all the entries corresponding to S#=S1 are updated otherwise inconsistency will be introduced. As a result some entries will suggest that the supplier is located at Madras while others will contradict this fact.

2.2 Second Normal Form A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally dependent on the primary key. Here it has been assumed that there is only one candidate key, which is of course primary key. A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction process consists of replacing the 1NF relation by suitable projections. We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy is to break the relation into two simpler relations. REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and REL3{S#, P#, PARTQTY} REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is S#. By similar argument, REL3 is also in 2NF. Evidently, these two relations have overcome all the update anomalies stated earlier. Now it is possible to insert the facts regarding supplier S5 even when he is not supplied any part, which was earlier not possible. This solves insert problem. Similarly, delete and update problems are also over now. These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the problems we are going to discuss here, however, REL2 still carries some problems. The reason is that the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via

transitive dependency. We will see that this transitive dependency gives rise to another set of anomalies. INSERT: We are unable to insert the fact that a particular city has a particular status until we have some supplier actually located in that city. DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city has that particular status. UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem related to update.

2.3 Third Normal Form

A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations REL4 and REL5 as shown below. RELATION 4 {S#, SUPPLYCITY} and RELATION 5 {SUPPLYCITY SUPLLYSTATUS} Sample relation is shown below. RELATION 4
S# S1 S2 S3 S4 S5 SUPPLYCITY Madras Mumbai Mumbai Madras Kolkata

RELATION 5
SUPPLYCITY Madras Mumbai Kolakata SUPPLYSTATUS 200 100 300

Evidently, the above relations RELATION 4 and RELATION5 are in 3NF, because there is no

transitive

dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any transitive dependency. 2.4 Boyce-Codd Normal Form The previous normal forms assumed that there was just one candidate key in the relation and that key was also the primary key. Another class of problems arises when this is not the case. Very often there will be more candidate keys than one in practical database designing situation. To be precise the 1NF, 2NF and 3NF did not deal adequately with the case of relations that had two or more candidate keys, and that the candidate keys were composite, and they overlapped (i.e. had at least one attribute common). A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, leftirreducible FD has a candidate key as its determinant. Or A relation is in BCNF if and only if all the determinants are candidate keys. It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in that it makes no explicit reference to first and second normal forms as such, nor to the concept of transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the

case that any given relation can be nonloss decomposed into an equivalent collection of BCNF relations. Thus, relations REL 1 and REL 2 which were not in 3NF, are not in BCNF either; also that relations REL3, REL 4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key, so REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant {SUPPLYCITY} is not a candidate key. Relations REL 3, REL 4, and REL 5, on the other hand, are each in BCNF, because in each case the sole candidate key is the only determinant in the respective relations. 2.5 Comparison of BCNF and 3NF We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an advantage to 3NF in that we know that it is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation. Nevertheless, there is a disadvantage to 3NF. If we do not eliminate all transitive dependencies, we may have to use null values to represent some of the possible meaningful relationship among data items, and there is the problem of repetition of information. The other difficulty is the repetition of information. If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a high penalty in system performance or risk the integrity of the data in our database. Neither of these alternatives is attractive. With such alternatives, the limited amount of redundancy imposed by transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to retain dependency preservation and to sacrifice BCNF. 2.6 Multi-valued dependency Multi-valued dependency may be formally defined as: Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is multi-dependent on A - in symbols, A B read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal value of R, the set of B values matching a given A value, C value pair depends only on the A value and is independent of the C value.

2.7 Fifth Normal Form

It seems that the sole operation necessary or available in the further normalization process is the replacement of a relation in a nonloss way by exactly two of its projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a surprise, therefore, to discover that there exist relations that cannot be nonloss-decomposed into two projections but can be nonlossdecomposed into three (or more). An unpleasant but convenient term, we will describe such a relation as "n-decomposable" (for some n > 2) - meaning that the relation in question can be nonloss-decomposed into n projections but not into m for any m < n. A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and similarly term n-decomposable may be defined.
2.8 Join Dependency:

Let R be a relation, and let A, B, Z be subsets of the attributes of R. Then we say that R satisfies the Join Dependency (JD) *{ A, B, ..., Z} (Read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its projections on A, B,..., Z. Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R. Let us understand what it means for a JD to be "implied by candidate keys." Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is certainly not implied by its sole candidate key (that key being the combination of all of its attributes). Now let us understand through an example, what it means for a JD to be implied by candidate keys. Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and {SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies the JD *{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }

That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME, SUPPLYSTATUS} and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those projections. (This fact does not mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the fact that {S#} is a candidate key (in fact it is implied by Heath's theorem) Likewise, relation REL1 also satisfies the JD

* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}} This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.

To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with respect to projection and join (which accounts for its alternative name, projection-join normal form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking projections. For a relation is in 5NF the only join dependencies are those that are implied by candidate keys, and so the only valid decompositions are ones that are based on those candidate keys.

Chapter-3
FUNCTIONAL DEPENDENCY AND NORMALIZATION
End Chapter quizzes: Q1 Normalization is step by step process of decomposing: (e) Table (f) Database (g) Group Data item (h) All of the above

Q2 A relation is said to be in 2 NF if (i) it is in 1 NF (ii) non-key attributes dependent on key attribute (iii) non-key attributes are independent of one another (iv) if it has a composite key, no non-key attribute should be dependent on part of the composite key.

(a) i, ii, iii (c) i, ii, iv

(b) i and ii (d) i, iv

Q3. A relation is said to be in 3 NF if (i) it is in 2 NF (ii) non-key attributes are independent of one another (iii) key attribute is not dependent on part of a composite key (iv) has no multi-valued dependency (a) i and iii (c) i and ii (b) i and iv (d) ii and iv

Q4. A relation is said to be in BCNF when (a) it has overlapping composite keys (b) it has no composite keys (c) it has no multivalued dependencies (d) it has no overlapping composite keys which have related attributes Q5. Fourth normal form (4 NF) relations are needed when. (a) there are multivalued dependencies between attributes in composite key (b) there are more than one composite key (c) there are two or more overlapping composite keys (d) there are multivalued dependency between non-key attributes

Q6. A good database design (i) is expandable with growth and changes in organization (ii) easy to change when software changes (iii) ensures data integrity (iv) allows access to only authorized users (a) i, ii (c) i, ii, iii, iv (b) ii, iii (d) i, ii, iii

Q7. Given an attribute x, another attribute y is dependent on it, if for a given x (a) there are many y values (b) there is only one value of y (c) there is one or more y values (d) there is none or one y value Q8. If a non key attribute is depending on another non key attribute, It is known as a) Full F D b) Partial F D c) TRANSITIVE F D d) None of the above Q9. Decomposition of relation should always be a) Lossy b) Lossless c) Both a and b d) None of the above

Chapter: 4
STRUCTURE QUERY LANGUAGE

1. INTRODUCTARY CONCEPT
1.1 What is SQL? SQL stands for Structured Query Language SQL allows you to access a database SQL is an ANSI standard computer language SQL can execute queries against a database SQL can retrieve data from a database SQL can insert new records in a database SQL can delete records from a database SQL can update records in a database SQL is easy to learn

SQL is an ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. SQL statements are used to retrieve and update data in a database. SQL works with database programs like MS Access, DB2, Informix, MS SQL Server, Oracle, Sybase, etc 1.2 SQL Database Tables: A database most often contains one or more tables. Each table is identified by a name (e.g. "Customers" or "Orders"). Tables contain records (rows) with data. Below is an example of a table called "Persons":
LastName Hansen Svendson Pettersen FirstName Ola Tove Kari Address Timoteivn 10 Borgvn 23 Storgt 20 City Sandnes Sandnes Stavanger

The table above contains three records (one for each person) and four columns (LastName, FirstName, Address, and City).

2. DATABASE LANGUAGE
2.1 SQL Data Definition Language (DDL) The Data Definition Language (DDL) part of SQL permits database tables to be created or deleted. We can also define indexes (keys), specify links between tables, and impose constraints between database tables.

The most important DDL statements in SQL are: CREATE TABLE - creates a new database table ALTER TABLE - alters (changes) a database table DROP TABLE - deletes a database table Create a Table To create a table in a database: CREATE TABLE table_name ( column_name1 data_type, column_name2 data_type, ....... )

Example This example demonstrates how you can create a table named "Person", with four columns. The column names will be "LastName", "FirstName", "Address", and "Age":

ALTER TABLE The ALTER TABLE statement is used to add, drop and modify columns in an existing table.

ALTER TABLE table_name

ADD column_name datatype

ALTER TABLE table_name MODIFY column_name datatype

ALTER TABLE table_name DROP COLUMN column_name

Delete a Table or Database To delete a table (the table structure attributes, and indexes will also be deleted): DROP TABLE table_name

2.2 SQL Data Manipulation Language (DML) DML language includes syntax to update, insert, and delete records. These query and update commands together form the Data Manipulation Language (DML) part of SQL: UPDATE - updates data in a database table DELETE - deletes data from a database table INSERT INTO - inserts new data into a database table

The INSERT INTO Statement The INSERT INTO statement is used to insert new rows into a table. Syntax INSERT INTO table_name VALUES (value1, value2,....)

You can also specify the columns for which you want to insert data:

INSERT INTO table_name (column1, column2,...) VALUES (value1, value2,....)

The Update Statement The UPDATE statement is used to modify the data in a table. Syntax UPDATE table_name SET column_name = new_value WHERE column_name = some_value

The DELETE Statement The DELETE statement is used to delete rows in a table. Syntax DELETE FROM table_name WHERE column_name = some_value

2.3 SQL Data Manipulation Language (DQL) It is used to retrieve the existing data from the database, using select statements. SQL SELECT Example To select the content of columns named "LastName" and "FirstName", from the database table called "Persons", use a SELECT statement like this:

SELECT LastName, FirstName FROM Persons

The WHERE Clause To conditionally select data from a table, a WHERE clause can be added to the SELECT statement. Syntax SELECT column FROM table WHERE column operator value

With the WHERE clause, the following operators can be used:

Operator = <> > < >= <= BETWEEN LIKE

Description Equal Not equal Greater than Less than Greater than or equal Less than or equal Between an inclusive range Search for a pattern

Using the WHERE Clause To select only the persons living in the city "Sandnes", we add a WHERE clause to the SELECT statement: SELECT * FROM Persons WHERE City='Sandnes'

"Persons" table LastName Hansen Svendson Svendson Pettersen FirstName Ola Tove Stale Kari Address Timoteivn 10 Borgvn 23 Kaivn 18 Storgt 20 City Sandnes Sandnes Sandnes Stavanger Year 1951 1978 1980 1960

Result LastName Hansen Svendson Svendson FirstName Ola Tove Stale Address Timoteivn 10 Borgvn 23 Kaivn 18 City Sandnes Sandnes Sandnes Year 1951 1978 1980

The LIKE Condition The LIKE condition is used to specify a search for a pattern in a column. Syntax SELECT column FROM table WHERE column LIKE pattern

A "%" sign can be used to define wildcards (missing letters in the pattern) both before and after the pattern.

Using LIKE The following SQL statement will return persons with first names that start with an 'O': SELECT * FROM Persons WHERE FirstName LIKE 'O%'

The ORDER BY keyword is used to sort the result.

Sort the Rows

The ORDER BY clause is used to sort the rows. Orders: Company Sega ABC Shop W3Schools W3Schools OrderNumber 3412 5678 2312 6798

Example To display the companies in alphabetical order: SELECT Company, OrderNumber FROM Orders ORDER BY Company

Result: Company ABC Shop Sega W3Schools W3Schools OrderNumber 5678 3412 6798 2312

Example To display the companies in alphabetical order AND the order numbers in numerical order:

SELECT Company, OrderNumber FROM Orders ORDER BY Company, OrderNumber

Result: Company ABC Shop Sega W3Schools W3Schools OrderNumber 5678 3412 2312 6798

GROUP BY... Aggregate functions (like SUM) often need an added GROUP BY functionality. GROUP BY... was added to SQL because aggregate functions (like SUM) return the aggregate of all column values every time they are called, and without the GROUP BY function it was impossible to find the sum for each individual group of column values. The syntax for the GROUP BY function is:

SELECT column, SUM(column) FROM table GROUP BY column

GROUP BY Example This "Sales" Table: Company W3Schools IBM W3Schools Amount 5500 4500 7100

3. What is a View?
In SQL, a VIEW is a virtual table based on the result-set of a SELECT statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database. You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from a single table. Syntax CREATE VIEW view_name AS SELECT FROM WHERE column_name(s) table_name condition

View is of two types updateable view and non-updateable view. Using updateable view value of the table can be modified where as in case of non updateable view base table can not be updated.

4. Rename of a Table Column


ALTER TABLE <table> RENAME <oldname> TO <newname>;

RENAME TABLE student TO student_new This SQL command will rename the student table to student_new

5. Renames a SQL view in the current database.


RENAME VIEW ViewName1 TO ViewName2 Parameters ViewName1 Specifies the name of the SQL view to be renamed. ViewName2 Specifies the new name of the SQL view.

6. Renaming Columns & Constraints

In addition to renaming tables and indexes Oracle9i Release 2 allows the renaming of columns and constraints on tables. In this example once the the TEST1 table is created it is renamed along with it's columns, primary key constraint and the index that supports the primary key: SQL> CREATE TABLE test1 ( 2 3 col1 NUMBER(10) NOT NULL, col2 VARCHAR2(50) NOT NULL );

Table created. SQL> ALTER TABLE test1 ADD ( 2 CONSTRAINT test1_pk PRIMARY KEY (col1));

Table altered. SQL> DESC Name test1 Null? Type

-------------------- -------- -------------------COL1 COL2 NOT NULL NUMBER(10) NOT NULL VARCHAR2(50)

SQL> SELECT 2 FROM 3 WHERE 4 AND c

constraint_name user_constraints table_name = 'TEST1'

onstraint_type = 'P';

CONSTRAINT_NAME -----------------------------TEST1_PK

1 row selected. SQL> SELECT index_name, column_name 2 FROM user_ind_columns 3 WHERE table_name = 'TEST1';

INDEX_NAME

COLUMN_NAME

-------------------- -------------------TEST1_PK 1 row selected. SQL> -- Rename the table, columns, primary key SQL> -- and supporting index. SQL> ALTER TABLE test1 RENAME TO test; Table altered. SQL> ALTER TABLE test RENAME COLUMN col1 TO id; Table altered. SQL> ALTER TABLE test RENAME COLUMN col2 TO description; Table altered. SQL> ALTER TABLE test RENAME CONSTRAINT test1_pk TO test_pk; Table altered. SQL> ALTER INDEX test1_pk RENAME TO test_pk; Index altered. SQL> DESC test Name Null? Type COL1

-------------------- -------- -------------------ID NOT NULL NUMBER(10) NOT NULL VARCHAR2(50)

DESCRIPTION

SQL> SELECT constraint_name 2 FROM user_constraints 3 WHERE table_name 4 AND = 'TEST'

constraint_type = 'P';

CONSTRAINT_NAME -------------------TEST_PK

1 row selected.

SQL> SELECT index_name, column_name 2 FROM user_ind_columns 3 WHERE table_name = 'TEST';

INDEX_NAME

COLUMN_NAME

-------------------- -------------------TEST_PK ID

1 row selected.

STRUCTURE QUERY LANGUAGE


End Chapter quizzes:
Q1 SELECT statement is used for a) b) c) d) Updating data in the database Retrieving data from the database Change in the structure of database None of the above

Q2. Select the correct statement


a) ALTER statement is used to modify the structure of Database. b) Update statement is used to change the data into the table. c) SELECT statement is used to retrieve the data from the database d) All of the above. Q3. Which of the following statements are NOT TRUE about ORDER BY clauses? A. Ascending or descending order can be defined with the asc or desc keywords. B. Only one column can be used to define the sort order in an order by clause. C. Multiple columns can be used to define sort order in an order by clause. D. Columns can be represented by numbers indicating their listed order in the select

Q4 GRANT and REVOKE are


(a) DDL statements (b) DML statements (c) DCL statements (d) None of these.

Q5. Oracle 8i can be best described as (a) Object-based DBMS (b) Object-oriented DBMS (c) Object-relational DBMS (d) Relational DBMS Q6. Select the correct statement. a) View has no physical existence. b) Data from the view are retrieved through the Table. c) Both (a) and (b) d) None of these. Q7 INSERT statement is used to a) Storing data into the Table

b) Deleting data from the Table c) Both a and b d) Updating data in the table Q8 ALTER statement is used to A) Changing structure of the table B ) Changing data from the Table C ) Both a and b D ) Deleting data from the table Q9. RENAME TABLE student TO student_new a) Rename the column of the Table b) Change the Table name student to student_new c) Rename the row of the table d) None of the above.

Q10. ORDER By clause ids used to a) Sort the row of the table in a particular order b) Remove the column of the table c) Rename the Table d) Both a and c

Chapter: 5
PROCEDURAL QUERY LANGUAGE

1. Introduction to PL/SQL
PL/SQL is a procedural extension for Oracles Structured Query Language. PL/SQL is not a separate language rather a technology. Mean to say that you will not have a separate place or prompt for executing your PL/SQL programs. PL/SQL technology is like an engine that executes PL/SQL blocks and subprograms. This engine can be started in Oracle server or in application development tools such as Oracle Forms, Oracle Reports etc.

As shown in the above figure PL/SQL engine executes procedural statements and sends SQL part of statements to SQL statement processor in the Oracle server. PL/SQL combines the data manipulating power of SQL with the data processing power of procedural languages.

2 Block Structure of PL/SQL:


PL/SQL is a block-structured language. It means that Programs of PL/SQL contain logical blocks. PL/SQL block consists of SQL and PL/SQL statements.

A PL/SQL Block consists of three sections: The Declaration section (optional). The Execution section (mandatory). The Exception (or Error) Handling section (optional).

2.1 Declaration Section:


The Declaration section of a PL/SQL Block starts with the reserved keyword DECLARE. This section is optional and is used to declare any placeholders like variables, constants, records and cursors, which are used to manipulate data in the execution section. Placeholders may be any of Variables, Constants and Records, which stores data temporarily. Cursors are also declared in this section.
Declaring Variables: Variables are declared in DECLARE section of PL/SQL.

DECLARE SNO NUMBER (3); SNAME VARCHAR2 (15);

2.2 Execution Section:


The Execution section of a PL/SQL Block starts with the reserved keyword BEGIN and ends with END. This is a mandatory section and is the section where the program logic is written to perform any task. The programmatic constructs like loops, conditional statement and SQL statements form the part of execution section.

2.3 Exception Section:


The Exception section of a PL/SQL Block starts with the reserved keyword EXCEPTION. This section is optional. Any errors in the program can be handled in this section, so that the PL/SQL Blocks terminates gracefully. If the PL/SQL Block contains exceptions that cannot be handled,

the Block terminates abruptly with errors. Every statement in the above three sections must end with a semicolon (;). PL/SQL blocks can be nested within other PL/SQL blocks. Comments can be used to document code.

3. How a sample PL/SQL Block looks. DECLARE Variable declaration BEGIN Program Execution EXCEPTION Exception handling
Variables and Constants: Variables are used to store query results. Forward references are not allowed. Hence you must first declare the variable and then use it.

Variables can have any SQL data type, such as CHAR, DATE, NUMBER etc or any PL/SQL data type like BOOLEAN, BINARY_INTEGER etc. Declaring Variables: Variables are declared in DECLARE section of PL/SQL.

DECLARE SNO NUMBER (3); SNAME VARCHAR2 (15); BEGIN Assigning values to variables: SNO NUMBER: = 1001; or SNAME: = JOHN; etc Following screen shot explain you how to write a simple PL/SQL program and execute it

. SET SERVEROUTPUT ON is a command used to access results from Oracle Server. A PL/SQL program is terminated by a / . DBMS_OUTPUT is a package and PUT_LINE is a procedure in it. You will learn more about procedures, functions and packages in the following sections of this tutorial.

Above program can also be written as a text file in Notepad editor and then executed as explained in the following screen shot.

4. Control Statements
This section explains about how to structure flow of control through a PL/SQL program. The control structures of PL/SQL are simple yet powerful. Control structures in PL/SQL can be divided into selection: Conditional, Iterative and Sequential.

4.1 Conditional Control (Selection): This structure tests a condition, depending on the condition is true or false it decides the sequence of statements to be executed. Example Syntax for IF-THEN IF THEN Statements END IF; Example:

Syntax for IF-THEN-ELSE: IF THEN Statements ELSE statements END IF;

Example:

Syntax for IF-THEN-ELSIF: IF THEN Statements ELSIF THEN Statements ELSE Statements END IF;

4.2 Iterative Control LOOP statement executes the body statements multiple times. The statements are placed between LOOP END LOOP keywords. The simplest form of LOOP statement is an infinite loop. EXIT statement is used inside LOOP to terminate it. Syntax for LOOP- END LOOP LOOP Statements END LOOP; Example: BEGIN LOOP DBMS_OUTPUT.PUT_LINE (Hello); END LOOP; END;

5. CURSOR For every SQL statement execution certain area in memory is allocated. PL/SQL allows you to name this area. This private SQL area is called context area or cursor. A cursor acts as a handle or pointer into the context area. A PL/SQL program controls the context area using the cursor. Cursor represents a structure in memory and is different from cursor variable. When you declare a cursor, you get a pointer variable, which does not point any thing. When the cursor is opened, memory is allocated and the cursor structure is created. The cursor variable now points the cursor. When the cursor is closed the memory allocated for the cursor is released. Cursors allow the programmer to retrieve data from a table and perform actions on that data one row at a time. There are two types of cursors implicit cursors and explicit cursors. 5.1 Implicit cursors For SQL queries returning single row PL/SQL declares implicit cursors. Implicit cursors are simple SELECT statements and are written in the BEGIN block (executable section) of the PL/SQL. Implicit cursors are easy to code, and they retrieve exactly one row. PL/SQL implicitly declares cursors for all DML statements.

The most commonly raised exceptions here are NO_DATA_FOUND or TOO_MANY_ROWS.

Syntax: SELECT Ename , sal INTO ena ,esa FROM EMP WHERE EMPNO =7845;

Note: Ename and sal are columns of the table EMP and ena and esa are the variables used to store ename and sal fetched by the query.

5.2 Explicit Cursors Explicit cursors are used in queries that return multiple rows. The set of rows fetched by a query is called active set. The size of the active set meets the search criteria in the select statement. Explicit cursor is declared in the DECLARE section of PL/SQL program. Syntax: CURSOR <cursor-name> IS <select statement> Sample Code:

DECLARE CURSOR emp_cur IS SELECT ename FROM EMP;

BEGIN -----END; Processing multiple rows is similar to file processing. For processing a file you need to open it, process records and then close. Similarly user-defined explicit cursor needs to be opened, before reading the rows, after which it is closed. Like how file pointer marks current position in file processing, cursor marks the current position in the active set.

5.3 Opening Cursor Syntax: OPEN <cursor-name>; Example: OPEN emp_cur; When a cursor is opened the active set is determined, the rows satisfying the where clause in the select statement are added to the active set. A pointer is established and points to the first row in the active set. 5.4 Fetching from the cursor: To get the next row from the cursor we need to use fetch statement. Syntax: FETCH <cursor-name> INTO <variables>; Example: FETCH emp_cur INTO ena;

FETCH statement retrieves one row at a time. Bulk collect clause need to be used to fetch more than one row at a time. Closing the cursor: After retrieving all the rows from active set the cursor should be closed. Resources allocated for the cursor are now freed. Once the cursor is closed the execution of fetch statement will lead to errors.

CLOSE <cursor-name>;

5.5 Explicit Cursor Attributes Every cursor defined by the user has 4 attributes. When appended to the cursor name these attributes let the user access useful information about the execution of a multi row query. The attributes are: 1. %NOTFOUND: It is a Boolean attribute, which evaluates to true, if the last fetch failed.

i.e. when there are no rows left in the cursor to fetch. 2. 3. %FOUND: Boolean variable, which evaluates to true if the last fetch, succeeded. %ROWCOUNT: Its a numeric attribute, which returns number of rows fetched by the

cursor so far. 4. %ISOPEN: A Boolean variable, which evaluates to true if the cursor is opened otherwise

to false.

In above example I wrote a separate fetch for each row, instead loop statement could be used here. Following example explains the usage of LOOP.

6. Exceptions
An Exception is an error situation, which arises during program execution. When an error occurs exception is raised, normal execution is stopped and control transfers to exception-handling part. Exception handlers are routines written to handle the exception. The exceptions can be internally defined (system-defined or pre-defined) or User-defined exception.

6.1 Predefined exception is raised automatically whenever there is a violation of Oracle coding rules. Predefined exceptions are those like ZERO_DIVIDE, which is raised automatically when we try to divide a number by zero. Other built-in exceptions are given below. You can handle unexpected Oracle errors using OTHERS handler. It can handle all raised exceptions that are not handled by any other handler. It must always be written as the last handler in exception block. CURSOR_ALREADY_OPEN Raised when we try to open an already open cursor. DUP_VAL_ON_INDEX When you try to insert a duplicate value into a unique column INVALID_CURSOR It occurs when we try accessing an invalid cursor INVALID_NUMBER On usage of something other than number in place of number value. LOGIN_DENIED At the time when user login is denied TOO_MANY_ROWS When a select query returns more than one row and the destination variable can take only single value. VALUE_ERROR When an arithmetic, value conversion, truncation, or constraint error occurs. Predefined exception handlers are declared globally in package STANDARD. Hence we need not have to define them rather just use them.

The biggest advantage of exception handling is it improves readability and reliability of the code. Errors from many statements of code can be handles with a single handler. Instead of checking

for an error at every point we can just add an exception handler and if any exception is raised it is handled by that. For checking errors at a specific spot it is always better to have those statements in a separate begin end block.

Examples 1: Following example gives the usage of ZERO_DIVIDE exception

Exmpmple 2: I have explained the usage of NO_DATA_FOUND exception in the following

The DUP_VAL_ON_INDEX is raised when a SQL statement tries to create a duplicate value in a column on which a primary key or unique constraints are defined. Example: To demonstrate the exception DUP_VAL_ON_INDEX.

More than one Exception can be written in a single handler as shown below.

EXCEPTION When Statements; END;

NO_DATA_FOUND

or

TOO_MANY_ROWS

then

6.2 User-defined Exceptions A User-defined exception has to be defined by the programmer. User-defined exceptions are declared in the declaration section with their type as exception. They must be raised explicitly using RAISE statement, unlike pre-defined exceptions that are raised implicitly. RAISE statement can also be used to raise internal exceptions. Declaring Exception: DECLARE myexception EXCEPTION; BEGIN -----Raising BEGIN RAISE ------Exception:

myexception;

Handling Exception: BEGIN --------EXCEPTION WHEN Statements; END; Points To Ponder:

myexception

THEN

An Exception cannot be declared twice in the same block. Exceptions declared in a block are considered as local to that block and global to its subblocks. An enclosing block cannot access Exceptions declared in its sub-block. Where as it possible for a sub-block to refer its enclosing Exceptions.

The following example explains the usage of User-defined Exception

RAISE_APPLICATION_ERROR To display your own error messages one can use the built-in RAISE_APPLICATION_ERROR. They display the error message in the same way as Oracle errors. You should use a negative number between 20000 to 20999 for the error_number and the error message should not exceed 512 characters. The syntax to call raise_application_error is RAISE_APPLICATION_ERROR (error_number, error_message, { TRUE | FALSE })

Fetch is used twice in the above example to make % FOUND available.

Using Cursor For Loop: The cursor for Loop can be used to process multiple records. There are two benefits with cursor for Loop 1. It implicitly declares a %ROWTYPE variable, also uses it as LOOP index 2. Cursor For Loop itself opens a cursor, read records then closes the cursor automatically. Hence OPEN, FETCH and CLOSE statements are not necessary in it.

2. Example:

emp_rec is automatically created variable of %ROWTYPE. We have not used OPEN, FETCH , and CLOSE in the above example as for cursor loop does it automatically. The above example can be rewritten as shown in the Fig , with less lines of code. It is called Implicit for Loop.

Deletion or Updation Using Cursor: In all the previous examples I explained about how to retrieve data using cursors. Now we will see how to modify or delete rows in a table using cursors. In order to Update or Delete rows, the cursor must be defined with the FOR UPDATE clause. The Update or Delete statement must be declared with WHERE CURRENT OF Following example updates comm of all employees with salary less than 2000 by adding 100 to existing comm.

7. PL/SQL subprograms
A subprogram is a named block of PL/SQL. There are two types of subprograms in PL/SQL namely Procedures and Functions. Every subprogram will have a declarative part, an executable part or body, and an exception handling part, which is optional.

Declarative part contains variable declarations. Body of a subprogram contains executable statements of SQL and PL/SQL. Statements to handle exceptions are written in exception part.

When client executes a procedure are function, the processing is done in the server. This reduces network traffic. The subprograms are compiled and stored in the Oracle database as stored programs and can be invoked whenever required. As they are stored in compiled form when called they only need to be executed. Hence they save time needed for compilation. Subprograms provide the following advantages

1. They allow you to write PL/SQL program that meet our need 2. They allow you to break the program into manageable modules. 3. They provide reusability and maintainability for the code. 7.1 Procedures Procedure is a subprogram used to perform a specific action. A procedure contains two parts specification and the body. Procedure specification begins with CREATE and ends with procedure name or parameters list. Procedures that do not take parameters are written without a parenthesis. The body of the procedure starts after the keyword IS or AS and ends with keyword END.

In the above given syntax things enclosed in between angular brackets (&lt; &gt; ) are user defined and those enclosed in square brackets ([ ]) are optional.

OR REPLACE is used to overwrite the procedure with the same name if there is any. AUTHID clause is used to decide whether the procedure should execute with invoker (currentuser or person who executes it) or with definer (owner or person created) rights Example CREATE PROCEDURE MyProc (ENO NUMBER) AUTHID DEFINER AS BEGIN DELETE FROM EMP WHERE EMPNO= ENO; EXCEPTION WHEN NO_DATA_FOUND THEN DBMS_OUTPUT.PUT_LINE (No END; Let us assume that above procedure is created in SCOTT schema (SCOTT user area) and say is executed by user SEENU. It will delete rows from the table EMP owned by SCOTT, but not from the EMP owned by SEENU. It is possible to use a procedure owned by one user on tables owned by other users. It is possible by setting invoker-rights

employee

with

this

number);

AUTHID

CURRENT_USER

PRAGMA AUTONOMOUS_TRANSACTION is used to instruct the compiler to treat the procedure as autonomous. i.e. commit or rollback the changes made by the procedure. Parameter Modes

Parameters are used to pass the values to the procedure being called. There are 3 modes to be used with parameters based on their usage. IN, OUT, and IN OUT. IN mode parameter used to pass the values to the called procedure. Inside the program IN parameter acts like a constant. i.e it cannot be modified. OUT mode parameter allows you to return the value from the procedure. Inside Procedure the OUT parameter acts like an uninitialized variable. Therefore its value cannot be assigned to another variable. IN OUT mode parameter allows you to both pass to and return values from the subprogram. Default mode of an argument is IN.

POSITIONAL

vs. NOTATIONAL parameters

A procedure can be communicated by passing parameters to it. The parameters passed to a procedure may follow either positional notation or named notation. Example If a procedure is defined as GROSS (ESAL NUMBER, ECOM NUMBER) If we call this procedure as GROSS (ESA, ECO) then parameters used are called positional parameters. For Notational Parameters we use the following syntax GROSS (ECOM =&gt; ECO, ESAL =&gt; ESA)

A procedure can also be executed by invoking it as an executable statement as shown below. BEGIN PROC1;

---

PROC1

is

name

of

the

procedure.

END; /

Functions: A function is a PL/SQL subprogram, which is used to compute a value. Function is same like a procedure except for the difference that it have RETURN clause. Syntax for Function

Examples Function without arguments

Function with arguments. Different ways of executing the function.

Chapter-5
PROCEDURAL QUERY LANGUAGE End Chapter quizzes Q1 Select the correct statement
c) d) e) f)

User-defined exceptions are defined by the programmer


PL/SQL improves the capacity of SQL

%NOTFOUND: It is a Boolean attribute


All of the above

Q2) Select the correct statement


a) Declaration section is optional. b) The Execution section is mandatory c) The Exception (or Error) Handling section is mandatory. d) Only a and c is correct. Q3. A command used to access results from Oracle Server a) SET SERVEROUTPUT ON b) PRINT c) WRITE d) OUTPUT_SERVER Q4. Which cursors are used in queries that return multiple rows? a) Explicit cursor b) Implicit cursors c) Open Cursor d) Both a and c Q5 Program logic of PL SQL is written in: a) Declaration section b) Execution Section c) Exception Handling d) Program Section. Q6 Variable and Constants are declared in a) Variable Section b) Declaration Section c) Execution Section d) Program Section Q7. There are two types of subprograms in PL/SQL namely a) Procedures b) Cursor c) Functions
d) Both a and c

Q8. User-defined exception has to be defined by a) Programmer b) User c) Technical Writer d) None Q9. Biggest advantage of exception handling is it improves a) Readability b) Reliability c) Both a and b d) None Q10. NO_DATA_FOUND or TOO_MANY_ROWS. are a) b) c) d) most commonly used function most commonly used raised exceptions Triggers Procedures

Chapter: 6
TRANSACTION MANAGEMENT & CONCURRENCY CONYROL TECHNIQUE

1. Introductory Concept to Database Transaction


A database transaction comprises of a logical unit of work performed within a database management system (or similar system) against a database, and treated in a coherent and reliable way independent of other transactions. Transactions in a database environment have two main purposes: 1. To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure, when execution stops (completely or partially) and many operations upon a database remain uncompleted, with unclear status. 2. To provide isolation between programs accessing a database concurrently. Without isolation the programs' outcomes are possibly erroneous. A database transaction, by definition, must be atomic, consistent, isolated and durable. Database practitioners often refer to these properties of database transactions using the acronym ACID. Transactions provide an "all-or-nothing" proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage. Most modern relational database management systems fall into the category of databases that support transactions: transactional databases. In a database system a transaction might consist of one or more data-manipulation statements and queries, each reading and/or writing information in the database. Users of database systems consider consistency and integrity of data as highly important. A simple transaction is usually issued to the database system in a language like SQL wrapped in a transaction, using a pattern similar to the following:

1. 2. 3. 4.

Begin the transaction Execute several data manipulations and queries If no errors occur then commit the transaction and end it If errors occur then rollback the transaction and end it

If no errors occurred during the execution of the transaction then the system commits the transaction. A transaction commit operation applies all data manipulations within the scope of the transaction and persists the results to the database. If an error occurs during the transaction, or if the user specifies a rollback operation, the data manipulations within the transaction are not persisted to the database. In no case can a partial transaction be committed to the database since that would leave the database in an inconsistent state. Internally, multi-user databases store and process transactions, often by using a transaction ID or XID.

2. ACID properties
When a transaction processing system creates a transaction, it will ensure that the transaction will have certain characteristics. The developers of the components that comprise the transaction are assured that these characteristics are in place. They do not need to manage these characteristics themselves. These characteristics are known as the ACID properties. ACID is an acronym for atomicity, consistency, isolation, and durability. 2.1 Atomicity The atomicity property identifies that the transaction is atomic. An atomic transaction is either fully completed, or is not begun at all. Any updates that a transaction might affect on a system are completed in their entirety. If for any reason an error occurs and the transaction is unable to complete all of its steps, the then system is returned to the state it was in before the transaction was started. An example of an atomic transaction is an account transfer transaction. The money is removed from account A then placed into account B. If the system fails after removing the money from account A, then the transaction processing system will put the money back into account A, thus returning the system to its original state. This is known as a rollback, as we said at the beginning of this chapter..

2.2 Consistency A transaction enforces consistency in the system state by ensuring that at the end of any transaction the system is in a valid state. If the transaction completes successfully, then all changes to the system will have been properly made, and the system will be in a valid state. If any error occurs in a transaction, then any changes already made will be automatically rolled back. This will return the system to its state before the transaction was started. Since the system was in a consistent state when the transaction was started, it will once again be in a consistent state. Looking again at the account transfer system, the system is consistent if the total of all accounts is constant. If an error occurs and the money is removed from account A and not added to account B, then the total in all accounts would have changed. The system would no longer be consistent. By rolling back the removal from account A, the total will again be what it should be, and the system back in a consistent state. 2.3 Isolation When a transaction runs in isolation, it appears to be the only action that the system is carrying out at one time. If there are two transactions that are both performing the same function and are running at the same time, transaction isolation will ensure that each transaction thinks it has exclusive use of the system. This is important in that as the transaction is being executed, the state of the system may not be consistent. The transaction ensures that the system remains consistent after the transaction ends, but during an individual transaction, this may not be the case. If a transaction was not running in isolation, it could access data from the system that may not be consistent. By providing transaction isolation, this is prevented from happening. 2.4 Durability A transaction is durable in that once it has been successfully completed, all of the changes it made to the system are permanent. There are safeguards that will prevent the loss of information, even in the case of system failure. By logging the steps that the transaction performs, the state of the system can be recreated even if the hardware itself has failed. The concept of durability allows the developer to know that a completed transaction is a permanent part of the system, regardless of what happens to the system later on.

3 The Concept of Schedules


When transactions are executing concurrently in an interleaved fashion, not only does the action of each transaction becomes important, but also the order of execution of operations from each of these transactions. Hence, for analyzing any problem, it is not just the history of previous transactions that one should be worrying about, but also the schedule of operations. 3.1 Schedule (History of transaction): We formally define a schedule S of n transactions T1, T2 Tn as on ordering of operations of the transactions subject to the constraint that, for each transaction, Ti that participates in S, the operations of Ti must appear in the same order in which they appear in Ti. i.e. if two operations Ti1 and Ti2 are listed in Ti such that Ti1 is earlier to Ti2, then in the schedule also Ti1 should appear before Ti2. However, if Ti2 appears immediately after Ti1 in Ti, the same may not be true in S, because some other operations Tj1 (of a transaction Tj) may be interleaved between them. In short, a schedule lists the sequence of operations on the database in the same order in which it was effected in the first place. For the recovery and concurrency control operations, we concentrate mainly on read and write operations of the transactions, because these operations actually effect changes to the database. The other two (equally) important operations are commit and abort, since they decide when the changes effected have actually become active on the database. Since listing each of these operations becomes a lengthy process, we make a notation for describing the schedule. The read operations (Readtr) , write operations(Writetr) of transactions , commit and abort, we indicate by r, w, c and a and each of them come with a subscript to indicate the transaction number For example SA : r1(x); y2(y); w2(y); r1(y), W1 (x); a1 Indicates the following operations in the same order: Readtr(x) Read tr (y) Write tr (y) Read tr(y) Write tr(x) Abort transaction 1 transaction 2 transaction 2 transaction 1 transaction 1 transaction 1

3.2 Conflicting operations: Two operations in a schedule are said to be in conflict if they satisfy these conditions i) ii) iii) The operations belong to different transactions They access the same item x At least one of the operations is a write operation.

For example: r1(x); w2 (x) W1 (x); r2(x) w1 (y); w2(y) Conflict because both of them try to write on the same item. But r1 (x); w2(y) and r1(x) and r2(x) do not conflict, because in the first case the read and write are on different data items, in the second case both are trying read the same data item, which they can do without any conflict. 3.3 A Complete Schedule: A schedule S of n transactions T1, T2.. Tn is said to be a Complete Schedule if the following conditions are satisfied. i) The operations listed in S are exactly the same operations as in T1, T2 Tn, including the commit or abort operations. Each transaction is terminated by either a commit or an abort operation. ii) The operations in any transaction. Ti appear in the schedule in the same order in which they appear in the Transaction. iii) Whenever there are conflicting operations, one of two will occur before the other in the schedule. A Partial order of the schedule is said to occur, if the first two conditions of the complete schedule are satisfied, but whenever there are non conflicting operations in the schedule, they can occur without indicating which should appear first. This can happen because non conflicting operations any way can be executed in any order without affecting the actual outcome. However, in a practical situation, it is very difficult to come across complete schedules. This is because new transactions keep getting included into the schedule. Hence, often one works with a

committed projection C(S) of a schedule S. This set includes only those operations in S that have committed transactions i.e. transaction Ti whose commit operation Ci is in S. Put in simpler terms, since non committed operations do not get reflected in the actual outcome of the schedule, only those transactions, who have completed their commit operations, contribute to the set and this schedule is good enough in most cases. 3.4 Schedules and Recoverability : Recoverability is the ability to recover from transaction failures. The success or otherwise of recoverability depends on the schedule of transactions. If fairly straightforward operations without much interleaving of transactions are involved, error recovery is a straight forward process. On the other hand, if lot of interleaving of different transactions have taken place, then recovering from the failure of any one of these transactions could be an involved affair. In certain cases, it may not be possible to recover at all. Thus, it would be desirable to characterize the schedules based on their recovery capabilities. To do this, we observe certain features of the recoverability and also of schedules. To begin with, we note that any recovery process, most often involves a roll back operation, wherein the operations of the failed transaction will have to be undone. However, we also note that the roll back need to go only as long as the transaction T has not committed. If the transaction T has committed once, it need not be rolled back. The schedules that satisfy this criterion are called recoverable schedules and those that do not, are called nonrecoverable schedules. As a rule, such non-recoverable schedules should not be permitted. Formally, a schedule S is recoverable if no transaction T which appears is S commits, until all transactions T1 that have written an item which is read by T have committed. The concept is a simple one. Suppose the transaction T reads an item X from the database completes its operations (based on this and other values) and commits the values. i.e. the output values of T become permanent values of database. But suppose, this value X is written by another transaction T (before it is read by T), but aborts after T has committed. What happens? The values committed by T are no more valid, because the basis of these values (namely X) itself has been changed. Obviously T also needs to be rolled back (if possible), leading to other rollbacks and so on. The other aspect to note is that in a recoverable schedule, no committed transaction needs to be rolled back. But, it is possible that a cascading roll back scheme may have to be effected, in which an uncommitted transaction has to be rolled back, because it read from a value contributed by a transaction which later aborted. But such cascading rollbacks can be very time consuming

because at any instant of time, a large number of uncommitted transactions may be operating. Thus, it is desirable to have cascadeless schedules, which avoid cascading rollbacks.

This can be ensured by ensuring that transactions read only those values which are written by committed transactions i.e. there is no fear of any aborted or failed transactions later on. If the schedule has a sequence wherein a transaction T1 has to read a value X by an uncommitted transaction T2, then the sequence is altered, so that the reading is postponed, till T2 either commits or aborts.

This delays T1, but avoids any possibility of cascading rollbacks. The third type of schedule is a strict schedule, which as the name suggests is highly restrictive in nature. Here, transactions are allowed neither to read nor write a value X until the last transaction that wrote X has committed or aborted. Note that the strict schedule largely simplifies the recovery process, but the many cases, it may not be possible device strict schedules.

It may be noted that the recoverable schedule, cascadeless schedules and strict schedules each is more stringent than its predecessor. It facilitates the recovery process, but sometimes the process may get delayed or even may become impossible to schedule.

4 Serializability
Given two transaction T1 and T2 are to be scheduled, they can be scheduled in a number of ways. The simplest way is to schedule them without in that bothering about interleaving them. i.e. schedule all operation of the transaction T1 followed by all operations of T2 or alternatively schedule all operations of T2 followed by all operations of T1. T1 read_tr(X) X=X+N write_tr(X) read_tr(Y) Y=Y+N Write_tr(Y) Time read_tr(X) X=X+P Write_tr(X) T2

Non-interleaved (Serial Schedule): A

T1

T2 read_tr(X) X=X+P Write_tr(X)

read_tr(X) X=X+N write_tr(X ) read_tr(Y) Y=Y+N Write_tr(Y)

Non-interleaved (Serial Schedule):B

These now can be termed as serial schedules, since the entire sequence of operation in one transaction is completed before the next sequence of transactions is started. In the interleaved mode, the operations of T1 are mixed with the operations of T2. This can be done in a number of ways. Two such sequences are given below:

T1 read_tr(X ) X=X+N

T2

read_tr(X) X=X+P write_tr(X) read_tr(Y) Write_tr(X) Y=Y+N Write_tr(Y) Interleaved (non-serial schedule): C

T1 read_tr(X) X=X+N write_tr(X)

T2

read_tr(X) X=X+P Write_tr(X) read_tr(Y) Y=Y+N Write_tr(Y)

Interleaved (non- serial) Schedule D.

Formally a schedule S is serial if, for every transaction, T in the schedule, all operations of T are executed consecutively, otherwise it is called non serial. In such a non-interleaved schedule, if the transactions are independent, one can also presume that the schedule will be correct, since each transaction commits or aborts before the next transaction begins. As long as the

transactions individually are error free, such sequences of events are guaranteed to give correct results. The problem with such a situation is the wastage of resources. If in a serial schedule, one of the transactions is waiting for an I/O, the other transactions also cannot use the system resources and hence the entire arrangement is wasteful of resources. If some transaction T is very long, the other transaction will have to keep waiting till it is completed. Moreover, wherein hundreds of machines operate concurrently becomes unthinkable. Hence, in general, the serial scheduling concept is unacceptable in practice. However, once the operations are interleaved, so that the above cited problems are overcome, unless the interleaving sequence is well thought of, all the problems that we encountered in the beginning of this block become addressable. Hence, a methodology is to be adopted to find out which of the interleaved schedules give correct results and which do not. A schedule S of N transactions is serializable if it is equivalent to some serial schedule of the some N transactions. Note that there are n different serial schedules possible to be made out of n transaction. If one goes about interleaving them, the numbers of possible combinations become unmanageably high. To ease our operations, we form two disjoint groups of non serial schedules- these non serial schedules that are equivalent to one or more serial schedules, which we call serializable schedules and those that are not equivalent to any serial schedule and hence are not serializable once a non-serial schedule is serializable, it becomes equivalent to a serial schedule and by our previous definition of serial schedule will become a correct schedule. But now can one prove the equivalence of a non-serial schedule to a serial schedule? The simplest and the most obvious method to conclude that two such schedules are equivalent is to find out their results. If they produce the same results, then they can be considered equivalent. i.e. it two schedules are result equivalent, then they can be considered equivalent. But such an oversimplification is full of problems. Two sequences may produce the

same set of results of one or even a large number of initial values, but still may not be equivalent. Consider the following two sequences: S1 read_tr(X) X=X+X write_tr(X) S2 read_tr(X) X=X*X Write_tr(X)

For a value X=2, both produce the same result. Can be conclude that they are equivalent? Though this may look like a simplistic example, with some imagination, one can always come out with more sophisticated examples wherein the bugs of treating them as equivalent are less obvious. But the concept still holds -result equivalence cannot mean schedule equivalence. One more refined method of finding equivalence is available. It is called conflict equivalence. Two schedules can be said to be conflict equivalent, if the order of any two conflicting operations in both the schedules is the same (Note that the conflicting operations essentially belong to two different transactions and if they access the same data item, and atleast one of them is a write _tr(x) operation). If two such conflicting operations appear in different orders in different schedules, then it is obvious that they produce two different databases in the end and hence they are not equivalent. 4.1 Testing for conflict serializability of a schedule: We suggest an algorithm that tests a schedule for conflict serializability. 1. For each transaction Ti, participating in the schedule S, create a node labeled T1 in the precedence graph. 2. For each case where Tj executes a readtr(x) after Ti executes write_tr(x), create an edge from Ti to Tj in the precedence graph. 3. For each case where Tj executes write_tr(x) after Ti executes a read_tr(x), create an edge from Ti to Tj in the graph. 4. For each case where Tj executes a write_tr(x) after Ti executes a write_tr(x), create an edge from Ti to Tj in the graph. 5. The schedule S is serialisable if and only if there are no cycles in the graph.

If we apply these methods to write the precedence graphs for the four cases of section 4, we get the following precedence graphs.

X T1 T2 T1 T2

X
Schedule A Schedule B

X T1 T2 T1 T2

Schedule C

Schedule D

We may conclude that schedule D is equivalent to schedule A.

4.2. View equivalence and view serializability: Apart from the conflict equivalence of schedules and conflict serializability, another restrictive equivalence definition has been used with reasonable success in the context of serializability. This is called view serializability. Two schedules S and S1 are said to be view equivalent if the following conditions are satisfied. i) The same set of transactions participates in S and S1 and S and S1 include the same operations of those transactions.

ii)

For any operation ri(X) of Ti in S, if the value of X read by the operation has been written by an operation wj(X) of Tj(or if it is the original value of X before the schedule started) the same condition must hold for the value of x read by operation ri(X) of Ti in S1.

iii)

If the operation Wk(Y) of Tk is the last operation to write, the item Y in S, then Wk(Y) of Tk must also be the last operation to write the item y in S1. The concept being view equivalent is that as long as each read operation of the transaction reads the result of the same write operation in both the schedules, the write operations of each transaction must produce the same results. Hence, the read operations are said to see the same view of both the schedules. It can easily be verified when S or S1 operate independently on a database with the same initial state, they produce the same end states. A schedule S is said to be view

serializable, if it is view equivalent to a serial schedule.


It can also be verified that the definitions of conflict serializability and view serializability are similar, if a condition of constrained write assumption holds on all transactions of the schedules. This condition states that any write operation wi(X) in Ti is preceded by a ri(X) is Ti and that the value written by wi(X) in Ti depends only on the value of X read by ri(X). This assumes that computation of the new value of X is a function f(X) based on the old value of x read from the database. However, the definition of view serializability is less restrictive than that of conflict serializability under the unconstrained write assumption where the value written by the operation Wi(x) in Ti can be independent of its old value from the database. This is called a blind write.

But the main problem with view serializability is that it is extremely complex computationally and there is no efficient algorithm to do the same.

4.3 Uses of serializability: If one were to prove the serializability of a schedule S, it is equivalent to saying that S is correct. Hence, it guarantees that the schedule provides correct results. But being serializable is not the same as being serial. A serial scheduling inefficient because of the reasons explained earlier, which leads to under utilization of the CPU, I/O devices and in some cases like mass reservation system, becomes untenable. On the other hand, a serializable schedule combines the benefits of concurrent execution (efficient system utilization, ability to cater to larger no of concurrent users) with the guarantee of correctness.

But all is not well yet. The scheduling process is done by the operating system routines after taking into account various factors like system load, time of transaction submission, priority of the process with reference to other process and a large number of other factors. Also since a very large number of possible interleaving combinations are possible, it is extremely difficult to determine before hand the manner in which the transactions are interleaved. In other words getting the various schedules itself is difficult, let alone

testing them for serializability.

Hence, instead of generating the schedules, checking them for serializability and then using them, most DBMS protocols use a more practical method impose restrictions on the transactions themselves. These restrictions, when followed by every participating transaction, automatically ensure serializability in all schedules that are created by these participating schedules.

Also, since transactions are being submitted at different times, it is difficult to determine when a schedule begins and when it ends. Hence serializability theory can be used to deal with the problem by considering only the committed projection C(CS) of the schedule. Hence, as an approximation, we can define a schedule S as serializable if its committed C(CS) is equivalent to some serial schedule.

5. The need for concurrency control


Let us imagine a situation wherein a large number of users (probably spread over vast geographical areas) are operating on a concurrent system. Several problems can occur if they are allowed to execute their transactions operations in an uncontrolled manner. Consider a simple example of a railway reservation system. Since a number of people are accessing the database simultaneously, it is obvious that multiple copies of the transactions are to be provided so that each user can go ahead with his operations. Let us make the concept a

little more specific. Suppose we are considering the number of reservations in a particular train of a particular date. Two persons at two different places are trying to reserve for this train. By the very definition of concurrency, each of them should be able to perform the operations irrespective of the fact that the other person is also doing the same. In fact they will not even know that the other person is also booking for the same train. The only way of ensuring the same is to make available to each of these users their own copies to operate upon and finally update the master database at the end of their operation. Now suppose there are 10 seats are available. Both the persons, say A and B want to get this information and book their seats. Since they are to be accommodated concurrently, the system provides them two copies of the data. The simple way is to perform a read tr (X) so that the value of X is copied on to the variable X of person A (let us call it XA) and of the person B (XB). So each of them know that there are 10 seats available. Suppose A wants to book 8 seats. Since the number of seats he wants is (say Y) less than the available seats, the program can allot him the seats, change the number of available seats (X) to X-Y and can even give him the seat numbers that have been booked for him. The problem is that a similar operation can be performed by B also. Suppose he needs 7 seats. So, he gets his seven seats, replaces the value of X to 3 (10 7) and gets his reservation. The problem is noticed only when these blocks are returned to main database disk in the above case). Before we can analyze these problems, we look at the problem from a more technical view. (the

5.1 The lost update problem: This problem occurs when two transactions that access the same database items have their operations interleaved in such a way as to make the value of some database incorrect. Suppose the transactions T1 and T2 are submitted at the (approximately) same time. Because of the concept of interleaving, each operation is executed for some period of time and then the control is passed on to the other transaction and this sequence continues. Because of the delay in updating, this creates a problem. This was what happened in the previous example. Let the transactions be called TA and TB.

TA Read tr(X)

TB Read tr(X) Time

X = X NA X = X - NB Write tr(X) Write tr(X)

Note that the problem occurred because the transaction TB failed to record the transactions TA. I.e. TB lost on TA. Similarly since TA did the writing later on, TA lost the updating of TB.

5.2 Dirty read problem

This happens when a transaction TA updates a data item, but later on (for some reason) the transaction fails. It could be due to a system failure or any other operational reason or the system may have later on noticed that the operation should not have been done and cancels it. To be fair, it also ensures that the original value is restored. But in the meanwhile, another transaction TB has accessed the data and since it has no indication as to what happened later on, it makes use of this data and goes ahead. Once the original value is restored by TA, the values generated by TB are obviously invalid.

TA Read tr(X) X=XN Write tr(X)

TB Time

Read tr(X) X=X-N Write tr(X)

Failure X=X+N Write tr(X) The value generated by TA out of a non-sustainable transaction is a dirty data which is read by TB, produces an illegal value. Hence the problem is called a dirty read problem. 5.3 The Incorrect Summary Problem: Consider two concurrent operations, again called TA and TB. TB is calculating a summary (average, standard deviation or some such operation) by accessing all elements of a database (Note that it is not updating any of them, only is reading them and is using the resultant data to calculate some values). In the meanwhile T A is updating these values. In case, since the Operations are interleaved, TA, for some of its operations will be using the not updated data, whereas for the other operations will be using the updated data. This is called the incorrect summary problem.

TA

TB Sum = 0 Read tr(A) Sum = Sum + A

Read tr(X) X=XN Write tr(X) Read tr(X) Sum = Sum + X Read tr(Y) Sum = Sum + Y Read (Y) Y=YN Write tr(Y) In the above example, both TA will be updating both X and Y. But since it first updates X and then Y and the operations are so interleaved that the transaction TB uses both of them in between the operations, it ends up using the old value of Y with the new value of X. In the process, the sum we got does not refer either to the old set of values or to the new set of values.

6 Locking techniques for concurrency control

Many of the important techniques for concurrency control make use of the concept of the lock. A lock is a variable associated with a data item that describes the status of the item with respect to the possible operations that can be done on it. Normally every data item is associated with a unique lock. They are used as a method of synchronizing the access of database items by the transactions that are operating concurrently. Such controls, when implemented properly can overcome many of the problems of concurrent operations listed earlier. However, the locks themselves may create a few problems, which we shall be seeing in some detail in subsequent sections.

6.1 Types of locks and their uses:


6.1.1: Binary locks: A binary lock can have two states or values (1 or 0) one of them indicates that it is locked and the other says it is unlocked. For example if we presume 1 indicates that the lock is on and 0 indicates it is open, then if the lock of item(X) is 1 then the read_tr(x) cannot access the time as long as the locks value continues to be 1. We can refer to such a state as lock (x). The concept works like this. The item x can be accessed only when it is free to be used by the transactions. If, say, its current value is being modified, then X cannot be (in fact should not be) accessed, till the modification is complete. The simple mechanism is to lock access to X as long as the process of modification is on and unlock it for use by the other transactions only when the modifications are complete. So we need two operations lock item(X) which locks the item and unlock item(X) which opens the lock. Any transaction that wants to makes use of the data item, first checks the lock status of X by the lock item(X). If the item X is already locked, (lock status=1) the transaction will have to wait. Once the status becomes = 0, the transaction accesses the item, and locks it (makes its status=1). When the transaction has completed using the item, it issues an unlock item (X) command, which again sets the status to 0, so that other transactions can access the item.

6.1.2 Shared and Exclusive locks While the operation of the binary lock scheme appears satisfactory, it suffers from a serious drawback. Once a transaction holds a lock (has issued a lock operation), no other transaction can access the data item. But in large concurrent systems, this can

become a disadvantage. It is obvious that more than one transaction should not go on writing into X or while one transaction is writing into it, no other transaction should be reading it, no harm is done if several transactions are allowed to simultaneously read the item. This would save the time of all these transactions, without in anyway affecting the performance. This concept gave rise to the idea of shared/exclusive locks. When only read operations are being performed, the data item can be shared by several transactions, only when a transaction wants to write into it that the lock should be exclusive. Hence the shared/exclusive lock is also sometimes called multiple mode lock. A read lock is a shared lock (which can be used by several transactions), whereas a write lock is an exclusive lock. So, we need to think of three operations, a read lock, a write lock and unlock. The algorithms can be as follows:

Read lock (X): Start: If Lock (X) = unlocked Then { Lock(X) read locked, 1

No of reads(X) }

else if Lock(X) = read locked then no. of reads(X) = no of reads(X)0+1; else { wait until Lock(X) unlocked and the lock manager

wakes up the transaction) } go to start end.

Read Lock Operation:

Write lock(X) Start: If lock(X) = unlocked Then Lock(X) unlocked.

Else {wait until Lock(X) = unlocked and The lock manager wakes up the transaction} Go to start End;

The write lock operation:

Unlock(X) If lock(X) = write locked Then {Lock(X) unlocked

Wakeup one of the waiting transaction, if any } else if Lock(X) = read locked then { no of reads(X) no of reads 1

if no of reads(X)=0 then { Lock(X) = unlocked wake up one of the waiting transactions, if any } }

The Unlock Operation: The algorithms are fairly straight forward, except that during the unlocking operation, if a number of read locks are there, then all of them are to be unlocked before the unit itself becomes unlocked. To ensure smooth operation of the shared / exclusive locking system, the system must enforce the following rules: 1. A transaction T must issue the operation read lock(X) or writelock(X) before any read or write operations are performed. 2. A transaction T must issue the operation write lock(X) before any writetr(X) operation is performed on it.

3. A transaction T must issue the operation unlock (X) after all readtr(X) are completed in T. 4. A transaction T will not issue a read lock(X) operation if it already holds a readlock or write lock on X. 5. A transaction T will not issue a write lock(X) operation if it already holds a readlock or write lock on X. 6.1.3 Two phase locking: A transaction is said to be following a two phase locking if the operation of the transaction can be divided into two distinct phases. In the first phase, all items that are needed by the transaction are acquired by locking them. In this phase, no item is unlocked even if its operations are over. In the second phase, the items are unlocked one after the other. The first phase can be thought of as a growing phase, wherein the store of locks held by the transaction keeps growing. The second phase is called the shrinking phase, the no. of locks held by the transaction keep shrinking. readlock(Y) readtr(Y) writelock(X) ----------------------------------unlock(Y) readtr(X) X=X+Y writetr(X) unlock(X)
Example: A two phase locking

Phase I

Phase II

The two phase locking, though provides serializability has a disadvantage. Since the locks are not released immediately after the use of the item is over, but is retained till all the other needed locks are also acquired, the desired amount of interleaving may not be derived worse, while a transaction T may be holding an item X, though it is not using it, just to satisfy the two phase locking protocol, another transaction T1 may be genuinely needing the item, but will be unable to get it till T releases it. This is the price

that is to be paid for the guaranteed serializability provided by the two phase locking system.

6.2 Deadlock and Starvation:


A deadlock is a situation wherein each transaction T1 which is in a set of two or more transactions is waiting for some item that is locked by some other transaction T 1 in the set i.e. taking the case of only two transactions T11 and T21 , T11 is waiting for an item X which is with T21 , and is also holding another item Y. T11 will release Y when X becomes available from T21 and T11 can complete some operations. Meanwhile T21 is waiting for Y held by T11 and T21 will release X only Y, held by T11 is released and after T21 has performed same operations on that. It can be easily seen that this is an infinite wait and the dead lock will never get resolved. T11 readlock(Y) T11 readtr(Y) readlock(X) The status graph T21 T21

readtr(X) writelock(X) writelock(Y)

A partial schedule leading to Deadlock. While in the case of only two transactions, it is rather easy to notice the possibility of deadlock, though preventing it may be difficult. The case may become more complicated, when more then two transactions are in a deadlock and even identifying a deadlock may be difficult. 6.2.1 Deadlock prevention protocols The simplest way of preventing deadlock is to look at the problem in detail. Deadlock occurs basically because a transaction has locked several items, but could not get one more item

and is not releasing other items held by it. The solution is to develop a protocol wherein a transaction will first get all the items that it needs & then only locks them. I.e. if it cannot get any one/more of the items, it does not hold the other items also, so that these items can be useful to any other transaction that may be needing them. Their method, though prevents deadlocks, further limits the prospects of concurrency.

A better way to deal with deadlocks is to identify the deadlock when it occurs and then take some decision. The transaction involved in the deadlock may be blocked or aborted or the transaction can preempt and abort the other transaction involved. In a typical case, the concept of transaction time stamp TS (T) is used. Based on when the transaction was started, (given by the time stamp, larger the value of TS, younger is the transaction), two methods of deadlock recovery are devised.

1.

Wait-die method: suppose a transaction Ti tries to lock an item X, but is unable to do

so because X is locked by Tj with a conflicting lock. Then if TS(Ti)<TS(Tj), (Ti is older then Tj) then Ti waits. Otherwise (if Ti is younger than Tj) then Ti is aborted and restarted later with the same time stamp. The policy is that the older of the transactions will have already spent sufficient efforts & hence should not be aborted. 2. Wound-wait method: If TS(Ti) <TS(Tj), (Ti is older then Tj), abort and restart Tj

with the same time stamp later. On the other hand, if Ti is younger then Ti is allowed to wait.

It may be noted that in both cases, the younger transaction will get aborted. But the actual method of aborting is different. Both these methods can be proved to be deadlock free, because no cycles of waiting as seen earlier are possible with these arrangements. There is another class of protocols that do not require any time stamps. They include the no waiting algorithm and the cautious waiting algorithms. In the no-waiting algorithm, if a transaction cannot get a lock, it gets aborted immediately (no-waiting). It is restarted again at a later time. But since there is no guarantee that the new situation. is dead lock free, it may have to aborted again. This may lead to a situation where a transaction may end up getting aborted repeatedly.

To overcome this problem, the cautious waiting algorithm was proposed. Here, suppose the transaction Ti tries to lock an item X, but cannot get X since X is already locked by another transaction Tj. Then the solution is as follows: If Tj is not blocked (not waiting for same other locked item) then Ti is blocked and allowed to wait. Otherwise Ti is aborted. This method not only reduces repeated aborting, but can also be proved to be deadlock free, since out of Ti & Tj, only one is blocked, after ensuring that the other is not blocked.

6.2.2 Deadlock detection & timeouts: The second method of dealing with deadlocks is to detect deadlocks as and when they happen. The basic problem with the earlier suggested protocols is that they assume that we know what is happening in the system which transaction is waiting for which item and so on. But in a typical case of concurrent operations, the situation is fairly complex and it may not be possible to predict the behavior of transaction. In such cases, the easier method is to take on deadlocks as and when they happen and try to solve them. A simple way to detect a deadlock is to maintain a wait forgraph. One node in the graph is created for each executing transaction. Whenever a transaction Ti is waiting to lock an item X which is currently held by Tj, an edge (Ti Tj) is created in their graph. When Tj

releases X, this edge is dropped. It is easy to see that whenever there is a deadlock situation, there will be loops formed in the wait-for graph, so that suitable corrective action can be taken. Again, once a deadlock has been detected, the transaction to be aborted is to be chosen. This is called the victim selection and generally newer transactions are selected for victimization. Another easy method of dealing with deadlocks is the use of timeouts. Whenever a transaction is made to wait for periods longer than a predefined period, the system assumes that a deadlock has occurred and aborts the transaction. This method is simple & with low overheads, but may end up removing the transaction, even when there is no deadlock.

6.3 Starvation:
The other side effect of locking in starvation, which happens when a transaction cannot proceed for indefinitely long periods, though the other transactions in the system, are continuing normally. This may happen if the waiting schemes for locked items is unfair. I.e. if some transactions may never be able to get the items, since one or the other of the high priority

transactions may continuously be using them. Then the low priority transaction will be forced to starve for want of resources. The solution to starvation problems lies in choosing proper priority algorithms like first-comefirst serve. If this is not possible, then the priority of a transaction may be increased every time it is made to wait / aborted, so that eventually it becomes a high priority transaction and gets the required services.

6.4 Concurrency control based on Time Stamp ordering


6.4.1 The Concept of time stamps: A time stamp is a unique identifier created by the DBMS, attached to each transaction which indicates a value that is measure of when the transaction came into the system. Roughly, a time stamp can be thought of as the starting time of the transaction, denoted by TS (T). They are generated by a counter that is initially zero and is incremented each time its value is assigned to the transaction. The counter is also given a maximum value and if the reading goes beyond that value, the counter is reset to zero, indicating, most often, that the transaction has lived its life span inside the system and needs to be taken out. A better way of creating such time stamps is to make use of the system time/date facility or even the internal clock of the system.

6.4.2 An algorithm for ordering the time stamp: The basic concept is to order the transactions based on their time stamps. A schedule made of such transactions is then serializable. This concept is called the time stamp ordering (To). The algorithm should ensure that whenever a data item is accessed by conflicting operations in the schedule, the data is available to them in the serializability order. To achieve this, the algorithm uses two time stamp values. 1. Read_Ts (X): This indicates the largest time stamp among the transactions that have successfully read the item X. Note that the largest time stamp actually refers to the youngest of the transactions in the set (that has read X). 2. Write_Ts(X): This indicates the largest time stamp among all the transactions that have successfully written the item-X. Note that the largest time stamp actually refers to the youngest transaction that has written X.

The above two values are often referred to as read time stamp and write time stamp of the item X.

6.4.3 The concept of basic time stamp ordering: When ever a transaction tries to read or write an item X, the algorithm compares the time stamp of T with the read time stamp or the write stamp of the item X, as the case may be. This is done to ensure that T does not violate the order of time stamps. The violation can come in the following ways. 1. Transaction T is trying to write X a) If read TS(X) > Ts(T) or if write Ts (X) > Ts (T) then abort and roll back T and reject the operation. In plain words, if a transaction younger than T has already read or written X, the time stamp ordering is violated and hence T is to be aborted and all the values written by T so far need to be rolled back, which may also involve cascaded rolling back. b) If read TS(X) < TS(T) or if write Ts(X) < Ts(T), then execute the write tr(X) operation and set write TS(X) to TS(T). i.e. allow the operation and the write time stamp of X to that of T, since T is the latest transaction to have accessed X.

2. Transaction T is trying to read X a) If write TS (X) > TS(T) , then abort and roll back T and reject the operation. This is because a younger transaction has written into X. b) If write TS(X) < = TS(T), execute read tr(X) and set read Ts(X) to the larger of the two values, namely TS(T) and current read_TS(X). This algorithm ensures proper ordering and also avoids deadlocks by penalizing the older transaction when it is trying to overhaul the operation done by an younger transaction. Of course, the aborted transaction will be reintroduced later with a new time stamp. However, in the absence of any other monitoring protocol, the algorithm may create starvation in the case of some transactions. 6.4.4 Strict time Stamp Ordering: This variation of the time stamp ordering algorithm ensures that the schedules are strict (so that recoverability is enhanced) and serializable. In this case, any transaction T that tries to read or write such that write TS(X) < TS(T) is made to wait until the transaction T that

originally wrote into X (hence whose time stamp matches with the writetime time stamp of X, i.e. TS(T) = write TS(X)) is committed or aborted. This algorithm also does not cause any dead lock, since T waits for T only if TS(T) > TS(T).

6.5 Multi version concurrency control techniques


The main reason why some of the transactions have to be aborted is that they try to access data items that have been updated (by transactions that are younger than it). One way of overcoming this problem is to maintain several versions of the data items, so that if a transaction tries to access an updated data item, instead of aborting it, it may be allowed to work on the older version of data. This concept is called the multiversion method of concurrency control. Whenever a transaction writes a data item, the new value of the item is made available, as also the older version. Normally the transactions are given access to the newer version, but in case of conflicts the policy is to allow the older transaction to have access to the older version of the item. The obvious drawback of this technique is that more storage is required to maintain the different versions. But in many cases, this may not be a major drawback, since most database applications continue to retain the older versions anyway, for the purposes of recovery or for historical purposes. 6.5.1 Multiversion Technique based on timestamp ordering In this method, several version of the data item X, which we call X1, X2, .. Xk are maintained. For each version Xi two timestamps are appended i) Read TS(Xi): the read timestamp of Xi indicates the largest of all time stamps of transactions that have read Xi. (This, in plain language means the youngest of the transactions which has read it). ii) Write TS(Xi) : The write timestamp of Xi indicates the timestamp of the

transaction time stamp of the transaction that wrote Xi.

Whenever a transaction T writes into X, a new version XK+1 is created, with both write. TS(XK+1) and read TS(Xk+1) being set to TS(T). Whenever a transaction T reads into X, the value of read TS(Xi) is set to the larger of the two values namely read TS(Xi) and TS(T). To ensure serializability, the following rules are adopted.

i) If T issues a write tr(X) operation and Xi has the highest write TS(Xi) which is less than or equal to TS(T), and has a read TS(Xi) >TS(T), then abort and roll back T, else create a new version of X, say Xk with read TS(Xk) = write TS(Xk) = TS(T) In plain words, if the highest possible write timestamp among all versions is less than or equal to that of T, and if it has been read by a transaction younger than T, then, we have no option but to abort T and roll back all its effects otherwise a new version of X is created with its read and write timestamps initiated to that of T.

ii)

If a transaction T issues a read tr(X) operation, find the version Xi with the highest write

TS(Xi) that is also less than or equal to TS(T) then return the value of Xi to T and set the value of read TS(Xi) to the value that is larger amongst TS(T) and current read TS(Xi). This only means, try to find the highest version of Xi that T is eligible to read, and return its value of X to T. Since T has now read the value find out whether it is the youngest transaction to read X by comparing its timestamp with the current read TS stamp of X. If X is younger (if timestamp is higher), store it as the youngest timestamp to visit X, else retain the earlier value.

6.5.2 Multiversion two phase locking certify locks: Note that the motivation behind the two phase locking systems have been discussed previously. In the standard locking mechanism, write lock is an exclusive lock i.e. only one transaction can use a write locked data item. However, no harm is done, if the item write locked by a transaction is read by one/more other transactions. On the other hand, it enhances the interleavability of operation. That is, more transactions can be

interleaved. This concept is extended to the multiversion locking system by using what are known as multiple-mode locking schemes. In this, there are three locking modes for the item : read, write and certify. I.e. a unit can be locked for read(X), write(x) or certify(X), as also it can remain unlocked. To see how the scheme works, we first see how the normal read, write system works by means of a lock compatibility table. Lock compatibility Table Read Read Yes Write No

Write

No

No

The explanation is as follows: If there is an entry yes in a particular cell, if a transaction T holds the type of lock specified in the column header and if another transaction T requests for the type of lock specified in row header, the T can obtain the lock, because the lock modes are compatible. For example, there is a yes in the first cell. Its column header is read. So if a transaction T holds the read lock, and another transaction T requests for the read lock, it can be granted. On the other hand, if T holds a write lock and another T requests for a readlock it will not be granted, because the action now has shifted to the first row, second column element. In the modified (multimode) locking system, the concept is extended by adding one more row and column to the tables. Read Read Write Certify Yes Yes No Write Yes No No Certify No No No

The multimode locking system works on the following lines. When one of the transactions has obtained a write lock for a data item, the other transactions may still be provided with the read locks for the item. To ensure this, two versions of the X are maintained. X(old) is a version which has been written and committed by a previous transaction. When a transaction T wants a write lock to be provided to it, a new version X(new) is created and handed over to T for writing. While T continues to hold the lock for X(new) other transactions can continue to use X(old) under read lock. Once T is ready to commit it should get exclusive certify locks on all items it wants to commit by writing. Note that write lock is no more an exclusive lock under our new scheme of things, since while one transaction is holding a write lock on X, one/more other transactions may be holding the read locks of the same X. To provide certify lock, the system waits till all other read locks are cleared on the item. Note that this process has to repeat on all items that T wants to commit.

Once all these items are under the certify lock of the transaction, it can commit to its values. From now on, the X(new) become X(old) and X(new) values will be created only if another T wants a write lock on X. This scheme avoids cascading rollbacks. But since a transaction will have to get exclusive certify rights on all items, before it can commit, a delay in the commit operation is inevitable. This may also leads to complexities like dead locks and starvation.

Chapter: 6
TRANSACTION MANAGEMENT & CONCURRENCY CONYROL TECHNIQUE End Chapter quizzes Q1. The sequence of operations on the database is called
a) Schedule b) Database Recovery c) Locking d) View Q2. Two operations in a schedule are said to be in conflict if they satisfy the conditions a) The operations belong to different transactions b) They access the same item x c) At least one of the operations is a write operation. d) All of the above. Q3. If, for every transaction, T in the schedule S, all operations of T is executed

consecutively then schedule S is called a) Serial schedule b) Non serial schedule c) Time stamping d) None of the above Q4. Concurrency control is needed to manage a) Transactions from large number of users b) Maintain consistency of database c) Both a and b d) None of the above Q5. A time stamp is a unique identifier created by the DBMS, attached to each a) Data Item b) Transaction c) Schedule d) All of the above

Q6. A read lock is also called as a) Shared LOCK b) Binary Lock c) Write Lock d) Dead Lock Q7. Write lock is also called as a) Two Phase Lock b) Exclusive Lock c) Binary Lock d) None of the above Q8. The ability to recover from failures of transaction is called a) Recoverability b) Back up c) Database Detection d) Both a and b Q9 A lock can have ONLY two states or values (1 or 0) is known as a) Binary Lock b) 2 Phase Lock c) Both a and b d) Read Lock Q10. The property the transaction that identifies that the transaction is either fully completed, or is not begun at all a) Consistency b) Atomic c) Durability d) Isolation

Chapter: 7
DATABASE RECOVEY, BACKUP & SECURITY

1. Introductory Concept of Database Failures and Recovery


Database operations can not be protected to the system on which it operates (both the hardware and the software, including the operating systems). The system should ensure that any transaction submitted to it is terminated in one of the following ways. a) All the operations listed in the transaction are completed, the changes are recorded permanently back to the database and the database is indicated that the operations are complete. b) In case the transaction has failed to achieve its desired objective, the system should ensure that no change, whatsoever, is reflected onto the database. Any intermediate changes made to the database are restored to their original values, before calling off the transaction and intimating the same to the database. In the second case, we say the system should be able to Recover from the failure.

1.1 Database failure Database Failures can occur in a variety of ways. i) A System Crash: A hardware, software or network error can make the completion of the transaction impossibility. ii) A transaction or system error: The transaction submitted may be faulty like creating a situation of division by zero or creating a negative numbers which cannot be handled (For example, in a reservation system, negative number of seats conveys no meaning). In such cases, the system simply discontinuous the transaction by reporting an error. iii) Some programs provide for the user to interrupt during execution. If the user changes his mind during execution, (but before the transactions are complete) he may opt out of the operation.

iv)

Local exceptions: Certain conditions during operation may force the system to raise what are known as exceptions. For example, a bank account holder may not have sufficient balance for some transaction to be done or special instructions might have been given in a bank transaction that prevents further continuation of the process. In all such cases, the transactions are terminated.

v)

Concurrency control enforcement: In certain cases when concurrency constrains are violated, the enforcement regime simply aborts the process to restart later.

The other reasons can be physical problems like theft, fire etc or system problems like disk failure, viruses etc. In all such cases of failure, a recovery mechanism is to be in place. 1.2 Database Recovery Recovery most often means bringing the database back to the most recent consistent state, in the case of transaction failures. This obviously demands that status information about the previous consistent states are made available in the form a log (which has been discussed in one of the previous sections in some detail). A typical algorithm for recovery should proceed on the following lines. 1. If the database has been physically damaged or there are catastrophic crashes like disk crash etc, the database has to be recovered from the archives. In many cases, a

reconstruction process is to be adopted using various other sources of information. 2. In situations where the database is not damaged but has lost consistency because of transaction failures etc, the method is to retrace the steps from the state of the crash (which has created inconsistency) until the previously encountered state of consistency is reached. The method normally involves undoing certain operation, restoring previous values using the log etc. In general two broad categories of these retracing operations can be identified. As we have seen previously, most often, the transactions do not update the database as and when they complete the operation. So, if a transaction fails or the system crashes before the commit operation, those values need not be retraced. So no undo operation is needed. However, if one is still interested in getting the results out of the transactions, then a Redo operation will have to be taken up. Hence, this type of retracing is often called

the no-undo /Redo algorithm. The whole concept works only when the system is working on a deferred update mode. However, this may not be the case always. In certain situations, where the system is working on the immediate update mode, the transactions keep updating the database without bothering about the commit operation. In such cases however, the updating will be normally onto the disk also. Hence, if a system fails when the immediate updating are being made, then it becomes necessary to undo the operations using the disk entries. This will help us to reach the previous consistent state. From there onwards, the transactions will have to be redone. Hence, this method of recovery is often termed as the Undo/Redo algorithm. 2. Role of check points in recovery: A Check point, as the name suggests, indicates that everything is fine up to the point. In a log, when a check point is encountered, it indicates that all values up to that have been written back to the DBMS on the disk. Any further crash / system failure will have to take care of the data appearing beyond this point only. Put the other way, all transactions that have their commit entries in the log before this point need no rolling back.

The recovery manager of the DBMS will decide at what intervals, check points need to be inserted (in turn, at what intervals data is to be written back to the disk). It can be either after specific periods of time (say M minutes) or specific number of transaction (t transactions) etc., When the protocol decides to check point it does the following:-

a) Suspend all transaction executions temporarily. b) Force write all memory buffers to the disk. c) Insert a check point in the log and force write the log to the disk. d) Resume the execution of transactions.

The force writing need not only refer to the modified data items, but can include the various lists and other auxiliary information indicated previously. However, the force writing of all the data pages may take some time and it would be wasteful to halt all transactions until then. A better way is to make use of the Fuzzy check pointing where

in the check point is inserted and while the buffers are being written back (beginning from the previous check point) the transactions are allowed to restart. This way the i/o time is saved. Until all data up to the new check point is written back, the previous check point is held valid for recovery purposes. 3 Write ahead logging: When updating is being used, it is necessary to maintain a log for recovery purposes. Normally before the updated value is written on to the disk, the earlier value (called Before Image Value (BFIM)) is to noted down elsewhere in the disk for recovery purposes. This process of recording entries is called the write ahead logging (write ahead of logging). It is to be noted that the type of logging also depends on the type of recovery. If no undo / Redo type of recovery is being used, then only those values which could not be written back before the crash, need to be logged. But in a undo / Redo types, the values before the image was created as well as those that were computed, but could not be written back need to be logged. Two other update mechanisms need brief mention. The cache pages, updated by the transaction, cannot be written back to the disk, by the DBMS manager, until and unless the transaction commits. If the system strictly follows this approach, then it is called a no steal approach. However, in some cases, the protocol allows the writing of the updated buffer back to the disk, even before the transaction commits. This may be done, for example, when some other transaction is in need of the results. This is called the steal approach. Secondly, if all pages are updated once the transaction commits, then it is a force approach, otherwise it is called a no force approach. Most protocols make use of steal / no force strategies, so that there is no urgency of writing back to the buffer once the transaction commits.

However, just the before image (BIM) and After image (AIM) values may not be sufficient for successful recovery. A number of lists, including the list of active transaction (those that have started operating, but have not committed yet), committed transactions as also aborted transactions need to be maintained, to avoid a brute force method of recovery.

4. Recovery techniques based on Deferred Update:


This is a very simple method of recovery. Theoretically, no transaction can write back into the database, until it has committed. Till then, it can only write into a buffer. Thus in case of any crash, the buffer needs to be reconstructed, but the DBMS need not be recovered.

However, in practice, most transactions are very long and it is dangerous us to hold all their updates in the buffer, since the buffers can run out of space and may need a page replacement. To avoid such situations, where in a page is removed inadvertently, a simple two pronged protocol is used.

1. A transaction cannot change the DBMS values on the disk until it commits. 2. A transaction does not reach commit stage until all its update values are written on to the log and log itself in force written on to the disk.

Notice that in case of failures, recovery is by the No UNDO/REDO techniques, since all data will be in the log if a transaction fails after committing.

4.1 An algorithm for recovery using the deferred update in a single user environment. In a single user entrainment, the algorithm is a straight application of the REDO procedure i.e. it uses two lists of transactions: The committed transactions since the last check point and the currently active transactions when the crash occurs, apply the REDO to all write tr operations of the committed transactions from the log. And let the active transactions run again. The assumption is that the REDO operations are idem potent. I.e. the operations produce the same results irrespective of the number of times they are redone provided, they start from the same initial state. This is essential to ensure that the recovery operation does not produce a result that is different from the case where no crash was there to begin with.

(Through this may look like a trivial constraint, students may verify themselves that not all DBMS applications satisfy this condition).

Also since there was only one transaction active (because it was a single user system) and it had not updated the buffer yet, all that remains to be done is to restart this transaction.

4.2 Deferred update with Concurrent execution: Most of the DBMS applications, we have insisted repeatedly, are multi-user in nature and the best way to run them is by concurrent execution. Hence, protocols for recovery from a crash in such cases are of prime importance.

To simplify the matters, we presume that we are in talking of strict and serializable schedules. I.e. there is strict two phase locking and they remain effective till the transactions commit themselves. In such a scenario, an algorithm for recovery could be as follows:-

Use two lists: The list of committed transactions T since the last check point and the list of active checkpoints T1 REDO all the write operations of committed transactions in the order in which they were written into the log. The active transactions are simply cancelled and resubmitted.

Note that once we put the strict serializability conditions, the recovery process does not vary too much from the single user system.

Note that in the actual process, a given item x may be updated a number of times, either by the same transaction or by different transactions at different times. What is important to the user is its final value. However, the above algorithm simply updates the value whenever its value was updated in the log. This can be made more efficient by the following manner. Instead of starting from the check point and proceeding towards the time of the crash, traverse the log from the time of the crash backwards. Whenever a value is updated, for the first time, update it and maintain the information that its value has been updated. Any further updating of the same can be ignored.

This method though guarantees correct recovery has some drawbacks. Since the items remain locked with the transactions until the transaction commits, the concurrent execution efficiency comes down. Also lot of buffer space is wasted to hold the values, till the transactions commit. The number of such values can be large, when the long transactions are working in concurrent mode, they delay the commit operation of one another.

5 Recovery techniques on immediate update In these techniques, whenever a writetr(X) is given, the data is written on to the database, without bothering about the commit operation of the transaction. However, as a rule, the update operate is accompanied by writing on to the log(on the disk), using a write ahead logging protocol. This helps in undoing the update operations whenever a transaction fails. This rolling back can be done by using the data on the log. Further, if the transaction is made to commit only after writing on to the log, there is no need for a redo of these operations after the transaction has failed, because the values are available in the log. This concept is called the UNDO/NO-REDO recovery algorithm. On the other hand, if some transaction commits before writing all its values, then a general UNDO/REDO type of recovery algorithm is necessary.

5.1 A typical UNDO/REDO algorithm for a immediate update single user environment

Here, at the time of failure, the changes envisaged by the transaction may have already been recorded in the database. These must be undone. A typical procedure for recovery should follow the following lines:

a) The system maintains two lists: The list of committed transactions since the last checkpoint and the list of active transactions (only one active transaction, infact, because it is a single user system). b) In case of failure, undo all the write_tr operations of the active transaction, by using the information on the log, using the UNDO procedure.

c) For undoing a write_tr(X) operation, examine the corresponding log entry writetr(T,X,oldvalue, newvalue) and set the value of X to oldvalue. The sequence of undoing must be in the reverse order, in which operations were written on to the log. d) REDO the writetr operations of the committed transaction from the log in the order in which they were written in the log, using the REDO procedure. 5.2 The UNDO/REDO recovery based on immediate update with concurrent execution: In the concurrent execution scenario, the process becomes slightly complex. In the following algorithm, we presume that the log includes checkpoints and the concurrency protocol uses strict schedules. I.e. the schedule does not allow a transaction to read or write an item until the transaction that wrote the item previously has committed. Hence, the danger of transaction failures is minimal. However, deadlocks can force abort and UNDO operations. The simplistic procedure is as follows: a) Use two lists maintained by the system: The committed transactions list(since the last check point) and the list of active transactions. b) Undo all writetr(X) operations of the active transactions which have not yet committed, using the UNDO procedure. The undoing operation must be in the reverse order of writing process in the log. c) Redo all writetr(X) operations of the committed transactions from the log in the order in which they were written into the log. Normally, the process of redoing the writetr(X) operations begins at the end of the log and proceeds in the reverse order, so that when a X is written into more than once in the log, only the latest entry is recorded, as discussed in a previous section.

6. Shadow paging
It is not always necessary that the original database is updated by overwriting the previous values. As discussed in an earlier section, we can make multiple versions of the data items, whenever a new update is made. The concept of shadow paging illustrates this:

Current Directory 1 2 3 4 5 6 7

Pages Page 2 Page 5 Page 7 Page 7(new) Page5 (New) Page 2 (new)

Shadow Directory 1 2 3 4 5 6 7 8

In a typical case, the database is divided into pages and only those pages that need updation are brought to the main memory(or cache, as the case may be). A shadow directory holds pointers to these pages. Whenever an update is done, a new block of the page is created (indicated by the suffice(new) in the figure) and the updated values are included there. Note that (i) the new pages are created in the order of updatings and not in the serial order of the pages. A current directory holds pointers to these new pages. For all practical purposes, these are the valid pages and they are written back to the database at regular intervals.

Now, if any roll back is to be done, the only operation to be done is to discard the current directory and treat the shadow directory as the valid directory.

One difficulty is that the new, updated pages are kept at unrelated spaces and hence the concept of a continuous database is lost. More importantly, what happens when the new pages are discarded as a part of UNDO strategy? These blocks form garbage in the system. (The same thing happens when a transaction commits the new pages become valid pages, while the old pages become garbage). A mechanism to systematically identify all these pages and reclaim them becomes essential.

7 Database security and authorization


It is common knowledge that the databases should be held secure, against damages, unauthorized accesses and updatings. A DBMS typically includes a database security and authorization subsystem that is responsible for the security of the database against unauthorized accesses and attacks. Traditionally, two types of security mechanisms are in use.

i)

Discretionary security mechanisms: Here each user (or a group of users) is granted privileges and authorities to access certain records, pages or files and denied access to others. The discretion normally lies with the database administer (DBA)

ii)

Mandatory

security

mechanisms:

These

are

standard

security

mechanisms that are used to enforce multilevel security by classifying the data into different levels and allowing the users (or a group of users) access to certain levels only based on the security policies of the organization. Here the rules apply uniformly across the board and the discretionary powers are limited. While all these discussions assume that a user is allowed access to the system, but not to all parts of the database, at another level, effects should be made to prevent unauthorized access of the system by outsiders. This comes under the purview of the security systems. Another type of security enforced in the statistical database security often large databases are used to provide statistical informations about various aspects like, say income levels, qualifications, health conditions etc. These are derived by collecting a large number of individual data. A person who is doing the statistical analysis may be allowed access to the statistical data which is an aggregated data, but he should not be allowed access to individual data. I.e. he may know, for example, the average income level of a region, but cannot verify the income level of a particular individual. This problem is more often encountered in government and quasi-government organizations and is studied under the concept of statistical database security.

It may be noted that in all these cases, the role of the DBA becomes critical. He normally logs into the system under a DBA account or a superuser account, which provides full capabilities to manage the Database, ordinarily not available to the other uses. Under the superuser account, he can manage the following aspects regarding security.

i)

Account creation: He can create new accounts and passwords to users or user groups.

ii)

Privilege granting: He can pass on privileges like ability to access certain files or certain records to the users.

iii)

Privilege revocation: The DBA can revoke certain or all privileges granted to one/several users.

iv)

Security level assignment: The security level of a particular user account can be assigned, so that based on the policies, the users become eligible /not eligible for accessing certain levels of information.

Another aspect of having individual accounts is the concept of database audit. It is similar to the system log that has been created and used for recovery purposes. If we can include in the log entries details regarding the users name and account number who has created/used the transactions which are writing the log details, one can have record of the accesses and other usage made by the user. This concept becomes useful in followup actions, including legal examinations, especially in sensitive and high security installations. Another concept is the creation of views. While the database record may have large number of fields, a particular user may be authorized to have information only about certain fields. In such cases, whenever he requests for the data item, a view is created for him of the data item, which includes only those fields which he is authorized to have access to. He may not even know that there are many other fields in the records.

The concept of views becomes very important when large databases, which cater to the needs of various types of users are being maintained. Every user can have and operate upon his view of the database, without being bogged down by the details. It also makes the security maintenance operations convenient.

Chapter: 7
DATABASE RECOVEY, BACKUP & SECURITY
End Chapter quizzes

Q1. Database Failures can occur due to; a) Transaction Failure b) System crash c) Both a and b d) Data backup Q2. The granting of a right or privilege that enables a subject to have legitimate access to a system or a systems objects. a) Authentication b) Authorization c) Data Unlocking d) Data Encryption Q3. The process of periodically taking a copy of the database and log file on to offline storage media a) Back up b) Data Recovery c) Data Mining d) Data Locking Q4. The encoding of the data by a special algorithm that renders the data unreadable a) b) c) d) Data hiding Encryption Data Mining Both a and c

Q5.. Access right to a database is controlled by (a) top management (b) system designer (c) system analyst (d) database administrator Q6. Firewall - is a system that prevents unauthorized access to or from a) Locks b) Private network c) Email d) Data Recovery

Q7. 5. Digital Certificate is an attachment to an electronic message used for a) security purposes b) Recovery purpose c) Database Locking d) Both a and c Q8. Rollback and Commit affect (a) Only DML statements (b) Only DDL statements (c) Both (a) and (b) (d) All statements executed in SQL*PLUS Q9 Large databases are used to provide statistical information is known as: a) b) c) d) Geographical Database Statistical Database Web Database Time Database

You might also like