You are on page 1of 59

Operating Systems

This minisite contains notes taken by Chris Northwood whilst studying Computer Science at the
University of York between 2005-09 and the University of Sheffield 2009-10.
They are published here in case others find them useful, but I provide no warranty for their
accuracy, completeness or whether or not they are up-to-date.
The contents of this page have dubious copyright status, as great portions of some of my revision
notes are verbatim from the lecture slides, what the lecturer wrote on the board, or what they
said. Additionally, lots of the images have been captured from the lecture slides.
An operating system is a program that acts as an intermediary between a user of the PC and the
hardware.
System Components
A computer system can be abstracted into various components. Users (people, or other machines
and systems) interact with system and application programs (defines the way in which system
resources are used to solve the computing problems of the users), which interact with the OS
(controls and co-ordinates the use of hardware among the various application programs for the
user), which interacts with the hardware (the most basic computing resources - CPU, memory,
I/O, etc). When developing an OS, all levels must be considered, not just the OS level.

For users, they interact with the hardware (monitor, keyboard, mouse, etc) to control the system
and application programs. The program needs an efficient interface to interact with the OS to
utilise the resources it needs to accomplish the tasks the user wants to do and to also
communicate with the user. The OS interface describes the calls that an application program can
make into the OS.
The OS is designed for ease of use, which is in terms of not having to deal with low level
hardware issues. This is more important than CPU usage - a user doesn't care if the CPU is idle
99% of the time. The goal of OS design is to provide a useful, simple and efficient system:
ease of use
minimise resource usage
maximise resource utilisation
However, abstraction away from the physical machine to provide ease of use can lead to
inefficiencies. There's a trade off required between the extra functionality of the virtual machine
and the overheads required to implement this.
History of Operating Systems
See CAR for
Computer History
1st Generation (1945-1955)
These are systems which didn't have operating systems. Users had complete access to the
machine and the interface was basically raw hardware. Programming was accomplished by
wiring up plugboards to control basic behaviour of the computer - the problems to be solved
were normally numerical calculations.
Operating systems and programming languages were virtually unheard of.
2nd Generation (1955-1965)
These systems had minimal operating systems and were typically mainframe computers used to
support commercial and scientific applications. Jobs and programs were written using punch
cards and the computers were normally application specific.
The minimal OS was required to interact with I/O devices and a subroutine library existed to
manage I/O interaction. The library is loaded into the top of memory and stays there. Application
programs existed (compiler/card reader), but were limited.
Optimisations could be made in reducing the setup time by batching similar jobs. A small
computer reads a number of jobs onto magnetic tape and the magtape is run on the mainframe. A
number of jobs is then done in succession without wasting CPU time. These systems had a small
permenantly resident OS which had initial control and controlled automatic job sequencing.
These systems had a new memory layout and the processor had two modes - Normal, where
some instructions (e.g., HALT) are forbidden, which was used by the batch programs, and
Privileged, where all instructions are available, this was the level the OS ran in. Some CPUs also
implemented protection in Normal mode to prevent the user program overriding the first section
of memory in which the OS existed. Normal mode also has an instruction to call the OS (hand
back control).
Batch systems were good for CPU bound scientific computing with minimal I/O, but bad for data
processing with high I/O as the CPU is often idle. The speeds of the mechanical I/O devices is
considerably slower than that of the electronic computer and jobs often have to pause when
performing I/O. In some cases, the CPU was often idle for >80% of the time.
However, with a permenantly resident OS in the memory, the OS now becomes an overhead,
with less physical memory available for the application running at the time. Programming
discipline in the application is still required as the OS can not protect against the application
crashing the machine via bad I/O, infinite loops, etc...
3rd Generation (1965-1980)
The introduction of ICs led to minicomputers, which tried to merge the scientific computers with
I/O computers and this introduced multiprogramming. Disk drives allowed jobs and programs to
be stored on disk, allowing the computer to have access to many jobs. The CPU can now load
many jobs into memory by giving each one a memory partition and then execute them by
multiplexing between them.
In these OSs, the OS picks a job and then passes control to it. Eventually, the job either finishes
or waits for some task (e.g., I/O) to complete. In a non-multiprogrammed system, the CPU would
now be idle. In a multiprogrammed system, the OS switches the CPU to another job.
Eventually, the operation the original job was waiting for completes, enabling that job to be
eligable to be waiting for CPU time. As long as there is at least one job waiting to execute, the
CPU is not idle.
When a job completes, the OS can load another job into that memory slot. This technique is
called spooling - Simultaneous Peripheral Operation On Line, and is still used today for
operations such as print queues.
Multiprogramming OSs are quite complex - they now handle quite complex I/O devices (disks,
etc) and swapping between applications when suspended on I/O.
Multiprogramming works best when there are a small number of large jobs - this reduces the
overhead of the OS due to managing memory between less jobs. This method also encourages a
fairer overall use of the computer.
Multiprogramming systems led on to time-sharing/interactive systems. The CPU here is multi-
plexed between several jobs that are kept in memory and on disk. The CPU is allocated to a job
only if the job is in memory. When a job waits on I/O, the OS switches the CPU to another job.
Jobs can also be switched in between memory and the disk. On-line communication between the
user and the system is provided - when the OS finishes the execution of one command, it seeks
the next "control statement" from the user's keyboard.
In time-share systems, the user has the illusion of using the computer continuously and the user
can now interact directly with the computer whilst it is running. This encourages smaller jobs
and more efficient use of the computer. Additionally, many users can share the computer
simultaneously, submitting jobs/commands directly via keyboards/terminals, etc.
The first time-sharing system was CTSS, developed by MIT. This formed the basis for
MULTICS, the first commercial OS that had limited success. UNIX was then developed by Bell
Labs in the high-level language C, based on B.
4th Generation (1980-present)
The fourth generation of operating systems broguth about the first mass user operating systems,
as LSI and VLSI enabled low-cost computing, with mini and desktop computers becoming
common.
The age of mass computing is based around desktop computers. The desktop PC is dedicated to a
single user and the distinction between desktop and minicomputers become increasingly blurred.
Many minicomputers are now merely highend desktop PCs, with advanced I/O capabilities and
multiprocessors.
The multiprogrammed OS (UNIX, VMS, etc) was developed further for mass user computers,
adding functionality like: massive file storage, fast networking, complex application programs
and the potential for multiprocessor. Modern desktop systems have dedicated single user OSs
(e.g., CP/M, DOS, MS-DOS, Mac OS, Windows) with complex I/O devices: keyboards, mice,
display screens, small printers, graphics tablets, etc, and complex graphic applications, with user
convenience and responsiveness considered of high importance.
An increased blurring between desktop computers and minicomputers has led to minicomputer
OSs becoming available for desktop, e.g., Linux - based on the BSD and SysV variants of UNIX.
Much of the current OS trends are based in the relative cost of computers and people. In the
beginning, computers were few with very expensive hardware, so the people cost was cheap
compared to the computer cost, therefore the goal was the maximise hardware utilisation. Now
we have an amazing number of computers with cheap hardware and the human cost is expensive
compared to the hardware cost. The goal now is ease of use.
Current OS trends are towards specialist requirements - moving away from the general purpose
OS for specialist needs, or tuning existing general purpose OSs. The introduction of real-time
OSs, which require low overheads and high efficiency of resource usage. Multi-processor
machines are becoming more common and OSs need to update to be able to handle the extra
hardware. On the other hand, embedded devices are now become very common, which have
limited power, space and weight requirements.
Role of the Operating System
Resource Sharing
The OS contains a set of algorithms that allocates resources to the programs executed on behalf
of the user. These resources include time, power, hardware, etc...
Control Program
The control program controls the operation of the application programs to prevent errors
affecting other programs.
Provision of a Virtual Machine
This hides interfaces to I/O devices, filing systems, etc, and provides a programming interface
for applications.
Kernel
The kernel is the only program resident all the time (all other applications are application
programs).
The modern general purpose OS contains various components to accomplish these roles, such as:
Process management
Main memory management
File management
I/O system management
Secondary storage management
Protection system
Command interpreter system
All but the last two in the above list can be consdered to be resource management
Processes
A program is an algorithm expressed in some notation. A process is the activity of executing the
program.
Early systems only allowed one program to be executed at a time. Later systems allowed
multiprogramming, where many programs are loaded into memory at once and only one can be
executed at any time. Programs are termed jobs or processes when executing. A process is
sequential and can only do one thing at a time.
Multiple user processes run in seperate (protected) address space, and can't directly access other
memory spaces. Some user programs can share address space (e.g., an interpreter), but only the
code (which is read only) is shared. The kernel can access all, however.
As the CPU only has one program counter, so the OS must use a system known as process
switching.

Process Creation
Processes are created by one of 4 principle events:
1. System initialisation (typically, this is the OS and system programs)
2. By running a process (started by the kernel or user programs)
3. By user execution (normally with interacting with some kind of shell)
4. Initiation of a batch job (typically only in batch systems, however)
When a new process is created, the OS must assign memory and create a task/process control
block (TCB/PCB) which holds various data about the currently running process, such as:
Process state
Unique program ID
Program counter
CPU registers
Memory information (this, along with the program counter and CPU registers forms the
process context)
I/O status information (e.g., open files)
Process Hierarchies
When a parent creates a child process, the child process can create its own child processes, and
we get a hierarchy (or tree) of processes. UNIX terms this as a process group, whereas Windows
has no such concept, and all processes are created equal.
Process State
In single processor systems, only one process can be in the running state at any time. If all
processes are waiting (or blocked), then the processor is idle, until one of the processes become
unblocked, or ready.
The states possible are:
new: The process is being created
running: Instructions are being executed
waiting: The process is waiting for some event to occur
ready: The process is waiting to be run
terminated: The process has finished execution

The switch between processes is called context switching - where the context of the currently
running program is saved in the process context of the PCB and then the process context from
the new PCB is loaded into the CPU. The context switch time is overhead as the CPU does
nothing useful whilst switching. The time it takes for a switch is dependent on the hardware
support available.
Process Termination
There are various conditions that can terminate processes:
1. Normal exit (voluntary)
2. Exit on an error condition (voluntary)
3. Fatal error (involuntary)
4. Killed by another process (involuntary)
When a process is terminated, all resources used by a process is reclaimed by the OS, with the
PCB last (as the PCB contains information on where the resources are to be reclaimed).
System Calls
System calls provide an interface between the running process and the OS. They are available as
an assembly level instruction, normally some kind of software interrupt (INT 0x80 on Linux)
which corresponds to a system call to which the OS reacts appropriately.
Parameters are used in system calls, just as procedure calls. Normally registers, tables or stacks
are used, but in modern processors, tables are the most popular. Another method is using
instructions that force an address space change without using an interrupt, but the effect is the
same.
System calls normally map onto the services provided by the OS.
POSIX is an IEEE standard based on the UNIX interface. It is not an OS, merely a standard API
which is very vague and covers almost all systems calls you would ever have to make. Some
aspects of most major OSs are POSIX compliant, but there exists no complete implementation of
the POSIX standard. Linux is primarily POSIX compliant for the core calls.
Threads
Traditional (heavyweight) processes have a single control path through their code. Many
applications are concurrent, but do not want the overhead of multiple processes and inter-process
communication (e.g., a word processor has UI, reading input, a background spell checker, etc).
Many OS's provide threads (sometimes called lightweight processes), a basic unit of CPU use.
Each is a seperate control path through the code, and threads within a process are essentially
independent. Threads can access all the address space by the process, and threads have no
protection against each other (therefore, you should program threads safely).
Single and Multi-Threaded Processes
Each thread comprises of a thread ID, program counter, register set and a stack. The thread
shares with other threads belonging to a process, the code section, the data section and OS
resources such as open files. Each thread needs context, but not as much as the process needs, as
it inherits some of the context of the process.

Threading has various benefits, such as responsiveness - other threads can continue if one blocks
(e.g., on I/O), resource benefits - there's only one copy of the code and TCB, etc, economy -
thread creation and management is less costly than that of processes, and better utilisations of
multi-processor architectures (threads can run in parallel).
User Threads
Implemented by user level libraries, this provide threads to an application with the the support of
the underlying OS. Here, the application handles its own threading, so the OS only sees one
process. The application has a thread table, used in the application for keeping track of the thread
state and context, in a similar way to how the kernel handles TCBs for processes. The state
diagram for a thread looks similar to the one for a process:

User threads are usually fast and have low management overheads (as a library, the threads
library is within the address space of the process, so calls to the threads library functions are
fast). Application can also define management (i.e., scheduling) based on better information on
the threads than a more general policy.
This method does have disadvantages, however. When a thread calls I/O, the whole process is
blocked, as the OS sees the whole process as waiting for I/O. When an application has threads
that are likely to block often, then user threads may not be the best option.
Kernel Threads
Kernel threads are implemented at the OS level, and therefore has higher overheads (a thread
control space/table is now needed in the TCB) and is slower, due to syscalls. With kernel
threads, the OS schedules all threads, so it can schedule appropriately (e.g., when one thread
blocks, only move that into waiting instead of the whole process). Threads can also be scheduled
over multiple processes, so we have parallelism, instead of concurrency (which can introduce its
own set of problems).
Hybrid Threading
Both user and kernel threading has their advantages and disadvantages, so using the hybrid
methods give you the best of both worlds. Most real implementations of threading are hybrid.
A common approach is to multiplex a number of user threads onto a number of kernel threads;
each kernel thread has some set of user threads that take turns using the kernel thread. There is
usually a maximum number of kernel threads per user process.
Many-to-One

This is the same as normal user-level threading (only with management in the OS); many user
level threads are mapped to a single kernel thread. All thread management happens in user-space,
so this is efficient, but it does suffer from the one thread blocking the whole process problem.
You could also implement a kernel thread as a process, which is used on systems that don't
support kernel threads.
One-to-One

All threads here are kernel threads - each user thread maps directly on to one kernel thread. This
has more concurrency than the many-to-one model, as when one thread blocks another can run.
This does have the overhead of having to create a kernel thread for each user thread, however.
Many-to-Many

This is the normal method of implementing threading. Here, the user library controls all - which
threads are in the kernel, which are not and also controls swapping, etc. The library tries to
control the threads so that not all of the kernel threads end up blocked at the same time, by
swapping in and out waiting user threads to kernel threads.
This has the advantages of there being control over concurrency in the application, and the
application can allocate user threads to kernel threads in a way to optimise the concurrency and
ensure that at least one user thread is runnable at any time.
This method is quite complex, however and there are overheads in both the OS and the
application. The user thread scheduling policy may also conflict with the kernel thread
scheduling policy, introducing inefficiencies.
Threading Issues
fork() creates a seperate duplicate process and exec() replaces the entire process with the
program specified by its parameter. However, we need to consider what happens in a multi-
threaded process. exec() works in the same manor - replacing the entire process including any
threads (kernel threads are reclaimed by the kernel), but if a thread calls fork(), should all threads
be duplicated, or is the new process single threaded?
Some UNIX systems work around this by having two versions of fork(), a fork() that duplicates
all threads and a fork that duplicates only the calling thread, and usage of these two versions
depends entirely on the application. If exec() is called immediately after a fork(), then
duplicating all threads is not necessary, but if exec() is not called, all threads are copied (an OS
overhead is implied to copy all the kernel threads as well).
Thread Cancellation
Cancellation is termination of the thread before it is complete (e.g., if multiple threads are
searching a database and one finds an answer, the others can stop).
There are two types of cancellation: asynchronous and deferred (or synchronous). With
asynchronous, the calling thread can immediately terminate the target thread, but in deferred
cancellation, the target thread is information that it is to be terminated by setting some status in
the target thread. When the target thread consults its status, it finds out it is to terminate and
performs some clear-up and terminates itself. This requires some co-ordination between threads.
Asynchronous cancellation is useful if the target process does not respond, otherwise it should
not be used, as it can lead to inconsistencies.
Signal Handling
This is used in UNIX to notify a process an error has occured. It is normally generated by an
event and is delivered to the process. Once delivered, the signal must be handled.
Synchronous signals are delivered to the same process that performs the operation that caused
the signal, e.g., divide by 0 errors, etc...
Aysnchronous signals are delivered to running processes and are generated by something
external to the process, e.g., a key press.
The signal can be handled by a default signal handler, or a user-defined signal handler. User-
defined overrides the default handler (in most cases).
In multi-threaded systems, we have to consider the case of to which thread should the signal be
delivered?
To the thread where the signal applies (how do you know this?)
To every thread (has a lot of overhead)
To certain threads (which ones?)
To a designated thread, which handles all signals (has the problem of there being the
overhead of another thread, and the thread may not be scheduled in the best way to
handle the signal)
CPU Scheduling
CPU scheduling refers to the assignment of the CPU amongst processes that are ready. It selects
from among the processes in memory that are ready to execute and allocates the CPU to one of
them. In terms of process state, the scheduler decides which of the processes that are in the ready
state should be allocated to the CPU and enter the running state.
The CPU scheduling decisions may take place when a process:
1. switches from the running to the waiting state
2. switches from the running to the ready state
3. switches from the waiting to the ready state
4. terminates
A scheduling decision may also take place as the result of a hardware interrupt.
We call scheduling under conditions 1 and 4 nonpreemptive - once the program is running, it
only gives up the CPU voluntarily. Scheduling under conditions 2 and 3 is preemptive - that is,
the running process can be preempted by an interrupt. Problems may arise when a program is
preempted whilst accessing shared data, which gives us a need for mutual exclusion.
Dispatcher
The dispatcher module gives control of the CPU to the process selected by the scheduler. This
involves:
switching context
switching to user mode
jumping to the proper location in the user program to restart that program
With the dispatcher, we have a new metric, dispatch latency, which is the time it takes for the
dispatcher to stop one process and start another running. Dispatch latency is optimal when it is
minimised.
The dispatcher is an example of a mechanism, which is how you should implement scheduling.
Policies decide what will be done, and the seperation of policies and mechanisms is an important
OS principle, allowing flexibility if policy decisions are to be changed later.
Process Scheduling Queues
This is based upon the states in the process state transition graph. The OS needs to find the TCBs
of all processes efficiently, and needs to find a process waiting on a particular event quickly
when that event occurs. Specifically, the OS needs to find the next process to dispatch (i.e., from
the ready queue) quickly on a context switch. You also need to be able to move processes
between queues efficiently.
A process is only in one queue at a time. For example, you may have a ready queue, which is the
set of all processes residing in main memory, ready to execute, and also a number of waiting
queues, one for each different I/O device, which is the set of processes waiting on I/O from that
device. Processes move between queues as they change state. Pictorally, this may look like this:

With this model, we have two types of schedulers, the short-term scheduler (or CPU scheduler),
which selects which process should be executed next from the ready queue and allocates CPU. It
is invoked very frequently (every 100 or so milliseconds), and must be very fast. Additionally,
we have the long-term scheduler (or job scheduler). This selects which new processes should be
brought into the ready queue, and is invoked very infrequently (in terms of seconds and perhaps
even minutes). The long-term scheduler controls the degree of multi-programming (i.e., the
number of processes in memory). However, how do we decide which jobs should be brought in
from main memory?
Processes can be described as being either I/O-bound (spends more time doing I/O than
computations, and has many short CPU bursts) or CPU-bound (spends more time doing
computation, has a few very long CPU bursts). The long term scheduler tries to maintain a
balance between these two forms.
Some OSs also have a medium term scheduler, which swaps in/out partially executed processes.

Scheduling Policy
When the OS needs to select another process to execute, the scheduling policy decides which of
the processes to assign to the CPU. Scheduling policy effectively determines the order of the
ready queue so that the next process to run is at the head of the queue.
Scheduling policies are often defined on the CPU-I/O burst cycle. Most process executions
consist of a CPU execution and an I/O wait. The characteristics of this cycle help determine
policies. The relationship between relative sizes of CPU and I/O bursts are very important. I/O
bursts are very long compared to CPU bursts in most applications. A histogram of CPU-burst
times gives us a typical CPU-burst graph, showing many short CPU bursts and few long CPU
bursts, which are the CPU bound processes.
Some of the goals of scheduling policies are:
CPU utilisation high (keeping the CPU as busy as possible)
Throughput high (the number of processes that complete their execution per time unit)
Turnaround time low (the amount of time it takes to execute a particular process)
Waiting time low (amount of time a process has been in the waiting queue)
Response time low (amount of time it takes from when a request was submitted until the
first response is produced, not output - for time-sharing environments)
Turnaround time alone is a bad goal to optimise towards, as overall processes will take longer.
First-Come, First-Served Scheduling
In First-Come, First-Served (FCFS) scheduling, processes are scheduled in the order that they
arrive in the ready queue.
The average waiting time for FCFS is not optimal, however, as it suffers from the convoy effect,
where short processes are delayed behind long processes.
Shortest-Job-First Scheduling
Shortest-Job-First (SJF) scheduling works by associating each process with the length of its next
CPU burst, and then use these lengths to schedule the process with the shortest time next.
This has two schemes: non-preemptive (once the CPU is given to the process, it cannot be
preempted until it completes its CPU burst); and preemptive (if a new process arrives with CPU
burst length less than the remaining time of the current executing process, preempt), which is
sometimes known as Shortest-Remaining-Time-First (SRTF) .
SJF is optimal for giving a minimum averasge waiting time for a given set of processes.
Determining the length of the next CPU burst can only be done by estimating the length. One
way of accomplishing this is by using the length of previous CPU bursts and exponential
averaging with the forumla
n + 1
= t
n
+ (1 - )
n
, where t
n
= the actual length of the n
th
CPU
burst,
n + 1
= predicted value for next CPU burst and 0 1.
Priority Scheduling
A priority number (integer) is associated with each process and the CPU is allocated to the
process with the highest priority (the smallest integer). This can be either preemptive or non-
preemptive. SJF is a priority scheduling policy if the priority is predicted next CPU burst time.
This does have the problem of starvation, where low priority processes may never execute, and
the solution to this is aging, where as time progresses, the priority of the process is increased.
Round Robin
Each process gets a small unit of CPU time (a time quanta), which is usually about 10-100 ms.
After this time has elapsed, the process is preempted and added to the end of the ready queue. A
smaller time quanta means more context switches and therefore more overheads, so the time
quanta needs to be set to some appropriate value that is large with respect to the size of the
context switch, but not too high as to basically become a FIFO policy. If there are n processes in
the ready queue and the time quantum is q, then the process gets 1/n of the CPU time in chunks
of at most q time units at once. No process waits more than (n - 1)q time units.
Average turnaround time does not necessarily improve as the quantum increases, so quantum
size is a trade-off between good turnaround and low overheads.
Multi-level Scheduling
Here, the ready queue is partitioned into seperate queues, e.g., a foreground queue (for
interactive processes) and a background queue (for batch processes). Each queue has its own
scheduling algorithm, for example the foreground one is a RR queue and the background one
may be FCFS.
Each queue must also be scheduled for CPU time, for example we could have a fixed priority
scheduling algorithm (serve all from foreground, then from background), but this does have the
possibilty of starvation, or we could have a time-slice, where each queue gets a certain amount of
CPU time which it can schedule among its processes, i.e., 80% to foreground and 20% to
background.
Multi-level Feedback Queue
A process can move between the various queues in a multi-level scheduling policy, and this is a
way of implementing aging. The multilevel-feedback-queue scheduler is defined by the
following parameters:
number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to downgrade a process
method used to determine which queue a process will enter when that process needs
service
Thread Scheduling
For user-level threads, any general purpose scheduling policy can be used according to the best
policy for that application. General user-level thread implementations should allow the
application sufficient freedom to define an application-specific scheduling regime, i.e., allow the
application to reorder the thread scheduling queues at any time, with any policy.
Memory Management
Processes (including threads) need physical memory in order to execute. The OS manages
memory and allocates memory to processes from physical memory, and virtual memory.
Physical memory management enables the OS to safely share physical memory among
processes. This allows us to multiprogram the processor, to time-share the system (overlap
computation and I/O) and to prevent one process from using another memory.
Memory management techniques must meet the following requirements:
Processes may require more memory than available physical memory
You may need to load processes to different locations each time they execute
You may need to swap all or part of a process to secondary storage during execution (and
then reload it at a different memory location, as above)
Memory protection between processes is required
Some simple memory management techniques include loading the process into memory and
switching between them with no protection between the processes (DOS), or copying the entire
process memory to secondary storage when it performs I/O and then copy it back when it is
ready to restart (early UNIX). This is clearly inefficient.
Most modern OSs work by giving each process a part of memory that it is allowed to access and
check each memory access to ensure that the process stays within its own memory.
Physical Memory Allocation
Main memory must accomodate both the OS and user processes. Usually, the main memory is
divided into a number of distinct partitions, such as the resident OS, which is normally a single
partition held in low memory, as this is where CPUs place interrupt vectors. User processes
normally consist of one or more partitions, within memory above the OS. However, we need to
consider how we're going to divide up these user partitions.
The simplest scheme is to partition into regions with fixed boundaries - this is called fixed
partitioning and has two types, where the partitions are of equal size, or where they are of
unequal size. Any process whose size is less than or equal to the partition size can be loaded into
an available partition. If all the partitions are full, the operating system can swap a process out of
a partition. However, if a process doesn't fit into a partition, it is the programmers responsibility
to make sure that the program can be split over multiple partitions. With this, main memory use
is inefficient. Any program, no matter how small, occupies an entire partion. This is called
internal fragmentation.
With unequal partitions, the problem is reduced, but not eliminated. Instead of creating x
partitions of equal size, partitions of varying sizes are created, so a program can be loaded into
the one most appropriate to its size.
Another method is dynamic partitioning, where the partitions are of variable length and number
and are created when the process asks for them - so, the process is allocated exactly as much
memory as it requires, eliminating internal fragmentation. It does however leave holes in
memory, as processes are swapped in and out in different places, leaving small gaps between
processes. This is called external fragmentation and could lead to a situation where there is
enough free memory for a process to run, but it is fragmented across main memory. The way to
solve this is using compaction - rearranging the programs in memory so they are contiguous and
all free memory is in one block. This is very, very slow.
There are different ways of implementing allocation of partitions from a list of free holes, such
as:
first-fit: allocate the first hole that is big enough
best-fit: allocate the smallest hole that is small enough; the entire list of holes must be
searched, unless it is ordered by size
next-fit: scan holes from the location of the last allocation and choose the next available
block that is large enough (can be implemented using a circular linked list)
First-fit is usually the best and fastest, but analysis has shown that under first-fit, fragmentation
can cause up to half of the available memory to be lost in holes (external fragmentation) or
unused memory (internal fragmentation). Best-fit has the smallest amount of external
fragmentation. Next-fit can produce external fragmentation similar to best or first, depending
upon the next block found. It does use holes at the end of memory more frequently than others
perhaps leading to the largest blocks of contiguous memory.
A common approach, especially in embedded systems is the buddy system. Under the buddy
system, the entire space is treated as a single block of 2
U
. If a process requests memory of size s,
such that 2
U-1
< s 2
U
, the entire block is allocated, otherwise the block is split into two equal
buddies. This process continues until the smallest block greater than or equal to s is generated.
At any time, the buddy system maintains a list of holes of each size 2
i
. A hole on the i + 1 list can
be split into 2 equal holes of size 2
i
in the i list.
Implementing Physical Memory Management
There are two main problems in implementing physical memory management - how can
processes be run in different areas of memory, given that they contain lots of addresses, and how
can the OS provide memory protection.
Memory addresses in a program must be translated into physical memory addresses, so programs
are created to be relocatable, such as the ELF binaries for Linux. There are three points where
binding of program address to memories can occur:
compile time - if the memory location is known at compile time, static absolute addresses
can be compiled in; any starting address changes must be recompiled in.
load time - here, the code generated is relocatable and the code can be loaded at any
address and the final binding occurs when the program is loaded. If the memory locations
change, the code just needs to be reloaded.
execution time - the binding is delayed until runtime if the process can be moved from
one memory segment to another during execution. This method does require hardware
support.
To work round this, we have the concept of a logical address space that is bound to a seperate
physical address space, which is the key to memory management. Logical addresses are the ones
generated by the CPU, and physical addresses are the ones seen by the memory unit. When using
compile or load-time address-binding schemes, the logical and physical addresses are the same,
but with the execution-time addres-binding scheme, the logical (virtual) and physical addresses
differ. This is accomplished using a memory management unit.
The memory management unit (MMU) is a hardware device that maps a logical address to a
physical one, so the user program only deals with logical addresses, and never sees the physical
one.
The most simple form of MMU is the relocation register. The relocation register holds the value
of an offset and then adds the offset to all logical addresses given to it to give the physical
address. In this system, logical addresses are in the range 0..max for all processes and physical
addresses are in the range R..R+max, where R is the value in the relocation register. The process
only needs to generate the logical addresses, and only sees the process running in addresses
0..max. This principle of mapping is central to memory management.
The relocation register can provide memory protection with the addition of a limit register,
which contains the allowed range of logical addresses - each logical address must be less than the
limit register. Hardware support is obviously required for this.
Another way of thinking of this is not thinking of a program as one large chunk of contiguous
memory - a process may be broken up into pieces that do not need to be loaded into main
memory all the time. This increases opportunities for multiprogramming as with many processes
in main memory, it is more likely a process will be in the ready state at any particular point. This
is achieved by programming techniques with minimal OS support.
Static vs. Dynamic Loading
So far, we have assumed that the entire program data and code must be loaded into memory for
process to execute - this is called static loading. An alternative approach is called dynamic
loading which works by not loading routines until it is called. This gives a better memory-space
utilisation and unused routines are never loaded. This is useful when large amounts of code are
required to handle infrequently occuring cases. If dynamic loading is implemented through
program design, no special support from the OS is required.
Dynamic loading enables a small part of the process to be loaded initially - usually around the
program entry point with part of the data and for a greater degree of multiprogramming as only
parts of the program are needed at any time in physical memory. It is usually the responsibility of
the application to ensure that the appropriate parts of the application are loaded at any time.
Some OSs do provide libraries to aid loading parts of a program.
Overlays
These are slightly more structured, and codes are put into phases that are loaded over one
another. Again, the motivation is to keep only code and data in memory that is actually needed.
When one phase completes, the application arranges to overlay its memory with the next phase
when required. The total memory required is the same as the memory required by the largest
phase. Some OS support may be given to implement this.
This is useful when there is a limited amount of physical RAM, and some secondary storage is
available to hold code/data, e.g., in embedded systems.
Shared Libraries
Shared libraries can be implemented in two ways - using static or dynamic linking. In static
linking, all libraries required by the program are linked into the image (i.e., form part of the
data/code of the program), so the memory allocated must be big enough to hold the application
and the libraries.
With dynamic linking, the libraries required by the program are dynamically linked at execution
time (just-in-time), so the memory allocated must be big enough to hold the application. Shared
libraries are allocated from a different area of memory.
Swapping
A process (or part of a process) can be swapped temporarily out of memory to a backing store
(e.g., disk) and then brought back into memory for continued execution. This is how medium-
level scheduling is used. If you want to reload at a different address, execution time binding is
required.
The issue with swapping is the backing store - a fast disk large enough to accomodate copies of
all memory images for all users that must provide direct access to these memory images is
required. A major part of the swap time overhead is the transfer time. The total transfer time is
directly proportional to the amount of memory swapped.
One of the more common swapping variants is called roll-out, roll-in and it is used with priority-
based scheduling algorithms. Lower-priority processes are swapped out so higher-priority
processes can be loaded and executed.
Many different varieties of swapping can be found on many OSs, such as UNIX, Linux,
Windows, etc...
Paging
The problems with using physical memory allocation with unequal, fixed or variable sized
partitions are fragmentation and inefficiencies (algorithms to track holes, and compaction).
Paging and segmentation are memory allocation techniques that largely eliminate these
problems.
Many of these problems arise as memory allocation is of contiguous blocks of memory to
processes. Paging removes this restriction, as the logical address space of a process can be
noncontiguous. The physical memory space is divided into fixed-size blocks called frames, and
logical memory is divided into blocks of the same size called pages. Pages are then allocated and
fit into frames.
To implement paging, we need to keep track of all free frames. To run a program of size n pages,
we need to find n free frames and load the program into them. There is a problem with internal
fragmentation if a process size is not an exact number of frames. Only part of one frame is lost to
this fragmentation, however, and if we make the frame size quite small, this isn't a significant
problem.
With paging, the logical address generated by the CPU is divided into a page number p and a
page offset d. The page number is used as an index into a page table which contains the base
address of each page in physical memory and the page offset is combined with the base address
to define the physical memory address that is sent to the memory unit.

Usually, page tables are a fixed length, often the same size as a frame. If a process does not
require all entries in the page table, the redundant entries are set to be invalid. If a reference is
made to an invalid page table entry, the OS must handle a page fault and there is the potential for
an erroneous logical address to have been generated by the application process.
The page table is normally kept in main memory, so in this scheme every data/instruction access
requires two memory lookups, one for the page table and then one for the actual instruction. The
two memory access problem can be solved using a special peice of hardware called a translation
look-aside buffer (TLB).
TLBs are implemented using associative memory, which allows for a parallel search of page
numbers in order to get the frame number. The TLB is not a part of normal physical memory.
Memory protection in paging is implemented by associating a protection bit with each frame.
The valid-invalid bit is attached to each entry in the page table, where "valid" indicates that the
associated page is in a process' legal address space and is thus a legal page, whereas "invalid"
indicates that the page is not in the legal address space of the process.
Modern systems allow processes to have large logical address spaces, so the page table itself can
become very large. Hierarchial page tables can be used to break up the logical address space into
multiple page tables.
See hash tables
in ADS
In address spaces > 32 bits, hashed page tables have become common. Here, the virtual page
table is hashed into a page table.
Often, having one page table per process can be expensive in terms of physical memory. The
alternative is the inverted page table, where there is one entry for each real page of memory and
the entry consists of the virtual address of the page stored in that real memory location, with
information about the process that owns that page. This decreases memory usage, but increases
the time needed to search the table when a page reference occurs. Hash tables can be used to
limit the search to one, or at most a few, page-table entries.
Segmentation
Paging seperates the user's view of memory from actual physical memory. A user thinks of the
program as a collection of segments. A segment is a logical unit, such as: the main program,
procedures, functions, methods, objects, local variables, global variables, common blocks, the
stack, the symbol table and arrays. Segmentation is a memory management scheme that supports
a user view of memory.
In segmentation, you have a logical address consisting of a segment number and an offset and a
segment table that maps two-dimensional physical addresses. Each table entry has a base that
contains the starting physical address where segments reside in memory and the limit that
specifies the length of the segment.

With each entry in the segment table, a validation bit (one that indicates whether or not it was an
illegal statement) and read/write/execute privileges are associated with the entry. Since segments
can vary in length, memory allocation is a dynamic storage-allocation problem, which can lead
to external fragmentation.
Segmented Paging
Both paging and segmentation have advantages and disadvantages and most succesful processor
families are moving towards a mixture of segmentation and paging. Segmented paging usually
uses a segment table, a multi-level page table and an offset.
Virtual Memory
Virtual memory is the seperation of user logical memory from physical memory. Only part of the
program needs to be in memory for execution (i.e., only the part needed for current execution -
follows from the principle of locality, where programs and data references inside a process tend
to cluster and only a few pieces of a process will be needed in any period of time). The major
advantages of virtual memory is that the logical address space can be much larger than the
physical address space and more processes can be partly in physical memory at any time.
Virtual memory can be much larger that physical memory, with n pages of virtual memory
mapped to m pages of physical memory and other pages on secondary storage.
Virtual memory can be implemented assuming a virtual address space of 0..max, where max is
usually a power of 2, e.g., could be 2
32
bytes, or 4 GiB, in 32-bit systems. The process can be
loaded anywhere in physical memory and often the physical memory size is significantly smaller
than the virtual address space of an individual process.
Physical memory nowadays is very large, so virtual memory is only useful for large data sets,
and for compilation models.
Demand Paging
Demand paging brings a page into memory only when it is needed. This means there is less I/O,
less memory, a faster response time and more multiprogramming.
A page is needed when there is a reference to it (i.e., an instruction/data address). An invalid
address will lead to an abort, and a not-in-memory request will lead to a fetch to memory - this is
called a page fault.
When the first reference to a page occurs, the page will not be in memory under this system, so a
page fault occurs. This is a non-maskable interrupts (can't be covered up by the user or OS) and
it is synchronous, you need to know the running process that caused the interrupt. This allows for
the OS to bring the appropriate page into main memory and for it to detect invalid memory
references - the OS contains a range of valid addresses for a process. The process often does not
need the full range of virtual memory, hence code size, data size and stack size can be used to
check if a memory reference is valid. This information is stored on a per process basis in the
TCB.
When a page fault occurs, the OS looks at the internal table to decide if the process is permitted
to make the memory reference and if it is, it finds an empty frame and swaps the page into that
frame, updates the page table and the internal OS tables and continues the instruction that caused
the page fault - as with other standard interrupts.
Pure demand paging is where you start executing a program with no pages loaded into memory.
This will trigger a page fault immediately, however.
Demand paging is more complex when a TLB is involved. When a virtual address is given, the
processor consults the TLB, and if a TLB hit occurs, the real address is returned and main
memory is consulted in the usual manner. If the page table entry is not found in the TLB (a
miss), the page number is used to index the process page table, and then a check occurs to see if
the page is in main memory. If it's not, a page fault is issued and the non-TLB process as above
is followed, with the TLB being updated as well as the page table.
Page Replacement
Page replacement is needed when there are no free physical memory frames.
When we want to load a page into memory, we try to find a free frame to put the page into. If
there is no free frame, then we use a page replacement algorithm to select a victim frame, and
then the victim is copied back onto disk and the new frame is loaded into its place and the page
and frame tables are updated.
The aim for designing a page replacement algorithm is to give the lowest page-fault rate. An
algorithm is evaluated by running a particular string of memory references and computing the
number of page faults on that string. An ideal page fault graph would look like:

Some algorithms, however give more page faults if the number of pages is increased - this is
called Belady's Anomaly.
The simplest page replacement algorithm is the first-in-first-out method, which can be
implemented using a circular linked list.
Another algorithm is the least recently used (LRU) algorithm, which replaces the page that has
not been referenced for the longest time. By the principle of locality, this should be the page that
is least likely to be referenced in the future. Each page could be tagged with the time of last
reference, but this would require a great deal of overhead. LRU can be easily implemented using
a stack of page numbers; when a page is referenced, it is moved or placed on the top of the stack,
and when looking for something for a replacement, the bottom of the stack is all that you need.
The second chance algorithm uses a reference bit, which is initially set to 0 and then is updated
to 1 when it is accessed. The second chance algorithm uses a circular list of pages. When
searching for a page to replace, if a page has a reference bit one, it is set to 0 and the page left in
memory, and the next page in circular order is replaced subject to the same rules.
We can also consider counting algorithms, where a counter is kept of the number of references
that have been made to each page. The least frequently used (LFU) algorithm replaces the page
with the smallest count. The most frequently used (MFU) algorithm is based on the algorithm
that the page with the smallest count was probably just brought in and has yet to be used.
The optimal algorithm replaces the page that will not be used for the longest period of time. It is
impossible to have perfect knowledge of future events, but if you already know what's happening
(as with a reference string), you can use this algorithm to measure how well your algorithm
performs.
The OS may wish some frames to remain in physical memory frames, such as parts of the OS
itself, control structures, I/O buffers, etc..., which is why we can use page and frame locking.
When a frame is locked, it may not be replaced. This usually requires a lock bit to be added with
each frame, and is usually supported in the OS, not hardware.
Each process requires a minimum number of pages. If indirect referencing is used, each
reference requires a page, so each level of references increases the amount of pages required.
Frame Allocation
Again, there are different methods of allocating frames. If there are 100 frames and 5 processes,
each process can be given 20 pages. This is called equal allocation. A different way is called
proportional allocation, where allocation is given according to the size of the process: a
i
, the
allocation for process p
i
= s
i
/S m, where s
i
= size of process p
i
, S = s
i
and m = total number of
frames.
A variation of the proportional allocation scheme is the priority allocation scheme, where the
proportions are computed based on priorities, rather than size (a higher priority process will
execute quicker if it has more frames).
If a process p
i
develops a page fault you can either select a replacement from one of its frames or
a replacement from a process with a lower priority number. This brings us to the question of
global or local replacement. Global replacement will select a replacement frame from the set of
all frames, so one process can take a frame from another. In priority based allocation schemes,
this may allow higher priority processes to increase their allocation of frames by taking a frame
from a low priority process. Problems arise as the page fault behaviour of one process will also
depend on the behaviour of other processes.
In local replacement, each process selects from only its own set of allocated frames, so the
process' page fault behaviour is entirely dependent upon itself, and not upon other processes. The
process execution time becomes more consistent across multiple executions.
Thrashing
If a process does not have enough frames allocated, the page fault rate can get very high, so a lot
of swapping back and forth between main memory and secondary storage is done, and the result
is thrashing, which is where more work is done by the OS swapping and paging than is done by
the application in doing something useful. This has severe performance implications.
A common form of thrashing is due to the OS monitoring utilisation and many page faults occur,
decreasing utilisation. As utilisation decreases, the OS increases the degree of multiprogramming
to try and increase it, which in turn increases the page faults and decreases utilisation...
Thrashing occurs under global frame replacement more as frames can be stolen from other
(active) processes. Local frame replacement limits thrashing to within individual processes. If a
process starts thrashing, it cannot steal frames from another process and cause it to start
thrashing.
To prevent thrashing, a process must allocate a sufficient number of frames. We can predict a
sufficient amount of frames using the locality model. The diagram below shows how memory
accesses change over execution of a program.

Thrashing will occur when the size of the locality > memory (pages) allocated to the process.
The working set model assumes locality - it defines a per-process working set window which
contains the most recent page references for a process. If a page is in active use, it will have been
accessed in the working set window, and will be in the working set. If a page is not being used, it
will drop from the working set sometime after its last reference. The working set approximates
the locality of a process.
Formally, we have , the working set window width in time and WSS
i
, the working set of
process P
i
is the total number of pages accessed in the most recent . The working set will vary
with time, and if is too small, it will not cover the entire locality, and if is too large, it'll
cover several localities. If = , it will cover the entire program. The most important property is
the total size of the working sets in the system. D = WSS
i
. If D > m (the total number of pages),
thrashing will occur, and a process must be suspended.
The working set can be implemented with an interval timer and a reference bit. For example, if
= 10,000, each process in the OS has bits to indicate whether it has been accessed in the last
5000, 10,000 or 15,000 cycles. On a page access, the 5000 bit is set, and a timer interrupt every
5000 cycles shifts all the bits along one, with the 5000 reference bit set to 0. If any of the bits in
memory = 1, the page is in the working set.
This is not entirely accurate, as you cannot tell within an interval of 5000 when a page reference
actually occured. You can improve this by interrupting more frequently and increasing the
number of reference bits, but this has a greater overhead and reduces the time available to do
useful work.
The working set model is quite clumsy and expensive to implement, and page-fault frequency is
better. An "acceptable" page-fault rate is better, and if the page fault rate travels outside of these
bounds, then the process loses a frame, and if the page-fault rate gets too high, the process gains
a frame.

Inter Process Communication
We need processes to co-operate to carry out particular jobs, e.g., one process requests a service
from another (such as requesting an I/O operation from the kernel, as the kernel is still a
process), pipelined processes working together (one process can't proceed until the previous one
has finished) and where multiple processes work on the same task, for example one might read
from disk to memory, and another reads from memory to the audio device. This can be
accomplished using inter process communication (IPC).
IPC has numerous advantages, such as:
Information sharing (e.g., of shared files)
Computation speed-up (e.g., a program can be split across multiple computers)
Modularity (for software engineering reasons)
Convenience (e.g., to edit, compile and print in parallel)
Independent processes cannot affect or be affected by the execution of another process (all
processes are in a seperate protected memory address space). Co-operating processes can affect
or be affected by the execution of another process, via some form of IPC mechanism.
The OS must solve some of the problems that IPC causes, however. It must provide mechanisms
to control interaction betwen the processes and must order accesses to shared resources by
processes. The OS must also provide mechanisms by which resources can wait until the resource
is available.
A paradigm for co-operating processes is the producer-consumer problem. A producer process
produces information that is consumed by a consumer process. There are two models which can
be used here, one is the unbounded-buffer, which places no practical limit on the size of the
buffer, and the buffer is just expanded as the producer fills it and the other is the bounded-buffer,
which assumes there is a fixed buffer size and the producer must wait for the consumer to
remove data if the buffer is full.
See lecture notes
for full details
One way to implement the bounded-buffer solution is to use a circular array/list, but this only
works if the processes can share memory. Another implementation that maximises the use of the
buffer is to use an integer called a counter which is then incremented when a producer writes to
the buffer and decremented when a consumer reads from it. These operations must be performed
atomically though, which isn't always the case if a context switch occurs during the increment
and decrement process.
We now have a race condition, which is the situation where several processes access and
manipulate shared data concurrently and the final value of the shared data depends on which
process finishes last, so the data may become inconsistent. To prevent race conditions and to
ensure consistent data, concurrent processes must be synchronised. This requires mechanisms to
ensure the orderly execution of co-operating processes.
Critical Sections
Here, we can use critical sections. If we have n processes competing to use some shared data,
each process has a code segment called the critical section, in which the shared data is accessed.
The problem is to ensure that when one process is executing in its critical section, no other
process is allowed to execute in its critical section. Solutions to this problem include:
Mutual exclusion (mutexs) - if a process is executing in its critical section, then no other
processes can execute in their critical sections
Progress - if no process is executing in its critical section and there exists some program
that wish to enter their critical section, then the selection of the processes that will enter
their critical section next can not be postponed indefinately
Bounded waiting - a bound must exist on the number of times that other processes are
allowed to enter their critical sections after a process has made a request to enter its
critical section and before that request is granted
One way to accomplish this is to use command such as EnterCriticalSection() and
LeaveCriticalSection(). Processes may share some common variables to synchronise their
actions. For example, we could have an integer value called turn, which if it matches the value
associated with that process, then the process can enter its critical section and then increment the
turn counter. This satisfies the mutual exclusion requirement, but not progress.
A second algorithm is to use guard flags, which consists of an array of flags. When a process is
ready to enter its critical section, it sets its flag to true and waits until all other processes have
completed their critical sections and then sets its flag to false. This satisfies mutual exclusion
again, but not progress.
The final algorithm is a mixture of turn and flags, which waits for the turn of a process that is
waiting to enter its critical section to finish and then enter its own. This meets all three
requirements, and solves the critical section problem for two processes.
However, we need to solve the problem for n processes, and one way of implementing this is to
use the Bakery (or Lamport) algorithm. Before entering its critical section, a process receives a
number, and the holder of the smallest number enters its critical section. If processes i and j
receive the same number, then if i < j, then i is served first, else j is served first. The number
scheme used always generates numbers in an increasing order of enumeration.
As in many aspects of software, hardware can make programming easier. Hardware exists that
can solve the critical section problem. If an operation TestAndSet existed, which could test and
modify the contents of a word atomically, mutual exclusion could be implemented. This does not
satisfy the bounded wait requirement, however.
The software algorithms discussed above for exclusion of processes from critical sections are
problematic as they rely on the continuous testing of variables (busy waiting), which is
inefficient and wasteful of CPU cycles. The protocols are complex and unenforceable, relying on
programmer discipline and co-operation. The OS implements efficient IPC mechanisms instead.
Signals
We've already discussed
signal handling in threads
in the threads section.
POSIX/Linux signals are a dataless communication either between processes, or between the OS
and a process. Signals can be thought of as software interrupts and require a handler within a
process. When the event that causes the signal occurs, the signal is said to be generated and when
the appropriate action for the process in response to the signal is taken, the signal is said to be
delivered. In the interim, the signal is said to be pending. Signals consist of a number of types,
such as system defined (normally up to 31) corresponding to a particular event, or user-defined
(normall integers > 31). Some of these signals include:
SIGINT (2) - interrupt generated from a terminal
SIGILL (4) - interrupt generated by an illegal instruction
SIGFPE (8) - floating point exception
SIGKILL (9) - kill a process
SIGSEGV (11) - segmentation violation
SIGALRM (14) - alarm clock timeout
SIGCHLD (17) - sent to parent when child exits
Synchronous signals occur at the same place, every time in the program (e.g., SIGFPE, SIGILL,
SIGSEGV) and are caused by the program itself. Asynchronous signals may occur at any time
from a source other than the process (another process e.g., SIGKILL, a terminal driver on a key
press e.g., SIGINT on Ctrl+C, on an alarm clock, e.g., SIGALRM) and a process can choose how
to handle the specific signal.
A signal can be treated in the following ways, it can be ignored (except for SIGKILL), the
default action set by the OS can be executed (the default action is decided by the signal used,
such as terminating the process, suspending the process, terminating the process and saving a
copy in memory, continue a suspended process or just ignore), or a specific signal-catching
function can be invoked.
There are essential system calls for signalling:
kill(pid, signal) - send a signal of value signal to the process with ID pid.
sigaction(handler, signal) - installs a handler for a specific signal (handlers are specific to
each process)
sigprocmask(..., mask, ...) - signals can be blocked or masked in much the same manner
as hardware interrupts can be masked by the OS
sigpending(mask) - signals that are raised whilst they are masked out can be examined
sigsuspend(mask) - a process can wait until a signal arrives
Semaphores
Semaphores are "a non-negative integer, which apart from initialisation can only be acted on by
two standard, atomic, uninterruptible operations, WAIT (which decrements the value of the
semaphore) and SIGNAL (which increments the value)" - E. Dijkstra
You must guarantee that a maximum of one process can execute a WAIT or SIGNAL operation
on the same semaphore at the same time (atomicity).
There can be two types of semaphore, a counting semaphore (the integer value can range over an
unrestricted domain) and a binary semaphore, where the binary value can range only between 0
and 1. This can be simpler to implement.
Semaphores can be used to implement mutual exclusion for the critical section of n processes.
The shared data is a binary-semaphore mutex which is WAITed on when a program wants to
enter the critical section and then SIGNALed when the critical section completes.
Semaphores can also be used to implement synchronisation. For example, B can only be
executed in program j only once A has executed in program i. When a semaphore is used to
indicate a condition, it is called a condition variable and synchronisation using a condition
variable is called condition synchronisation.
This can introduce two conditions, however: starvation, which is indefinite blocking when a
process may never be removed from the queue in which it is suspended; and deadlock, where
two or more processes are waiting indefinitely for an event that can only be caused by only one
of the waiting processes.
Counting semaphores can be used for resource management. If you consider a resource with m
allocatable elements, the counting semaphore is set to m and then users can wait for a resource
by waiting on the semaphore and then free up a resource by signalling a semaphore.
The classic problems of synchronisation, such as the bounded-buffer problem, the readers and
writers problem and the dining-philosophers problem can be solved using semaphores and
mutexs.
OS level semaphores ensure mutual exclusion over a shared resource if and only if the
application is disciplined. For example, the programmer should only access shared data if and
only if it is holding the appropriate semaphores, and it should issue a signal to release the
semaphores when it finishes the critical section.
Semaphore operations still require atomicity. If an operation is in the OS, it can be accessed by a
process using a system call. When it is in the OS, the call will execute atomically, or at least
seem atomic from the users point of view, if interrupts are enabled. True atomicity can only be
achieved with hardware support, such as the TestAndSet instruction discussed earlier.
POSIX semaphores support two interfaces (Linux does not support these fully). Unnamed
semaphores are used within a process (or between related processes) and are created by a call to
sem_init(...) by a process. Operations include sem_wait(...) for a wait and sem_post(...) for a
signal. sem_trywait(...) also exists, which works like wait if the semaphore has a greater count
than 0, but doesn't wait if the semaphore is 0, returning immediately to the calling process. These
semaphores are destroyed by a call to sem_destroy(...), or if the creating process dies.
Named semaphores can be used between any process. The namespace is the same as the
filesystem (so the name of a semaphore is effectively a filename). These are created using
sem_create(...) and then can be manipulated as above. sem_close(...) removes the link between a
process and a semaphore, and they can live on past the termination of the creating process, so
must be explictly destroyed using sem_destroy(...). Here, the OS merely stores a list of processes
waiting on a particular semaphore (i.e., processes that are blocked on a semaphore), but hardware
support is still required for atomicity.
Linux implements SysV semaphores (i.e., as it existed in System V UNIX), which are not
POSIX compliant. They are system wide resources, managed by the OS, and any process can
access a semaphore, which is stored in the semaphore table of an OS. Allocation occurs via
semget(...), which creates a set (which may be of one) of semaphores or associate with an
existing set of semaphores. All operations on semaphores occurs using semop(semaphore, op, ...)
and semctl(...) allows you to examine or change the value of a semaphore, e.g., to initialise it, as
well as telling you how many processes are waiting on the semaphore and allowing you to
destroy it.
Semaphores can also be used to provide mutual exclusion between threads, such as provided by
the Linux pthreads library which provides mutexes. pthread_mutex_init() provides initialisation
for a mutex, pthread_mutex_lock() obtains a lock for a mutex and pthread_mutex_unlock() will
give up a lock. The semaphores between pthreads are supported by the standard Linux kernel
semaphores.
Conditional synchronisation can also be used, where a condition enables threads to atomically
block and test a condition under the protection of mutual exclusion, until the condition is true. If
the condition is false, the thread blocks on the variable and automatically releases the mutex. If
another thread changes the condition, it may wake up waiting threads by signalling the
associated condition variable. Waiting threads will wake up and then reacquire the mutex and
reevaluate the condition. pthread_cond_init() will initialise a condition variable, which must be a
mutex initialised by pthread_mutex_init(). pthread_cond_wait() and pthread_cond_signal() take
a condition variable and a mutex as parameters.
Using the low-level IPC mechanisms mentioned above relies on programmer discipline for the
correct use of semaphores to wait and signal in the correct order at the correct point in the
algorithm. Additionally, it requires an agreement between communicators (i.e., between the
reader and the writer) on the precise protocol to be used, and problems can be encountered if the
writer assumes writers have priority and readers assume that the readers have priority over the
shared data. Sometimes, hardware support is required to provide atomicity (TestAndSet,
interrupt disabling, etc).
High-level IPC provides a structured IPC less susceptible to programmer error. It incorporates
both the means of shared data control and the shared data itself, as opposed to low level IPC,
such as semaphores, where the shared data control is distinct from the shared data (note, the
POSIX implementation assumes that the shared data in a file could be accessed from processes
without waiting for the appropriate semaphore).
Message Passing
Message passing is a mechanism for processes to communicate and synchronise their actions. In
a message system, processes communicate with each other without resorting to shared variables.
If P and Q wish to communicate, they need to establish a communication link between them and
exchange messages via send/receive.
This does require OS support and a number of implementation issues must be resolved - where
are messages stored whilst they are in transit, how many processes can join in with a
communication, and is the message size fixed or variable, is the link uni or bi-directional,
synchronous or asynchronous, etc...
With message passing, processes must name each other explicitly, such as: send(P, message), to
send a message to process P and receive(Q, message) to receive a message from process Q.
With this method, links are established automatically and a link is associated with exactly one
pair of communicating processes, and between each pair their exists exactly one link. The link
may be unidirectional, but it is usually bidirectional.
A method of indirect communication involves using messages being directed and received from
mailboxes. Each mailbox has a unique ID and processes can only communicate only if they share
a mailbox. This introduces the problem of deciding who gets a message if a mailbox is shared,
however. The primitives in this method are: send(A, message) and receive(A, message), which
sends a receives messages from mailbox A.
In this case, a link is only established if processes share a common mailbox, and a link may be
associated with many processes. Each pair of processes may share several communication links
and the links may be uni- or bi-directional.
Mailboxes can be owned by either the OS or a process. If it is OS owned (in OS address space),
the OS is independent and not attached to any process, so operations and system calls are
required for the process to create a new mailbox, send and receive messages through the mailbox
and to delete mailboxes. If a mailbox is process owned (in the address space of a process) the
process with a mailbox receives messages and other processes send messages to the mailbox. An
operation or system call is still required to write to the mailbox.
Message passing can be considered to either be blocking or non-blocking. Blocking is considered
synchronous; in a blocking send, the sending process is blocked until the message is received by
the receiver or a mailbox, and in a blocking receive, the receiver is blocked until the message is
available. Non-blocking is considered asynchronous; in a non-blocking send, the sending process
sends the message and then resumes execution, and for a non-blocking receiver, the receive call
will either get a message or a NULL. The send and receive primitives may be either blocking or
non-blocking, depening on the implementation.
The link (direct communication) or mailbox (indirect communication) will contain messages that
have been sent or not received. This is called a buffer, and they can be implemented in more than
one way:
Zero capacity - 0 messages are held, and the sender must wait for the receiver
Bounded capacity - here there is a buffer with a finite length of n messages and the
sender must wait if the link is full
Unbounded capacity - infinite length, and the sender never waits
Client-Server Communication
When the processes are on different machines, network communication is required. Here, three
forms of IPC are available: sockets, remote procedure calls (RPCs) and remote method
invocation (Java). These will be discussed in the NDS course.
In POSIX, message queues between processes are used. Messages are stored in FIFO order, and
the queue is set up so that either the process sends or receives blocking (on the full or empty
queue respectively), or a the process sends or receives non-blocking. A process reading from an
empty queue will block.
In POSIX, the process creates or connects to an existing one using mq_open(name, ...), where
name is a filename in the same namespace as POSIX semaphores. Sending and receiving
messages occur using mq_send(...) and mq_receive(...), and process can close its connection to a
message queue by mq_close(...), however the message queue will still exist after this call. To
destroy a message queue, mq_unlink(...) is used before mq_close(...), and the queue will be
destroyed after all processes attached to the queue have unlinked.
High-level Synchronisation Mechanisms
Semaphores provide a convenient and effective mechanism for process synchronisation, but their
incorrect use can result in problems difficult to detect. To combat these problems, a number of
high-level synchronisation constructs have been proposed: critical regions and monitors.
These processes are not implemented at the OS level, but usually as part of the programming
language. There is no atomicity at the application level.
Critical regions are a high-level synchronisation construct. A shared variable v of type T is
declared as v : shared T. The variable v is only accessible inside the statement region v when B
do S, where B is a boolean expression, which means that when statement S is being executed, no
other process can access variable v.
Regions referring to the same shared variable exclude each other in time (i.e., mutual exclusion
is enforced). When a process tries to execute the region statement, the boolean expression B is
evaluated. If B is true, S is executed, but if B is false, the process is delayed until B becomes true
and no other process is in the region associated with v.
Critical regions are implemented using semaphores. With the shared variable x, the following are
associated: semaphore mutex, first-delay, second-delay; int first-count, second-count;. Mutually
exclusive access to the critical section is provided by mutex. If a process can not enter the critical
section because B is false, it initially waits on first-delay and is moved to second-delay before it
is allowed to re-evaluate B. A track is kept on the number of processes waiting on first-delay and
second-delay by using first-count and second-count respectively.
This algorithm assumes a FIFO ordering in the queueing of processes for a semaphore. For an
arbitrary queueing discipline, a more complicated implementation is required.
Monitors are high-level synchronisation constructs that allows the safe sharing of an abstract
data type among concurrent processes. The monitor encapsulates shared data together with
procedures that act upon that data. Only the procedures can be accessed from outside the
monitor; the data inside the monitor can not be directly accessed from outside the monitor. Only
one process can be active inside a monitor at a time.

To allow a process to wait within a monitor, a condition variable must be declared as: condition
x, y;. The condition variable can only be used with the operations wait and signal. x.wait() means
that the process invoking the operation is suspended until another process invokes x.signal(),
which resumes exactly one suspended process. If no process is suspended, then the signal
operation has no effect.
A conditional-wait construct also exists: x.wait(c), where c is an integer expression evaluated
when the wait operation is executed. The value is a priority number stored within the name of a
process that is suspended. When x.signal() is executed, the process with the smallest associated
priority number is resumed next.
Two conditions need to be checked to establish the correctness of a system:
User processes must always make their calls on the monitor in a correct sequence.
You must ensure that an uncooperative process does not ignore the mutual-exclusion
gateway provided by the monitor and try to access the shared resource directly, without
using the access protocols.
Deadlocks
Deadlocks occur when a set of blocked processes are each holding a resource and waiting to
acquire a resource held by another process in the set. Deadlocks can be characterised by the
bridge-crossing problem:

There is only one lane across the bridge. We can view each section of the bridge as a resource,
and if a deadlock occurs, it can be resolved if one car backs up (pre-empt resources and
rollback), but sometimes that will require the cars following it to be backed up also. Starvation
can also occur, if only cars from one direction are allowed to pass.
Deadlock can arise if four conditions hold simultaneously in a situation:
1. Mutual exclusion: only one process at a time can use the resource
2. Hold and wait: a process holding at least one resource is waiting to acquire additional
resources held by other processes
3. No preemption: a resource can only be released voluntarily by the process holding it,
after the process has completed its task
4. Circular wait: there exists a set {P
0
, P
1
, ..., P
n
} of waiting processes such that P
0
is
waiting for a resource that is held by P
1
, which is waiting for a resource that is held by P
2
,
..., P
n - 1
is waiting for a resource that is held by P
n
, that is waiting for a resource that is
held by P
0
.
We could plot a resource allocation graph which consists of a set of vertices V and a set of edges
E, such that V is partitioned into two types, P = {P
1
, P
2
, ..., P
n
}, the set of all processes in the
system and R = {R
1
, R
2
, ..., R
m
}, the set of all resource types in the system. Each process utilises
a resource using: requests, a directed edge P
i
R
j
, which if it can not be assigned immediately,
the requesting process is blocked; and resource use, indicated by a directed edge R
j
P
i
. That is,
the process uses the resource. When a resource is released, edges are removed.

The above resource allocation graph has no deadlocks, as there are no circularities. P
3
has all the
resources it needs; when it frees R
3
then P
2
can proceed; when P
2
releases R
1
, then P
1
can
proceed.

The above resource allocation graph does have a deadlock, however. Mutual exclusion holds
over all of the resources, a number of processes hold resources and are waiting for others and no
process wants to release a resource. P
3
holds R
3
and wants R
2
; P
2
holds R
1
and R
2
and wants R
3

and P
1
holds R
2
and wants R
1
. This is a circular wait, and we have all the conditions for
deadlock.
If a graph contains a cycle, the following basic rules can be applied: if the graph contains no
cycles, no deadlock occurs; if the graph contains a cycle, if there is only one instance per
resource type, then deadlock occurs, otherwise if there are several instances per resource type,
there is the possibility of deadlock.
Handling Deadlocks
The free basic methods for handling deadlocks are prevention/avoidance (ensuring that the
system will never enter a deadlock state), detection and recovery (allows the system to enter a
deadlock state and then recover from it) and the head-in-the-sand approach (ignore the problem
and pretend that deadlocks never occur - this is the approach used by most OS, including UNIX,
Linux and Windows).
To implement prevention, we need to ensure that one or more of the deadlock conditions do not
hold:
Mutual exclusion is not required for sharable resources, but it must hold for non-sharable
resources.
Hold and wait must guarantee that whenever a process requests a resource, it does not
hold any other resources. The process must request and be allocated all of its resources
before it begins execution, or the process can only request resources when the process
currently has none. This does have the possibility of low resource utilisation and
starvation.
No preemption - If a process that is holding some resources requests another resource that
can not be immediately allocated to it, then all resources currently being held are
released. Preempted resources are added to the list of resources for which the process is
waiting and the process will only be restarted when it can regain its old resources, as well
as the new ones it is requesting.
Circular wait - impose a total ordering of all resource types and require that each process
requests resources in an increasing order of enumeration.
Deadlock algorithms prevent deadlock, so deadlock will never occur if (at least one of) the
restrictions are enforced. But, there is a cost to this system. To restrain how resource requests are
made can have the potential for low device utilisation and a reduced system throughput.
Alternatively, with some addition a priori information, deadlock can be avoided on a per request
basis. The simple method is that each process declares the maximum number of resources of
each type that it may need, or a dynamic alternative is to dynamically examine the resource-
allocation state to ensure that there can never be a circular-wait condition. The resource
allocation state is defined by the number of available and allocated resources, and the maximum
demands of the process.
Systems States
When a process requests an available resource, the system must decide if immediate allocation
would leave the system in a safe state - a state where there exists a sequence of all processes so
that all processes can get their required resources and complete. If the system is not in a safe
state, it is in an unsafe state, which means the possibility of deadlock exists (but is not
necessarily going to happen). The deadlock state is a subset of the unsafe states.
Avoidance requires that the system will never enter an unsafe state, so an algorithm must pick a
safe sequence of events and processes for execution. One such algorithm is the resource-
allocation graph algorithm. A claim edge P
i
R
j
indicates that process P
i
may request resource
R
j
which is represented by a dashed line. A claim edge is converted to a request edge when a
process requests a resource. When a resource is released by a process, the assignment reconverts
to a claim edge. Resources must be claimed a priori in the system.
File Systems
Files are an integral part of most computer systems. Input to applications can be by means of a
file, and output can be saved in a file for long-term storage. File systems can store large amounts
of data, and the storage is persistent. The file management system is considered part of the OS.
The file managment system is a way a process may access files. The programmer does not need
to develop file management software, this is provided by the OS. The objectives of a file
management system are to meet the data management needs of the user, to guarantee the data in
the file is valid, to minimise the potential for lost or destroyed data and to optimise performance.
We can also break down what the user requirements for data management are into:
being able to create, delete, read and change files
controlled access to other users' files
control accesses allowed to that users' files
restructure the users' files in a form appropriate to the problem
move data between files
backup and recover the users' files in case of damage
access and share the users' files using symbolic names
To enable this, the file management system typically provides functions to indentify and locate a
selected file, to organise files in a structure (often accomplished using directories) which
describes the locations of all files and their attributes, and to provide mechanisms for protection.
The file management systems provides secondary storage management, that is, it allocates files
to free disk blocks and manages free storage for available disk blocks.
Files
A file is a contiguous, logical addres space (apart from sparse files), however, they may not be
stored as a contiguous file. Files may be of different types, but from the perspective of the OS,
files have no structure - all a file is is a sequence of machine words and bytes. However, from the
user or application perspective, files may have a simple record structure made up of lines (of
fixed or variable length) or complex structures, which have special control characters in the
structure to make up files of various types.
Files also have attributes assigned to them:
Name - in a human readable form
Type - for systems that support different types
Location - pointer to a file location on the device
Size - current file size
Protection - e.g., read, write, execute bits
Time, date and user identification - used for protection, security and usage monitoring
These file attributes are kept in the directory structure.
The OS only provides basic file operations on files, such as: create, write, read, seek, delete and
truncate, so more complex operations must be made up of combinations of these operations, e.g.,
copy is a create, followed by a read from a source file and a write to the new file.
File types, traditionally implemented using file extensions (in Microsoft Windows, this is a
legacy from DOS), enables the OS to invoke an appropriate application for each file type. UNIX
and its variants use a "magic number" at the start of the file to declare its type.
File systems have criteria for file organisation, such as rapid access (especially needed when
accessing a single data item), ease of update (not always a concern, however, e.g., CD-ROM),
economy of storage (want to reduced wasted space, although this is less of a concern in modern
computing), simple maintenance and reliability.
Sequential Access Files

In these types of files, access starts at the beginning of the file, and to read to the middle of the
file, you must start at the beginning and read the file until you reach the position you want, and
then rewind to get back to the beginning. This method of file access is slow.
A fixed format is used for structuring data into records. Records are normally the same length,
and the fields in a record are in fixed order. One field (usually the first) is a key, which uniquely
identifies the record within that file. Usually, records are stored in key order and batch updates
are used to reduce the update time (records are cached and then ordered before an update is
done). This method is designed for sequential access devices (tapes), but sequential files can be
implemented on random access devices, such as disks.
Additions to and deletions from the file can cause problems. The file is stored in a sequential
ordering of record keys, so insertion requires "moving" some of the records on the storage
medium. This is very expensive in time. Usually, new records are placed in a new file, called the
overflow, log or transaction file and a periodic batch update occurs that merges the overflow file
with the original file. Alternatively, records are organised as a linked list. More overhead in
terms of storage is required, and additional processing is required also. This only really works
(on sequential storage devices) if the records remain in key order. Deletion requests can also be
stored in the log file.
To reduce slow, sequential access files, and to support random access storage devices better,
indexed sequential files can be used. The index provides the lookup capability to get in the
vicinity of a desired record faster. The index is searched for a key field that is equal or less than
the desired key value which contains a pointer to the main file. The search then continues in the
main file by the location indicated by the pointer. The records are still stored in order, and new
records are added to an overflow file (a record in the main file preceding the new record is
updated to contain a pointer to the new record in the overflow file). The overflow is then
periodically merged with the main file during a batch update.
Comparison: If a file contains 1,000,000 records, in a standard sequential file, the average lookup
case is 500,000 accesses to find a record in the sequential file. If an index contains 1000 entries,
it will take on average 500 accesses to find the key, followed by 500 accesses in the main file.
On average, it is now 1000 accesses.
Direct Access Files
Direct access files exploit the capability of random access devices (such as disks) to access
directly any address on the device. No sequential ordering of data in the file is required, and this
method can be used where rapid access is required, such as in swapping or convential user files.
It is not that useful for backups or archives where only occasional access is required (these can
therefore be implemented on cheaper media, such as tapes).
Directories
Directories are used to organise files, and they themselves are files owned by the OS. Directories
contain information about files, such as name, attributes, location, ownership, size, access/update
times and protection information. They provide a mapping between file names and the files
themselves.
From a user perspective, they need to search for a file (by name or attribute), create a file, delete
a file, list a directory and rename a file, etc...
The directory must be organised to obtain efficiency (e.g., to locate a file quickly), naming (so it
is convenient for users, two users can have the same name for different files and the same file
can have several different names) and grouping (a logical grouping of files by properties).
There are multiple ways of implementing directories. One such way is the single-level directory,
where a single directory exists for all users. However, if all users share the same directory, they
must ensure that all file names are unique, and there is the grouping problem of little structure or
organisation.
In a two-level directory setup, each user has their own directory, which allows a user to have the
same filename as another user, this allows for efficient searching, but has the problem of there
being no grouping capability between users.
Another is the tree-structured directory, which allows for efficient searching and grouping
capability, without any sharing of files. In this model, we have the concept of a current, or
working, directory. There are issues with this model, however. There are absolute and relative
path names, and creating a new file or directory is done in the current directory. Also, how do
you delete a directory? Most OSs require for you to explicitly state that you want to remove all
files in the directory, or that you do it beforehand.

Tree structure prohibits the sharing of files and directories, however. Acyclic graphs remove this
restriction. They are more complex than tree, and two different names can refer to the same file
(aliasing), so there can be multiple path names for the same file. This does introduce a deletion
problem, however - you can not delete a file until all references to it are solved. This is difficult,
so we can solve it using backpointers (so we can delete all the pointers to a file), have
backpointers in a daisy-chain organisation, or by using an entry-hold-count solution.

Acyclic graphs are complex to manage, however. You need to ensure there are no cycles, which
links can introduce. Merely having subdirectories and files can not introduce cycles. Another
way of guaranteeing no cycles is to only allow links to files, not subdirectories, or by using a
cycle detection algorithm every time a new link is added to determine whether or not it is okay
(this is expensive).
A file has to be opened before it is of any use, and similarly, a file system must be mounted
before any files or directories are available.
Sharing of files on multi-user systems is desirable. Sharing can be done through a protection
scheme. A simple method is to ensure that only one user has write access to a file at a time. A
complex method is using a variation of semaphores implemented by some file systems
(including in POSIX) called file locks. On distributed systems, files may be shared across a
network.
Protection
Protection exists to make sure that files are not read or modified by unauthorised persons.
Protection mechanisms are provided by the OS to safeguard information inside the computer.
Objects and the rights that they have in particular contexts are identified in the aim to prohibit
processes from accessing objects they are not authorised to access.
A domain is associated with a set of pairs <objects, rights>, where rights will be a subset of
operations that are allowed on the object, and a domain could be a user or a process.
It is possible for an object to be in two domains at the same time with or without the same rights.
Different mechanisms are used for identifying domains, in UNIX a combination of user ID and
group ID is used.
Another method of providing protection is through access control lists and capabilities. It is
abstract model, where matrix rows correspond to domains, and columns correspond to objects.
The ACL defines columns - access rights for an object in different domains, and capabilities
define rows, which are associated with each process and is a set of operations that are permitted.
An access matrix might look like this:
Capability Access List File 1 File 2 File 3 Card Reader Printer
neil read read
andy read print
DomainX read execute
DomainY read/write read/write
The capability list is a row of the matrix, and a capability then is like an abstract data type, that
is, it points to an object and specifies which operations are allowed on it (it may also have a field
'type' indicating what type of object it is). The list then specifies what capabilities are accessible
in this domain.
In UNIX, the file owner (by default, the creator) is able to control what should be done, and by
whom. Types of accesses include read, write and execute. In UNIX, this is implemented as a
bitmask with three domains, owner, group and other. Groups in this case are created by a
sysadmin, not the user.
File systems reside on secondary storage and contains both programs and data. It also contains
swapped out pages as part of the virtual memory management system. The main issues to be
addressed here are how to store files (a contiguous logical structure) on secondary storage (disk)
and mapping high level file operations on to low-level disk operations, whilst requiring efficient
use of the disk and fast read/write times.
Disks provide a series of blocks, usually of a fixed size, which are sometimes related to the
memory frame/page size, often as a multiple. The number of file blocks can be loaded into one
physical memory frame.
Disks are random access, that is, you can directly address any given block on the disk, and it is
simple to access any file sequentially or randomly. For I/O efficiency, transfers between memory
and disk are often performed in units of blocks.

For convenient access to disks, the OS imposes a logical file system, which defines how files
look to the user (i.e., operations, file attributes, directories, etc...). Algorithms and data structures
are needed to map the logical file structure onto the physical disk. This file organisation maps
logical file structure onto physical block addresses. A basic file system issues generic commands
to a device driver to read and write blocks. Finally, the I/O control contains device drivers to
translate basic file system commands into commands for a specific disk.
Several structures are used to implement file systems, such as on disk structures including the
boot control block, the partition control block, the directory structure to organise the files and a
per file file control block (FCB) containing permissions, dates (access, create, modify), owner,
group, size and blocks. In memory structures are also used, such as the partition table, the
directory structure (with recently accessed directories) and an open file table (both system wide,
and per process).

The above diagram shows the necessary file system structures provided by the OS to open a file
(a) and reading a file (b).
Virtual File Systems
Virtual file systems exist so that a single interface can be provided by the OS to many different
types of file systems. Different file systems can be mounted on different mount points within the
same directory structure, and operations have the same effect on all the different file systems.
This logical file system is called the Virtual File System, or VFS and it allows the same system
call interface (the API) to be used for different types of file system. The API is to the VFS
interface, rather than any specific type of file system, and the VFS can distinguish between local
and remote files.
Directory Implementation
The simplest implementation of directories is a linear list of file names with pointers to the data
blocks. This is time-consuming, however (e.g., a search is required when a file is created to
ensure uniqueness), as linear searches are expensive, however this cost can be offset by keeping
frequently used directory information in a cache. Keeping an order on the names could reduce
searches, but it would make it more costly for insert/delete of names.
Another implementation is to use a hash table, which decreases directory search time. There is a
problem with collisions (situations where two file names hash to the same location) in this
approach, however. A method is needed for knowing that a hash location has a number of
corresponding file names, and for determining which name to use. This type of hash table is
called a chained-overflow hash table.
File Allocation Methods
In contiguous allocation, each file occupies a set of contiguous blocks on the disk. This is
relatively simple to implement, as only a starting location (block #) and length (number of
blocks) are required. Random access is also relatively easy. This is wasteful of space (dynamic
storage-allocation problem) due to fragmentation (see memory management earlier) and files can
not easily grow.
A slightly more improved method is linked allocation, where each file is a linked list of disk
blocks, which may be scattered anywhere on the disk, and where each block contains a pointer to
the next block. Again, this is simple as only the starting address is required and there is no
wasted space in this free space management system. However, there is no random access, as you
have to go down the linked list to find the appropriate block.
The file allocation table (FAT) system is a variant of linked allocation used in MS-DOS and
OS/2. A section at the beginning of the disk is dedicated to a FAT, containing an entry for each
disk block. The directory entry then only needs the block number for the start of the file. The
FAT then indexed by the block number gives the next block, etc...
Linked allocation solves the problems of contiguous allocation, but it does not support random
access tables without a FAT. This can be achieved with an index table - all the pointers are
brought together into an index block, which exists per file, where the ith entry points to the ith
block of file, and this gives us random access. With large files, however, large index tables are
needed, which is a disadvantage; multi-level indexing can be introduced to combat this, however.
Additionally, the index table is an overhead.
Free Space Management
Since disk space is limited, space needs to be reused from deleted files and free space needs to be
monitored. One way of implementing this is using a bit vector which shows when a block is free
or occupied. An alternative to this bit vector is the linked list. A pointer is kept to the first free
block, which contains a pointer to the next free block, etc... This is inefficient, as to traverse the
list requires multiple disk block reads, however this is an uncommon operation, as most
operations require just one block. However, the linked list does not require any extra free space,
unlike the bit vector, although there is no easy way to get contiguous space easily.
These methods can be modified using grouping, storing the addresses of the next n free blocks in
the first free block; next n addresses of free blocks in the next free block, etc... - this is more
efficient in the number of blocks that need to be read; and counting, which takes advantage of the
fact that free blocks usually form together. The first free block contains the number of free
blocks immediately following (i.e., forming a contiguous free store), together with the address of
the next free block at the start of some contiguous free store.
POSIX and Linux File Systems
In these implementations, the logical file system provides simple file operations, including:
create, open, read, write, close, mount (mounting a file system at a mount point), stat (returns
information about a file), lseek (repositions the current position in the file) and truncate (causes
file to be truncated to a specific size).
This assumes a structure where all files have distinct absolute paths, using a tree hierarchy. It is
possible to set up cycles via symbolic links
The VPS provides uniform access to all file systems, including special files (e.g., /proc, /dev).
The file system organisation is a multi-level indexing scheme, with an i-node (index-node)
associated with each file, which contains attributes and disk addresses of the files' blocks. There
is no distinction between files and directories other than having a different mode type field.
In an i-node, the first few disk addresses are stored (which for small files, will be all the disk
addresses) and there are links to further i-nodes which contain further disk addresses. This may
go up to further indirection, which gives us a very flexible system at the expense of complexity
and storage space.
For directories, each entry contains a file name and an i-node number. Here, the i-node contains
information about the time, size, type, ownership and disk blocks contained in the i-node. If you
imagine that the file system knows the root directory and wants to look up /usr/neil/ops/exam,
then from the root i-node, the system locates the i-node number of the file /usr, and then looks
for the next component neil, and so on. Relative paths work in the same way, where there is a
known current directory i-node.
The Ext2 Linux general purpose filesystem implements this, with a mostly static allocation of i-
nodes. The disk is split up into block groups of one or more disk cylinders (i.e., those physically
together) and blocks are used for i-nodes and file data blocks. Bitmaps are used for keeping data
blocks and i-nodes in the group. When allocating disk blocks, related information is attempted to
be kept together (i.e., i-nodes of all files in a directory in the same block group) and new
directories are created in block groups with the smallest number of directories.
This scheme works well as long as there is > 10% free space.
With journalling file systems, files are stored in both disc and main memory (a memory cache of
disk blocks is used for efficiency), however we have to consider the possibility of what happens
when the system fails. On the disk, file system data structure may be corrupted, but we must
ensure there is no loss of data (i.e., consistency is maintained). Journaling (or log-based
transaction) file systems attempt to maintain consistency.
For a file operation, updates to the disk structure are written to a log, and when complete, the
changes are considered committed. The updates for a file operation form a transaction, and actual
updates to disk structures are performed so that if there is a crash, it is known how much of the
transaction has been performed (and which other transactions remain in the log). ReiserFS under
Linux is an example of a journalling file system.
I/O
Handling I/O efficiently is important to the user, and therefore to the OS. All user interaction
with the machine is done using I/O and most user invoked operations (i.e., applications) will use
I/O at some point.
Operating systems provide I/O device handling based on utilising low-level hardware interfaces
to devices (i.e., access device via special instructions). Device drivers are provisioned here to
handle the idiosyncrasies of individual devices, programmed in terms of a low-level hardware
interface and provide a standard high-level application interface to devices, to enable their easy
use by application software.
See ICS for information
on computer organisation
Devices are addressed either using direct mapping, where devices exist in a special I/O range and
special I/O instructions exist in the CPU to implement this, or by memory mapping, where
devices appear as addresses in the normal memory range and normal processor instructions are
used to access these addresses.
Devices tend to be accessible in the same place for a particular architecture and each address
range maps to the I/O control registers for the deviced accessed by special I/O instructions. Some
devices have both direct and memory-mapped address space (e.g., graphics cards).
I/O ports typically consist of a status register, which contains the bits read by the host that
indiciate device state (e.g., data ready for a ready), a control register which enables the host to
change the mode of a device (e.g., from full to half-duplex for serial ports), a data-in register
which is read by the host to get input and the data-out register, which is written to by the host for
data to be sent to the device. This introduces the issue of how does the host know when a
particular condition has arisen in the status register.
Polling is one way to solve this problem, and is a host initiated approach. The device status
register is polled until the wanted condition occurs, but this way is inefficient as there is a busy-
wait cycle to wait for I/O from a device. The alternative solution is using interrupts, which is
initiated by both the host and the device. A CPU interrupt request line is triggered by the device
and the interrupt handler receives interrupts and uses an interrupt vector to dispatch an interrupt
to the correct handler based on priority. Most interrupts can be maskable to ignore or delay some
interrupts, and the interrupt mechanism can be used for exceptions. There is no need for busy-
waiting in this model.

In the IA32 architecture, interrupts 0-31 are unmaskable, as some are exceptions, and 32-255 are
maskable and are used for devices.
For information on
DMA, see ICS
DMA is used to avoid programmed I/O for large data movement and is implemented using a
DMA controller to bypass the CPU and transfer data directly between the I/O device and
memory.

Application Interfaces
There are many different hardware devices, but application processes want a common interface,
but devices can vary in many dimensions: character (deals with bytes or words of data, e.g.,
keyboards, mice, serial ports, etc and commands to deal with this include get and put - often
libraries are implemented to allow in-line editing), stream (a stream of bits) or block (devices
deal with blocks of data, such as disks, commands include read, write, seek and most raw I/O
filesystem access uses this) transfer; sequential or random access; sharable or dedicated device;
device speed; read-write, read-only or write-only. Timer devices also exist, which provide the
current time, elapsed time, or just a standard timer.
I/O can be either blocking or non-blocking. That is, a process is suspended until I/O is completed
under a blocking scheme (this is easy to use and understand, but is insufficient for some needs -
although using threads blocking can be programmed around), or under a non-blocking scheme,
an I/O call returns as much as available and returns quickly with a count of bytes read or written.
A final scheme is asynchronous, which is difficult to use and is a scheme where a program runs
while the I/O executes and the I/O subsystem signals the process when the I/O is completed.
Usually, hardware devices are controlled in the same manner as files (create, write, read, delete,
etc...), so the filesystem namespace can be used for devices (/dev in UNIX). The kernel
implements I/O through a series of layers as below:

The I/O subsystem of the kernel implements I/O scheduling and some I/O request ordering via a
per-device queue, although some OSs do try to implement fairness. Additionally, buffering is
also implemented, where data is stored in memory while it is transferred between devices (this
copes with device speed mismatch and transfer size mismatch and "copy semantics).
Buffering can be implemented using many different methods, such as caching, where fast
memory holds a copy of the data (always just a copy, and this is the key to performance), or
spooling, where output for a device is held (this is useful is the device can only serve one request
at a time, e.g., printing). Device reservation can also be used, where exclusive access is provided
to a device and system calls for allocation and deallocation are used. However, deadlock
(discussed above) can be encountered here.
The kernel has data structures for things like keeping state info for I/O components including
open file tables, network connections and the character device state, as well as many, many other
data structures to track buffers, memory allocation and "dirty" blocks. Some OSs use object-
orrientated methods and message passing to implement I/O.
Some OSs use a file based approach to kernel I/O structures. Devices are named as files and
accesses to devices are by file operations (with some exceptions, e.g., seek in a character device
may have no effect). In Linux, an additional command ioctl() provides direct access to the device
driver, perhaps to set the device into a particular mode (e.g., changing speed or other parameters
of a serial port).
I/O Performance
I/O is a major factor in system performance, as it makes demands on the CPU to execute the
device driver or kernel I/O code and it forces context switches due to interrupts. Data copying
and network traffic are especially stressful.
Ways of improving performance include reducing the number of context switches, data copying
and interrupts by using large transfers, smart controllers and polling. DMA can also be used and
the CPU, memory, bus and I/O performance need to be balanced for highest throughput.
This brings us to the question of where should functionality be implemented to get best
performance - application kernel or hardware level?

You might also like