You are on page 1of 470

Advanced Course on Bioinformatic and Comparative Genome Analysis

UFSC Florianopolis - June 30 - July 12, 2008

UFSC - CTC

Unix/perl
Prof. Mario Dantas
mario@inf.ufsc.br
http://www.inf.ufsc.br

Motivation
The large amount of computing resources
available in many organizations can be gather to
solve a large amount of problems from several
research areas.
Biology represents an example of an areas that
can improve experiments through the use of
these distributed resources.

Motivation
Workflow:

- represent a execution flow which


data are passed between some
tasks obeying rules previously
defined.
Ontology:
Ontology can be expressed as a
formal and explicit specification from
a shared concept.
GRID 2007 - September 20, 2007

Motivation
Grid Information Service

R2
Grid Resource Broker

R3

R5

R4
RN
R6

R1

Grid Information Service

Several
Virtual
Organizations

Each organization
develops your
ontology

It does not have an


unique ontology

Motivation
The Architecture Proposal

module Integration Portal


module Information
Provider
module Matchmaker

Motivation

Approach Semantic Integration

Motivation

Reference Ontology (RO)

Motivation
Graphic interface for editing
queries

Motivation
Case Study 1 (matching semantic)

Motivation
Case Study 2 (checking queries)

Motivation
SuMMIT

GRID 2007 - September 20, 2007

11

Motivation
SuMMIT Mobile GUI

Monitoring Interface
GRID 2007 - September 20, 2007

12

Motivation
SuMMIT Agent

GRID 2007 - September 20, 2007

13

Motivation
SuMMIT Workflow Manager
(4/7)

GRID 2007 - September 20, 2007

14

Motivation
SuMMIT Operation
Automation
and
Coordination

GRID 2007 - September 20, 2007

15

Motivation
SuMMIT Resource Selector

GRID 2007 - September 20, 2007

16

Objective
In this course will introduce the basics of
UNIX/perl programming, therefore in the
end participants will be able to write
simple scripts programms.

References
There are several books and sites that
can help in the task to develop and
improve your knowledge about Unix and
perl, some examples are:

Books

Available Free Online

http://proquest.safaribooksonline.com

Books

Interesting Sites
http://directory.fsf.org/project/emacs/
http://stein.cshl.org/genome_informatics/
http://www.cs.usfca.edu/~parrt/course/601/l
ectures/unix.util.html

Interesting Sites
http://people.genome.duke.edu/~jes12//co
urses/perl_duke_2005/
http://www.pasteur.fr/~tekaia/BCGA_useful
_links.html
http://google.com/linux

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

In the Beginning

UNICS: 1969 PDP-7 minicomputer


PDP-7 goes away, rewritten on PDP-11 to
help patent lawyers
V1: 1971
V3: 1973 (pipes, C language)
V6: 1976 (rewritten in C, base for BSD)
V7: 1979 (Licensed, portable)

PDP-11

Big Reason for V6 Success

Commercial Success

AIX
SunOS, Solaris
Ultrix, Digital Unix
HP-UX
Irix
UnixWare -> Novell -> SCO -> Caldera ->SCO
Xenix:
-> SCO
Standardization (Posix, X/Open)

But Then The Feuding Began


Unix International vs. Open Software Foundation
(to compete with desktop PCs)
Battle of the Window Managers

Openlook

Motif

Threat of Windows NT resolves battle with CDE

Send in the Clones


Linux
Written in 1991 by Linus Torvalds
Most popular UNIX variant
Free with GNU license

BSD Lite

FreeBSD (1993, focus on PCs)


NetBSD (1993, focus on portability)
OpenBSD (1996, focus on security)
Free with BSD license
Development less centralized

Today: Unix is Big

Popular Success!

Linux at Google & Elsewhere

Darwin
Apple abandoned old Mac OS for UNIX

Purchased NeXT in December 1996


Unveiled in 2000
Based on 4.4BSD-Lite
Aqua UI written over Darwin
Open Source

Why did UNIX succeed?

Technical strengths!
Research, not commercial
PDP-11 was popular with an unusable OS
AT&Ts legal concerns
Not allowed to enter computer business but
needed to write software to help with switches
Licensed cheaply or free

The Open Source Movement


Has fueled much growth in UNIX
Keeps up with pace of change
More users, developers
More platforms, better
performance, better code

Many vendors switching to Linux

SCO vs. Linux


Jan 2002: SCO releases Ancient Unix : BSD style
licensing of V5/V6/V7/32V/System III
March 2003: SCO sues IBM for $3 billion. Alleges
contributions to Linux come from proprietary licensed
code
AIX is based on System V r4, now owned by SCO

Aug 2003: Evidence released


Code traced to Ancient UNIX
Isnt in 90% of all running Linux distributions
Already dropped from Linux in July

Aug 2005: Linux Kernel Code May Have Been in SCO

Does Linux borrow from ancient UNIX or System V R4?

UNIX vs. Linux

The UNIX Philosophy


Small is beautiful
Easy to understand
Easy to maintain
More efficient
Better for reuse

Make each program do one thing well


More complex functionality by combining
programs
Make every program a filter

The UNIX Philosophy


Portability over efficiency

..continued

Most efficient implementation is rarely portable


Portability better for rapidly changing hardware

Use flat ASCII files


Common, simple file format (yesterdays XML)
Example of portability over efficiency

Reusable code
Good programmers write good code;
great programmers borrow good code

The UNIX Philosophy


..continued

Scripting increases leverage and portability

print $(who | awk '{print $1}' | sort | uniq) | sed 's/ /,/g'

List the logins of a systems users on a single line.

Build prototypes quickly (high level


interpreted languages)

who

755

awk

3,412

sort

2,614

uniq

302

sed

2,093

9,176 lines

The UNIX Philosophy


..continued

Avoid captive interfaces


The user of a program isnt always human
Look nice, but code is big and ugly
Problems with scale

Silence is golden
Only report if something is wrong

Think hierarchically

UNIX Highlights / Contributions

Portability (variety of hardware; C implementation)


Hierarchical file system; the file abstraction
Multitasking and multiuser capability for minicomputer
Inter-process communication
Pipes: output of one programmed fed into input of another

Software tools
Development tools
Scripting languages
TCP/IP

The Operating System


The government of your computer
Kernel: Performs critical system functions
and interacts with the hardware
Systems utilities: Programs and libraries
that provide various functions through
systems calls to the kernel

Kernel Basics
The kernel is
a program loaded into memory during the
boot process, and always stays in physical
memory.
responsible for managing CPU and memory
for processes, managing file systems, and
interacting with devices.

UNIX Structural Layout


shell scripts
system calls
signal handler
device drivers
terminal
disk

User Space

utilities
C programs
compilers

Kernel

scheduler

Devices

printer

swapper

RAM

Kernel Subsystems
Process management
Schedule processes to run on CPU
Inter-process communication (IPC)

Memory management
Virtual memory
Paging and swapping

I/O system
File system
Device drivers
Buffer cache

System Calls
Interface to the kernel
Over 1,000 system calls available on Linux
3 main categories
File/device manipulation
e.g. mkdir(), unlink()

Process control
e.g. fork(), execve(), nice()

Information manipulation
e.g. getuid(), time()

Logging In
Need an account and password first
Enter at login: prompt
Password not echoed
After successful login, you will see a shell prompt
Entering commands
At the shell prompt, type in commands
Typical format: command options arguments
Examples: who, date, ls, cat myfile, ls l
Case sensitive
exit to log out

Remote Login
Use Secure Shell
(SSH)
Windows
e.g. PuTTY

UNIX-like OS
ssh name@access.cims.nyu.edu

UNIX on Windows
Two recommended UNIX emulation environments:

UWIN (AT&T)
http://www.research.att.com/sw/tools/uwin

Cygwin (GPL)
http://www.cygwin.com

Linux Distributions

Slackware the original


Debian collaboration of volunteers
Red Hat / Fedora commerical success
Ubuntu currently most popular, based on
Debian. Focus on desktop
Gentoo portability
Knoppix live distribution

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

Unix System Structure


user

c programs
scripts

shell and utilities

ls
ksh

gcc
find

open()
fork()
exec()

kernel
hardware

Kernel Subsystems
File system
Deals with all input and output
Includes files and terminals
Integration of storage devices

Process management
Deals with programs and program interaction

How processes share CPU, memory and signals


Scheduling
Interprocess Communication
Memory management

UNIX variants have different implementations


of different subsystems.

What is a shell?
The user interface to the operating system
Functionality:
Execute other programs
Manage files
Manage processes

A program like any other


Executed when you log on

Most Commonly Used Shells


/bin/sh The Bourne Shell / POSIX shell
/bin/csh
C shell
/bin/tcsh
Enhanced C Shell
/bin/ksh
Korn shell
/bin/bash
Free ksh clone
Basic form of shell:
while (read command) {
parse command
execute command
}

Shell Interactive Use


When you log in, you interactively use the shell:

Command history
Command line editing
File expansion (tab completion)
Command expansion
Key bindings
Spelling correction
Job control

Shell Scripting
A set of shell commands that
constitute an executable program
A shell script is a regular text file that
contains shell or UNIX commands
Very useful for automating repetitive task and
administrative tools and for storing
commands for later execution

Simple Commands
simple command: sequence of non blanks
arguments separated by blanks or tabs.
1st argument (numbered zero) usually specifies
the name of the command to be executed.
Any remaining arguments:
Are passed as arguments to that command.
Arguments may be filenames, pathnames, directories
or special options (up to command)
Special characters are interpreted by shell

A simple example
$ ls l /bin
-rwxr-xr-x 1 root
$

prompt

command

sys

43234 Sep 26

2001 date

arguments

Execute a basic command


Parsing into command in arguments is called
splitting

Types of Arguments
$ tar c v f archive.tar main.c main.h

Options/Flags
Convention: -X or --longname

Parameters
May be files, may be strings
Depends on command

Getting Help on UNIX


man: display entries from UNIX online
documentation
whatis, apropos
Manual entries organization:

1. Commands
2. System calls
3. Subroutines
4. Special files
5. File format and conventions
6. Games
http://en.wikipedia.org/wiki/Unix_manual

Example Man Page


ls ( 1 )

USER
COMMANDS

ls ( 1 )

NAME
ls - list files and/or directories

SYNOPSIS
ls [ options ] [ file ... ]
DESCRIPTION
For each directory argument ls lists the contents; for each file argument the name and requested information are listed. The
current directory is listed if no file arguments appear. The listing is sorted by file name by default, except that file arguments
are listed before directories.
.

OPTIONS
a, --all
List entries starting with .; turns off --almost-all.
-F, --classify
Append a character for typing each entry.
-l, --long|verbose
Use a long listing format.
-r, --reverse
Reverse order while sorting.
-R, --recursive
List subdirectories recursively.
-

SEE ALSO
chmod(1), find(1), getconf(1), tw(1)

Fundamentals of Security
UNIX systems have one or more users,
identified with a number and name.
A set of users can form a group. A user
can be a member of multiple groups.
A special user (id 0, name root) has
complete control.
Each user has a primary (default)
group.

How are Users & Groups used?


Used to determine if file or process
operations can be performed:
Can a given file be read? written to?
Can this program be run?
Can I use this piece of hardware?
Can I stop a particular process thats running?

A simple example
$ ls l /bin
-rwxr-xr-x 1 root
$

read

write

sys

execute

43234 Sep 26

2001 date

The UNIX File Hierarchy

Hierarchies are Ubiquitous

Definition: Filename
A sequence of characters other than slash.

Case sensitive.

tmp

usr
dmr

wm4

foo

etc

bin

who

date

foo

.profile

who
date
.profile

Definition: Directory
Holds a set of files or other directories.

Case sensitive.

tmp

usr
dmr

wm4

foo

etc

bin

who

date

etc

.profile

usr
dmr
bin

Definition: Pathname
A sequence of directory names followed by a simple
filename, each separated from the previous one by a /
/
tmp

usr
dmr

wm4
.profile

foo

etc

bin

who

date

/usr/wm4/.profile

Definition: Working Directory


A directory that file names refer to by default.
One per process.
/
tmp

usr
dmr

wm4
.profile

foo

etc

bin

who

date

Definition: Relative Pathname


A pathname relative to the working directory (as
opposed to absolute pathname)
/
tmp

usr
dmr

wm4
.profile

foo

etc

bin

who

date

.. refers to parent directory


. refers to current directory
.profile
./.profile
../wm4/.profile

Files and Directories


Files are just a sequence of bytes
No file types (data vs. executable)
No sections
Example of UNIX philosophy

Directories are a list of files and status of the


files:
Creation date
Attributes
etc.

Tilde Expansion
Each user has a home directory
Most shells (ksh, csh) support ~ operator:
~ expands to my home directory
~/myfile

/home/kornj/myfile

~user expands to users home directory


~unixtool/file2

/home/unixtool/file2

Useful because home directory locations


vary by machine

Mounting File Systems


When UNIX is started, the directory hierarchy
corresponds to the file system located on a
single disk called the root device.
Mounting allows root to splice the root
directory of a file system into the existing
directory hierarchy.
File systems created on other devices can be
attached to the original directory hierarchy
using the mount mechanism.
Commands mount and umount manage

Mounting File Systems


root device

external device

/
a

a
b

Device

Mount Point

/
/a/b

b
a

Mount table

Printing File Contents


The cat command can be used to copy the
contents of a file to the terminal. When invoked
with a list of file names, it concatenates them.
Some options:
-n
-v

number output lines (starting from 1)


display control-characters in visible
form (e.g. ^C)

Interactive commands more and less show a


page at a time

Common Utilities for Managing files and


directories

pwd
print process working dir
ed, vi, emacs
create/edit files
ls
list contents of directory
rm
remove file
mv
rename file
cp
copy a file
touch
create an empty file or update
mkdir and rmdir create and remove dir
wc
counts the words in a file
file
determine file contents
du
directory usage

File Permissions
UNIX provides a way to protect files based on
users and groups.
Three types of permissions:
read, process may read contents of file
write, process may write contents of file
execute, process may execute file

Three sets of permissions:


permissions for owner
permissions for group (1 group per file)
permissions for other

Directory permissions
Same types and sets of permissions as for
files:
read: process may a read the directory
contents (i.e., list files)
write: process may add/remove files in the
directory
execute: process may open files in directory
or subdirectories

Utilities for Manipulating file


attributes

chmod
change file permissions
chown
change file owner
chgrp
change file group
umask
user file creation mode mask
only owner or super-user can change file
attributes
upon creation, default permissions given
to file modified by process umask value

Chmod command
Symbolic access modes {u,g,o} /
{r,w,x}
example: chmod +r file

Octal access modes


octal

read

write

execute

0
1
2
3
4
5
6
7

no
no
no
no
yes
yes
yes
yes

no
no
yes
yes
no
no
yes
yes

no
yes
no
yes
no
yes
no
yes

File System Internals

The Open File Table


I/O operations are done on files by first
opening them, reading/writing/etc., then
closing them.
The kernel maintains a global table
containing information about each open file.
Inode

Mode

Count

Position

1023
1331

read
read/write

1
2

0
50

The File Descriptor Table


Each process contains a
table of files it has opened.
Inherits open files from
parent
Each open file is associated
with a number or handle,
called file descriptor, (fd).
Each entry of this table
points to an entry in the
open file table.
Always starts at 0

Why not directly use


the open file table?
Convenient for kernel
Indirection makes security easier

Numbering scheme can be local to


process (0 .. 128)
Extra information stored:
Should the open file be inherited by children?
(close-on-exec flag)

Standard in/out/err
The first three entries in the file descriptor
table are special by convention:

cat

Entry 0 is for input


Entry 1 is for output
Entry 2 is for error
messages

What about reading/writing to the


screen?

Devices
Besides files, input and output can go
from/to various hardware devices

UNIX innovation: Treat these just like files!


/dev/tty, /dev/lpr, /dev/modem

By default, standard in/out/err opened


with /dev/tty

Redirection
Before a command is executed, the input and
output can be changed from the default
(terminal) to a file
Shell modifies file descriptors in child process
The child program knows nothing about this

ls

ls

Redirection of input/ouput
Redirection of output: >
example:$ ls > my_files

Redirection of input: <


example: $ mail kornj <input.data

Append output: >>


example: $ date >> logfile

Bourne Shell derivatives: fd>


example: $ ls 2> error_log

Using Devices
Redirection works with devices (just like
files)
Special files in /dev directory
Example: /dev/tty
Example: /dev/lp
Example: /dev/null
cat big_file > /dev/lp
cat big_file > /dev/null

Links
Directories are a list of files and directories.
Each directory entry links to a file on the disk

mydir
hello
file2
subdir

Hello
World!

Two different directory entries can link to the same file


In same directory or across different directories

Moving a file does not actually move any data around.


Creates link in new location
Deletes link in old location

ln command

Symbolic links
Symbolic links are different than regular links (often
called hard links). Created with ln -s
Can be thought of as a directory entry that points to
the name of another file.
Does not change link count for file
When original deleted, symbolic link remains

They exist because:


Hard links dont work across file systems
Hard links only work for regular files, not directories
dirent
dirent

Contents of file
Hard link

symlink

dirent

Symbolic Link

Contents of file

Example
usr tmp etc bin

dmr wm4

.profile
etc

foo

who date

Hard Link
usr tmp etc bin

dmr wm4

.profile
etc

foo

who date

Symbolic Link
usr tmp etc bin

dmr wm4

foo

who date

.profile
etc
/usr/wm4/.profile

Can a file have no links?


usr tmp etc bin

dmr wm4

foo

who date

.profile
etc

cat

Tree Walking
How can do we find a set of files in the
hierarchy?
One possibility:
ls l R /

What about:
All files below a given directory in the hierarchy?
All files since Jan 1, 2001?
All files larger than 10K?

find utility
find pathlist expression
find recursively descends through pathlist
and applies expression to every file.
expression can be:
-name pattern
true if file name matches pattern. Pattern may
include shell patterns such as *, must be in quotes
to suppress shell interpretation.
Eg: find / -name '*.c'

find utility (continued)


-perm [+-]mode
Find files with given access mode, mode must be in octal.
Eg: find . 755
-type ch
Find files of type ch (c=character, b=block, f for plain file,
etc..). Eg: find /home type f
-user userid/username
Find by owner userid or username
-group groupid/groupname
Find by group groupid or groupname
-size size
File size is at least size

many more

find: logical operations


! expression

op1 -a op2
op1 -o op2
( )

returns the logical negation


of expression
matches both patterns
op1 and op2
matches either op1 or op2
group expressions together

find: actions
-print

prints out the name of the


current file (default)

-exec cmd
Executes cmd, where cmd must be terminated by an
escaped semicolon (\; or ';').
If you specify {} as a command line argument, it is
replaced by the name of the current file just found.
exec executes cmd once per file.
Example:
find -name "*.o" -exec rm "{}" ";"

find Examples
Find all files beneath home directory beginning with f
find ~ -name 'f*' -print

Find all files beneath home directory modified in last day


find ~ -mtime 1 -print

Find all files beneath home directory larger than 10K


find ~ -size 10k -print

Count words in files under home directory


find ~ -exec wc -w {} \; -print

Remove core files


find / -name core exec rm {} \;

diff: comparing two files


diff: compares two files and outputs a
description of their differences
Usage: diff [options] file1 file2
-i: ignore case
apples
oranges
walnuts

apples
oranges
grapes

$ diff test1 test2


3c3
< walnuts
--> grapes

Other file comparison utilities


cmp
Tests two files for equality
If equal, nothing returned. If different, location of
first differing byte returned
Faster than diff for checking equality

comm
Reads two files and outputs three columns:
Lines in first file only
Lines in second file only
Lines in both files

Must be sorted
Options: fields to suppress ( [-123] )

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

Kernel Data Structures


Information
about each
process.
Process table:
contains an
entry for every
process in the
system.
Open-file table:
contains at least
one entry for
every open file
in the system.

User Space

Code

Code

Code

Data

Data

Data

Process
Info

Process
Info

Process
Info
Open File
Table
Kernel Space

Process
Table

Unix Processes
Process: An entity of execution
Definitions
program: collection of bytes stored in a file that can
be run
image: computer execution environment of program
process: execution of an image

Unix can execute many processes


simultaneously.

Process Creation
Interesting trait of UNIX
fork system call clones the current process
A

exec system call replaces current process


A

A fork is typically followed by an exec

Process Setup
All of the per process information is copied
with the fork operation
Working directory
Open files

Copy-on-write makes this efficient


Before exec, these values can be
modified

fork and exec


Example: the shell
while(1){
display_prompt();
read_input(cmd,params);
pid=fork();
if(pid!=0)
waitpid(1,&stat,0);
else
execve(cmd,params,0);
}

/*createchild*/
/*parentwaits*/
/*childexecs*/

Unix process genealogy


P r o c e s s g e n e r a t io n
I n it p r o c e s s 1
f o r k s in it p r o c e s s e s
in it
execs

in it
execs

I n it
execs

g e tty

g e tty
execs

g e tty

lo g in
execs
/ b in / s h

Background Jobs
By default, executing a command in the
shell will wait for it to exit before printing
out the next prompt
Trailing a command with & allows the shell
and command to run simultaneously
$ /bin/sleep 10 &
[1] 3424
$

Program Arguments
When a process is started, it is sent a list
of strings
argv, argc

The process can use this list however it


wants to

Ending a process
When a process ends, there is a return
code associated with the process
This is a positive integer
0 means success
>0 represent various kinds of failure, up to
process

Process Information Maintained


Working directory
File descriptor table
Process id
number used to identify process

Process group id
number used to identify set of processes

Parent process id
process id of the process that created the process

Process Information Maintained


Umask
Default file permissions for new file
We havent talked about these yet:

Effective user and group id


The user and group this process is running with
permissions as

Real user and group id


The user and group that invoked the process

Environment variables

Setuid and Setgid Mechanisms


The kernel can set the effective user and
group ids of a process to something
different than the real user and group
Files executed with a setuid or setgid flag set
cause the these values to change

Make it possible to do privileged tasks:


Change your password

Open up a can of worms for security if


buggy

Environment of a Process
A set of name-value pairs associated with
a process
Keys and values are strings
Passed to children processes
Cannot be passed back up
Common examples:
PATH: Where to search for programs
TERM: Terminal type

The PATH environment variable


Colon-separated list of directories.
Non-absolute pathnames of executables
are only executed if found in the list.
Searched left to right

Example:
$ myprogram
sh: myprogram not found
$ PATH=/bin:/usr/bin:/home/kornj/bin
$ myprogram
hello!

Having . In Your Path


$ ls
foo
$ foo
sh: foo: not found

$ ./foo
Hello, foo.

What not to do:


$ PATH=.:/bin
$ ls
foo
$ cd /usr/badguy
$ ls
Congratulations, your files have been removed
and you have just sent email to Prof. Korn
challenging him to a fight.

Shell Variables
Shells have several mechanisms for creating
variables. A variable is a name representing a
string value. Example: PATH
Shell variables can save time and reduce typing
errors

Allow you to store and manipulate information


Eg: ls $DIR > $FILE

Two types: local and environmental


local are set by the user or by the shell itself
environmental come from the operating system
and are passed to children

Variables (cont)
Syntax varies by shell
varname=value
set varname = value

# sh, ksh
# csh

To access the value: $varname


Turn local variable into environment:
export varname
# sh, ksh
setenv varname value # csh

Environmental Variables
NAMEMEANING
$HOME Absolute pathname of your home
directory
$PATH A list of directories to search for
$MAIL Absolute pathname to mailbox
$USER Your user id
$SHELL Absolute pathname of login shell
$TERM Type of your terminal
$PS1
Prompt

Inter-process Communication
Ways in which processes communicate:
Passing arguments, environment
Read/write regular files
Exit values
Signals
Pipes

Signals
Signal: A message a process can send to a process
or process group, if it has appropriate permissions.
Message type represented by a symbolic name
For each signal, the receiving process can:
Explicitly ignore signal
Specify action to be taken upron receipt (signal handler)
Otherwise, default action takes place (usually process is
killed)

Common signals:
SIGKILL, SIGTERM, SIGINT
SIGSTOP, SIGCONT
SIGSEGV, SIGBUS

An Example of Signals
When a child exists, it sends a
SIGCHLD signal to its parent.
If a parent wants to wait for a child to
exit, it tells the system it wants to catch
the SIGCHLD signal
When a parent does not issue a wait, it
ignores the SIGCHLD signal

Process Subsystem utilities


ps
kill
wait
nohup
the
signal
sleep
nice

monitors status of processes


send a signal to a pid
parent process wait for one of its
children to terminate
makes a command immune to
hangup and terminate
sleep in seconds
run processes at low priority

Pipes

One of the cornerstones of UNIX

Pipes
General idea: The input of one program
is the output of the other, and vice versa

Both programs run at the same time

Pipes (2)
Often, only one end of the pipe is used
standard out

standard in

Could this be done with files?

File Approach
Run first program, save output into file
Run second program, using file as input
process 1

Unnecessary use of the disk


Slower
Can take up a lot of space

Makes no use of multi-tasking

process 2

More about pipes


What if a process tries to read data but nothing
is available?
UNIX puts the reader to sleep until data available

What if a process cant keep up reading from


the process thats writing?
UNIX keeps a buffer of unread data
This is referred to as the pipe size.

If the pipe fills up, UNIX puts the writer to sleep until
the reader frees up space (by doing a read)

Multiple readers and writers possible with pipes.

More about Pipes


Pipes are often chained together
Called filters

standard out

standard in

Interprocess Communication
For Unrelated Processes
FIFO (named pipes)
A special file that when opened represents
pipe

System V IPC
p1
message queues
semaphores
shared memory
Sockets (client/server model)

p2

Pipelines
Output of one program becomes input to
another
Uses concept of UNIX pipes

Example: $ who | wc -l
counts the number of users logged in

Pipelines can be long

Whats the difference?


Both of these commands send input to command from a
file instead of the terminal:

$ cat file | command


vs.

$ command < file

An Extra Process
$ cat file | command
cat

command

$ command < file


command

Introduction to Filters
A class of Unix tools called filters.
Utilities that read from standard input,
transform the file, and write to standard out

Using filters can be thought of as data


oriented programming.
Each step of the computation transforms data
stream.

Examples of Filters
Sort
Input: lines from a file
Output: lines from the file sorted

Grep
Input: lines from a file
Output: lines that match the argument

Awk
Programmable filter

cat: The simplest filter


The cat command copies its input to output
unchanged (identity filter). When supplied a list
of file names, it concatenates them onto stdout.
Some options:
-n
-v
^C)

number output lines (starting from 1)


display control-characters in visible form (e.g.

cat file*
ls | cat -n

head
Display the first few lines of a specified file
Syntax: head [-n] [filename...]
-n - number of lines to display, default is 10
filename... - list of filenames to display

When more than one filename is


specified, the start of each files listing
displays
==>filename<==

tail
Displays the last part of a file
Syntax: tail +|-number [lbc] [f] [filename]
or: tail +|-number [l] [rf] [filename]
+number - begins copying at distance number from
beginning of file, if number isnt given, defaults to 10
-number - begins from end of file
l,b,c - number is in units of lines/block/characters
r - print in reverse order (lines only)
f - if input is not a pipe, do not terminate after end of
file has been copied but loop. This is useful to
monitor a file being written by another process

head and tail examples


head /etc/passwd
head *.c
tail +20 /etc/passwd
ls -lt | tail -3
head 100 /etc/passwd | tail -5
tail f /usr/local/httpd/access_log

tee
Unix Command

Standard output

file-list

Copy standard input to standard output


and one or more files
Captures intermediate results from a filter
in the pipeline

tee cont
Syntax: tee [ -ai ] file-list
-a - append to output file rather than overwrite,
default is to overwrite (replace) the output file
-i - ignore interrupts
file-list - one or more file names for capturing
output

Examples
ls | head 10 | tee first_10 | tail 5
who | tee user_list | wc

Unix Text Files: Delimited Data


Tab Separated
John
Anne
Andrew
Tim
Arun
Sowmya

99
75
50
95
33
76

Pipe-separated
COMP1011|2252424|Abbot, Andrew John |3727|1|M
COMP2011|2211222|Abdurjh, Saeed |3640|2|M
COMP1011|2250631|Accent, Aac-Ek-Murhg |3640|1|M
COMP1021|2250127|Addison, Blair |3971|1|F
COMP4012|2190705|Allen, David Peter |3645|4|M
COMP4910|2190705|Allen, David Pater |3645|4|M

Colon-separated
root:ZHolHAHZw8As2:0:0:root:/root:/bin/ksh
jas:nJz3ru5a/44Ko:100:100:John Shepherd:/home/jas:/bin/ksh
cs1021:iZ3sO90O5eZY6:101:101:COMP1021:/home/cs1021:/bin/bash
cs2041:rX9KwSSPqkLyA:102:102:COMP2041:/home/cs2041:/bin/csh
cs3311:mLRiCIvmtI9O2:103:103:COMP3311:/home/cs3311:/bin/sh

cut: select columns


The cut command prints selected parts of input
lines.
can select columns (assumes tab-separated input)
can select a range of character positions

Some options:
-f listOfCols: print only the specified columns (tabseparated) on output
-c listOfPos: print only chars in the specified positions
-d c: use character c as the column separator

Lists are specified as ranges (e.g. 1-5) or commaseparated (e.g. 2,4,5).

cut examples
cut -f 1 < data
cut -f 1-3 < data
cut -f 1,4 < data
cut -f 4- < data
cut -d'|' -f 1-3 < data
cut -c 1-4 < data
Unfortunately, there's no way to refer to "last
column" without counting the columns.

paste: join columns


The paste command displays several text files "in
parallel" on output.
1
3
5
If the inputs are files a, b, c
2
4
6
the first line of output is composed
of the first lines of a, b, c
the second line of output is composed
of the second lines of a, b, c

1
2

3
4

5
6

Lines from each file are separated by a tab character.


If files are different lengths, output has all lines from
longest file, with empty strings for missing lines.

paste example
cut -f 1 < data > data1
cut -f 2 < data > data2
cut -f 3 < data > data3
paste data1 data3 data2 > newdata

sort: Sort lines of a file


The sort command copies input to output but
ensures that the output is arranged in
ascending order of lines.
By default, sorting is based on ASCII comparisons
of the whole line.

Other features of sort:


understands text data that occurs in columns.
(can also sort on a column other than the first)
can distinguish numbers and sort appropriately
can sort files "in place" as well as behaving like a
filter
capable of sorting very large files

sort: Options
Syntax: sort [-dftnr] [-o filename] [filename(s)]
-dDictionary order, only letters, digits, and whitespace
are significant in determining sort order
-f Ignore case (fold into lower case)
-t Specify delimiter
-n Numeric order, sort by arithmetic value instead of
first digit
-r Sort in reverse order
-ofilename - write output to filename, filename can be
the same as one of the input files

Lots of more options

sort: Specifying fields


Delimiter : -td
Old way:
+f[.c][options] [-f[.c][options]
+2.1 3 +0 2 +3n

Exclusive
Start from 0 (unlike cut, which starts at 1)

New way:
-k f[.c][options][,f[.c][options]]
-k2.1 k0,1 k3n

Inclusive
Start from 1

sort Examples
sort +2nr < data
sort k2nr data
sort -t: +4 /etc/passwd
sort -o mydata mydata

uniq: list UNIQue items


Remove or report adjacent duplicate lines
Syntax: uniq [ -cdu] [input-file] [ output-file]
-c Supersede the -u and -d options and
generate an output
report with each line
preceded by an occurrence count
-d Write only the duplicated lines
-u Write only those lines which are not
duplicated
The default output is the union (combination) of
-d and -u

wc: Counting results


The word count utility, wc, counts the
number of lines, characters or words
Options:
-l
-w
-c

Count lines
Count words
Count characters

Default: count lines, words and chars

wc and uniq Examples


who | sort | uniq d
wc my_essay
who | wc
sort file | uniq | wc l
sort file | uniq d | wc l
sort file | uniq u | wc -l

tr: TRanslate Characters


Copies standard input to standard output with
substitution or deletion of selected characters
Syntax: tr [ -cds ] [ string1 ] [ string2 ]
-d delete all input characters contained in string1
-c complements the characters in string1 with respect
to the entire ASCII character set
-s squeeze all strings of repeated output characters
in the last operand to single characters

tr (continued)
tr reads from standard input.
Any character that does not match a character in
string1 is passed to standard output unchanged
Any character that does match a character in string1
is translated into the corresponding character in
string2 and then passed to standard output

Examples
tr s z replaces all instances of s with z
tr so zx replaces all instances of s with z and o with
x
tr a-z A-Z
replaces all lower case characters
with
upper case characters
tr d a-c
deletes all a-c characters

tr uses
Change delimiter
tr | :

Rewrite numbers
tr ,. .,

Import DOS files


tr d \r < dos_file

Find printable ASCII in a binary file


tr cd \na-zA-Z0-9 < binary_file

xargs
Unix limits the size of arguments and
environment that can be passed down to child
What happens when we have a list of 10,000
files to send to a command?
xargs solves this problem
Reads arguments as standard input
Sends them to commands that take file lists
May invoke program several times depending on size
of arguments
cmd a1 a2
a1 a300

xargs
cmd

cmd a100 a101


cmd a200 a201

find utility and xargs


find . -type f -print | xargs wc -l
-type f for files
-print to print them

out
xargs invokes wc 1 or more times
wc -l a b c d e f g
wc -l h i j k l m n o

Compare to: find . -type f exec wc -l {} \;

Next Time
Regular Expressions
Allow you to search for text in files
grep command

We will soon learn how to write scripts that


use these utilities in interesting ways.

Previously
Basic UNIX Commands
Files: rm, cp, mv, ls, ln
Processes: ps, kill

Unix Filters

cat, head, tail, tee, wc


cut, paste
find
sort, uniq
comm, diff, cmp
tr

Subtleties of commands

Executing commands with find


Specification of columns in cut
Specification of columns in sort
Methods of input
Standard in
File name arguments
Special "-" filename

Options for uniq

Today
Regular Expressions
Allow you to search for text in files
grep command

Stream manipulation:
sed
But first, a command we didnt cover last time

xargs
Unix limits the size of arguments and
environment that can be passed down to child
What happens when we have a list of 10,000
files to send to a command?
xargs handles this problem
Reads arguments as standard input
Sends them to commands that take file lists
May invoke program several times depending on size
of arguments
cmd a1 a2
a1 a300

xargs
cmd

cmd a100 a101


cmd a200 a201

find utility and xargs


find . -type f -print | xargs wc -l
for files
-print to print them out
xargs invokes wc 1 or more times
-type f

wc -l a b c d e f g
wc -l h i j k l m n o

Compare to: find . -type f exec wc -l {} \;

The -n option can be used to limit number of args

Regular Expressions

What Is a Regular Expression?


A regular expression (regex) describes a set
of possible input strings.
Regular expressions descend from a
fundamental concept in Computer Science
called finite automata theory
Regular expressions are endemic to Unix

vi, ed, sed, and emacs


awk, tcl, perl and Python
grep, egrep, fgrep
compilers

Regular Expressions
The simplest regular expressions are a
string of literal characters to match.
The string matches the regular
expression if it contains the substring.

regular expression

c k s

UNIX Tools rocks.


match

UNIX Tools sucks.


match

UNIX Tools is okay.


no match

Regular Expressions
A regular expression can match a string in
more than one place.

regular expression

a p p l e

Scrapple from the apple.


match 1

match 2

Regular Expressions
The . regular expression can be used to
match any character.
regular expression

o .

For me to poop on.


match 1

match 2

Character Classes
Character classes [] can be used to
match any specific set of characters.

regular expression

b [eor] a t

beat a brat on a boat


match 1

match 2

match 3

Negated Character Classes


Character classes can be negated with the
[^] syntax.

regular expression

b [^eo] a t

beat a brat on a boat


match

More About Character Classes


[aeiou] will match any of the characters a, e, i, o,
or u
[kK]orn will match korn or Korn

Ranges can also be specified in character


classes
[1-9] is the same as [123456789]
[abcde] is equivalent to [a-e]
You can also combine multiple ranges
[abcde123456789] is equivalent to [a-e1-9]
Note that the - character has a special meaning in a
character class but only if it is used within a range,
[-123] would match the characters -, 1, 2, or 3

Named Character Classes


Commonly used character classes can be
referred to by name (alpha, lower, upper,
alnum, digit, punct, cntrl)
Syntax [:name:]
[a-zA-Z]
[a-zA-Z0-9]
[45a-z]

[[:alpha:]]
[[:alnum:]]
[45[:lower:]]

Important for portability across languages

Anchors
Anchors are used to match at the beginning or end of a line (or both).
^ means beginning of the line
$ means end of the line

^ b [eor] a t

regular expression

beat a brat on a boat


match

regular expression

b [eor] a t $

beat a brat on a boat


match

^word$

^$

Repetition
The * is used to define zero or more
occurrences of the single regular
expression preceding it.

y a * y

regular expression

I got mail, yaaaaaaaaaay!


match

regular expression

o a * o

For me to poop on.


match

.*

Match length
A match will be the longest string that
satisfies the regular expression.

regular expression

a . * e

Scrapple from the apple.


no

no

yes

Repetition Ranges
Ranges can also be specified
{ } notation can specify a range of
repetitions for the immediately preceding
regex
{n} means exactly n occurrences
{n,} means at least n occurrences
{n,m} means at least n occurrences but no
more than m occurrences

Example:
.{0,} same as .*
a{2,} same as aaa*

Subexpressions
If you want to group part of an expression so
that * or { } applies to more than just the
previous character, use ( ) notation
Subexpresssions are treated like a single
character
a* matches 0 or more occurrences of a
abc* matches ab, abc, abcc, abccc,
(abc)* matches abc, abcabc, abcabcabc,
(abc){2,3} matches abcabc or abcabcabc

grep
grep comes from the ed (Unix text editor)
search command global regular expression
print or g/re/p
This was such a useful command that it was
written as a standalone utility
There are two other variants, egrep and fgrep
that comprise the grep family
grep is the answer to the moments where you
know you want the file that contains a specific
phrase but you cant remember its name

Family Differences
grep - uses regular expressions for pattern
matching
fgrep - file grep, does not use regular
expressions, only matches fixed strings but can
get search strings from a file
egrep - extended grep, uses a more powerful
set of regular expressions but does not support
backreferencing, generally the fastest member
of the grep family
agrep approximate grep; not standard

Syntax
Regular expression concepts we have seen
so far are common to grep and egrep.
grep and egrep have slightly different syntax
grep: BREs
egrep: EREs (enhanced features we will discuss)

Major syntax differences:


grep: \( and \), \{ and \}
egrep: ( and ), { and }

Protecting Regex
Metacharacters
Since many of the special characters used
in regexs also have special meaning to the
shell, its a good idea to get in the habit of
single quoting your regexs
This will protect any special characters from
being operated on by the shell
If you habitually do it, you wont have to worry
about when it is necessary

Escaping Special Characters


Even though we are single quoting our regexs so
the shell wont interpret the special characters,
some characters are special to grep (eg * and .)
To get literal characters, we escape the character
with a \ (backslash)
Suppose we want to search for the character
sequence a*b*
Unless we do something special, this will match zero or
more as followed by zero or more bs, not what we
want
a\*b\* will fix this - now the asterisks are treated as
regular characters

Egrep: Alternation
Regex also provides an alternation character |
for matching one or another subexpression
(T|Fl)an will match Tan or Flan
^(From|Subject): will match the From and Subject
lines of a typical email message
It matches a beginning of line followed by either the characters
From or Subject followed by a :

Subexpressions are used to limit the scope of the


alternation
At(ten|nine)tion then matches Attention or
Atninetion, not Atten or ninetion as would happen
without the parenthesis - Atten|ninetion

Egrep: Repetition Shorthands


The * (star) has already been seen to specify
zero or more occurrences of the immediately
preceding character
+ (plus) means one or more
abc+d will match abcd, abccd, or abccccccd
but will not match abd
Equivalent to {1,}

Egrep: Repetition Shorthands


cont
The ? (question mark) specifies an optional
character, the single character that immediately
precedes it
July? will match Jul or July
Equivalent to {0,1}
Also equivalent to (Jul|July)
The *, ?, and + are known as quantifiers because
they specify the quantity of a match
Quantifiers can also be used with subexpressions
(a*c)+ will match c, ac, aac or aacaacac but will not
match a or a blank line

Grep: Backreferences
Sometimes it is handy to be able to refer to a
match that was made earlier in a regex
This is done using backreferences
\n is the backreference specifier, where n is a number

Looks for nth subexpression


For example, to find if the first word of a line is
the same as the last:
^\([[:alpha:]]\{1,\}\) .* \1$
The \([[:alpha:]]\{1,\}\) matches 1 or more letters

Practical Regex Examples


Variable names in C
[a-zA-Z_][a-zA-Z_0-9]*

Dollar amount with optional cents


\$[0-9]+(\.[0-9][0-9])?

Time of day
(1[012]|[1-9]):[0-5][0-9] (am|pm)

HTML headers <h1> <H1> <h2>


<[hH][1-4]>

grep Family
Syntax
grep [-hilnv] [-e expression] [filename]
egrep [-hilnv] [-e expression] [-f filename] [expression]
[filename]
fgrep [-hilnxv] [-e string] [-f filename] [string] [filename]
-h Do not display filenames
-i Ignore case
-l List only filenames containing matching lines
-n Precede each matching line with its line number
-v Negate matches
-x Match whole line only (fgrep only)
-e expression Specify expression as option
-f filename
Take the regular expression (egrep)
or a list of strings (fgrep) from filename

grep Examples

grep 'men' GrepMe


grep 'fo*' GrepMe
egrep 'fo+' GrepMe
egrep -n '[Tt]he' GrepMe
fgrep 'The' GrepMe
egrep 'NC+[0-9]*A?' GrepMe
fgrep -f expfile GrepMe

Find all lines with signed numbers


$ egrep[-+][0-9]+\.?[0-9]* *.c
bsearch. c: return -1;
compile. c: strchr("+1-2*3", t-> op)[1] - 0, dst,
convert. c: Print integers in a given base 2-16 (default 10)
convert. c: sscanf( argv[ i+1], "% d", &base);
strcmp. c: return -1;
strcmp. c: return +1;

egrep has its limits: For example, it cannot match all lines that
contain a number divisible by 7.

Fun with the Dictionary


/usr/dict/words contains about
egrep hh /usr/dict/words
beachhead
highhanded
withheld
withhold

25,000 words

egrep as a simple spelling checker: Specify plausible


alternatives you know
egrep "n(ie|ei)ther" /usr/dict/words
neither

How many words have 3 as one letter apart?


egrep a.a.a /usr/dict/words | wc l
54
egrep u.u.u /usr/dict/words
cumulus

Other Notes
Use /dev/null as an extra file name
Will print the name of the file that matched
grep test bigfile
This is a test.

grep test /dev/null bigfile


bigfile:This is a test.

Return code of grep is useful

grep fred filename > /dev/null && rm filename

This is one line of text


o.*o
x
xyz
\m
^
$
.
[xy^$x]
[^xy^$z]
[a-z]
r*
r1r2
\(r\)
\n
\{n,m\}
r+
r?
r1|r2
(r1|r2)r3
(r1|r2)*
{n,m}

Ordinary characters match themselves


(NEWLINES and metacharacters excluded)
Ordinary strings match themselves
Matches literal character m
Start of line
End of line
Any single character
Any of x, y, ^, $, or z
Any one character other than x, y, ^, $, or z
Any single character in given range
zero or more occurrences of regex r
Matches r1 followed by r2
Tagged regular expression, matches r
Set to what matched the nth tagged expression
(n = 1-9)
Repetition
One or more occurrences of r
Zero or one occurrences of r
Either r1 or r2
Either r1r3 or r2r3
Zero or more occurrences of r1|r2, e.g., r1, r1r1,
r2r1, r1r1r2r1,)
Repetition

input line
regular expression

fgrep, grep, egrep

grep, egrep

grep

egrep

Quick
Reference

Sed: Stream-oriented, NonInteractive, Text Editor


Look for patterns one line at a time, like grep
Change lines of the file
Non-interactive text editor
Editing commands come in as script
There is an interactive editor ed which accepts the
same commands

A Unix filter
Superset of previously mentioned tools

Sed Architecture
Input

Input line
(Pattern Space)

Output

scriptfile

Commands in a sed script are applied


in order to each line.
If a command changes the input,
subsequent command will be applied to
the modified line in the pattern space,
not the original input line.
The input file is unchanged (sed is a
filter).
Results are sent to standard output

Scripts
A script is nothing more than a file of commands
Each command consists of up to two addresses
and an action, where the address can be a
regular expression or line number.

address

action

address

action

address

action

address

action

address

action

command

script

Sed Flow of Control


sed then reads the next line in the input file and
restarts from the beginning of the script file
All commands in the script file are compared to,
and potentially act on, all lines in the input file
script

cmd 1

cmd 2

...

cmd n

Executed if line matches address

print command

input

output

output
only without -n

sed Syntax
Syntax: sed [-n] [-e] [command] [file]
sed [-n] [-f scriptfile] [file]
-n - only print lines specified with the print
command (or the p flag of the substitute (s)
command)
-f scriptfile - next argument is a filename
containing editing commands
-e command - the next argument is an editing
command rather than a filename, useful if multiple
commands are specified
If the first line of a scriptfile is #n, sed acts as
though -n had been specified

sed Commands
sed commands have the general form
[address[, address]][!]command [arguments]

sed copies each input line into a pattern space


If the address of the command matches the line in the
pattern space, the command is applied to that line
If the command has no address, it is applied to each
line as it enters pattern space
If a command changes the line in pattern space,
subsequent commands operate on the modified line

When all commands have been read, the line in


pattern space is written to standard output and a
new line is read into pattern space

Addressing
An address can be either a line number or
a pattern, enclosed in slashes (
/pattern/ )
A pattern is described using regular
expressions (BREs, as in grep)
If no pattern is specified, the command will
be applied to all lines of the input file
To refer to the last line: $

Addressing (continued)
Most commands will accept two addresses
If only one address is given, the command operates
only on that line
If two comma separated addresses are given, then
the command operates on a range of lines between
the first and second address, inclusively

The ! operator can be used to negate an


address, ie; address!command causes
command to be applied to all lines that do not
match address

Commands
command is a single letter
Example: Deletion: d
[address1][,address2]d
Delete the addressed line(s) from the pattern
space; line(s) not passed to standard output.
A new line of input is read and editing
resumes with the first command of the script.

Address and Command


Examples
deletes the all lines
deletes line 6
/^$/d
deletes all blank lines
1,10d
deletes lines 1 through 10
1,/^$/d
deletes from line 1 through the first blank line
/^$/,$d
deletes from the first blank line through
the last line of the file
/^$/,10d
deletes from the first blank line through line 10
/^ya*y/,/[0-9]$/d deletes from the first line that begins
with yay, yaay, yaaay, etc. through
the first line that ends with a digit

d
6d

Multiple Commands
Braces {} can be used to apply multiple commands to
an address
[/pattern/[,/pattern/]]{
command1
command2
command3
}

Strange syntax:
The opening brace must be the last character on a line
The closing brace must be on a line by itself
Make sure there are no spaces following the braces

Sed Commands
Although sed contains many editing commands,
we are only going to cover the following subset:

s - substitute
a - append
i - insert
c - change

d - delete
p - print
y - transform
q - quit

Print
The Print command (p) can be used to force
the pattern space to be output, useful if the -n
option has been specified
Syntax: [address1[,address2]]p
Note: if the -n option has not been specified, p
will cause the line to be output twice!
Examples:
1,5p will display lines 1 through 5
/^$/,$p will display the lines from the first
blank line through the last line of the file

Substitute
Syntax: [address(es)]s/pattern/replacement/
[flags]
pattern - search pattern
replacement - replacement string for pattern
flags - optionally any of the following

a number from 1 to 512 indicating which


occurrence of pattern should be
replaced
global, replace all occurrences of pattern
in pattern space
print contents of pattern space

Substitute Examples
s/Puff Daddy/P. Diddy/
Substitute P. Diddy for the first occurrence of Puff
Daddy in pattern space

s/Tom/Dick/2
Substitutes Dick for the second occurrence of Tom in
the pattern space

s/wood/plastic/p
Substitutes plastic for the first occurrence of wood and
outputs (prints) pattern space

Replacement Patterns
Substitute can use several special
characters in the replacement string
& - replaced by the entire string matched in
the regular expression for pattern
\n - replaced by the nth substring (or
subexpression) previously specified using \(
and \)
\ - used to escape the ampersand (&) and
the backslash (\)

Replacement Pattern Examples


"the UNIX operating system "
s/.NI./wonderful &/
"the wonderful UNIX operating system "
cat test1
first:second
one:two
sed 's/\(.*\):\(.*\)/\2:\1/' test1
second:first
two:one
sed 's/\([[:alpha:]]\)\([^ \n]*\)/\2\1ay/g'
Pig Latin ("unix is fun" -> "nixuay siay unfay")

Append, Insert, and Change


Syntax for these commands is a little strange
because they must be specified on multiple
lines
append
[address]a\
text
insert
[address]i\
text
change
[address(es)]c\
text
append/insert for single lines only, not range

Append and Insert


Append places text after the current line in pattern space
Insert places text before the current line in pattern space
Each of these commands requires a \ following it.
text must begin on the next line.
If text begins with whitespace, sed will discard it
unless you start the line with a \

Example:

/<Insert Text Here>/i\


Line 1 of inserted text\
\
Line 2 of inserted text

would leave the following in the pattern space


Line 1 of inserted text
Line 2 of inserted text
<Insert Text Here>

Change
Unlike Insert and Append, Change can be
applied to either a single line address or a range
of addresses
When applied to a range, the entire range is
replaced by text specified with change, not each
line
Exception: If the Change command is executed with
other commands enclosed in { } that act on a range
of lines, each line will be replaced with text

No subsequent editing allowed

Change Examples
Remove mail headers, ie;
/^From /,/^$/c\
the address specifies a
<Mail Headers Removed>
range of lines beginning
with a line that begins with
From until the first blank
/^From /,/^$/{
line.
s/^From //p
The first example replaces
all lines with a single
occurrence of <Mail Header
Removed>.
The second example
replaces each line with
<Mail Header Removed>

c\
<Mail Header Removed>
}

Using !
If an address is followed by an exclamation point
(!), the associated command is applied to all lines
that dont match the address or address range
Examples:
1,5!d would delete all lines except 1 through 5
/black/!s/cow/horse/ would substitute
horse for cow on all lines except those that
contained black
The brown cow -> The brown horse
The black cow -> The black cow

Transform
The Transform command (y) operates like tr, it
does a one-to-one or character-to-character
replacement
Transform accepts zero, one or two addresses
[address[,address]]y/abc/xyz/
every a within the specified address(es) is
transformed to an x. The same is true for b to y and
c to z
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQR
STUVWXYZ/ changes all lower case characters on the

addressed line to upper case


If you only want to transform specific characters (or a
word) in the line, it is much more difficult and requires
use of the hold space

Quit
Quit causes sed to stop reading new input lines
and stop sending them to standard output
It takes at most a single line address
Once a line matching the address is reached, the
script will be terminated
This can be used to save time when you only want
to process some portion of the beginning of a file

Example: to print the first 100 lines of a file (like


head) use:
sed '100q' filename
sed will, by default, send the first 100 lines of
filename to standard output and then quit processing

Pattern and Hold spaces


Pattern space: Workspace or temporary
buffer where a single line of input is held
while the editing commands are applied
Hold space: Secondary temporary buffer
for temporary storage only
in
h, H, g, G, x

Pattern
Hold
out

Sed Advantages
Regular expressions
Fast
Concise

Sed Drawbacks
Hard to remember text from one line to
another
Not possible to go backward in the file
No way to do forward references
like /..../+1
No facilities to manipulate numbers
Cumbersome syntax

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

What is a shell?
The user interface to the operating system
Functionality:
Execute other programs
Manage files
Manage processes

Full programming language


A program like any other
This is why there are so many shells

Shell History
There are
many choices
for shells
Shell features
evolved as
UNIX grew

Most Commonly Used Shells


/bin/csh
/bin/tcsh
/bin/sh
/bin/ksh
/bin/bash

C shell
Enhanced C Shell
The Bourne Shell / POSIX shell
Korn shell
Korn shell clone, from GNU

Ways to use the shell


Interactively
When you log in, you interactively use the
shell

Scripting
A set of shell commands that constitute an
executable program

Review: UNIX Programs


Means of input:
Program arguments
[control information]
Environment variables
[state information]
Standard input [data]

Means of output:
Return status code [control information]
Standard out [data]
Standard error [error messages]

Shell Scripts
A shell script is a regular text file that
contains shell or UNIX commands
Before running it, it must have execute
permission:
chmod +x filename

A script can be invoked as:


sh name [ arg ]
sh < name [ args ]
name [ arg ]

Shell Scripts
When a script is run, the kernel determines
which shell it is written for by examining the first
line of the script
If 1st line starts with #!pathname-ofshell, then it invokes pathname and sends
the script as an argument to be interpreted
If #! is not specified, the current shell
assumes it is a script in its own language
leads to problems

Simple Example
#!/bin/sh
echo Hello World

Scripting vs. C Programming


Advantages of shell scripts
Easy to work with other programs
Easy to work with files
Easy to work with strings
Great for prototyping. No compilation

Disadvantages of shell scripts


Slower
Not well suited for algorithms & data
structures

The C Shell
C-like syntax (uses { }'s)
Inadequate for scripting

Poor control over file descriptors


Difficult quoting "I say \"hello\"" doesn't work
Can only trap SIGINT
Can't mix flow control and commands

Survives mostly because of interactive features.


Job control
Command history
Command line editing, with arrow keys (tcsh)

http://www.faqs.org/faqs/unix-faq/shell/csh-whynot

The Bourne Shell

Slight differences on various systems


Evolved into standardized POSIX shell
Scripts will also run with ksh, bash
Influenced by ALGOL

Simple Commands
simple command: sequence of non blanks
arguments separated by blanks or tabs.
1st argument (numbered zero) usually specifies
the name of the command to be executed.
Any remaining arguments:
Are passed as arguments to that command.
Arguments may be filenames, pathnames, directories
or special options
ls l /

/bin/ls
-l
/

Background Commands
Any command ending with "&" is run in the
background.
firefox &

wait will block until the command finishes

Complex Commands
The shell's power is in its ability to hook
commands together
We've seen one example of this so far
with pipelines:
cut d: -f2 /etc/passwd | sort | uniq

We will see others

Redirection of input/ouput
Redirection of output: >
example:$ ls -l > my_files

Redirection of input: <


example: $ cat <input.data

Append output: >>


example: $ date >> logfile

Arbitrary file descriptor redirection: fd>


example: $ ls l 2> error_log

Multiple Redirection
cmd 2>file
send standard error to file
standard output remains the same

cmd > file 2>&1


send both standard error and standard output to
file

cmd > file1 2>file2


send standard output to file1
send standard error to file2

Here Documents
Shell provides alternative ways of supplying
standard input to commands (an anonymous file)
Shell allows in-line input redirection using <<
called here documents
Syntax:
command [arg(s)] << arbitrary-delimiter
command input
:
:
arbitrary-delimiter
arbitrary-delimiter should be a string that does
not appear in text

Here Document Example


#!/bin/sh
mail steinbrenner@yankees.com <<EOT
Sorry, I really blew it this
year. Thanks for not firing me.
Yours,
Joe
EOT

Shell Variables
To set:
name=value

Read: $var
Variables can be local or environment.
Environment variables are part of UNIX
and can be accessed by child processes.
Turn local variable into environment:
export variable

Variable Example
#!/bin/sh
MESSAGE="Hello World"
echo $MESSAGE

Environmental Variables
NAMEMEANING
$HOME Absolute pathname of your home
directory
$PATH A list of directories to search for
$MAIL Absolute pathname to mailbox
$USER Your login name
$SHELL Absolute pathname of login shell
$TERM Type of your terminal
$PS1
Prompt

Here Documents Expand Vars


#!/bin/sh
mail steinbrenner@yankees.com <<EOT
Sorry, I really blew it this
year. Thanks for not firing me.
Yours,
$USERS
EOT

Parameters
A parameter is one of the following:
A variable
A positional parameter, starting from 1
A special parameter
To get the value of a parameter: ${param}
Can be part of a word (abc${foo}def)
Works within double quotes

The {} can be omitted for simple variables,


special parameters, and single digit positional
parameters.

Positional Parameters
The arguments to a shell script
$1, $2, $3

The arguments to a shell function


Arguments to the set built-in command
set this is a test
$1=this, $2=is, $3=a, $4=test

Manipulated with shift


shift 2
$1=a, $2=test

Parameter 0 is the name of the shell or the shell


script.

Example with Parameters


#!/bin/sh
# Parameter 1: word
# Parameter 2: file
grep $1 $2 | wc l
$ countlines ing /usr/dict/words
3277

Special Parameters

$#
$$?
$$
$!
process
$*
"$@"

Number of positional parameters


Options currently in effect
Exit value of last executed command
Process number of current process
Process number of background
All arguments on command line
All arguments on command line
individually quoted "$1" "$2" ...

Command Substitution
Used to turn the output of a command into a string
Used to create arguments or variables
Command is placed with grave accents ` ` to
capture the output of command
$ date
Wed Sep 25 14:40:56 EDT 2001
$ NOW=`date`
$ grep `generate_regexp` myfile.c
$ sed "s/oldtext/`ls | head -1`/g"
$ PATH=`myscript`:$PATH

File name expansion


Used to generate a set of arguments from files
Wildcards (patterns)
* matches any string of characters
? matches any single character
[list] matches any character in list
[lower-upper] matches any character in range lowerupper inclusive
[!list] matches any character not in list

This is the same syntax that find uses

File Expansion
If multiple matches, all are returned $ /bin/ls
and treated as separate arguments: file1 file2

Handled by the shell (programs


argv[0]: /bin/cat
argv[1]: file1
argv[2]: file2

NOT

$ cat file1
a
$ cat file2
b
$ cat file*
a
dont bsee the wildcards)

argv[0]: /bin/cat
argv[1]: file*

Compound Commands
Multiple commands
Separated by semicolon or newline

Command groupings
pipelines

Subshell
( command1; command2 ) > file

Boolean operators
Control structures

Boolean Operators
Exit value of a program (exit system call) is a
number
0 means success
anything else is a failure code

cmd1 && cmd2


executes cmd2 if cmd1 is successful

cmd1 || cmd2
executes
cmd2 if cmd1
is not successful
$ ls bad_file
> /dev/null
&& date
$ ls bad_file > /dev/null || date
Wed Sep 26 07:43:23 2006

Control Structures
if expression
then
command1
else
command2
fi

What is an expression?
Any UNIX command. Evaluates to true if the
exit code is 0, false if the exit code > 0
Special command /bin/test exists that does
most common expressions
String compare
Numeric comparison
Check file properties

[ often a builtin version of /bin/test for


syntactic sugar
Good example UNIX tools working together

Examples
if test "$USER" = "kornj"
then
echo "I know you"
else
echo "I dont know you"
fi
if [ -f /tmp/stuff ] && [ `wc l < /tmp/stuff` -gt 10 ]
then
echo "The file has more than 10 lines in it"
else
echo "The file is nonexistent or small"
fi

test Summary

String based tests


-z string
-n string
string1 = string2
string1 != string2
string

Numeric tests
int1 eq int2
int1 ne int2
-gt, -ge, -lt, -le

File tests
-r
-w
-f
-d
-s

file
file
file
file
file

Logic
!
-a, -o
( expr )

Length of string is 0
Length of string is not 0
Strings are identical
Strings differ
String is not NULL
First int equal to second
First int not equal to second
greater, greater/equal, less, less/equal
File exists and is readable
File exists and is writable
File is regular file
File is directory
file exists and is not empty
Negate result of expression
and operator, or operator
groups an expression

Arithmetic
No arithmetic built in to /bin/sh
Use external command /bin/expr
expr expression
Evaluates expression and sends the result to
standard output.
Yields a numeric or string result
expr 4 "*" 12

Particularly
expr "("
4 + with
3 ")"
"*" 2
useful
command
substitution
X=`expr $X + 2`

Control Structures Summary


if then fi
while done
until do done
for do done
case in esac

for loops
Different than C:
for var in list
do
command
done

Typically used with positional parameters or a


list of files:
sum=0
for var in "$@"
do
sum=`expr $sum + $var`
done
echo The sum is $sum
for file in *.c ; do echo "We have $file"
done

Case statement
Like a C switch statement for strings:
case $var in
opt1) command1
command2
;;
opt2) command
;;
*)
command
;;
esac

* is a catch all condition

Case Example
#!/bin/sh
for INPUT in "$@"
do
case $INPUT in
hello)
echo "Hello there."
;;
bye)
echo "See ya later."
;;
*)
echo "I'm sorry?"
;;
esac
done
echo "Take care."

Case Options
opt can be a shell pattern, or a list of shell
patterns delimited by |
Example:
case $name in
*[0-9]*)
echo "That doesn't seem like a name."
;;
J*|K*)
echo "Your name starts with J or K, cool."
;;
*)
echo "You're not special."
;;
esac

Types of Commands
All behave the same way
Programs
Most that are part of the OS in /bin

Built-in commands
Functions
Aliases

Built-in Commands
Built-in commands are internal to the shell and
do not create a separate process. Commands
are built-in because:
They are intrinsic to the language (exit)
They produce side effects on the current process (cd)
They perform faster
No fork/exec

Special built-ins
: . break continue eval exec export exit
readonly return set shift trap unset

Important Built-in Commands


exec
cd :
shift
set:
wait
umask
exit
eval
time
export
trap

:
replaces shell with program
change working directory
:
rearrange positional parameters
set positional parameters
:
wait for background proc. to exit
:
change default file permissions
:
quit the shell
:
parse and execute string
:
run command and print times
:
put variable into environment
:
set signal handlers

Important Built-in Commands


continue
break
return
:
.

:
:
:
:
:

continue in loop
break in loop
return from function
true
read file of commands into
current shell; like #include

Functions
Functions are similar to scripts and other
commands except:
They can produce side effects in the callers script.
Variables are shared between caller and callee.
The positional parameters are saved and restored
when invoking a function.
Syntax:
name ()
{
commands
}

Aliases
Like macros (#define in C)
Shorter to define than functions, but more
limited
Not recommended for scripts
Example:
alias rm='rm i'

Command Search Rules


Special built-ins
Functions
command bypasses search for functions

Built-ins not associated with PATH


PATH search
Built-ins associated with PATH

Parsing and Quoting

How the Shell Parses


Part 1: Read the command:
Read one or more lines a needed
Separate into tokens using space/tabs
Form commands based on token types

Part 2: Evaluate a command:


Expand word tokens (command substitution, parameter
expansion)

Split words into fields


File expansion
Setup redirections, environment
Run command with arguments

Useful Program for Testing


/home/unixtool/bin/showargs
#include <stdio.h>
int main(int argc, char *argv[])
{
int i;
for (i=0; i < argc; i++) {
printf("Arg %d: %s\n", i, argv[i]);
}
return(0);
}

Shell Comments

Comments begin with an unquoted #


Comments end at the end of the line
Comments can begin whenever a token begins
Examples
# This is a comment
# and so is this
grep foo bar # this is a comment
grep foo bar# this is not a comment

Special Characters
The shell processes the following characters specially
unless quoted:
| & ( ) < > ; " ' $ ` space tab newline

The following are special whenever patterns are


processed:
* ? [ ]

The following are special at the beginning of a word:


# ~

The following is special when processing assignments:


=

Token Types
The shell uses spaces and tabs to split the
line or lines into the following types of
tokens:
Control operators (||)
Redirection operators (<)
Reserved words (if)
Assignment tokens
Word tokens

Operator Tokens
Operator tokens are recognized everywhere
unless quoted. Spaces are optional before and
after operator tokens.
I/O Redirection Operators:
> >> >| >& < << <<- <&
Each I/O operator can be immediately preceded by a
single digit

Control Operators:
| & ; ( ) || && ;;

Shell Quoting
Quoting causes characters to loose special
meaning.
\ Unless quoted, \ causes next character to
be quoted. In front of new-line causes lines to
be joined.
''
Literal quotes. Cannot contain '
""
Removes special meaning of all
characters except $, ", \ and `. The \ is only
special before one of these characters and newline.

Quoting Examples
$ cat file*
a
b
$ cat "file*"
cat: file* not found
$ cat file1 > /dev/null
$ cat file1 ">" /dev/null
a
cat: >: cannot open
FILES="file1 file2"
$ cat "$FILES"
cat: file1 file2 not found

Simple Commands
A simple command consists of three types of
tokens:

Assignments (must come first)


Command word tokens
Redirections: redirection-op + word-op
The first token must not be a reserved word
Command terminated by new-line or ;

Example:
foo=bar z=`date`
echo $HOME
x=foobar > q$$ $xyz z=3

Word Splitting
After parameter expansion, command
substitution, and arithmetic expansion,
the characters that are generated as a
result of these expansions that are not
inside double quotes are checked for split
characters
Default split character is space or tab
Split characters are defined by the value
of the IFS variable (IFS="" disables)

Word Splitting Examples


FILES="file1 file2"
cat $FILES
a
b
IFS=
cat $FILES
cat: file1 file2: cannot open

IFS=x v=exit
echo exit $v "$v"
exit e it exit

Pathname Expansion
After word splitting, each field that
contains pattern characters is replaced by
the pathnames that match
Quoting prevents expansion
set o noglob disables
Not in original Bourne shell, but in POSIX

Parsing Example
DATE=`date` echo $foo > \
/dev/null
DATE=`date` echo $foo > /dev/null
assignment

word

param

redirection

echo hello there


/bin/echo
PATH expansion

hello there
split by IFS

/dev/null
/dev/null

Script Examples
Rename files to lower case
Strip CR from files
Emit HTML for directory contents

Rename files
#!/bin/sh
for file in *
do
lfile=`echo $file | tr A-Z a-z`
if [ $file != $lfile ]
then
mv $file $lfile
fi
done

Remove DOS Carriage Returns


#!/bin/sh
TMPFILE=/tmp/file$$
if [ "$1" = "" ]
then
tr -d '\r'
exit 0
fi
trap 'rm -f $TMPFILE' 1 2 3 6 15
for file in "$@"
do
if tr -d '\r' < $file > $TMPFILE
then
mv $TMPFILE $file
fi
done

Generate HTML
$ dir2html.sh > dir.html

The Script
#!/bin/sh
[ "$1" != "" ] && cd "$1"
cat <<HUP
<html>
<h1> Directory listing for $PWD </h1>
<table border=1>
<tr>
HUP
num=0
for file in *
do
genhtml $file
# this function is on next
page
done
cat <<HUP
</tr>
</table>
</html>

Function genhtml
genhtml()
{
file=$1
echo "<td><tt>"
if [ -f $file ]
then
echo "<font color=blue>$file</font>"
elif [ -d $file ]
then
echo "<font color=red>$file</font>"
else
echo "$file"
fi
echo "</tt></td>"
num=`expr $num + 1`
if [ $num -gt 4 ]
then
echo "</tr><tr>"
num=0
fi
}

Korn Shell / bash Features

Command Substitution
Better syntax with $(command)
Allows nesting
x=$(cat $(generate_file_list))

Backward compatible with ` ` notation

Expressions
Expressions are built-in with the [[ ]] operator
if [[ $var = "" ]]

Gets around parsing quirks of /bin/test, allows checking


strings against patterns
Operations:

string == pattern
string != pattern
string1 < string2
file1 nt file2
file1 ot file2
file1 ef file2
&&, ||

Patterns
Can be used to do string matching:
if [[ $foo = *a* ]]
if [[ $foo = [abc]* ]]

Similar to regular expressions, but


different syntax

Additonal Parameter Expansion

${#param} Length of param


${param#pattern} Left strip min pattern
${param##pattern} Left strip max pattern
${param%pattern} Right strip min pattern
${param%%pattern} Right strip max pattern
${param-value} Default value if param not
set

Variables
Variables can be arrays
foo[3]=test
echo ${foo[3]}

Indexed by number
${#arr} is length of the array
Multiple array elements can be set at once:
set A foo a b c d
echo ${foo[1]}
Set command can also be used for positional
params: set a b c d; print $2

Printing
Built-in print command to replace echo
Much faster
Allows options:
-u#

print to specific file descriptor

Functions
Alternative function syntax:
function name {
commands
}

Allows for local variables


$0 is set to the name of the function

Additional Features
Built-in arithmetic: Using $((expression ))
e.g., print $(( 1 + 1 * 8 / x ))

Tilde file expansion


~
~user
~+
~-

$HOME
home directory of user
$PWD
$OLDPWD

Course Outline
Introduction

Operating system overview


UNIX utilities
Scripting languages
Programming tools

Parsing and Quoting

Shell Quoting
Quoting causes characters to loose special
meaning.
\

Unless quoted, \ causes next character


to be quoted. In front of new-line
causes lines to be joined.
''
Literal quotes. Cannot contain '
""
Removes special meaning of all
characters except $, ", \ and `.
The \ is only special before one of
these characters and new-line.

Quoting Examples
$ cat file*
a
b
$ cat "file*"
cat: file* not found
$ cat file1 > /dev/null
$ cat file1 ">" /dev/null
a
cat: >: cannot open
FILES="file1 file2"
$ cat "$FILES"
cat: file1 file2 not found

Shell Comments

Comments begin with an unquoted #


Comments end at the end of the line
Comments can begin whenever a token begins
Examples
# This is a comment
# and so is this
grep foo bar # this is a comment
grep foo bar# this is not a comment

How the Shell Parses


Part 1: Read the command:
Read one or more lines a needed
Separate into tokens using space/tabs
Form commands based on token types

Part 2: Evaluate a command:


Expand word tokens (command substitution, parameter
expansion)

Split words into fields


File expansion
Setup redirections, environment
Run command with arguments

Useful Program for Testing


/home/unixtool/bin/showargs
#include <stdio.h>
int main(int argc, char *argv[])
{
int i;
for (i=0; i < argc; i++) {
printf("Arg %d: %s\n", i, argv[i]);
}
return(0);
}

Special Characters
The shell processes the following characters specially
unless quoted:
| & ( ) < > ; " ' $ ` space tab newline

The following are special whenever patterns are


processed:
* ? [ ]

The following are special at the beginning of a word:


# ~

The following is special when processing assignments:


=

Token Types
The shell uses spaces and tabs to split the
line or lines into the following types of
tokens:
Control operators (||)
Redirection operators (<)
Reserved words (if)
Assignment tokens
Word tokens

Operator Tokens
Operator tokens are recognized everywhere
unless quoted. Spaces are optional before and
after operator tokens.
I/O Redirection Operators:
> >> >| >& < << <<- <&
Each I/O operator can be immediately preceded by a
single digit

Control Operators:
| & ; ( ) || && ;;

Simple Commands
A simple command consists of three types of
tokens:
Assignments (must come first)
Command word tokens
Redirections: redirection-op + word-op

The first token must not be a reserved word


Command terminated by new-line or ;
Examples:
foo=bar z=`date`
echo $HOME
x=foobar > q$$ $xyz z=3

Word Splitting
After parameter expansion, command
substitution, and arithmetic expansion,
the characters that are generated as a
result of these expansions that are not
inside double quotes are checked for split
characters
Default split character is space or tab
Split characters are defined by the value
of the IFS variable (IFS="" disables)

Word Splitting Examples


FILES="file1 file2"
cat $FILES
a
b
IFS=
cat $FILES
cat: file1 file2: cannot open

IFS=x v=exit
echo exit $v "$v"
exit e it exit

Pathname Expansion
After word splitting, each field that
contains pattern characters is replaced by
the pathnames that match
Quoting prevents expansion
set o noglob disables
Not in original Bourne shell, but in POSIX

Parsing Example
DATE=`date` echo $foo > \
/dev/null
DATE=`date` echo $foo > /dev/null
assignment

word

param

redirection

echo hello there


/bin/echo
PATH expansion

hello there
split by IFS

/dev/null
/dev/null

The eval built-in


eval arg
Causes all the tokenizing and expansions to
be performed again

trap command
trap specifies command that should be evaled
when the shell receives a signal of a particular
value.
trap [ [command] {signal}+]
If command is omitted, signals are ignored
Especially useful for cleaning up temporary files
trap 'echo "please, dont interrupt!"' SIGINT
trap 'rm /tmp/tmpfile' EXIT

Reading Lines
read is used to read a line from a file and
to store the result into shell variables
read r prevents special processing
Uses IFS to split into words
If no variable specified, uses REPLY
read
read r NAME
read FIRSTNAME LASTNAME

Script Examples
Rename files to lower case
Strip CR from files
Emit HTML for directory contents

Rename files
#!/bin/sh
for file in *
do
lfile=`echo $file | tr A-Z a-z`
if [ $file != $lfile ]
then
mv $file $lfile
fi
done

Remove DOS Carriage Returns


#!/bin/sh
TMPFILE=/tmp/file$$
if [ "$1" = "" ]
then
tr -d '\r'
exit 0
fi
trap 'rm -f $TMPFILE' 1 2 3 6 15
for file in "$@"
do
if tr -d '\r' < $file > $TMPFILE
then
mv $TMPFILE $file
fi
done

Generate HTML
$ dir2html.sh > dir.html

The Script
#!/bin/sh
[ "$1" != "" ] && cd "$1"
cat <<HUP
<html>
<h1> Directory listing for $PWD </h1>
<table border=1>
<tr>
HUP
num=0
for file in *
do
genhtml $file
# this function is on next
page
done
cat <<HUP
</tr>
</table>
</html>

Function genhtml
genhtml()
{
file=$1
echo "<td><tt>"
if [ -f $file ]
then
echo "<font color=blue>$file</font>"
elif [ -d $file ]
then
echo "<font color=red>$file</font>"
else
echo "$file"
fi
echo "</tt></td>"
num=`expr $num + 1`
if [ $num -gt 4 ]
then
echo "</tr><tr>"
num=0
fi
}

Korn Shell / bash Features

Command Substitution
Better syntax with $(command)
Allows nesting
x=$(cat $(generate_file_list))

Backward compatible with ` ` notation

Expressions
Expressions are built-in with the [[ ]] operator
if [[ $var = "" ]]

Gets around parsing quirks of /bin/test, allows checking


strings against patterns
Operations:

string == pattern
string != pattern
string1 < string2
file1 nt file2
file1 ot file2
file1 ef file2
&&, ||

Patterns
Can be used to do string matching:
if [[ $foo = *a* ]]
if [[ $foo = [abc]* ]]

Note: patterns are like a subset of regular


expressions, but different syntax

Additonal Parameter Expansion

${#param} Length of param


${param#pattern} Left strip min pattern
${param##pattern} Left strip max pattern
${param%pattern} Right strip min pattern
${param%%pattern} Right strip max pattern
${param-value} Default value if param not
set

Variables
Variables can be arrays
foo[3]=test
echo ${foo[3]}

Indexed by number
${#arr} is length of the array
Multiple array elements can be set at once:
set A foo a b c d
echo ${foo[1]}
Set command can also be used for positional
params: set a b c d; print $2

Functions
Alternative function syntax:
function name {
commands
}

Allows for local variables


$0 is set to the name of the function

Additional Features
Built-in arithmetic: Using $((expression ))
e.g., print $(( 1 + 1 * 8 / x ))

Tilde file expansion


~
~user
~+
~-

$HOME
home directory of user
$PWD
$OLDPWD

KornShell 93

Variable Attributes
By default attributes hold strings of unlimited length
Attributes can be set with typeset:

readonly (-r) cannot be changed


export (-x) value will be exported to env
upper (-u) letters will be converted to upper case
lower (-l) letters will be converted to lower case
ljust (-L width) left justify to given width
rjust (-R width) right justify to given width
zfill (-Z width) justify, fill with leading zeros
integer (-I [base]) value stored as integer

float (-E [prec]) value stored as C double


nameref (-n) a name reference

Name References
A name reference is a type of variable that
references another variable.
nameref is an alias for typeset -n
Example:
user1="jeff"
user2="adam"
typeset n name="user1"
print $name
jeff

New Parameter Expansion


${param/pattern/str} Replace first pattern
with str
${param//pattern/str} Replace all
patterns with str
${param:offset:len} Substring with offset

Patterns Extended
Additional pattern
types so that shell
patterns are
equally
expressive as
regular
expressions
Used for:

file expansion
[[ ]]
case statements
parameter
expansion

Patterns

Regular Expressions

ANSI C Quoting
$'' Uses C escape sequences
$'\t'

$'Hello\nthere'

printf added that supports C like printing:


printf "You have %d apples" $x

Extensions

%b ANSI escape sequences


%q Quote argument for reinput
\E Escape character (033)
%P convert ERE to shell pattern
%H convert using HTML conventions
%T date conversions using date formats

Associative Arrays

Arrays can be indexed by string


Declared with typeset A
Set: name["foo"]="bar"
Reference ${name["foo"]}
Subscripts: ${!name[@]}

Networking, HTTP, CGI

Network Application
Client application and server application
communicate via a network protocol
A protocol is a set of rules on how the client and
server communicate

web
client

HTTP

web
server

kernel

user

TCP/IP Suite
application
layer

client
TCP/UDP

transport
layer

server
TCP/UDP

internet
layer

IP
drivers/
hardware

IP
drivers/
hardware

network access layer

(ethernet)

Data Encapsulation
Data

ApplicationLayer

H1

Data

H2

H1

Data

H2

H1

Data

TransportLayer
InternetLayer
Network
Access
Layer

H3

Network Access/Internet Layers


Network Access Layer
Deliver data to devices on the same physical network
Ethernet

Internet Layer

Internet Protocol (IP)


Determines routing of datagram
IPv4 uses 32-bit addresses (e.g. 128.122.20.15)
Datagram fragmentation and reassembly

Transport Layer
Transport Layer
Host-host layer
Provides error-free, point-to-point connection between
hosts

User Datagram Protocol (UDP)


Unreliable, connectionless

Transmission Control Protocol (TCP)


Reliable, connection-oriented
Acknowledgements, sequencing, retransmission

Ports
Both TCP and UDP use 16-bit port numbers
A server application listen to a specific port for
connections
Ports used by popular applications are well-defined
SSH (22), SMTP (25), HTTP (80)
1-1023 are reserved (well-known)

Clients use ephemeral ports (OS dependent)

Name Service
Every node on the network normally has a
hostname in addition to an IP address
Domain Name System (DNS) maps IP
addresses to names
e.g. 128.122.81.155 is access1.cims.nyu.edu

DNS lookup utilities: nslookup, dig


Local name address mappings stored
in /etc/hosts

Sockets
Sockets provide access to TCP/IP on UNIX
systems
Sockets are communications endpoints
Invented in Berkeley UNIX
Allows a network connection to be opened as a
file (returns a file descriptor)

machine 1

machine 2

Major Network Services


Telnet (Port 23)
Provides virtual terminal for remote user
The telnet program can also be used to connect to
other ports

FTP (Port 20/21)


Used to transfer files from one machine to another
Uses port 20 for data, 21 for control

SSH (Port 22)


For logging in and executing commands on
remote machines
Data is encrypted

Major Network Services cont.


SMTP (Port 25)
Host-to-host mail transport
Used by mail transfer agents (MTAs)

IMAP (Port 143)


Allow clients to access and manipulate emails
on the server

HTTP (Port 80)


Protocol for WWW

Ksh93: /dev/tcp
Files in the form

/dev/tcp/hostname/port result in a

socket connection to the given service:


exec 3<>/dev/tcp/smtp.cs.nyu.edu/25 #SMTP
print u3 EHLO cs.nyu.edu"
print u3 QUIT"
while IFS= read u3
do
print r "$REPLY"
done

HTTP
Hypertext Transfer Protocol
Use port 80

Language used by web browsers (IE,


Netscape, Firefox) to communicate with
web servers (Apache, IIS)

QuickTime and a
TIFF (LZW) decompressor
are needed to see this picture.

HTTPrequest:
Getmethisdocument
HTTPresponse:
Hereisyourdocument

Resources
Web servers host web resources, including
HTML files, PDF files, GIF files, MPEG movies,
etc.
Each web object has an associated MIME type
HTML document has type text/html
JPEG image has type image/jpeg

Web resource is accessed using a Uniform


Resource Locator (URL)
http://www.cs.nyu.edu:80/courses/fall06/G22.2245-001/index.html
protocol

host

port

resource

HTTP Transactions
HTTP request to web server
GET/v40images/nyu.gifHTTP/1.1
Host:www.nyu.edu

HTTP response to web client


HTTP/1.1200OK
Contenttype:image/gif
Contentlength:3210

Sample HTTP Session


GET/HTTP/1.1
HOST:www.cs.nyu.edu

request

HTTP/1.1200OK
Date:Wed,19Oct200506:59:49GMT
Server:Apache/2.0.49(Unix)mod_perl/1.99_14Perl/v5.8.4
mod_ssl/2.0.49OpenSSL/0.9.7emod_auth_kerb/4.13PHP/5.0.0RC3
LastModified:Thu,12Sep200217:09:03GMT
ContentLength:163
ContentType:text/html;charset=ISO88591

response

<!DOCTYPEHTMLPUBLIC"//IETF//DTDHTML//EN">
<html>
<head>
<title></title>
<metaHTTPEQUIV="Refresh"CONTENT="0;URL=csweb/index.html">
<body>
</body>
</html>

Status Codes
Status code in the HTTP response indicates if a
request is successful
Some typical status codes:
200
302

OK
Found; Resource in different
URI

401
403
404

Authorization required
Forbidden
Not Found

Gateways
Interface between resource and a web
server

QuickTime and a
TIFF (LZW) decompressor
are needed to see this picture.

http WebServer

Gateway

resource

CGI
Common Gateway Interface is a standard interface for
running helper applications to generate dynamic
contents
Specify the encoding of data passed to programs

Allow HTML documents to be created on the fly


Transparent to clients
Client sends regular HTTP request
Web server receives HTTP request, runs CGI program, and
sends contents back in HTTP responses

CGI programs can be written in any language

CGI Diagram

HTTPrequest

QuickTime and a
TIFF (LZW) decompressor
are needed to see this picture.

WebServer

HTTPresponse

spawnprocess
Document

Script

HTML
Document format used on the web
<html>
<head>
<title>SomeDocument</title>
</head>
<body>
<h2>SomeTopics</h2>
ThisisanHTMLdocument
<p>
Thisisanotherparagraph
</body>
</html>

HTML
HTML is a file format that describes a web
page.
These files can be made by hand, or
generated by a program
A good way to generate an HTML file is by
writing a shell script

Forms
HTML forms are used to collect user input
Data sent via HTTP request
Server launches CGI script to process data
<formmethod=POST
action=http://www.cs.nyu.edu/~unixtool/cgi
bin/search.cgi>
Enteryourquery:<inputtype=textname=Search>
<inputtype=submit>
</form>

Input Types
Text Field
<inputtype=textname=zipcode>

Radio Buttons
<inputtype=radioname=sizevalue=S>Small
<inputtype=radioname=sizevalue=M>Medium
<inputtype=radioname=sizevalue=L>Large

Checkboxes
<inputtype=checkboxname=extrasvalue=lettuce>Lettuce
<inputtype=checkboxname=extrasvalue=tomato>Tomato

Text Area
<textareaname=addresscols=50rows=4>

</textarea>

Submit Button
Submits the form for processing by the
CGI script specified in the form tag
<inputtype=submitvalue=SubmitOrder>

HTTP Methods
Determine how form data are sent to web
server
Two methods:
GET
Form variables stored in URL

POST
Form variables sent as content of HTTP request

Encoding Form Values


Browser sends form variable as name-value
pairs
name1=value1&name2=value2&name3=value3

Names are defined in form elements


<inputtype=textname=ssnmaxlength=9>

Special characters are replaced with %## (2digit hex number), spaces replaced with +
e.g. 10/20Wed is encoded as 10%2F20+Wed

GET/POST examples
GET:
GET/cgibin/myscript.pl?name=Bill%20Gates&
company=MicrosoftHTTP/1.1
HOST:www.cs.nyu.edu
POST:
POST/cgibin/myscript.plHTTP/1.1
HOST:www.cs.nyu.edu
other headers
name=Bill%20Gates&company=Microsoft

GET or POST?
GET method is useful for
Retrieving information, e.g. from a database
Embedding data in URL without form element

POST method should be used for forms with


Many fields or long fields
Sensitive information
Data for updating database

GET requests may be cached by clients


browsers or proxies, but not POST requests

Parsing Form Input


Method stored in HTTP_METHOD
GET: Data encoded into QUERY_STRING
POST: Data in standard input (from body
of request)
Most scripts parse input into an
associative array
You can parse it yourself
Or use available libraries (better)

CGI Environment Variables

DOCUMENT_ROOT
HTTP_HOST
HTTP_REFERER
HTTP_USER_AGENT
HTTP_COOKIE
REMOTE_ADDR
REMOTE_HOST
REMOTE_USER
REQUEST_METHOD
SERVER_NAME
SERVER_PORT

CGI Script: Example

Part 1: HTML Form


<html>
<center>
<H1>Anonymous Comment Submission</H1>
</center>
Please enter your comment below which will
be sent anonymously to <tt>kornj@cs.nyu.edu</tt>.
If you want to be extra cautious, access this
page through <a
href="http://www.anonymizer.com">Anonymizer</a>.
<p>
<form action=cgi-bin/comment.cgi method=post>
<textarea name=comment rows=20 cols=80>
</textarea>
<input type=submit value="Submit Comment">
</form>
</html>

Part 2: CGI Script (ksh)


#!/home/unixtool/bin/ksh
. cgi-lib.ksh
ReadParse
PrintHeader

# Read special functions to help parse

print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj


print
print
print
print

"<H2>You submitted the comment</H2>"


"<pre>"
-r -- "${Cgi.comment}"
"</pre>"

Debugging
Debugging can be tricky, since error
messages don't always print well as HTML
One method: run interactively
$ QUERY_STRING='birthday=10/15/03'
$ ./birthday.cgi
Content-type: text/html
<html>
Your birthday is <tt>10/15/02</tt>.
</html>

How to get your script run


This can vary by web server type
http://www.cims.nyu.edu/systems/resources/webhosting/index.html

Typically, you give your script a name that


ends with .cgi
Give the script execute permission
Specify the location of that script in the
URL

CGI Security Risks


Sometimes CGI scripts run as owner of the
scripts
Never trust user input - sanity-check
everything
If a shell command contains user input, run
without shell escapes
Always encode sensitive information, e.g.
passwords
Also use HTTPS

Clean up - dont leave sensitive data around

CGI Benefits
Simple
Language independent
UNIX tools are good for this because
Work well with text
Integrate programs well
Easy to prototype
No compilation (CGI scripts)

Example: Find words in


Dictionary
<form action=dict.cgi>
Regular expression: <input type=entry
name=re value=".*">
<input type=submit>
</form>

Example: Find words in


Dictionary
#!/home/unixtool/bin/ksh
PATH=$PATH:.
. cgi-lib.ksh
ReadParse
PrintHeader
print "<H1> Words matching <tt>${Cgi.re}</tt> in the dictionary
</H1>\n";
print "<OL>"
grep "${Cgi.re}" /usr/dict/words | while read word
do
print "<LI> $word"
done
print "</OL>"

What is Perl?
Practical Extraction and Report Language
Scripting language created by Larry Wall in the
mid-80s
Functionality and speed somewhere between
low-level languages (like C) and high-level
ones (like shell)
Influence from awk, sed, and C Shell
Easy to write (after you learn it), but sometimes
hard to read
Widely used in CGI scripting

A Simple Perl Script


turnsonwarnings
hello:
#!/usr/bin/perl -w
print Hello, world!\n;

$ chmod a+x hello


$ ./hello
Hello, world!
$ perl -e print Hello, world!\n;
Hello, world!

Another Perl Script


$;=$_;$/='0#](.+,a()$=(\}$+_c2$sdl[h*du,(1ri)b$2](n} /
1)1tfz),}0(o{=4s)1rs(2u;2(u",bw-2b $
hc7s"tlio,tx[{ls9r11$e(1(9]q($,$2)=)_5{4*s{[9$,lh$2,_.
(ia]7[11f=*2308t$$)]4,;d/
{}83f,)s,65o@*ui),rt$bn;5(=_stf*0l[t(o$.o$rsrt.c!(i([$a]
$n$2ql/d(l])t2,$.+{i)$_.$zm+n[6t(e1+26[$;)+]61_l*,*)],
(41${/@20)/z1_0+=)(2,,4c*2)\5,h$4;$91r_,pa,)$[4r)
$=_$6i}tc}!,n}[h$]$t 0rd)_$';open(eval$/);
$_=<0>;for($x=2;$x<666;$a.=++$x){s}{{.|.}};push@@,$&;
$x==5?$z=$a:++$}}for(++$/..substr($a,1885))
{$p+=7;$;.=$@[$p%substr($a,$!,3)+11]}eval$;

Data Types
Basic types: scalar, lists, hashes
Support OO programming and userdefined types

What Type?
Type of variable determined by special
leading character
$foo

scalar

@foo

list

%foo

hash

&foo

function

Data types have separate name spaces

Scalars
Can be numbers
$num = 100;
$num = 223.45;
$num = -1.3e38;

Can be strings
$str
$str
$str
$str

=
=
=
=

unix tools;
Who\s there?;
good evening\n;
one\ttwo;

Backslash escapes and variable names are interpreted


inside double quotes

Special Scalar Variables


$0
$_
$$
$?
$!
$/
$.
undef

Name of script
Default variable
Current PID
Status of last pipe or system call
System error message
Input record separator
Input record number
Acts like 0 or empty string

Operators
Numeric: + - * / % **
String concatenation: .
$state = New . York;

# NewYork

String repetition: x
print bla x 3;

# blablabla

Binary assignments:
$val = 2; $val *= 3;
$state .= City;

# $val is 6
# NewYorkCity

Comparison Operators
Comparison

Numeri
c

String

Equal

==

eq

Not Equal

!=

ne

Greater than

>

gt

Less than

<

lt

Less than or equal to

<=

le

Greater than or equal


to

>=

ge

Boolean Values
if ($ostype eq unix) { }
if ($val) { }
No boolean data type
undef is false
0 is false; Non-zero numbers are true
and 0 are false; other strings are true
The unary not (!) negates the boolean
value

undef and defined


$f = 1;
while ($n < 10) {
# $n is undef at 1st iteration
$f *= ++$n;
}

Use defined to check if a value is undef


if (defined($val)) { }

Lists and Arrays

List: ordered collection of scalars


Array: Variable containing a list
Each element is a scalar variable
Indices are integers starting at 0

Array/List Assignment
@teams=(Knicks,Nets,Lakers);
print $teams[0];
# print Knicks
$teams[3]=Celtics;# add new elt
@foo = ();
# empty list
@nums = (1..100);
# list of 1-100
@arr = ($x, $y*6);
($a, $b) = (apple, orange);
($a, $b) = ($b, $a); # swap $a $b
@arr1 = @arr2;

More About Arrays and Lists


Quoted words - qw
@planets = qw/ earth mars jupiter /;
@planets = qw{ earth mars jupiter };

Last elements index: $#planets


Not the same as number of elements in array!

Last element: $planets[-1]

Scalar and List Context


@colors = qw< red green blue >;
Array interpolated as string:
print My favorite colors are @colors\n;
Prints My favorite colors are red green blue

Array in scalar context returns the number of


elements in the list
$num = @colors + 5;

# $num gets 8

Scalar expression in list context


@num = 88;

# a one-element list (88)

pop and push


push and pop: arrays used as stacks
push adds element to end of array
@colors = qw# red green blue #;
push(@colors, yellow);
# same as
@colors = (@colors, yellow);
push @colors, @more_colors;

pop removes last element of array and returns


it
$lastcolor = pop(@colors);

shift and unshift


shift and unshift: similar to push and pop on
the left side of an array
unshift adds elements to the beginning
@colors = qw# red green blue #;
unshift @colors, orange;
First element is now orange

shift removes element from beginning


$c = shift(@colors); # $c gets orange

sort and reverse


reverse returns list with elements in reverse
order
@list1 = qw# NY NJ CT #;
@list2 = reverse(@list1); # (CT,NJ,NY)

sort returns list with elements in ASCII order


@day = qw/ tues wed thurs /;
@sorted = sort(@day); #(thurs,tues,wed)
@nums = sort 1..10; # 1 10 2 3 8 9

reverse and sort do not modify their


arguments

Iterate over a List


foreach loops through a list of values
@teams = qw# Knicks Nets Lakers #;
foreach $team (@teams) {
print $team win\n;
}

Value of control variable restored at end of loop


Synonym for the for keyword
$_ is the default
foreach (@teams) {
$_ .= win\n;
print; # print $_
}

Hashes
Associative arrays - indexed by strings (keys)
$cap{Hawaii} = Honolulu;
%cap = ( New York, Albany, New Jersey,
Trenton, Delaware, Dover );

Can use => (big arrow or comma arrow) in


place of , (comma)
%cap = ( New York
=> Albany,
New Jersey => Trenton,
Delaware
=> Dover );

Hash Element Access


$hash{$key}
print $cap{New York};
print $cap{New . York};

Unwinding the hash


@cap_arr = %cap;
Gets unordered list of key-value pairs

Assigning one hash to another


%cap2 = %cap;
%cap_of = reverse %cap;
print $cap_of{Trenton};

# New Jersey

Hash Functions
keys returns a list of keys
@state = keys %cap;

values returns a list of values


@city = values %cap;

Use each to iterate over all (key, value) pairs


while ( ($state, $city) = each %cap )
{
print Capital of $state is $city\n;
}

Hash Element Interpolation


Unlike a list, entire hash cannot be interpolated
print %cap\n;
Prints %cap followed by a newline

Individual elements can


foreach $state (sort keys %cap) {
print Capital of $state is
$cap{$state}\n;
}

More Hash Functions


exists checks if a hash element has ever been
initialized
print Exists\n if exists $cap{Utah};
Can be used for array elements
A hash or array element can only be defined if it
exists

delete removes a key from the hash


delete $cap{New York};

Merging Hashes
Method 1: Treat them as lists
%h3 = (%h1, %h2);

Method 2 (save memory): Build a new hash by


looping over all elements
%h3 = ();
while ((%k,$v) = each(%h1)) {
$h3{$k} = $v;
}
while ((%k,$v) = each(%h2)) {
$h3{$k} = $v;
}

Subroutines
sub myfunc { }
$name=Jane;

sub print_hello {
print Hello $name\n; # global $name
}
&print_hello;
# print Hello Jane
print_hello;
# print Hello Jane
print_hello();
# print Hello Jane

Arguments
Parameters are assigned to the special array @_
Individual parameter can be accessed as $_[0],
$_[1],
sub sum {
my $x;
# private variable $x
foreach (@_) { # iterate over params
$x += $_;
}
return $x;
}
$n = &sum(3, 10, 22); # n gets 35

More on Parameter Passing


Any number of scalars, lists, and hashes can be
passed to a subroutine
Lists and hashes are flattened
func($x, @y, %z);
Inside func:
$_[0] is $x
$_[1] is $y[0]
$_[2] is $y[1], etc.

Scalars in @_ are implicit aliases (not copies) of


the ones passed changing values of $_[0],
etc. changes the original variables

Return Values
The return value of a subroutine is the last
expression evaluated, or the value returned by
the return operator
sub myfunc {
my $x = 1;
$x + 2; #returns 3
}

sub myfunc {
my $x = 1;
return $x + 2;
} @somelist;
Can also return a list: return

If return is used without an expression (failure),


undef or () is returned depending on context

Lexical Variables
Variables can be scoped to the enclosing block
with the my operator
sub myfunc {
my $x;
my($a, $b) = @_;

# copy params

Can be used in any block, such as if block or


while block
Without enclosing block, the scope is the source file

use strict
The use strict pragma enforces some
good programming rules
All new variables need to be declared with
my
#!/usr/bin/perl -w
use strict;
$n = 1;

# <-- perl will complain

Another Subroutine Example


@nums = (1, 2, 3);
$num = 4;
@res = dec_by_one(@nums, $num);
minus_one(@nums, $num);

sub dec_by_one {
my @ret = @_;
for my $n (@ret) { $n-- }
return @ret;
}
sub minus_one {
for (@_) { $_-- }
}

# @res=(0, 1, 2, 3)
# (@nums,$num)=(1, 2, 3, 4)
# (@nums,$num)=(0, 1, 2, 3)

# make a copy

Reading from STDIN


STDIN is the builtin filehandle to the std input
Use the line input operator around a file handle
to read from it
$line = <STDIN>;
chomp($line);

# read next line

chomp removes trailing string that corresponds


to the value of $/ (usually the newline character)

Reading from STDIN example


while (<STDIN>) {
chomp;
print Line $. ==> $_\n;
}
Line 1 ==> [Contents of line 1]
Line 2 ==> [Contents of line 2]

<>
Diamond operator < > helps Perl programs
behave like standard Unix utilities (cut, sed, )
Lines are read from list of files given as command
line arguments (@ARGV), otherwise from stdin
while (<>) {
chomp;
print Line $. from $ARGV is $_\n;
}

./myprog file1 file2 Read from file1, then file2, then standard input
$ARGV is the current filename

Filehandles
Use open to open a file for reading/writing
open
open
open
open

LOG,
LOG,
LOG,
LOG,

syslog;
<syslog;
>syslog;
>>syslog;

# read
# read
# write
# append

When youre done with a filehandle, close


it
close LOG;

Errors
When a fatal error is encountered, use die to
print out error message and exit program
die Something bad happened\n if .;
Always check return value of open
open LOG, >>syslog
or die Cannot open log: $!;

For non-fatal errors, use warn instead


warn Temperature is below 0!
if $temp < 0;

Reading from a File


open MSG, /var/log/messages
or die Cannot open messages: $!\n;
while (<MSG>) {
chomp;
# do something with $_
}
close MSG;

Reading Whole File


In scalar context, <FH> reads the next line
$line = <LOG>;

In list context, <FH> read all remaining lines


@lines = <LOG>;

Undefine $/ to read the rest of file as a string


undef $/;
$all_lines = <LOG>;

Writing to a File
open LOG, >/tmp/log
or die Cannot create log: $!;
print LOG Some log messages\n
printf LOG %d entries processed.\n,
$num;
close LOG;
no comma after filehandle

File Tests examples


die The file $filename is not readable
if ! -r $filename;
warn The file $filename is not owned by
you unless -o $filename;
print This file is old if -M $filename >
365;

File Tests list


-r
-w
-x
-o

File or directory is readable


File or directory is writable
File or directory is executable
File or directory is owned by this
user

-e
-z
-s

File or directory exists


File exists and has zero size
File or directory exists and has
nonzero size (value in bytes)

File Tests list


-f
-d
-l
-M
-A

Entry if a plain file


Entry is a directory
Entry is a symbolic link
Modification age (in days)
Access age (in days)

$_isthedefaultoperand

Manipulating Files and Dirs

unlink removes files


unlink file1, file2
or warn failed to remove file: $!;

rename renames a file


rename file1, file2;

link creates a new (hard) link


link file1, file2
or warn cant create link: $!;

symlink creates a soft link


link file1, file2 or warn ;

Manipulating Files and Dirs cont.


mkdir creates directory
mkdir mydir, 0755
or warn Cannot create mydir: $!;
rmdir removes empty directories
rmdir dir1, dir2, dir3;
chmod modifies permissions on file or directory
chmod 0600, file1, file2;

if - elsif - else
if elsif else
if ( $x
print
}
elsif (
print
}
else {
print
}

> 0 ) {
x is positive\n;
$x < 0 ) {
x is negative\n;

x is zero\n;

unless
Like the opposite of if
unless ($x < 0) {
print $x is non-negative\n;
}
unlink $file unless -A $file < 100;

while and until


while ($x < 100) {
$y += $x++;
}
until is like the opposite of while
until ($x >= 100) {
$y += $x++;
}

for
for (init; test; incr) { }
# sum of squares of 1 to 5
for ($i = 1; $i <= 5; $i++) {
$sum += $i*$i;
}

next
next skips the remaining of the current
iteration (like continue in C)
# only print non-blank lines
while (<>) {
if ( $_ eq \n) { next; }
else { print; }
}

last
last exits loop immediately (like break in
C)
# print up to first blank line
while (<>) {
if ( $_ eq \n) { last; }
else { print; }
}

Logical AND/OR
Logical AND : &&
if (($x > 0) && ($x < 10)) { }

Logical OR : ||
if

($x < 0) || ($x > 0)) { }

Both are short-circuit second


expression evaluated only if necessary

Ternary Operator
Same as the ternary operator (?:) in C
expr1 ? expr2 : expr3
Like if-then-else: If expr1 is true, expr2 is
used; otherwise expr3 is used
$weather=($temp>50)?warm:cold;

Regular Expressions
Use EREs (egrep style)
Plus the following character classes

\w
word characters: [A-Za-z0-9_]
\d
digits: [0-9]
\s
whitespaces: [\f\t\n\r ]
\b
word boundaries
\W, \D, \S, \B are complements of the corresponding
classes above

Can use \t to denote a tab

Backreferences
Support backreferences
Subexpressions are referred to using \1, \
2, etc. in the RE and $1, $2, etc. outside
RE
if (/^this (red|blue|green) (bat|ball) is \1/)
{
($color, $object) = ($1, $2);
}

Matching
Pattern match operator: /RE/ is shortcut of m/RE/
Returns true if there is a match
Match against $_
Can also use m(RE), m<RE>, m!RE!, etc.
if (/^\/usr\/local\//) { }
if (m%/usr/local/%) { }

Case-insensitive match
if (/new york/i) { };

Matching cont.
To match an RE against something other than
$_, use the binding operator =~
if ($s =~ /\bblah/i) {
print Found blah!
}
!~ negates the match
while (<STDIN> !~ /^#/) { }

Variables are interpolated inside REs


if (/^$word/) { }

\Substitutions
Sed-like search and replace with s///
s/red/blue/;
$x =~ s/\w+$/$`/;

m/// does not modify variable; s/// does

Global replacement with /g


s/(.)\1/$1/g;

Transliteration operator: tr/// or y///


tr/A-Z/a-z/;

RE Functions
split string using RE (whitespace by default)
@fields = split /:/, ::ab:cde:f;
# gets (,,ab,cde,f)

join strings into one


$str = join -, @fields;

# gets --ab-cde-f

grep something from a list


Similar to UNIX grep, but not limited to using RE
@selected = grep(!/^#/, @code);
@matched = grep { $_>100 && $_<150 } @nums;

Modifying elements in returned list actually modifies


the elements in the original list

Running Another program


Use the system function to run an external
program
With one argument, the shell is used to run the
command
Convenient when redirection is needed
$status = system(cmd1 args > file);

To avoid the shell, pass system a list


$status = system($prog, @args);
die $prog exited abnormally: $? unless $status
== 0;

Capturing Output
If output from another program needs to
be collected, use the backticks
my $files = `ls *.c`;
Collect all output lines into a single string
my @files = `ls *.c`;
Each element is an output line

The shell is invoked to run the command

Environment Variables
Environment variables are stored in the
special hash %ENV
$ENV{PATH} = /usr/local/bin:
$ENV{PATH};

Example: Word Frequency


#!/usr/bin/perl -w
# Read a list of words (one per line) and
# print the frequency of each word
use strict;
my(@words, %count, $word);
chomp(@words = <STDIN>); # read and chomp all lines
for $word (@words) {
$count{$word}++;
}
for $word (keys %count) {
print $word was seen $count{$word} times.\n;
}

Good Ways to Learn Perl


a2p
Translates an awk program to Perl

s2p
Translates a sed script to Perl

perldoc

$
$
$
$

Online Perl documentation


perldoc perldoc perldoc man page
perldoc perlintro Perl introduction
perldoc -f sort Perl sort function man page
perldoc CGI
CGI module man page

Modules
Perl modules are libraries of reusable
code with specific functionalities
Standard modules are distributed with
Perl, others can be obtained from
Include modules in your program with use,
e.g. use CGI incorporates the CGI module
Each module has its own namespace

CGI Programming

Forms
HTML forms are used to collect user input
Data sent via HTTP request
Server launches CGI script to process data
<form method=POST
action=http://www.cs.nyu.edu/~unixtool/cgibin/search.cgi>
Enter your query: <input type=text name=Search>
<input type=submit>
</form>

Input Types
Text Field
<input type=text name=zipcode>

Radio Buttons
<input type=radio name=size value=S> Small
<input type=radio name=size value=M> Medium
<input type=radio name=size value=L> Large

Checkboxes
<input type=checkbox name=extras value=lettuce> Lettuce
<input type=checkbox name=extras value=tomato> Tomato

Text Area
<textarea name=address cols=50 rows=4>

</textarea>

Submit Button
Submits the form for processing by the
CGI script specified in the form tag
<input type=submit value=Submit Order>

HTTP Methods
Determine how form data are sent to web
server
Two methods:
GET
Form variables stored in URL

POST
Form variables sent as content of HTTP request

Encoding Form Values


Browser sends form variable as name-value
pairs
name1=value1&name2=value2&name3=value3

Names are defined in form elements


<input type=text name=ssn maxlength=9>

Special characters are replaced with %## (2digit hex number), spaces replaced with +
e.g. 11/8 Wed is encoded as 11%2F8+Wed

HTTP GET/POST examples


GET:
GET /cgi-bin/myscript.pl?name=Bill%20Gates&
company=Microsoft HTTP/1.1
HOST: www.cs.nyu.edu
POST:
POST /cgi-bin/myscript.pl HTTP/1.1
HOST: www.cs.nyu.edu
other headers
name=Bill%20Gates&company=Microsoft

GET or POST?
GET method is useful for
Retrieving information, e.g. from a database
Embedding data in URL without form element

POST method should be used for forms with


Many fields or long fields
Sensitive information
Data for updating database

GET requests may be cached by clients


browsers or proxies, but not POST requests

Parsing Form Input


Method stored in HTTP_METHOD
GET: Data encoded into QUERY_STRING
POST: Data in standard input (from body
of request)
Most scripts parse input into an
associative array
You can parse it yourself
Or use available libraries (better)

CGI Script: Example

Part 1: HTML Form


<html>
<center>
<H1>Anonymous Comment Submission</H1>
</center>
Please enter your comment below which will
be sent anonymously to <tt>kornj@cs.nyu.edu</tt>.
If you want to be extra cautious, access this
page through <a
href="http://www.anonymizer.com">Anonymizer</a>.
<p>
<form action=cgi-bin/comment.cgi method=post>
<textarea name=comment rows=20 cols=80>
</textarea>
<input type=submit value="Submit Comment">
</form>
</html>

Part 2: CGI Script (ksh)


#!/home/unixtool/bin/ksh
. cgi-lib.ksh
ReadParse
PrintHeader

# Read special functions to help parse

print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj


print
print
print
print

"<H2>You submitted the comment</H2>"


"<pre>"
-r -- "${Cgi.comment}"
"</pre>"

Perl CGI Module


Interface for parsing and interpreting query
strings passed to CGI scripts
Methods for creating generating HTML
Methods to handle errors in CGI scripts
Two interfaces: procedural and OO
Ask for the procedural interface:
use CGI qw(:standard);

A Perl CGI Script


#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
my $bday = param("birthday");
# Print headers (text/html is the default)
print header(-type => 'text/html');
# Print <html>, <head>, <title>, <body> tags etc.
print start_html(Birthday);
# Your HTML body
print "Your birthday is $bday.\n";
# Print </body></html>
print end_html();

Debugging Perl CGI Scripts


Debugging CGI script is tricky - error messages
dont always come up on your browser
Check if the script compiles
$ perl -wc cgiScript

Run script with test data


$ perl -w cgiScript prod=MacBook price=1800
Content-Type: text/html
<html>

</html>

How to get your script run


This can vary by web server type
http://www.cims.nyu.edu/systems/resources/webhosting/index.html

Typically, you give your script a name that


ends with .cgi and/or put it in a special
directory (e.g. cgi-bin)
Give the script execute permission
Specify the location of that script in the
URL

CGI Security Risks


Sometimes CGI scripts run as owner of the
scripts
Never trust user input - sanity-check
everything
If a shell command contains user input, run
without shell escapes
Always encode sensitive information, e.g.
passwords
Also use HTTPS

Clean up - dont leave sensitive data around

CGI Benefits
Simple
Language independent
UNIX tools are good for this because
Work well with text
Integrate programs well
Easy to prototype
No compilation (CGI scripts)

You might also like