Awk By: Example

AWK By
Example
Copyright  Hewlett-Packard Company 2005
Awk By Example
David L. Totsch
Technical Consultant
Hewlett-Packard Company
david.totsch@hp.com
One of the attractive and powerful features of HP-UX is its strong set of
data manipulation tools such as grep(1), sed(1), shell scripts and
awk(1). One of the most powerful of these tools is also one of the least
used. This session has been designed to quickly introduce users,
programmers and system administrators to the abilities of the awk(1)
programming language. The approach will be to present this tool in the
way a technical user usually learns -- by looking at how someone
already using the tool is using it and trying progressively more complex
implementations. By adding awk(1) to their skill set, everyone writing
shell scripts, parsing data, converting programs, writing prototype
systems and creating formatted reports can take further advantage of
the productivity and functionality gains HP-UX has to offer. Awk(1) is
also a great place to learn the proper techniques and mindset as a
precursor to PERL. This tutorial will provide a functional introduction to
awk programming that can be utilized immediately upon return to your
HP-UX system.
Awk By Example AvMed

HP Group Operational Review 12/2008 - Page 1 of 48
What is it?
• text pattern scanner

• highly programmable editor
• programming language
• complex system command
• report processor
Awk(1) is a UNIX utility that defies description. Of those that have used
its unique capabilities, many describe it as a text pattern scanner -- it is
very effective at scanning text for a pattern and taking an action on it.
Others describe it as nothing more than a more sophisticated version of
sed(1). Those that have delved a little further into its capabilities have
found awk(1) to be a very capable programming language. Some
frustrated users describe awk(1) as nothing more than a very complex
system command (and one to be avoided because of its complexity).
In-between these two extremes, you will find some users who view it as
nothing more than a handy report processor. Which assessment should
you believe? They are all correct! As we will see, all of those
descriptions, and then some, fit awk(1).
Occasionally, someone mentions Perl when they hear me talk about
awk(1). Perl is a good tool. It has the advantage of being available on
multiple platforms. But, I believe you should apply the appropriate tool
to the job. I have also seen some people try to write Perl code who
lacked a firm grasp of Regular Expressions. Those same coders were
not satisfied with the Perl utility. It is my belief, and intent, that teaching
awk(1) will build the Regular Expression skills you will need to be
effective with Perl. This is because awk(1) is “regular expression
driven”. Basically, awk(1) is scanning input for records that match
regular expressions to take action upon. You will quickly realize the
relationship simply by using awk(1).
Origins
• grep (Global Regular Expression Print)

• sed (Stream Editor)
• awk (Aho, Weinberger, Kernighan)
One of the most useful commands in early UNIX was ed(1). OK, quit
laughing. It is true. Original UNIX was not so full-featured as the
versions we are accustomed to. Probably the most common instruction
passed to ed(1) was “g/RE/p”. The meaning? The “g” stands for global
-- every line in the file. “RE” means a regular expression. The “p”
instructs ed(1) to print the line the regular expression is found on. The
usage was so common that a separate command was created: grep(1).
The name was derived from the instruction passed to ed(1): global
regular expression print.
Many users found ed(1) and grep(1) very useful. Eventually, the full file
scan and edit capability of ed(1) maid its way into a filter style command
named sed(1). But, like all good computer users, UNIX users longed for
more.
Three gentlemen who had corroborated on other portions of UNIX, Aho,
Weinberger and Kernighan, also longed for a more full featured editor.
They put their heads together and created awk(1). The result was
something more than a highly programmable editor -- it was more of a
mini programming language. Without a succinct name to give the utility,
their initials stuck. [No ego here.]

The First Example
$ lpstat -a
lj4si accepting requests since Jan 28 17:29
colorjet accepting requests since Jan 28 17:29
mopier accepting requests since Jan 28 17:29
forms accepting requests since Jan 28 17:29
$
$ lpstat -a | awk ’{print $1}’
lj4si
colorjet
mopier
forms
Contriving the first example of an awk(1) script was not simple. This
one is benign. The desire is to list the available printers on a system. A
good definition would be to list those printers that are currently
accepting new print jobs: lpstat -a. But, that output also prints the date
and time that the printer began accepting print jobs. That is very
extraneous information. To get just the print queue names, we need to
pull just the first word of each line. Above is the simple, in-line awk(1)
script to do just that.
There is always more than one way to do anything in UNIX, and the
astute student will point out that cut(1) could be used to perform the
same task. This is true. But, pretend with me for a moment that the
delimiter was not just a single space, but the text had been justified
using multiple spaces. How would cut(1) have reacted? Well, the same
since we wanted the first field. But, consider for a moment that you had
wanted the second field. Telling cut(1) that the field delimiter is a space
falls apart -- if there are multiple consecutive spaces, cut(1) sees a null
field between each space. We probably will not get what we want. By
default, awk(1) uses white space (spaces and tabs) as field delimiters.
Also by default, it will span across white space. This means that
multiple spaces and tabs are seen as a single field delimiter. This
behavior is extremely handy when dealing with data files.

Something CUT Cannot Do
$ lpstat -a
lj4si accepting requests since Jan 28 17:29
colorjet accepting requests since Jan 28 17:29
mopier accepting requests since Jan 28 17:29
forms accepting requests since Jan 28 17:29
$
$ lpstat -a | awk ’{print $5,$6,$7,$1}’
Jan 28 17:29 lj4si
Jan 28 17:29 colorjet
Jan 28 17:29 mopier
Jan 28 17:29 forms
OK, we did a little cut(1) bashing. Well, simply pointing out capabilities.
Here is another way cut(1) fails us: we cannot use it to re-order fields
on a line. Cut(1) will only give us the fields in the original order. On the
other hand, awk(1) can report the fields in any order we wish.
In the example above, we have listed the printers accepting jobs but
have listed the date and time fields first, instead of last. Now we have
the advantage of re-ordering the fields reported.

Running An AWK Program
awk [-Ffs] [‘program’| -f progfile] [datafile...]
fs field separator
‘program’ AWK program entered on cmd line
progfile text file containing AWK program
datafile data you want to pass through

your AWK program (“-” is stdin)
Lets pause for a moment and turn our attention from awk(1) capabilities
to how awk(1) programs can be invoked. [I have called them programs
here instead of scripts. Please note the instructions are interpreted, not
compiled.]
There are two methods to choose from when passing a script to awk(1):
1) you can pass the program as an argument on the command line or 2)
you can pass the name of the file containing your awk(1) program. For
short programs, you will get in the habit of putting the text on the
command-line. For longer, more involved, more permanent programs,
you will want to store them in files.
When you pass the program as part of the command-line, you need to
enclose the program in single-quotes. This is not an awk(1)
requirement; it is a shell requirement. As we will see later, some of the
syntax will catch the attention of the shell (the shell will want to interpret
it instead of passing the text to awk(1). The single-quotes tell the shell
to keep its hands off.
You can have awk(1) read data from either standard-input or data files.
You can pass multiple data files, too. Awk(1) will work on them in the
order given. Later, we will see that awk(1) can also detect when the
input changes to another file.
Finally, we see that we can specify the field delimiter on the command-
line. Be forewarned that this turns off the spanning that is the default.
Command-Line Illustrations
lpstat -a | awk ’{print $1}’
lpstat -a | awk -f prog.awk
awk -F: ’{print $1}’ /etc/passwd
awk -f prog.awk data1 data2 data3
Here we see some illustrations of various awk(1) command-lines.

The first one is a repeat. It illustrates an in-line program. The input is
read from standard-input (the output of lpstat).
Now, lets say we take the program on the first example and type it into
the file “prog.awk”. The second example instructs awk(1) to read this
file. Input is the same, so we will get the same result.
The third example takes its input from /etc/passwd. The program is
taken from the command-line. The field delimiter is also set to a colon.
The result is to print the list of user names.
The fourth example if very self-contained. The program is again read
from “prog.awk”. The data will be read from three files in the local
directory: data1, data2 and data3.

Internal Variables
$ cat data1
one
two two
three three three oops
four four four four
$
$ awk '{print NR,$0,"FIELD COUNT:",NF}' data1
1 one FIELD COUNT: 1
2 two two FIELD COUNT: 2
3 three three three oops FIELD COUNT: 4
4 four four four four FIELD COUNT: 4
Now that we can successfully select where awk(1) finds its program and
what data it will work with, lets jump into learning more about the
language. You have been exposed to some examples without much
description.
We have seen that the fields on lines of data can be referenced by the
field number preceded with a dollar-sign. For example, $5 refers to
filed number five (numbering begins at one). To reference the entire
record we use $0. There are two internal variables that are very handy.
“NF” refers to the number of fields on the current data line. “NR” refers
to the number of records, or lines, read so far.
Now, to explain the example above. You can see the contents of the
data file for yourself. The awk(1) program prints out the following: the
number of fields read so far, the entire data record, the literal string
“FIELD COUNT:” and the number of fields in the record. I think we can
find ways to use these capabilities to our advantage.

Internal Variables Continued
NF field count for current record
NR count of records read so far
FNR count of records read from current file
FS input field separator

RS input record separator
OFS output field separator
ORS output record separator
$0 entire input line

$1 first field, $2 is second field ....
Here we see a more complete listing of internal variables.

We have already seen NF and NR can do for us. But, there is a new
member to the set. FNR stands for file number of records. Where NR
continues to increment the number of records read so far, regardless of
the file transition, FNR goes back to one when the reading of a new
data file starts, thereby keeping track of the number of records read so
far from the current file.
Besides the ability to change the field separator on the command-line,
there is the ability to change it within the awk(1) program. To do this,
we use FS. Where the field and record counts are automatically
assigned, we can assign the field separator, the record separator and
the like. We will discuss the input and output separators in detail later.
At the bottom, we have our reminders of how to access the fields and
records.

Even More About Internal
Variables
NF is an integer
(the number of fields in this record)
$NF is the contents of the last

field in the record
$NR the contents of the NRth field in

this record
There is one more feature you need to know about internal variables to
complete your understanding. Lets take a look at the number of fields
variable NF. By itself, NF returns an integer number indicating the
number of fields in the current record. If you precede NF with a dollar-
sign, the information returned is the contents of the last field in the
record. For example, if there are four fields in the record, NF will return
“4” and $NF will return the contents of the fourth field.
Just to drive the point home, lets look at the variable NR. This reports
the number of records read so far: 1, 2, 3 ,4, up to the final record. If we
reference $NR, we will receive the contents of the first file on the first
record, the contents of the second field on the second record, the
contents of the third field on the third record and so on. An example is
on the next page. What if the number of records exceeds the number of
fields on the line? The field doesn’t exist and awk(1) is happy to report
a null value.

$ cat data2
abcde
fghij
klmnop
qrstu
vwxyz
$
$ awk '{print $NR}' data2
a
g
m
t
z
We discussed this on the previous page. I agree that this is not a very
useful example, but it illustrates the fact that the dollar-sign instructs
awk(1) to return the contents of the field referenced.

$ cat data2
abcde
fghij
klmnop
qrstu
vwxyz
$
$ awk '{print $NF}' data2
e
j
p
u
z
Here is another example that illustrates returning the contents of the

field pointed to by the internal variable. This example is a little more
useful in that it returns the last field of every record. OK, maybe not that
useful.

AWK Program Structure
PATTERN {action}
PATTERN {action}
.
.
.
Now lets take a look at the structure of an awk(1) program. The general
syntax is rather simple: you specify a pattern to look for an an action or
actions to take when that pattern is encountered. Just remember this:
the test to match a pattern is not made until the preceding actions are
taken. This means that actions which modify a record can impact the
match test for subsequent actions.

PATTERNS
/regular_expression/
$1 ~ /regular_expression/
$1 !~ /regular_expression/
NF != 3
$2 == 5
$1 == “literal_string”
$2 >= 4 || $3 <= 20
BEGIN
END
So far, all of our examples have lacked a pattern. This means that the
action is to be applied to every line. Here is a list of potential patterns
you could use.
The first example matches the entire record to the regular expression. It
is short hand for “$0 ~ /regular_expression/”.
The second example limits the test to the first field. The next example
is the inverse (when the first field is not matched by the RE).
“NF != 3” looks for any record where the number of files is less-than or
greater-than three. If you want records where the second field is exactly
five, use “$2 == 5”. Or, if you are not looking for numbers, you can look
for literal strings.
The next example is rather complicated. It restricts the actions to
records where the second field is four or more, or the third field is twenty
or less.
The last two examples are special patterns. “BEGIN” denotes actions
you want to take before you start reading records. This is handy to
initialize variables and set the field delimiters. The “END” pattern is for
actions that you want to take after you have read your last record. This
is great for printing out running totals and other summary information.

Pattern Ranges
/BOZO/,/PROFANDY/
BOZO tricks COOKIE

WIZZO botches magic trick
PROFANDY plays music
PROFANDY gets cream pie
BOZO chases COOKIE with pie
BOZO sings to audience
PROFANDY talks with GOLLY
WIZZO performs magic trick
WIZZO receives cream pie
Here is another interesting feature. You can match ranges of records.

Here we see an expression that matches records from a record that
matches “BOZO” to a record that matches “PROFANDY”. This is an
inclusive match, so the actions are taken on the first record that
matches “BOZO”. The same action is taken on all records, even those
that match “BOZO”, until a record that matches “PROFANDY” is found.
The ending record that is matched is also processed by the action.
In the above data, the first three lines are matched as a single set. The
fourth line is skipped. Lines five thru seven are matched as a set. The
last two lines are not matches.

Actions
• Output
• Data Manipulation
• Flow Control
Now that we can successfully select records, what kind of actions can
we take? As we have seen in the simple examples, we can output
selected records, fields or information in internal variables. We also
have the ability to manipulate data. This means that we can change the
contents of fields or modify defined variables. Record selection can be
seen as a type of data flow, but we also have some flow control at our
disposal once we select a record to process.

Output
print basic output
quoted strings are output verbatim

comma-separated arguments are output
with the OFS between them
printf formatted output (works just like C)
first argument is format string

other arguments are the values to substitute
Output, like most advanced languages is fairly simple. There are two
types you can select from: 1) the no-frills basic method and 2) the
format controlled type that is patterned after C.
The basic method allows you to pass arguments to the “print”
command. Arguments may be quoted strings or variables. When they
have commas between them, the Output Field Separator is placed
between them. The are examples on the pages that follow.
The format controlled method provides much more control over how
strings and numbers are output. If you are already familiar with how C
programs output, you have an advantage. Similar to C, the “printf”
statement accepts arguments; the first argument is a format string used
to control the other arguments. Examples are on the following pages.

print
one two three
BEGIN { OFS=”|”}
{ print $1 $2 ”test”,$3,”wow”}
onetwotest|three|wow
In this example we see the use of the “print” command. The awk(1)
program first sets the Output Field Separator to a vertical bar. Then we
begin selecting every line and output the first, second and third fields.
Notice that there is not a comma between the first and second fields.
Therefore, the output as no delimiter between. Then we have the literal
string “test”. Since, on the print statement, it is followed by a comma,
there is an OFS in the output. I think you can follow the rest.
The print command also automatically generates a new line.

printf
one two three
BEGIN{ OFS=”|”}
{ printf(“%s%s%s%s\n”,$1,$3,OFS,”wow”}
onethree|wow
In this example, we see the usage of the “printf” command. Again, the
OFS has been set to a vertical bar. Also notice that OFS has been
used a an argument to the printf statement; they are not output
automatically. On the next page we will see the formatting constructs
available. You should also notice that you have to intentionally specify
a new-line.

printf format strings
|1234567890|
printf(“|%c|”,100) |d|
printf(“|%5d”,100) | 100|
printf(“|%7.2f|,100.5) | 100.5|
printf(“|%s|”,”MySystm”) |MySystm|
printf(“|%-10s|”,”MySystm”) |MySystm |
printf(“|%10s|”,”MySystm”) | MySystm|
printf(“|%5s|”,”MySystm”) |MySystm|
|1234567890|
Here we see a short list of printf format strings. All of the examles are
constructed such that a vertical-bar is output before and after the
example.
%c outputs a single character. See how our example translated the
ASCII value for 100 to a “d”.
%d is for decimal. A number between the percent-sign and the “d”
indicates the width of the field. Your decimal number will be right-
justified.
%f is for floating-point numbers. When a field width is used, the first
number is the total number of characters to output. The second number
is the position for the decimal point. Note that 2 does not represent two
significant digits, but that the decimal point should be in the second
column to the left of the end of the field.
%s indicates a string. When no width is specified, enough spaces are
used to output the entire string, no more, no less. When a width is
specified, the string is right justified. If there is insufficient with for the
string, the entire string is still output! That could mess up your pretty
output justification. If the width is preceded with a dash, the text is left-
justified in the field.

Data Manipulation
• Built-In Functions
–String
–Numeric
• Operators
Data manipulation can mean many things. For example, we have

functions at our disposal that will help us manipulate strings and
numbers. We also have some operations we can perform on both data
types. Just remember, awk(1) doesn’t necessarily distinguish data
types until we give it a hint; everything is seen as a string until we tell
awk(1) otherwise.

Built-In Functions (String)
gsub(r,s,t) substitue s for r in string t

index(s,t) return first postition of t in s
length(s) number of characters in string s
split(s,a,fs)
split s into array a on field separator fs;
returns field count
sub(r,s,t) substitute s for the leftmost longest
substring of t matched by r
substr(s,p,n) return substring of s of length n
starting at position p
Here are the built-in functions that manipulate strings. You can make
substitutions, find things, and get the length of a string. You can also
take a string, lets say a field, and break it up into an array. For
example, you may have a field in a record that contains hours, minutes
and seconds, colon-delimited. Using the function split, you could break
up the field into an array called “time”, with the first element the hours,
the second element the minutes and so on.

XXXYYYYMMDDWWWWWW
{
print substr($0,8,4)substr($0,4,4)
}
Here is an example of using the string function substr. In the data, we

have three characters of extraneous data, followed by four characters of
year information, then two of month and two of day. To output this
information in civilian format you would turn to substr. The first substr
call works on the entire record, starting at position eight, return four
characters. This gives us the month and day. The other substr call also
works on the entire record, but starts at position four and returns four
characters: the calendar year.

Built-In Functions (Numeric)
cos(x) returns the cosine of x radians

exp(x) exponential (exp(1) returns e)
int(x) returns integer portion of x
log(x) natural logarithm of x
rand(x) random number ( 0 <= r < 1)
sin(x) returns the sine of x radians
sqrt(x) returns the square root of x
Here is a list of numeric functions. Yes, one advantage of awk(1) is that

it is capable of processing floating-point numbers. Take that, shell!

Operators
Assignment = += -= *= /= %= ^=
( y *= 2 is y = y * 2 )
Conditional ?:
logical OR ||
logical AND &&
match ~ !~
relational < <= == != >= >
add,subtract + -
multiply,divide * /
mod %
logical NOT ! (!$1 - 1 if $1 is zero or null)
exponentiation ^ ( x ^ y = xy )
increment,decrement ++ --
grouping ()
field $ ( $(n+1) is nth+1 field )
array membership in
Here is a decently complete list of operators. Again, those familiar with

C will notice a resemblance. Many awk(1) code writers abuse the
conditional. I cannot stand it, so I refuse to describe it. You can
consider it confusing shorthand for an if-then-else statement.
The real interesting elements of the list are the last two. Using the field
notation, you can calculate the field number you want to return. This
implies that the data can be used relationally: the value in the fourth
field could point to any of the numbered fields in the record. For
example, if the last field contained a four, it would mean to use the data
in the fourth field. Even more powerful is the test “in”. Later we further
investigate array processing, but for now, you need to know that array
references are strings. That’s right, strings! Not consecutive integers.
The “in” test checks to see if a given string appears as an array
reference!

A2 + B2 = C 2
{
print sqrt( ($1 * $1) + ($2 * $2) )
}
or
{
print sqrt( ($1 ^ 2) + ($2 ^ 2) )
}
Just to be complete, here are a couple of examples playing around with

Pythagorean’s Relation.

Assignment
$ cat data
a 10 5 13
b 20 4 21
c 30 3 18
d 40 2 66
$ cat pres.awk
{ $2=$2+$3 }
$2 > $4 { print $0 }
$ awk -f pres.awk data
a 15 5 13
b 24 4 21
c 33 3 18
Here is a rather complicated example that illustrates the ability to modify

the value in a field. It also successfully illustrates the fact that the
record selection is based on the record after the previous action is
finished modifying it.
The first pattern/action statement in the awk(1) program modifies the
second field of every record. The second field has the contents of the
third field added to it. Really, very simple, so far. The next statement
selects the record only if field two is greater than field four. No problem.
If it is no problem, please explain why the last record, record “d” if you
will, is not output? Well, the first statement adds field three to field two.
The result is that the value in field two is now 42, which is less than 66.
Therefore, the selection pattern of the output statement is not satisfied.
BTW: there is not a mistake in the awk(1) program. The dollar-sign
before the field number to the left of the assignment is correct. Without
it, you are trying to re-set the value of the constant “2”.

Flow-Control
if (expression) statement [else statement]

while (expression) statement
for (expression; expression; expression) statement
do statement while ( expression )
break - breaks innermost while,for,do
continue - next iteration of innnermost
while,for,do
next - next iteration of main input loop
exit [expression]
After you have selected a record, it may not be enough to perform a

simple manipulation. You man want to make some decisions or apply
an iterative process to the record you selected. Here are some flow-
control statements at your disposal. Some examples follow.

BEGIN
{
maxwidth=79
}
{
if ( $1 <= maxwidth )
{
for (i=$1;i > 0; i--) printf(“*”)
printf(”\n”)
}
else
{
printf(“Value %d at line %d > %d\n”,$1,NR,maxwidth)
}
}
Here is an awk program that uses the iterative “for” statement.

CHALLENGE: can you guess what it is doing?
First of all, we set a variable, maxwidth, to a value.
On every record, we test the contents of the first field for being less than
maxwidth. Could you think of another way to do that? When the value
in the first field is less than maxwidth, a for statement is executed. This
for statement sets a variable to the value in the first field. Then, the
variable is decremented while it’s value is greater than zero. For each
iteration, a single asterisk is output. When the for statement is
complete, a new-line is generated. If a value in the first field is out of
range, an error statement is printed.
Figured it out yet? It prints a histogram of the values in the first field.
HINT: How could you combine the output of sar(1) with a derivation of
this awk(1) program to graphically illustrate to users how busy the
system was over a given day?

OR
BEGIN {
maxwidth=79
}
$1 <= maxwidth {
for (i=$1;i > 0; i--) printf(“*”)}
printf(”\n”)
}
$1 > maxwidth {
printf(“Value %d at line %d > %d\n”,$1,NR,maxwidth)
}
Here is an awk program that uses the iterative “for” statement.

CHALLENGE: can you guess what it is doing?
First of all, we set a variable, maxwidth, to a value.
On every record, we test the contents of the first field for being less than
maxwidth. Could you think of another way to do that? When the value
in the first field is less than maxwidth, a for statement is executed. This
for statement sets a variable to the value in the first field. Then, the
variable is decremented while it’s value is greater than zero. For each
iteration, a single asterisk is output. When the for statement is
complete, a new-line is generated. If a value in the first field is out of
range, an error statement is printed.
Figured it out yet? It prints a histogram of the values in the first field.
HINT: How could you combine the output of sar(1) with a derivation of
this awk(1) program to graphically illustrate to users how busy the
system was over a given day?

Command Line Variables
awk -f prog.awk a=- c=2 d=3 data0 a=k b=y c=x data1 data2
data0 a=- b=““ c=2 d=3

data1 a=k b=y c=x d=3
data2 a=k b=y c=x d=3
Before or first discussion of command-line arguments, we had not

broached the topic of variables. We have seen how they can just spring
into existence by merely mentioning them by name. Here we see that
we can assign them dynamically as an argument. This can be very
handy for passing bounds to a program.
In the example above, there are four variables: a, b, c and d. Given the
position of arguments on the command line, the table shows the value
of the variable while the file is processed. Notice that, if the variable is
not re-set, its value doesn’t change for the next file. The table assumes
that “prog.awk” is not modifying the variable contents in any way.

Arrays
• Declaration not necessary

• Subscripts are strings
• Multi-dimensional obtained by subscript
concatenation
• Element Occurrance (“in”)
Arrays within awk(1) are very powerful. You can manipulate array
structures with the POSIX and Korn shells, but awk(1) throws in some
very interesting and useful features.
Awk(1) does not require you to define any structure, not even arrays.
Just remember that awk(1) treats everything as string and you will be
just fine. This also means that array subscripts are strings. I don’t
simply mean that 52 is treated as a string. It means that a valid
subscript could be something like “/etc/passwd” or “mares eat oats”.
Then what it references will be a string, that can also be interpreted as a
number if you wish.
You can also trick awk(1) into multi-dimensional arrays. You simply
concatenate two or more strings to make the reference.
As we have already discussed, you have a test for array membership.
Since array references are hashed strings, the function “in” can easily
tell you if an array reference exists for a given string. We will see an
example in a moment.

Arrays
# Output lines of file in reverse order

{
array[NR]=$0
}
END {
for (i=NR;i>0;i--) print array[i]
}
Here is an awk(1) program that populates an array called “array” with

the contents of every record. The reference is an integer with reflects
the record’s record number. Hope your system has enough memory to
store the entire file in memory!
At the end, we use a for loop to decrement a variable from the number
of records read to one. On each iteration, we output the array record.
The result, we output the lines of the file in reverse order.
That one was easy, lets try another.

Arrays (Continued)
BEGIN {
primaries[“red”]=1
primaries[“blue”]=2
primaries[“yellow”]=3
}
{
if ( $1 in primaries ) print $1,”is a primary color”
else print $1,”is NOT a primary color”
}
Here, the “BEGIN” pattern is used to populate the array “primaries”.

The references are the color names. We will see in a moment that the
numbers they reference are irrelevant.
Then, for every record, we test to see if the first field is a reference in
the array “primaries” and output an appropriate message.
Just for the sake of argument, we could have done this another way:
BEGIN {
}
$1 in primaries { print $1,”is a primary color” }

$1 !in primaries { print $1,”is NOT a primary color”}
You could also avoid hard-coding the values of the array. For instance,
what if the first file passed to the program contained the valid names for
primary colors? Subsequent files would contain the data to check.
Your pattern selection would be something like this: FNR == NR. This
would mean that while File Number of Records is the same as Number
of Records, you are processing the first file. When you move to the
second data file, NR continues to increment while FNR restarts at one.

Arrays (Continued)
BEGIN {
}
$1 in primaries { print $1,”is a primary color”}

$1 ! In primaries {print $1,”is NOT a primary color”}
Here, the “BEGIN” pattern is used to populate the array “primaries”.

The references are the color names. We will see in a moment that the
numbers they reference are irrelevant.
Then, for every record, we test to see if the first field is a reference in
the array “primaries” and output an appropriate message.
Just for the sake of argument, we could have done this another way:
BEGIN {
}
$1 in primaries { print $1,”is a primary color” }

$1 !in primaries { print $1,”is NOT a primary color”}
You could also avoid hard-coding the values of the array. For instance,
what if the first file passed to the program contained the valid names for
primary colors? Subsequent files would contain the data to check.
Your pattern selection would be something like this: FNR == NR. This
would mean that while File Number of Records is the same as Number
of Records, you are processing the first file. When you move to the
second data file, NR continues to increment while FNR restarts at one.

Arrays Continued
{ cntarray[$3]=cntarray[$3]+1 }
END { for ( i in cntarray ) print i, cntarray[i]}
Yet another array example. Can you guess what this one is doing?
On ever record, we are using the contents of field number three as an
array reference. We are also incrementing the number the reference
pointed to by one. Remember, when we use an array reference for the
first time, the value it points to is null, or zero, unless we have otherwise
set it. When we are done with data, we process a for statement. Notice
the unique syntax. This time, the “in” operator is returning the unique
references for the array “cntarray” to the variable. The output is then
the array reference followed by the number of times it occurred in the
input. Result: we have counted the number of times individual values
appear in the third field.
Could you find an application for this rather short awk(1) program? Lets
say you have a data-set that the third field contains the part number for
an order. The awk(1) program above could quickly tell you what parts
you need and how many of each...

Multi-Dimensional Arrays
The data file:

– ascii data from performance collector
– format:
• mm/dd/yy,hh:mm,application,%CPU
– collected on five-minute intervals
– some intervals may be missing
Specification:
– average across the hour for each specification
– reformat output to:
• mm/dd/yy,hh,app1_value,app2_value,app3_value
Lets take on a more aggressive example. I was once give a data file
that contained data from a performance monitor. Each line of the file
had a date and time stamp, the application name and the percentage of
CPU the application was using during that interval. The intervals were
on a five-minute basis, but some intervals might be missing (the
application had no usage or the collector tool was inoperative).
There were two requirements: average the CPU percentage across
every hour and then output a single line for every hour that had values
for ever application.

Example Data
05/04/95,00:00,Perftools , 5.7,
05/04/95,00:00,Utilities , 0.5,
05/04/95,00:00,Sybase SQL , 0.1,
05/04/95,00:05,Perftools , 5.9,
05/04/95,00:05,Utilities , 0.8,
05/04/95,00:05,Sybase SQL , 0.0,
05/04/95,00:10,Perftools , 5.9,
05/04/95,00:10,Utilities , 0.7,
05/04/95,00:15,Perftools , 5.9,
05/04/95,00:15,Utilities , 0.5,
05/04/95,00:20,Perftools , 5.9,
05/04/95,00:20,Utilities , 0.6,
This is an example of the input data. Notice several different

applications during each five-minute interval.

Average Across Hour
BEGIN {
FS=","
OFS=","
}
{
hour=substr($2,1,2)
totals[$1","hour","$3]=totals[$1","hour","$3] + $4
counts[$1","hour","$3]=counts[$1","hour","$3] + 1
}
END {
for ( i in totals ) print i,totals[i]/counts[i]
}
The first thing we need to do is to average across the hour.

To keep things simple, I used the same Field Separators throughout the
exercise.
On every record, I pulled the hour out of field two. Then, using the date
field, the hour and the application name as an array reference, I totaled
the CPU percentages across the hour. To calculate a proper
percentage, I needed to keep track of how may data points actually
occurred during the hour. So, the array “totals” has the totals and
“counts” keeps track of the number of data points read.
When we are done reading data, we simply range across the array
references for the array “totals” and print out the array reference
followed by the calculation of the total divided by the count. Since the
date, hour and application names (with commas between) were used as
the array reference, the output is the average per hour properly
referenced.
Warning: awk(1) array references are hashed, so the output order is
unpredictable. It will yield the same order each time you range acrosss
an array (provided you have not added elements), but the order will not
be correlated with the order the records were encountered. We will
have to sort the output to get is listed sorted by day and hour. But, we
don’t need to worry about that just yet.

Averaged Across Hour Output
$ awk -f averager data | sort
05/04/95,00,Perftools ,6
05/04/95,00,Sybase SQL ,0
05/04/95,00,Utilities ,0.9
05/04/95,01,Perftools ,6.1
05/04/95,01,Sybase SQL ,0.2
05/04/95,01,Utilities ,0.6
05/04/95,02,Perftools ,5.7
05/04/95,02,Sybase SQL ,0
05/04/95,02,Utilities ,0.6
05/04/95,03,Perftools ,5.7
05/04/95,03,Sybase SQL ,0
The output of our averaging looks something like the above. We get the
date reference, the hour of the day, the application name and the
average, all comma-delimited.

Invert the Output
BEGIN { FS="," ; OFS="," }
{
datapoint[$1","$2","$3]=$4 ##### "," preserves commas
timepoint[$1","$2]=0 ##### for output
appname[$3]=0
}
END {
printf("Date,Time,") ##### column headings
for ( k in appname ) printf("%s,",k)
printf("\n")
for ( tick in timepoint )
{
printf("%s,",tick)
for ( app in appname )
printf("%5.1f,",datapoint[tick","app])
printf("\n")
}
}
The awk(1) program that lists the data all on one line for each day/hour
combination is a bit more tricky.
On every record we populate three arrays. The array “datapoint”
records the CPU percentage for every element. Notice the reference is
the date, hour and application name. The array “timepoint” simply
records all of the date/hour combinations. The array “appname” simply
captures the names of all applications (the array references end up
being a set of application names that appeared).
When we start output is when the fun begins. First, we want a line that
tells us the heading for each column (the list of applications).
Remember that we do not know what order they will be output in, but
each pass through the list will be in the same order. After printing the
column headings we need to print the real data.
To print the data, we need to range across the date/hour references, so
we use the array “timepoint”. For each “tick” in “timepoint” (an hour of a
give day, we need to range across all applications, printing the data. If
the reference does not exist in the array (for instance, the application
did not have record for that hour), awk(1) will conveniently return a zero.
Once this is finished we are done. Just remember that he data will need
to be sorted to get it into date/hour order.

What We Wanted
Date,Time,Perftools ,Sybase SQL ,Utilities ,
05/04/95,10, 4.4, 0.4, 3.7,
05/05/95,00, 6.2, 0.0, 0.9,
05/04/95,11, 5.6, 0.3, 2.1,
05/05/95,01, 5.9, 0.1, 0.7,
05/04/95,12, 5.2, 0.1, 4.7,
05/05/95,02, 6.0, 0.0, 0.8,
05/04/95,13, 5.4, 0.3, 4.1,
05/05/95,03, 5.6, 0.0, 0.7,
05/04/95,14, 6.0, 0.1, 3.0,
05/05/95,04, 5.1, 0.0, 0.8,
05/05/95,05, 5.7, 0.0, 0.8,
Finally, after passing through two awk(1) programs, we have the data
we want. Notice that the date/hour references are not in order. You will
need a couple of moments through sort(1) to get that.

Stand-Alone AWK Programs
$ cat checklp
#!/usr/bin/awk -f
BEGIN { FS="-" }
$2 ~ /priority/ { lpQ[$1]++ }
$0 ~ / on / { onlpQ[$1]="*" }
END {
for ( name in lpQ )
printf("%-14s %3d %c\n", name, lpQ[name], onlpQ[name])
}
$ ll ./checklp
-rwxr-xr-x 1 dtotsch techsvcs ..... checklp
$ lpstat -u | ./checklp
lj4si 2*
mopier 12
$
When you start using awk(1), you will probably begin with short
programs that are easily specified as a command-line argument. In
awhile, you will begin to write longer programs that will not lend
themselves to being stored in a separate file. A little later, you will find
yourself writing utilities exclusively in awk(1). Writing a short wrapper
shell script to do nothing more that make a call to awk(1) isn’t very
efficient. Think about it; you will spawn a shell to interpret you script,
which will do nothing more than set-up a call to awk(1). How do we get
the shell script out of the picture?
The shell script used to call awk(1) can be eliminated. The loader
process (the process used to load an executable into memory) knows to
examine the first byte of the executable. If this byte happens to be a
pound-sign, the loader will use the following line as the program to
interpret the text. Some call this capability the curtsey loader. We can
leverage this for awk(1) an make our programs self-contained.
What the loader will do is to use the text after the pound-sign verbatim
and then append the pathname to the file as an argument. Notice that
the first line in the example is a call to awk(1) using the “-f” option.
Recall that the this option specifies the file awk(1) is to interpret. But,
the line in the shell script has nothing following the option! This is
because the loader will provide the full path name to it. Therefore, we
have a self-contained awk(1) program.
The program above summarized the output of lpstat.
Multiple-Line Records
DATA(mlr.data) Program(mlr.awk) RUN
David L. Totsch BEGIN { $ awk -f mlr.awk mlr.data
Hewlett-Packard Co. RS=““ David L. Totsch
Englewood, CO FS=“\n” Interex
} Hewlett-Packard
Interex $
P.O. Box 3429 {
Sunnyvale, CA print $1
}
Hewlett-Packard
Cupertino, CA
Another handy thing awk(1) can do for us is process multi-line data

records. For example, lets look at an address book. If it is set-up such
that a blank line separates the records, such as we find in “mlr.data”, we
can use awk(1) to process those records. All we need to do is fuss with
the Record Separator and Field Separator. The awk(1) program sets
the Record Separator to null and the Field Separator to a new-line. For
every record, we print the first field. Notice that each line becomes a
separate field.
Since every line is a separate field, how would we go about parsing the
pieces out of a field? For example, how would we go about getting the
two-character state designation out of mlr.data? We would turn to the
string manipulation functions in awk(1), like “split”, or “substr”.
Multiple-line records can also be processed using the ranging capability
seen earlier. Then only restriction is that the record must have some
sort of header.

CONCLUSION
• experiment
• walk before you run
• you might find AWK useful for:
– making sure every record of a file has the same
field count
– manipulating numeric information
– creating reports from raw data
– gathering specific information from reports
– data conversions
This presentation was not designed to be an in-depth description of

awk(1). It was designed to jump-start your used of awk(1). What I
would like to see is for you to begin using awk(1) at the command-line
level with short, on-liner awk(1) programs. It is very easy to learn
awk(1) a little-bit at a time. Just as a child learns to crawl, then to toddle
around furniture, then to walk and finally to run, you should gradually
increase your ability with awk(1).
If you view awk(1) as a mini programming language, there are multiple
applications of the utility. Just a few, enough to start your thinking, have
been listed above. I hope that you find some applications for the tool.
On the next few pages are some short awk(1) programs you might find
useful; useful for performing some of your daily tasks and useful for
picking apart and learning the language.
Of course, I have not mentioned other subtiles that make awk(1) very
powerful, such as: Field Separators can be specified using Regular
Expressions, awk(1) programs can be written to call system programs,
awk(1) programs can also run interactively and read the keyboard for
input.

Useful AWK One-Liners
{ print $NF } # Last field on every line
{ if ( NF > mfields ) mfields = NF }

END { print “MOST FIELDS =“,mfields}
{ if ( length($0) > wideline ) wideline = length($0) }

END { print “WIDEST LINE =“,wideline }
{ print $($1) } # print the field referenced by first field
NF > 0 { print $0 } # removes blank lines
{ print NR":",$0 } # numbers the lines in a file
Above are some useful awk(1) scripts. Yes, “one-liners” is sort of a

misnomer; many of the examples need more than one line, but, they are
all short.
The second program reports the number of fields in the record that has
the most number of fields. The third program performs a similar
function on the width of the line. How could you modify these two
programs to also capture the record number?
The fourth program may not be useful, but it is of interest due to what it
is doing. It assumes that the first field is an integer number. The
number in the first field points to a field number. The contents of the
field number is then printed.
Ever wanted to quickly and easily remove blank lines from a file? If you
use grep(1), you have to formulate a regular expression that eliminates
lines that may have a mixture of spaces and tabs and other white
space. The fifth program completely avoids the regular expression
trouble. Of course, I want you to be able to write the appropriate regular
expression, but now you have a quick way out until you have developed
your regular expression writing skills.
The last example will add line numbers to the front of each line. Again,
you could use grep(1) to do the same thing, but grep(1) cannot put the
line numbers at the end of the line...

More Useful AWK One-Liners
NF != 5 { print NR,NF,$0 } # If a line does not have exactly 5 fields
# print the line number, the number of
# fields found and the line itself
{ total=total + $2 }
END { print total } # total of field two
$3 > max { max = $3 ; maxline = $0 }

END { print maxline } # find the maximum value for
# field three and print the line
{
for (i=NF;i>0;i--) printf("%s ",$i)
printf("\n")
} # reverse the order of fields on each line
The first example awk(1) program scans the input for lines that do not
have exactly five fields (sort of a data checking function). If the line
does not have exactly five fields, the line number, number of fields and
entire record are printed out.
The second example totals the contents of field number two. How could
you modify the same awk(1) program to print an average? HINT: NR
still holds the number of records read even during the “END”
processing.
The next example scans the input records looking for a maximum value
in field number three. It then prints that record. Notice how the pattern
scanning ability has been used to make the evaluation. The first record
sets “max” to the value of the third field in the first record. Remember
that awk(1) will automatically return null if the variable is not set.
The last example is also more interesting than it is useful. In previous
examples, we have seen an awk(1) program that will reverse the input,
making the last line first. This program takes each line and reverses the
order of the fields (last field first, and so on). How about reversing each
line character by character? Here it is:
{
for(I=length($0);I>0;I--) printf(“%c”,substr($0,I,1))
printf(“\n”)
}

Still More AWK One-Liners
####### potential orphaned processes
ps -ef | awk ‘$3 == 1 { print $0 }’
####### Center Lines of Text (need to pass width)

{
format=sprintf("%%%ds", width/2 - length($0)/2)
printf(format"%s\n", " ", $0)
}
####### total each input line

{
total=0
for (i=NF;i>0;i--) total=total+$i
print total
}
The first example will be of interest to System Administrators. It scans

the output of ps(1) looking for processes inherited by init. You might
want to exclude processes owned by root. How would you modify the
awk(1) program to introduce the restriction?
The second example if very handy for both screen-oriented scripts and
creating reports. If you pass this awk(1) program the width of the
current output device to the awk(1) variable “width”, it will center any
text sent as standard-input.
A previous one-liner totaled the values found in a specific column. The
last example totals the values found on each line. How could you
modify this awk(1) program to output the average of each line?

REFERENCES
• The AWK Programming Language

– Alfred V. Aho
– Brian W. Kernighan
– Peter J. Weinberger
• SED & AWK

– Dale Dougherty
• Mastering Regular Expressions

– Jeffrey E. F. Friedl
The AWK Programming Language is the definitive resource for awk(1).

It functions very well as both a tutorial and as a reference. It is highly
recommended.
SED & AWK is a nutshell handbook available from O’Reilly &
Associates. Be sure to get the second edition; the examples are better.
Mastering Regular Expressions is another nutshell handbook from
O’Reilly & Associates. The training tends to lean more towards Perl, but
that is not a disadvantage (unless you have removed the utility from
your system).


Awk By: Example

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Awk By: Example

Uploaded by

Copyright:

Available Formats

AWK By

Awk By Example AvMed

• text pattern scanner

Copyright  Hewlett-Packard Company 2005

• grep (Global Regular Expression Print)

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

Awk By Example AvMed

Awk By Example AvMed

awk [-Ffs] [‘program’| -f progfile] [datafile...]

‘program’ AWK program entered on cmd line

progfile text file containing AWK program

datafile data you want to pass through

Copyright  Hewlett-Packard Company 2005

lpstat -a | awk ’{print $1}’

lpstat -a | awk -f prog.awk

awk -F: ’{print $1}’ /etc/passwd

awk -f prog.awk data1 data2 data3

Copyright  Hewlett-Packard Company 2005

Here we see some illustrations of various awk(1) command-lines.

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

FS input field separator

$0 entire input line

Here we see a more complete listing of internal variables.

Awk By Example AvMed

$NF is the contents of the last

$NR the contents of the NRth field in

Awk By Example AvMed

Awk By Example AvMed

Here is another example that illustrates returning the contents of the

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

BOZO tricks COOKIE

Copyright  Hewlett-Packard Company 2005

Here is another interesting feature. You can match ranges of records.

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

print basic output

quoted strings are output verbatim

printf formatted output (works just like C)

first argument is format string

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

one two three

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

one two three

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed

Copyright  Hewlett-Packard Company 2005

Data manipulation can mean many things. For example, we have

Awk By Example AvMed

gsub(r,s,t) substitue s for r in string t

Copyright  Hewlett-Packard Company 2005

Awk By Example AvMed