Professional Documents
Culture Documents
Example
Copyright Hewlett-Packard Company 2005
Awk By Example
David L. Totsch
Technical Consultant
Hewlett-Packard Company
david.totsch@hp.com
One of the attractive and powerful features of HP-UX is its strong set of
data manipulation tools such as grep(1), sed(1), shell scripts and
awk(1). One of the most powerful of these tools is also one of the least
used. This session has been designed to quickly introduce users,
programmers and system administrators to the abilities of the awk(1)
programming language. The approach will be to present this tool in the
way a technical user usually learns -- by looking at how someone
already using the tool is using it and trying progressively more complex
implementations. By adding awk(1) to their skill set, everyone writing
shell scripts, parsing data, converting programs, writing prototype
systems and creating formatted reports can take further advantage of
the productivity and functionality gains HP-UX has to offer. Awk(1) is
also a great place to learn the proper techniques and mindset as a
precursor to PERL. This tutorial will provide a functional introduction to
awk programming that can be utilized immediately upon return to your
HP-UX system.
Awk(1) is a UNIX utility that defies description. Of those that have used
its unique capabilities, many describe it as a text pattern scanner -- it is
very effective at scanning text for a pattern and taking an action on it.
Others describe it as nothing more than a more sophisticated version of
sed(1). Those that have delved a little further into its capabilities have
found awk(1) to be a very capable programming language. Some
frustrated users describe awk(1) as nothing more than a very complex
system command (and one to be avoided because of its complexity).
In-between these two extremes, you will find some users who view it as
nothing more than a handy report processor. Which assessment should
you believe? They are all correct! As we will see, all of those
descriptions, and then some, fit awk(1).
Occasionally, someone mentions Perl when they hear me talk about
awk(1). Perl is a good tool. It has the advantage of being available on
multiple platforms. But, I believe you should apply the appropriate tool
to the job. I have also seen some people try to write Perl code who
lacked a firm grasp of Regular Expressions. Those same coders were
not satisfied with the Perl utility. It is my belief, and intent, that teaching
awk(1) will build the Regular Expression skills you will need to be
effective with Perl. This is because awk(1) is “regular expression
driven”. Basically, awk(1) is scanning input for records that match
regular expressions to take action upon. You will quickly realize the
relationship simply by using awk(1).
Awk By Example AvMed
HP Group Operational Review 12/2008 - Page 2 of 48
Origins
One of the most useful commands in early UNIX was ed(1). OK, quit
laughing. It is true. Original UNIX was not so full-featured as the
versions we are accustomed to. Probably the most common instruction
passed to ed(1) was “g/RE/p”. The meaning? The “g” stands for global
-- every line in the file. “RE” means a regular expression. The “p”
instructs ed(1) to print the line the regular expression is found on. The
usage was so common that a separate command was created: grep(1).
The name was derived from the instruction passed to ed(1): global
regular expression print.
Many users found ed(1) and grep(1) very useful. Eventually, the full file
scan and edit capability of ed(1) maid its way into a filter style command
named sed(1). But, like all good computer users, UNIX users longed for
more.
Three gentlemen who had corroborated on other portions of UNIX, Aho,
Weinberger and Kernighan, also longed for a more full featured editor.
They put their heads together and created awk(1). The result was
something more than a highly programmable editor -- it was more of a
mini programming language. Without a succinct name to give the utility,
their initials stuck. [No ego here.]
$ lpstat -a
lj4si accepting requests since Jan 28 17:29
colorjet accepting requests since Jan 28 17:29
mopier accepting requests since Jan 28 17:29
forms accepting requests since Jan 28 17:29
$
$ lpstat -a | awk ’{print $1}’
lj4si
colorjet
mopier
forms
Copyright Hewlett-Packard Company 2005
Contriving the first example of an awk(1) script was not simple. This
one is benign. The desire is to list the available printers on a system. A
good definition would be to list those printers that are currently
accepting new print jobs: lpstat -a. But, that output also prints the date
and time that the printer began accepting print jobs. That is very
extraneous information. To get just the print queue names, we need to
pull just the first word of each line. Above is the simple, in-line awk(1)
script to do just that.
There is always more than one way to do anything in UNIX, and the
astute student will point out that cut(1) could be used to perform the
same task. This is true. But, pretend with me for a moment that the
delimiter was not just a single space, but the text had been justified
using multiple spaces. How would cut(1) have reacted? Well, the same
since we wanted the first field. But, consider for a moment that you had
wanted the second field. Telling cut(1) that the field delimiter is a space
falls apart -- if there are multiple consecutive spaces, cut(1) sees a null
field between each space. We probably will not get what we want. By
default, awk(1) uses white space (spaces and tabs) as field delimiters.
Also by default, it will span across white space. This means that
multiple spaces and tabs are seen as a single field delimiter. This
behavior is extremely handy when dealing with data files.
$ lpstat -a
lj4si accepting requests since Jan 28 17:29
colorjet accepting requests since Jan 28 17:29
mopier accepting requests since Jan 28 17:29
forms accepting requests since Jan 28 17:29
$
$ lpstat -a | awk ’{print $5,$6,$7,$1}’
Jan 28 17:29 lj4si
Jan 28 17:29 colorjet
Jan 28 17:29 mopier
Jan 28 17:29 forms
Copyright Hewlett-Packard Company 2005
OK, we did a little cut(1) bashing. Well, simply pointing out capabilities.
Here is another way cut(1) fails us: we cannot use it to re-order fields
on a line. Cut(1) will only give us the fields in the original order. On the
other hand, awk(1) can report the fields in any order we wish.
In the example above, we have listed the printers accepting jobs but
have listed the date and time fields first, instead of last. Now we have
the advantage of re-ordering the fields reported.
fs field separator
Lets pause for a moment and turn our attention from awk(1) capabilities
to how awk(1) programs can be invoked. [I have called them programs
here instead of scripts. Please note the instructions are interpreted, not
compiled.]
There are two methods to choose from when passing a script to awk(1):
1) you can pass the program as an argument on the command line or 2)
you can pass the name of the file containing your awk(1) program. For
short programs, you will get in the habit of putting the text on the
command-line. For longer, more involved, more permanent programs,
you will want to store them in files.
When you pass the program as part of the command-line, you need to
enclose the program in single-quotes. This is not an awk(1)
requirement; it is a shell requirement. As we will see later, some of the
syntax will catch the attention of the shell (the shell will want to interpret
it instead of passing the text to awk(1). The single-quotes tell the shell
to keep its hands off.
You can have awk(1) read data from either standard-input or data files.
You can pass multiple data files, too. Awk(1) will work on them in the
order given. Later, we will see that awk(1) can also detect when the
input changes to another file.
Finally, we see that we can specify the field delimiter on the command-
line. Be forewarned that this turns off the spanning that is the default.
Awk By Example AvMed
HP Group Operational Review 12/2008 - Page 6 of 48
Command-Line Illustrations
Now that we can successfully select where awk(1) finds its program and
what data it will work with, lets jump into learning more about the
language. You have been exposed to some examples without much
description.
We have seen that the fields on lines of data can be referenced by the
field number preceded with a dollar-sign. For example, $5 refers to
filed number five (numbering begins at one). To reference the entire
record we use $0. There are two internal variables that are very handy.
“NF” refers to the number of fields on the current data line. “NR” refers
to the number of records, or lines, read so far.
Now, to explain the example above. You can see the contents of the
data file for yourself. The awk(1) program prints out the following: the
number of fields read so far, the entire data record, the literal string
“FIELD COUNT:” and the number of fields in the record. I think we can
find ways to use these capabilities to our advantage.
NF is an integer
(the number of fields in this record)
There is one more feature you need to know about internal variables to
complete your understanding. Lets take a look at the number of fields
variable NF. By itself, NF returns an integer number indicating the
number of fields in the current record. If you precede NF with a dollar-
sign, the information returned is the contents of the last field in the
record. For example, if there are four fields in the record, NF will return
“4” and $NF will return the contents of the fourth field.
Just to drive the point home, lets look at the variable NR. This reports
the number of records read so far: 1, 2, 3 ,4, up to the final record. If we
reference $NR, we will receive the contents of the first file on the first
record, the contents of the second field on the second record, the
contents of the third field on the third record and so on. An example is
on the next page. What if the number of records exceeds the number of
fields on the line? The field doesn’t exist and awk(1) is happy to report
a null value.
We discussed this on the previous page. I agree that this is not a very
useful example, but it illustrates the fact that the dollar-sign instructs
awk(1) to return the contents of the field referenced.
PATTERN {action}
PATTERN {action}
.
.
.
Now lets take a look at the structure of an awk(1) program. The general
syntax is rather simple: you specify a pattern to look for an an action or
actions to take when that pattern is encountered. Just remember this:
the test to match a pattern is not made until the preceding actions are
taken. This means that actions which modify a record can impact the
match test for subsequent actions.
/regular_expression/
$1 ~ /regular_expression/
$1 !~ /regular_expression/
NF != 3
$2 == 5
$1 == “literal_string”
$2 >= 4 || $3 <= 20
BEGIN
END
So far, all of our examples have lacked a pattern. This means that the
action is to be applied to every line. Here is a list of potential patterns
you could use.
The first example matches the entire record to the regular expression. It
is short hand for “$0 ~ /regular_expression/”.
The second example limits the test to the first field. The next example
is the inverse (when the first field is not matched by the RE).
“NF != 3” looks for any record where the number of files is less-than or
greater-than three. If you want records where the second field is exactly
five, use “$2 == 5”. Or, if you are not looking for numbers, you can look
for literal strings.
The next example is rather complicated. It restricts the actions to
records where the second field is four or more, or the third field is twenty
or less.
The last two examples are special patterns. “BEGIN” denotes actions
you want to take before you start reading records. This is handy to
initialize variables and set the field delimiters. The “END” pattern is for
actions that you want to take after you have read your last record. This
is great for printing out running totals and other summary information.
/BOZO/,/PROFANDY/
• Output
• Data Manipulation
• Flow Control
Now that we can successfully select records, what kind of actions can
we take? As we have seen in the simple examples, we can output
selected records, fields or information in internal variables. We also
have the ability to manipulate data. This means that we can change the
contents of fields or modify defined variables. Record selection can be
seen as a type of data flow, but we also have some flow control at our
disposal once we select a record to process.
Output, like most advanced languages is fairly simple. There are two
types you can select from: 1) the no-frills basic method and 2) the
format controlled type that is patterned after C.
The basic method allows you to pass arguments to the “print”
command. Arguments may be quoted strings or variables. When they
have commas between them, the Output Field Separator is placed
between them. The are examples on the pages that follow.
The format controlled method provides much more control over how
strings and numbers are output. If you are already familiar with how C
programs output, you have an advantage. Similar to C, the “printf”
statement accepts arguments; the first argument is a format string used
to control the other arguments. Examples are on the following pages.
BEGIN { OFS=”|”}
{ print $1 $2 ”test”,$3,”wow”}
onetwotest|three|wow
In this example we see the use of the “print” command. The awk(1)
program first sets the Output Field Separator to a vertical bar. Then we
begin selecting every line and output the first, second and third fields.
Notice that there is not a comma between the first and second fields.
Therefore, the output as no delimiter between. Then we have the literal
string “test”. Since, on the print statement, it is followed by a comma,
there is an OFS in the output. I think you can follow the rest.
The print command also automatically generates a new line.
BEGIN{ OFS=”|”}
{ printf(“%s%s%s%s\n”,$1,$3,OFS,”wow”}
onethree|wow
In this example, we see the usage of the “printf” command. Again, the
OFS has been set to a vertical bar. Also notice that OFS has been
used a an argument to the printf statement; they are not output
automatically. On the next page we will see the formatting constructs
available. You should also notice that you have to intentionally specify
a new-line.
|1234567890|
printf(“|%c|”,100) |d|
printf(“|%5d”,100) | 100|
printf(“|%7.2f|,100.5) | 100.5|
printf(“|%s|”,”MySystm”) |MySystm|
printf(“|%-10s|”,”MySystm”) |MySystm |
printf(“|%10s|”,”MySystm”) | MySystm|
printf(“|%5s|”,”MySystm”) |MySystm|
|1234567890|
Here we see a short list of printf format strings. All of the examles are
constructed such that a vertical-bar is output before and after the
example.
%c outputs a single character. See how our example translated the
ASCII value for 100 to a “d”.
%d is for decimal. A number between the percent-sign and the “d”
indicates the width of the field. Your decimal number will be right-
justified.
%f is for floating-point numbers. When a field width is used, the first
number is the total number of characters to output. The second number
is the position for the decimal point. Note that 2 does not represent two
significant digits, but that the decimal point should be in the second
column to the left of the end of the field.
%s indicates a string. When no width is specified, enough spaces are
used to output the entire string, no more, no less. When a width is
specified, the string is right justified. If there is insufficient with for the
string, the entire string is still output! That could mess up your pretty
output justification. If the width is preceded with a dash, the text is left-
justified in the field.
• Built-In Functions
–String
–Numeric
• Operators
Here are the built-in functions that manipulate strings. You can make
substitutions, find things, and get the length of a string. You can also
take a string, lets say a field, and break it up into an array. For
example, you may have a field in a record that contains hours, minutes
and seconds, colon-delimited. Using the function split, you could break
up the field into an array called “time”, with the first element the hours,
the second element the minutes and so on.
{
print substr($0,8,4)substr($0,4,4)
}
{
print sqrt( ($1 * $1) + ($2 * $2) )
}
or
{
print sqrt( ($1 ^ 2) + ($2 ^ 2) )
}
Copyright Hewlett-Packard Company 2005
{
if ( $1 <= maxwidth )
{
for (i=$1;i > 0; i--) printf(“*”)
printf(”\n”)
}
else
{
printf(“Value %d at line %d > %d\n”,$1,NR,maxwidth)
}
}
BEGIN {
maxwidth=79
}
$1 <= maxwidth {
for (i=$1;i > 0; i--) printf(“*”)}
printf(”\n”)
}
$1 > maxwidth {
printf(“Value %d at line %d > %d\n”,$1,NR,maxwidth)
}
awk -f prog.awk a=- c=2 d=3 data0 a=k b=y c=x data1 data2
Arrays within awk(1) are very powerful. You can manipulate array
structures with the POSIX and Korn shells, but awk(1) throws in some
very interesting and useful features.
Awk(1) does not require you to define any structure, not even arrays.
Just remember that awk(1) treats everything as string and you will be
just fine. This also means that array subscripts are strings. I don’t
simply mean that 52 is treated as a string. It means that a valid
subscript could be something like “/etc/passwd” or “mares eat oats”.
Then what it references will be a string, that can also be interpreted as a
number if you wish.
You can also trick awk(1) into multi-dimensional arrays. You simply
concatenate two or more strings to make the reference.
As we have already discussed, you have a test for array membership.
Since array references are hashed strings, the function “in” can easily
tell you if an array reference exists for a given string. We will see an
example in a moment.
END {
for (i=NR;i>0;i--) print array[i]
}
BEGIN {
primaries[“red”]=1
primaries[“blue”]=2
primaries[“yellow”]=3
}
{
if ( $1 in primaries ) print $1,”is a primary color”
else print $1,”is NOT a primary color”
}
BEGIN {
primaries[“red”]=1
primaries[“blue”]=2
primaries[“yellow”]=3
}
You could also avoid hard-coding the values of the array. For instance,
what if the first file passed to the program contained the valid names for
primary colors? Subsequent files would contain the data to check.
Your pattern selection would be something like this: FNR == NR. This
would mean that while File Number of Records is the same as Number
of Records, you are processing the first file. When you move to the
second data file, NR continues to increment while FNR restarts at one.
BEGIN {
primaries[“red”]=1
primaries[“blue”]=2
primaries[“yellow”]=3
}
BEGIN {
primaries[“red”]=1
primaries[“blue”]=2
primaries[“yellow”]=3
}
You could also avoid hard-coding the values of the array. For instance,
what if the first file passed to the program contained the valid names for
primary colors? Subsequent files would contain the data to check.
Your pattern selection would be something like this: FNR == NR. This
would mean that while File Number of Records is the same as Number
of Records, you are processing the first file. When you move to the
second data file, NR continues to increment while FNR restarts at one.
{ cntarray[$3]=cntarray[$3]+1 }
END { for ( i in cntarray ) print i, cntarray[i]}
Yet another array example. Can you guess what this one is doing?
On ever record, we are using the contents of field number three as an
array reference. We are also incrementing the number the reference
pointed to by one. Remember, when we use an array reference for the
first time, the value it points to is null, or zero, unless we have otherwise
set it. When we are done with data, we process a for statement. Notice
the unique syntax. This time, the “in” operator is returning the unique
references for the array “cntarray” to the variable. The output is then
the array reference followed by the number of times it occurred in the
input. Result: we have counted the number of times individual values
appear in the third field.
Could you find an application for this rather short awk(1) program? Lets
say you have a data-set that the third field contains the part number for
an order. The awk(1) program above could quickly tell you what parts
you need and how many of each...
Lets take on a more aggressive example. I was once give a data file
that contained data from a performance monitor. Each line of the file
had a date and time stamp, the application name and the percentage of
CPU the application was using during that interval. The intervals were
on a five-minute basis, but some intervals might be missing (the
application had no usage or the collector tool was inoperative).
There were two requirements: average the CPU percentage across
every hour and then output a single line for every hour that had values
for ever application.
The output of our averaging looks something like the above. We get the
date reference, the hour of the day, the application name and the
average, all comma-delimited.
The awk(1) program that lists the data all on one line for each day/hour
combination is a bit more tricky.
On every record we populate three arrays. The array “datapoint”
records the CPU percentage for every element. Notice the reference is
the date, hour and application name. The array “timepoint” simply
records all of the date/hour combinations. The array “appname” simply
captures the names of all applications (the array references end up
being a set of application names that appeared).
When we start output is when the fun begins. First, we want a line that
tells us the heading for each column (the list of applications).
Remember that we do not know what order they will be output in, but
each pass through the list will be in the same order. After printing the
column headings we need to print the real data.
To print the data, we need to range across the date/hour references, so
we use the array “timepoint”. For each “tick” in “timepoint” (an hour of a
give day, we need to range across all applications, printing the data. If
the reference does not exist in the array (for instance, the application
did not have record for that hour), awk(1) will conveniently return a zero.
Once this is finished we are done. Just remember that he data will need
to be sorted to get it into date/hour order.
Finally, after passing through two awk(1) programs, we have the data
we want. Notice that the date/hour references are not in order. You will
need a couple of moments through sort(1) to get that.
When you start using awk(1), you will probably begin with short
programs that are easily specified as a command-line argument. In
awhile, you will begin to write longer programs that will not lend
themselves to being stored in a separate file. A little later, you will find
yourself writing utilities exclusively in awk(1). Writing a short wrapper
shell script to do nothing more that make a call to awk(1) isn’t very
efficient. Think about it; you will spawn a shell to interpret you script,
which will do nothing more than set-up a call to awk(1). How do we get
the shell script out of the picture?
The shell script used to call awk(1) can be eliminated. The loader
process (the process used to load an executable into memory) knows to
examine the first byte of the executable. If this byte happens to be a
pound-sign, the loader will use the following line as the program to
interpret the text. Some call this capability the curtsey loader. We can
leverage this for awk(1) an make our programs self-contained.
What the loader will do is to use the text after the pound-sign verbatim
and then append the pathname to the file as an argument. Notice that
the first line in the example is a call to awk(1) using the “-f” option.
Recall that the this option specifies the file awk(1) is to interpret. But,
the line in the shell script has nothing following the option! This is
because the loader will provide the full path name to it. Therefore, we
have a self-contained awk(1) program.
The program above summarized the output of lpstat.
Awk By Example AvMed
HP Group Operational Review 12/2008 - Page 43 of 48
Multiple-Line Records
DATA(mlr.data) Program(mlr.awk) RUN
David L. Totsch BEGIN { $ awk -f mlr.awk mlr.data
Hewlett-Packard Co. RS=““ David L. Totsch
Englewood, CO FS=“\n” Interex
} Hewlett-Packard
Interex $
P.O. Box 3429 {
Sunnyvale, CA print $1
}
Hewlett-Packard
Cupertino, CA
• experiment
• walk before you run
• you might find AWK useful for:
– making sure every record of a file has the same
field count
– manipulating numeric information
– creating reports from raw data
– gathering specific information from reports
– data conversions
{ total=total + $2 }
END { print total } # total of field two
{
for (i=NF;i>0;i--) printf("%s ",$i)
printf("\n")
} # reverse the order of fields on each line
The first example awk(1) program scans the input for lines that do not
have exactly five fields (sort of a data checking function). If the line
does not have exactly five fields, the line number, number of fields and
entire record are printed out.
The second example totals the contents of field number two. How could
you modify the same awk(1) program to print an average? HINT: NR
still holds the number of records read even during the “END”
processing.
The next example scans the input records looking for a maximum value
in field number three. It then prints that record. Notice how the pattern
scanning ability has been used to make the evaluation. The first record
sets “max” to the value of the third field in the first record. Remember
that awk(1) will automatically return null if the variable is not set.
The last example is also more interesting than it is useful. In previous
examples, we have seen an awk(1) program that will reverse the input,
making the last line first. This program takes each line and reverses the
order of the fields (last field first, and so on). How about reversing each
line character by character? Here it is:
{
for(I=length($0);I>0;I--) printf(“%c”,substr($0,I,1))
printf(“\n”)
}