Awk

From Leo's Notes
Last edited on 3 July 2021, at 03:56.

Awk is an excellent text parser and scripting language. It can be run in-line as part of a Unix command pipeline which makes it extremely useful when needing to add more complicated behavior in a shell script.

Introduction to Awk

Awk's scripting language is structed as a sequence of patterns and actions:

# A pattern definition
PATTERN  { ACTION1; ACTION2; }

# A real example
/^[a-z0-9]/ { print $1 }

# Another example
/^[0-9]\s*/ { Sum += $1 }

A pattern is typically defined as a regular expression, boolean expressions, and special patterns. Regular expressions are always enclosed with a leading and ending forward slash. Boolean expressions are additional expressions that use the && (and), || (or), and ! (not) operators. Special patterns are additional awk built-ins that trigger on specific conditions, such as before the first line and after the last line with BEGIN and END.

# To run an action before the first input, use BEGIN
BEGIN { Sum = 0 }

# Get a number and add it to sum using regular expressions
/^[0-9]+ / { Sum += $1 }

# Add more numbers with boolean expressions
/^[0-9]+ / && $1 >= 0 { print "Adding" $1 }

# After the last output, use END
END { print "Sum is" Sum }

Fields

Fields are basically the 'word' within the current line delimited by one or more white spaces. The field $1 would reference the first 'word' of the current line. $2 the second word, and so on. The field $0 represents the entire line.

# Here's an example line:
#
# $1     $2            $3          $4  $5 $6   $7
# GET    /index.html   HTTP/1.0    Dec 20 2020 01:23:45
/index.html/ {
print "Someone accessed index.html on" $4 $4 $5 $6 $7
}

Tip: You may alter the current line being processed by assigning a modified value to $0.

Scripting

Functions

Functions are defined in the C style:

function do_something(parameter1, parameter2, ...) {
   ACTIONS
}

They can then be called like any other function when defining actions.

/^GET/ { do_something($0) }

Variables

Awk has built-in variables:

  • NR - current line number
  • NF - number of fields in current line
  • OFS - output field separator
  • FS - field separator, specified with -F
  • RS - record separator

You can also pass your own custom variables with the -v var=value parameter. Eg:

$ echo | awk \
   -v hostname=`hostname` \
   -v timestamp=`date +%s` '
{
   printf("Hostname is %s at time %s", hostname, timestamp")
}
'


Inlining with Bash

When writing a bash script, you may inline Awk as part of a command pipeline by passing the Awk script within a set of single quotes. We use single quotes because we do not want Bash to treat Awk fields as variables (Eg. values such as $2 should not be replaced).

#!/bin/bash

cat /etc/hosts | awk '
/^server/ { print $2 }
'

Tip: To add a single quote, you will need to use '"'"' which inserts a single ' by switching from single to double quotes. This is useful if you need to insert a single quote in order to trigger a subshell call, for instance.

Tasks

Line Matching

With Awk, it's easy to do something for each matching line using the regex matching operator.

Print every line before a line match

Simple awk code:

# cat list.txt | awk '/PATTERN/ { exit } { print $0 }'

Basically, the code does: match against the given pattern. If it matches, awk exits. Otherwise, print the line and continue.

Print the line number on matching lines

To print the number of lines up to a matching line, we do something similar to the previous example but now we keep an accumulator (n):

awk 'BEGIN { n = 0 } /PATTERN/ { print n; exit } { n++ }'

Eg: Suppose I have a file with the contents:

leo
spoon
cake
fork

To find the number of lines up until 'cake', do:

# cat list.txt | awk 'BEGIN { n = 0 } /cake/ { print n; exit } { n++ }'

Retrieve a section of text with line matching

To grab a specific section from a .spec file (where sections begin with %):

if [ $# -ne 2 ] ; then
	echo "Usage: $0 file.spec section"
	exit
fi

Section="$2"

cat $1 | awk -v Section="$Section" '
BEGIN {
	InSection=false
} 
/^%.*/ { 
	if ($0 ~ Section) {
		InSection=1
	} else {
		InSection=0
	}

	next
}
{
	if (InSection) {
		print $0
	}
}

String Matching

Use the match(string, regex, output_array) command to parse out specific values with regex.

# Given a string Job <87010>, User <asdf>, Project <default>
# we can parse out the Job ID with:

match($0, /^Job <([^>]+)>.*/, arr)
print "Job ID: " arr[1]

Converting Values

Convert Number as Bytes to Human Readable Value

I wanted to sort size in reverse order, but in order to do that properly, the value from du needs to be in kilobytes. I also didn't want to run this through du again just to get the human readable value. So, I did this:

$ du -s ./*/ ./*/*/ ./*/*/*/ \ 
   | sort -rn \ 
   | awk 'BEGIN { \ 
      split("KB MB GB TB PB", type) \ 
   } \ 
   { \ 
      y = 0; \                                                                                                  
      x = $1; \ 
      for (i = 4; y &lt; 1 ; i--) \ 
         y = x / (2 ** (10 * i)); \ 
      print y type[i+2]" "$2 \ 
   }'

If you want to count using bytes instead of kilobytes, just add K to the split function and replace i=4 with i=5.

Convert ps Elapsed Time to Seconds

To convert the elapsed time (in POSIX locale formatted as [[dd-]hh:]mm:ss) to seconds using awk:

Elapsed=`ps -p $Pid -o etime=`
Elapsed=`echo $Elapsed | tr - : | tr : ' '` 

Seconds=`echo $Elapsed | awk ' 
	NF == 2 { print ($1 * 60) + $2 } 
	NF == 3 { print ((($1 * 60) + $2) * 60) + $3 } 
	NF == 4 { print ((((($1 * 24) + $2) * 60) + $3) * 60) + $4 } 
	{}'` 
	
echo $Seconds

For example, a process running for 66-00:12:58 has been running for 5703178 seconds. The Awk command will match how many columns were found and then do the proper calculation.

String Manipulation

String Upper/Lower Casing

There is a tolower() function that can lowercase an entire string.

You could use this to mass-rename a bunch of files to lower case for instance.

## Lower case every file in the current directory
$ for i in `ls` ; do mv $i `echo $i | awk '{print tolower($1)}'`; done

## Alternatively, use `tr 'A-Z' 'a-z'` to do the lowercasing.
$ for i in `ls` ; do mv $i `echo $i | tr 'A-Z 'a-z'`; done

Retrieving all columns after column N

You can retrieve a specific column using $N, where N is the column number. If you wish to get all values after a specific column, you can use this function which concatenates all strings after the specified column together and returns it:

function after(x) {
        out=""
        for (i=x; i<=NF; i++) out=out" "$i
        return out
}

printf("After column 11: %s\n", after(11))

Executing Commands

If your awk script needs to call an external process, pass the command to getline followed by the variable name used to store stdout output.

Eg. To get the path of a executable from a given PID, use:

"readlink /proc/" $2 "/exe" | getline proc
printf("%s\n", proc)

If you intend to run many commands, you should close the pipe or else you will get a fatal: cannot open pipe 'xyz' (Too many open files) error. Do so by using the close function:

cmd="date -d\""$1" "$2"\" \"+%s\""; cmd | getline timestamp; close(cmd)