Carsten Nase: parsing ASCII data

[

home] [

notes] — [

top] [sparsing] [symmetrizing] [references] [

bottom]

Not finished. Use with care!

Parsing ASCII data files with perl, awk, sed, grep, and others

This tutorial treats the non-interactive parsing of ASCII text files containing numerical data. Recipes for common problems are presented and illustrated via a sample data file.

# comment
# another comment, followed by an empty line

0       0        1.000000        0.000000
1       1        0.540302        0.841471
2       4       -0.416147        0.909297
3       9       -0.989992        0.141120
4       16      -0.653644       -0.756802
5       25       0.283662       -0.958924
6       36       0.960170       -0.279415
7       49       0.753902        0.656987
8       64      -0.145500        0.989358

Programs

These programs are useful tools for parsing files containing numerical data (or, of course, simple text). Some of them are quiet simple but useful (cat, tac, head, or tail, e.g.). Others are powerful and versatile script languages (perl, e.g.).

perl: practical extraction and report language.
awk: pattern scanning and processing language.
gawk is the GNU Project's implementation of the AWK programming language.
sed: a stream editor.
grep, egrep, fgrep, zgrep: print lines matching a pattern.
agrep: search a file for a string or regular expression, with approximate matching capabilities.
cat: concatenate files and print on the standard output.
tac: concatenate and print files in reverse (cat -> tac).
head: output the first part (the head) of files.
tail: output the last part (the tail) of files.
comm compare two sorted files line by line.
diff: compare files line by line.
paste merges lines of files.
join lines of two files on a common field.
csplit: split a file into sections determined by context lines.
cut: remove sections from each line of files.
tr: translate or delete characters.
sort lines of text files.
uniq removes duplicate lines from a sorted file.
nl: number lines of files.
wc (wordcount): print the number of bytes, words, and lines in files.
less: opposite of more. "Less is a program similar to more , but which allows backward movement in the file as well as forward movement. Also, less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi."

Pipes and redirections

The easiest way for effecient parsing is to combine the above mentioned tools in a sequence of instructions. Therefore we have to use the piping symbol |.

computer> command1 < infile | command2 | command3 > outfile

computer> denotes the shell prompt. We have now already used the redirection symbols < (refers to the standard input) and > (standard output). If you want to append redirected output use the >> operator.

computer> command < infile >> outfile

Otherwise the content of outfile (if it consits) is overwritten. Some commands do not read from standard input but from a file directly. In this case

computer> command infile

has to be used. Do never ever try to write in a file you are reading at the same time.

computer> command < file > file   # destroys file! Do NOT use!

Removing comments

Data files often contain comments starting with a hash #. For parsing these files we have to ignore the commentary lines. grep -v (or grep --invert-match) helps us.

computer> grep -v "^#" < parsing.dat

0       0        1.000000        0.000000
[...]
# We dropped the trailing lines.

The -v option inverts the sense of matching, to select non-matching lines. The matching pattern is ^#. The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. We use ^# to find a hash # at the beginning ^ of a line.

Regular expressions

A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions. The first and very simple regular expression we used was ^# Two regular expressions may be joined by the infix operator | the resulting regular expression matches any string matching either subexpression.

Now let us get rid of the bothering empty line at the beginning of the output. To match empty lines we use the regular expression ^$. It matches lines whose end $ follows the beginning ^. Nerdy description of an empty line, but it works.

computer> grep -v "^#\|^$" < parsing.dat
0       0        1.000000        0.000000
[...]

To drop the masking character \ and use | instead of \| we have to use egrep or grep -E.

computer>  grep -E -v "^#|^$" < parsing.dat
computer> egrep    -v "^#|^$" < parsing.dat

Grouping expressions

Sometimes it is useful to group subexpressions with brackets ().

computer>  grep -E -v "^(#|$)" < parsing.dat
computer> egrep    -v "^(#|$)" < parsing.dat

computer>  grep    -v "^\(#\|$\)" < parsing.dat   # using non-extened regular expressions

So flip(flop|flap) matches flipflop and flipflap. Somewhat lengthy, as only the vocal differs in flop and flap. flipfl(o|a)p would have done the same. We can be even shorter ...

Bracket expressions

A bracket expression is a list of characters enclosed by [ and ] It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit. And flipfl[oa]p matches flipflop and flipflap.

Within a bracket expression, a range expression consists of two characters separated by a hyphen -. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd] and [0-9] is equivalent to [0123456789].

Repetition operators

A regular expression may be followed by one of several repetition operators.

`?`	The preceding item is optional and matched at most once.
`*`	The preceding item will be matched zero or more times.
`+`	The preceding item will be matched one or more times.
`{n}`	The preceding item is matched exactly n times.
`{n,}`	The preceding item is matched n or more times.
`{n,m}`	The preceding item is matched at least n times, but not more than m times.

So ([+-]?[0-9]+) matches an (integer) number with plus or minus sign in front.

Sorting data

Some useful options for the sort command.

`-b`	`--ignore-leading-blanks`	ignore leading blanks
`-d`	`--dictionary-order`	consider only blanks and alphanumeric characters
`-f`	`--ignore-case`	fold lower case to upper case characters
`-g`	`--general-numeric-sort`	compare according to general numerical value
`-n`	`--numeric-sort`	compare according to string numerical value
`-r`	`--reverse`	reverse the result of comparisons
`-k`	`--key=POS1[,POS2]`	start a key at POS1, end it at POS 2 (origin 1)

For example, we can sort our sample file with respect to column number 3.

computer> egrep -v "^#|^$" < parsing.dat | sort -g -k3
3       9       -0.989992        0.141120
9       81      -0.911130        0.412118
4       16      -0.653644       -0.756802
2       4       -0.416147        0.909297
8       64      -0.145500        0.989358
5       25       0.283662       -0.958924
1       1        0.540302        0.841471
7       49       0.753902        0.656987
6       36       0.960170       -0.279415
0       0        1.000000        0.000000

To reverse the data we can use the inverse of cat, i.e. tac.

computer> egrep -v "^#|^$" < parsing.dat | tac
9       81      -0.911130        0.412118
8       64      -0.145500        0.989358
7       49       0.753902        0.656987
6       36       0.960170       -0.279415
5       25       0.283662       -0.958924
4       16      -0.653644       -0.756802
3       9       -0.989992        0.141120
2       4       -0.416147        0.909297
1       1        0.540302        0.841471
0       0        1.000000        0.000000

Reformatting: selecting and reordering rows

Let us now use awk to select specific rows of a data file and reorder it according to our demands. Each field in the input record may be referenced by its position, $1, $2, and so on. $0 is the whole record.

Print column 1 and column 3 separated by a tabulator

computer> egrep -v "^#|^$" < parsing.dat | awk '{ print $2 "\t" $4; }'
0       1.000000
1       0.540302
2       -0.416147
3       -0.989992
4       -0.653644
5       0.283662
6       0.960170
7       0.753902
8       -0.145500
9       -0.911130

Note the ticks and the curly braces {} and the semicolon ;.

We might also use the well known C command printf for the output command

computer> egrep -v "^#|^$" < parsing.dat | awk '{ printf("%d\t%9.6f\n", $1, $3); }'
0        1.000000
1        0.540302
2       -0.416147
3       -0.989992
[...]
computer> egrep -v "^#|^$" < parsing.dat | awk '{ printf("%d\t%12.5e\n", $1, $3); }'
0        1.00000e+00
1        5.40302e-01
2       -4.16147e-01
3       -9.89992e-01
[...]

Calculations on data sets

A large set of arithmetic operations can be applied to the elements of the data file. First let us test whether sin²x+cos²x=1.

computer> egrep -v "^#|^$" < parsing.dat | \
          awk '{ printf("%d\t%9.6f\t%9.6f\n", $1, $3*$3+$4*$4, sin($1)^2+cos($1)^2); }'
0        1.000000        1.000000
1        1.000000        1.000000
2        0.999999        1.000000
3        0.999999        1.000000
4        1.000000        1.000000
5        0.999999        1.000000
6        0.999999        1.000000
7        1.000000        1.000000
8        1.000000        1.000000
9        0.999999        1.000000

Column 2 is spoiled by rounding errors.

Awk has the following built-in operators and arithmetic functions.

operator
`(...)`	Grouping
`$`	Field reference.
`+ -`	Addition and subtraction.
`+ - !`	Unary plus, unary minus, and logical negation.
`++ --`	Increment and decrement, both prefix and postfix.
`* / %`	Multiplication, division, and modulus.
`^`	Exponentiation (`` may also be used, and `=` for the assignment operator).
`< > <= >= != ==`	The regular relational operators.
`= += -= *= /= %= ^=`	Assignment.
function
`atan2(y, x)`	Returns the arctangent of `y/x` in radians.
`cos(expr)`	Returns the cosine of expr, which is in radians.
`exp(expr)`	The exponential function.
`int(expr)`	Truncates to integer.
`log(expr)`	The natural logarithm function.
`sin(expr)`	Returns the sine of expr, which is in radians.
`sqrt(expr)`	The square root function.

Often the value of Pi is used in simple calculations. Define this variable in awk in an BEGIN block. The special patterns BEGIN and END may be used to capture control before the first input line has been read and after the last input line has been read respectively.

computer> awk 'BEGIN { pi=4*atan2(1,1); printf("%.15f\n", pi); }'
3.141592653589793
computer> egrep -v "^#|^$" < parsing.dat | \
          awk 'BEGIN { pi=4*atan2(1,1); } \
	       { printf("%18.15f\t%18.15f\n", $1, $1*pi); s += $1;} \
	       END { printf("------------------\n%18.15f\n", s); }'
 0.000000000000000       0.000000000000000
 1.000000000000000       3.141592653589793
 2.000000000000000       6.283185307179586
 3.000000000000000       9.424777960769379
 4.000000000000000      12.566370614359172
 5.000000000000000      15.707963267948966
 6.000000000000000      18.849555921538759
 7.000000000000000      21.991148575128552
 8.000000000000000      25.132741228718345
 9.000000000000000      28.274333882308138
------------------
45.000000000000000

Skipping lines: making a data set more sparse

We use this small perl script to do the job.

#!/usr/bin/perl -w

use strict;

unless ($ARGV[0]) {
    die "Sorry, I need at least one argumenst: sparse.pl line [offset=0] < infile > outfile."
}

my $l = $ARGV[0]; # print every l line
my $o = 0;        # offset default
if ($ARGV[1]) { $o = $ARGV[1]; }

for (my $i=0; $i<$o; $i++) { <STDIN>; }

my $n = 0;
 LINE: while (<STDIN>) {
     $n++;
     if (($n-1) % $l == 0) { print $_; }
 };

The syntax of sparse is sparse.pl line [offset=0] < infile > outfile. Every lineth line is printed starting from line number offset. The default value for offset is zero and may be skipped.

computer> egrep -v "^#|^$" < parsing.dat | sparse.pl 3 0
0       0        1.000000        0.000000
3       9       -0.989992        0.141120
6       36       0.960170       -0.279415
9       81      -0.911130        0.412118
computer> egrep -v "^#|^$" < parsing.dat | sparse.pl 2 1
1       1        0.540302        0.841471
3       9       -0.989992        0.141120
5       25       0.283662       -0.958924
7       49       0.753902        0.656987
9       81      -0.911130        0.412118

Symmetrizing data sets

Assume we have a data set {F(x)} with {x} in [-A,+A]. The zero may be included in {x} (odd number of data points) or not (even number of data points). Now the symmetrized version of F(x) shall be calculated.

x_i	F	Symmetrized F
x_-n=-A	F(x_-n)	F(x_-n)+F(x_n)
x_-n+1	F(x_-n+1)	F(x_-n+1)+F(x_n-1)
x_-n+2	F(x_-n+2)	F(x_-n+2)+F(x_n-2)
...	...	...
x₀=0	F(0)	2 F(0)
...	...	...
x_n-2	F(x_n-2)	F(x_-n+2)+F(x_n-2)
x_n-1	F(x_n-1)	F(x_-n+1)+F(x_n-1)
x_n=+A	F(x_n)	F(x_-n)+F(x_n)

We use this small perl script to do the job.

#!/usr/bin/perl -w

use strict;

my $n='[0-9e\.\+\-]+'; # a number
my $s='[ \t]';         # a separator

my $x = 0.0;
my $f = 0.0;

my @x;
my @f;
my @fsym;

my $l = 0;
$x[1] = -1e10;
my $eps = 1e-12;

 LINE: while (<STDIN>) {
   SWITCH: {
       /^\#/ && do {
	   next LINE;
       };
       /^$/ && do {
	   next LINE;
       };
       /^($s*)($n)($s+)($n)/ && do {
	   $x = $2;
	   $f = $4;
	   if (abs($x) < $eps) { $x = 0.0; }
	   if ($l == 0) { $l = 1; }
	   else { if (abs($x[$l]-$x) > $eps) { $l++; } }
	   $x[$l] = $x;
	   $f[$l] = $f;
	   next LINE;
       };
       next LINE;
   };
 };

my $z = 0;
if (abs($l % 2) == 1) { $z = ($l+1)/2; }
if (abs($l % 2) == 0) { $z =  $l   /2; }

for (my $i=1; $i<=$z; $i++) {
    if (abs($x[$i] + $x[$l-$i+1]) > 1e-10) {
	die "Sorry, interval [", $x[1], ", ", $x[$l], "] seems not to be symmetric:\n",
	"i=", $i, " ", " ", $x[$i], " != ", $x[$l-$i+1], "\n";
    }
}

for (my $i=1; $i<=$z; $i++) {
    $fsym[$i] = $fsym[$l-$i+1] = $f[$i] + $f[$l-$i+1];
}

print "# Found $l values in [", $x[1], ", ", $x[$l], "].\n";

for (my $i=1; $i<=$l; $i++) {
    printf("%18.12f\t%18.12f\n", $x[$i], $fsym[$i]);
}

A small example with working on sym.dat.

computer> cat sym.dat
  -10.000000000000	    0.000000000000
   -9.000000000000	    0.105170918076
   -8.000000000000	    0.221402758160
   -7.000000000000	    0.349858807576
   -6.000000000000	    0.491824697641
   -5.000000000000	    0.648721270700
   -4.000000000000	    0.822118800391
   -3.000000000000	    1.013752707470
   -2.000000000000	    1.225540928492
   -1.000000000000	    1.459603111157
    0.000000000000	    1.718281828459
    1.000000000000	    2.004166023946
    2.000000000000	    2.320116922737
    3.000000000000	    2.669296667619
    4.000000000000	    3.055199966845
    5.000000000000	    3.481689070338
    6.000000000000	    3.953032424395
    7.000000000000	    4.473947391727
    8.000000000000	    5.049647464413
    9.000000000000	    5.685894442279
   10.000000000000	    6.389056098931
computer> ./sym.pl < sym.dat
# Found 21 values in [-10.000000000000, 10.000000000000].
  -10.000000000000	    6.389056098931
   -9.000000000000	    5.791065360355
   -8.000000000000	    5.271050222573
   -7.000000000000	    4.823806199303
   -6.000000000000	    4.444857122036
   -5.000000000000	    4.130410341038
   -4.000000000000	    3.877318767236
   -3.000000000000	    3.683049375089
   -2.000000000000	    3.545657851229
   -1.000000000000	    3.463769135103
    0.000000000000	    3.436563656918
    1.000000000000	    3.463769135103
    2.000000000000	    3.545657851229
    3.000000000000	    3.683049375089
    4.000000000000	    3.877318767236
    5.000000000000	    4.130410341038
    6.000000000000	    4.444857122036
    7.000000000000	    4.823806199303
    8.000000000000	    5.271050222573
    9.000000000000	    5.791065360355
   10.000000000000	    6.389056098931