Homepage of
|
Dr. Carsten Nase TU Dortmund Department of Physics Uhrig Group 44221 Dortmund |
This tutorial treats the non-interactive parsing of ASCII text files containing numerical data. Recipes for common problems are presented and illustrated via a sample data file.
# comment
# another comment, followed by an empty line
0 0 1.000000 0.000000 1 1 0.540302 0.841471 2 4 -0.416147 0.909297 3 9 -0.989992 0.141120 4 16 -0.653644 -0.756802 5 25 0.283662 -0.958924 6 36 0.960170 -0.279415 7 49 0.753902 0.656987 8 64 -0.145500 0.989358
These programs are useful tools for parsing files containing numerical data (or, of course, simple text). Some of them are quiet simple but useful (cat, tac, head, or tail, e.g.). Others are powerful and versatile script languages (perl, e.g.).
The easiest way for effecient parsing is to combine the above mentioned tools in a sequence of instructions.
Therefore we have to use the piping symbol |
.
computer>
command1 < infile | command2 | command3 > outfile
computer>
denotes the shell prompt.
We have now already used the redirection symbols <
(refers to the standard input) and >
(standard output).
If you want to append redirected output use the >>
operator.
computer>
command < infile >> outfile
Otherwise the content of outfile (if it consits) is overwritten. Some commands do not read from standard input but from a file directly. In this case
computer>
command infile
has to be used. Do never ever try to write in a file you are reading at the same time.
computer>
command < file > file# destroys file! Do NOT use!
Data files often contain comments starting with a hash #
.
For parsing these files we have to ignore the commentary lines.
grep -v
(or grep --invert-match
) helps us.
computer>
grep -v "^#" < parsing.dat 0 0 1.000000 0.000000[...]
# We dropped the trailing lines.
The -v
option inverts the sense of matching, to select non-matching lines.
The matching pattern is ^#
.
The caret ^
and the dollar sign $
are metacharacters that respectively match the empty string at the beginning and end of a line.
We use ^#
to find a hash #
at the beginning ^
of a line.
A regular expression is a pattern that describes a set of strings.
Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
The first and very simple regular expression we used was ^#
Two regular expressions may be joined by the infix operator |
the resulting regular expression matches any string matching either subexpression.
Now let us get rid of the bothering empty line at the beginning of the output. To match empty lines we use the regular expression ^$
.
It matches lines whose end $
follows the beginning ^
.
Nerdy description of an empty line, but it works.
computer>
grep -v "^#\|^$" < parsing.dat 0 0 1.000000 0.000000[...]
To drop the masking character \
and use |
instead of \|
we have to use egrep
or grep -E
.
computer>
grep -E -v "^#|^$" < parsing.datcomputer>
egrep -v "^#|^$" < parsing.dat
Sometimes it is useful to group subexpressions with brackets ()
.
computer>
grep -E -v "^(#|$)" < parsing.datcomputer>
egrep -v "^(#|$)" < parsing.datcomputer>
grep -v "^\(#\|$\)" < parsing.dat# using non-extened regular expressions
So flip(flop|flap)
matches flipflop
and flipflap
.
Somewhat lengthy, as only the vocal differs in flop
and flap
.
flipfl(o|a)p
would have done the same.
We can be even shorter ...
A bracket expression is a list of characters enclosed by [
and ]
It matches any single character in that list;
if the first character of the list is the caret ^
then it matches any character not in the list.
For example, the regular expression [0123456789]
matches any single digit.
And flipfl[oa]p
matches flipflop
and flipflap
.
Within a bracket expression, a range expression consists of two characters separated by a hyphen -
.
It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set.
For example, in the default C locale, [a-d]
is equivalent to [abcd]
and [0-9]
is equivalent to [0123456789]
.
A regular expression may be followed by one of several repetition operators.
? | The preceding item is optional and matched at most once. |
* | The preceding item will be matched zero or more times. |
+ | The preceding item will be matched one or more times. |
{n} | The preceding item is matched exactly n times. |
{n,} | The preceding item is matched n or more times. |
{n,m} | The preceding item is matched at least n times, but not more than m times. |
So ([+-]?[0-9]+)
matches an (integer) number with plus or minus sign in front.
Some useful options for the sort
command.
-b | --ignore-leading-blanks | ignore leading blanks |
-d | --dictionary-order | consider only blanks and alphanumeric characters |
-f | --ignore-case | fold lower case to upper case characters |
-g | --general-numeric-sort | compare according to general numerical value |
-n | --numeric-sort | compare according to string numerical value |
-r | --reverse | reverse the result of comparisons |
-k | --key=POS1[,POS2] | start a key at POS1, end it at POS 2 (origin 1) |
For example, we can sort our sample file with respect to column number 3.
computer>
egrep -v "^#|^$" < parsing.dat | sort -g -k3
3 9 -0.989992 0.141120
9 81 -0.911130 0.412118
4 16 -0.653644 -0.756802
2 4 -0.416147 0.909297
8 64 -0.145500 0.989358
5 25 0.283662 -0.958924
1 1 0.540302 0.841471
7 49 0.753902 0.656987
6 36 0.960170 -0.279415
0 0 1.000000 0.000000
To reverse the data we can use the inverse of cat, i.e. tac.
computer>
egrep -v "^#|^$" < parsing.dat | tac
9 81 -0.911130 0.412118
8 64 -0.145500 0.989358
7 49 0.753902 0.656987
6 36 0.960170 -0.279415
5 25 0.283662 -0.958924
4 16 -0.653644 -0.756802
3 9 -0.989992 0.141120
2 4 -0.416147 0.909297
1 1 0.540302 0.841471
0 0 1.000000 0.000000
Let us now use awk
to select specific rows of a data file and reorder it according to our demands.
Each field in the input record may be referenced by its position, $1
, $2
, and so on.
$0
is the whole record.
Print column 1 and column 3 separated by a tabulator
computer>
egrep -v "^#|^$" < parsing.dat | awk '{ print $2"\t"
$4; }' 0 1.000000 1 0.540302 2 -0.416147 3 -0.989992 4 -0.653644 5 0.283662 6 0.960170 7 0.753902 8 -0.145500 9 -0.911130
Note the ticks and the curly braces {}
and the semicolon ;
.
We might also use the well known C
command printf
for the output command
computer>
egrep -v "^#|^$" < parsing.dat | awk '{ printf("%d\t%9.6f\n"
, $1, $3); }' 0 1.000000 1 0.540302 2 -0.416147 3 -0.989992[...]
computer>
egrep -v "^#|^$" < parsing.dat | awk '{ printf("%d\t%12.5e\n"
, $1, $3); }' 0 1.00000e+00 1 5.40302e-01 2 -4.16147e-01 3 -9.89992e-01[...]
A large set of arithmetic operations can be applied to the elements of the data file. First let us test whether sin2x+cos2x=1.
computer>
egrep -v "^#|^$" < parsing.dat | \ awk '{ printf("%d\t%9.6f\t%9.6f\n"
, $1, $3*$3+$4*$4, sin($1)^2+cos($1)^2); }' 0 1.000000 1.000000 1 1.000000 1.000000 2 0.999999 1.000000 3 0.999999 1.000000 4 1.000000 1.000000 5 0.999999 1.000000 6 0.999999 1.000000 7 1.000000 1.000000 8 1.000000 1.000000 9 0.999999 1.000000
Column 2 is spoiled by rounding errors.
Awk has the following built-in operators and arithmetic functions.
operator | |
---|---|
(...) | Grouping |
$ | Field reference. |
+ - | Addition and subtraction. |
+ - ! | Unary plus, unary minus, and logical negation. |
++ -- | Increment and decrement, both prefix and postfix. |
* / % | Multiplication, division, and modulus. |
^ | Exponentiation (** may also be used, and **= for the assignment operator). |
< > <= >= != == | The regular relational operators. |
= += -= *= /= %= ^= | Assignment. |
function | |
atan2(y, x) | Returns the arctangent of y/x in radians. |
cos(expr) | Returns the cosine of expr, which is in radians. |
exp(expr) | The exponential function. |
int(expr) | Truncates to integer. |
log(expr) | The natural logarithm function. |
sin(expr) | Returns the sine of expr, which is in radians. |
sqrt(expr) | The square root function. |
Often the value of Pi is used in simple calculations. Define this variable in awk
in an BEGIN
block.
The special patterns BEGIN
and END
may be used to capture control before the first input line has been read and after the last input line has been read respectively.
computer>
awk 'BEGIN { pi=4*atan2(1,1); printf("%.15f\n", pi); }' 3.141592653589793computer>
egrep -v "^#|^$" < parsing.dat | \ awk 'BEGIN { pi=4*atan2(1,1); } \ { printf("%18.15f\t%18.15f\n", $1, $1*pi); s += $1;} \ END { printf("------------------\n%18.15f\n", s); }' 0.000000000000000 0.000000000000000 1.000000000000000 3.141592653589793 2.000000000000000 6.283185307179586 3.000000000000000 9.424777960769379 4.000000000000000 12.566370614359172 5.000000000000000 15.707963267948966 6.000000000000000 18.849555921538759 7.000000000000000 21.991148575128552 8.000000000000000 25.132741228718345 9.000000000000000 28.274333882308138 ------------------ 45.000000000000000
We use this small perl script to do the job.
#!/usr/bin/perl -w
use strict;
unless
($ARGV[0]) { die"Sorry, I need at least one argumenst: sparse.pl line [offset=0] < infile > outfile."
}my
$l = $ARGV[0];# print every l line
my
$o = 0;# offset default
if
($ARGV[1]) { $o = $ARGV[1]; }for
(my $i=0; $i<$o; $i++) { <STDIN>; }my
$n = 0;LINE:
while
(<STDIN>) { $n++;if
(($n-1) % $l == 0) { print $_; } };
The syntax of sparse is sparse.pl line [offset=0] < infile > outfile. Every lineth line is printed starting from line number offset. The default value for offset is zero and may be skipped.
computer>
egrep -v "^#|^$" < parsing.dat | sparse.pl 3 0 0 0 1.000000 0.000000 3 9 -0.989992 0.141120 6 36 0.960170 -0.279415 9 81 -0.911130 0.412118computer>
egrep -v "^#|^$" < parsing.dat | sparse.pl 2 1 1 1 0.540302 0.841471 3 9 -0.989992 0.141120 5 25 0.283662 -0.958924 7 49 0.753902 0.656987 9 81 -0.911130 0.412118
Assume we have a data set {F(x)} with {x} in [-A,+A]. The zero may be included in {x} (odd number of data points) or not (even number of data points). Now the symmetrized version of F(x) shall be calculated.
xi | F | Symmetrized F |
---|---|---|
x-n=-A | F(x-n) | F(x-n)+F(xn) |
x-n+1 | F(x-n+1) | F(x-n+1)+F(xn-1) |
x-n+2 | F(x-n+2) | F(x-n+2)+F(xn-2) |
... | ... | ... |
x0=0 | F(0) | 2 F(0) |
... | ... | ... |
xn-2 | F(xn-2) | F(x-n+2)+F(xn-2) |
xn-1 | F(xn-1) | F(x-n+1)+F(xn-1) |
xn=+A | F(xn) | F(x-n)+F(xn) |
We use this small perl script to do the job.
#!/usr/bin/perl -w
use strict;
my
$n='[0-9e\.\+\-]+'
;# a number
my
$s='[ \t]'
;# a separator
my
$x = 0.0;my
$f = 0.0;my
@x;my
@f;my
@fsym;my
$l = 0; $x[1] = -1e10;my
$eps = 1e-12;LINE:
while
(<STDIN>) {SWITCH:
{ /^\#/ &&do
{next
LINE; }; /^$/ &&do
{next
LINE; }; /^($s*)($n)($s+)($n)/ &&do
{ $x = $2; $f = $4;if
(abs($x) < $eps) { $x = 0.0; }if
($l == 0) { $l = 1; }else
{if
(abs($x[$l]-$x) > $eps) { $l++; } } $x[$l] = $x; $f[$l] = $f;next
LINE; };next
LINE; }; };my
$z = 0;if
(abs($l % 2) == 1) { $z = ($l+1)/2; }if
(abs($l % 2) == 0) { $z = $l /2; }for
(my $i=1; $i<=$z; $i++) {if
(abs($x[$i] + $x[$l-$i+1]) > 1e-10) { die"Sorry, interval ["
, $x[1],", "
, $x[$l],"] seems not to be symmetric:\n"
,"i="
, $i," "
," "
, $x[$i]," != "
, $x[$l-$i+1],"\n"
; } }for
(my $i=1; $i<=$z; $i++) { $fsym[$i] = $fsym[$l-$i+1] = $f[$i] + $f[$l-$i+1]; } print"# Found
$lvalues in ["
, $x[1],", "
, $x[$l],"].\n"
;for
(my $i=1; $i<=$l; $i++) { printf("%18.12f\t%18.12f\n"
, $x[$i], $fsym[$i]); }
A small example with working on sym.dat.
computer>
cat sym.dat -10.000000000000 0.000000000000 -9.000000000000 0.105170918076 -8.000000000000 0.221402758160 -7.000000000000 0.349858807576 -6.000000000000 0.491824697641 -5.000000000000 0.648721270700 -4.000000000000 0.822118800391 -3.000000000000 1.013752707470 -2.000000000000 1.225540928492 -1.000000000000 1.459603111157 0.000000000000 1.718281828459 1.000000000000 2.004166023946 2.000000000000 2.320116922737 3.000000000000 2.669296667619 4.000000000000 3.055199966845 5.000000000000 3.481689070338 6.000000000000 3.953032424395 7.000000000000 4.473947391727 8.000000000000 5.049647464413 9.000000000000 5.685894442279 10.000000000000 6.389056098931computer>
./sym.pl < sym.dat# Found 21 values in [-10.000000000000, 10.000000000000].
-10.000000000000 6.389056098931 -9.000000000000 5.791065360355 -8.000000000000 5.271050222573 -7.000000000000 4.823806199303 -6.000000000000 4.444857122036 -5.000000000000 4.130410341038 -4.000000000000 3.877318767236 -3.000000000000 3.683049375089 -2.000000000000 3.545657851229 -1.000000000000 3.463769135103 0.000000000000 3.436563656918 1.000000000000 3.463769135103 2.000000000000 3.545657851229 3.000000000000 3.683049375089 4.000000000000 3.877318767236 5.000000000000 4.130410341038 6.000000000000 4.444857122036 7.000000000000 4.823806199303 8.000000000000 5.271050222573 9.000000000000 5.791065360355 10.000000000000 6.389056098931