Text processing/1

Often data is produced by one program, in the wrong format for later use by another program or person. In these situations another program can be written to parse and transform the original data into a format useful to the other. The term "Data Munging" is often used in programming circles for this task.

A request on the comp.lang.awk newsgroup lead to a typical data munging task:

I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: $1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:

<date> <val1> <flag1> <val2> <flag2> ...  <val24> <flag24>

Some test data is available at: 
... (nolonger available at original location)

I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)

The data is free to download and use and is of this format:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

Only a sample of the data showing its format is given above. The full example file, readings.txt, is at the bottom of the page.

Structure your program to show statistics for each line of the file, (similar to the original Python, Perl, and AWK examples below), followed by summary statistics for the file. When showing example output just show a few line statistics and the full end summary.

AWK

<c># Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{

 nodata = 0;             # Current run of consecutive flags<0 in lines of file
 nodata_max=-1;          # Max consecutive flags<0 in lines of file
 nodata_maxline="!";     # ... and line number(s) where it occurs

} FNR==1 {

 # Accumulate input file names
 if(infiles){
   infiles = infiles "," infiles
 } else {
   infiles = FILENAME
 }

} {

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0

 # extract field info, skipping initial date field
 for(field=2; field<=NF; field+=2){
   datum=$field; 
   flag=$(field+1); 
   if(flag<1){
     nodata++
   }else{
     # check run of data-absent fields
     if(nodata_max==nodata && (nodata>0)){
       nodata_maxline=nodata_maxline ", " $1
     }
     if(nodata_max<nodata && (nodata>0)){
       nodata_max=nodata
       nodata_maxline=$1
     }
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line+=datum
     num_line++;
   }
 }

 # totals for the file so far
 tot_file += tot_line
 num_file += num_line

 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
        $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

 # debug prints of original data plus some of the computed values
 #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
 #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline

}

END{

 printf "\n"
 printf "File(s)  = %s\n", infiles
 printf "Total    = %10.3f\n", tot_file
 printf "Readings = %6i\n", num_file
 printf "Average  = %10.3f\n", tot_file / num_file

 printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline

}</c> Sample output:

bash$ awk -f readings.awk readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

Perl

<perl># Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {

 $nodata = 0;             # Current run of consecutive flags<0 in lines of file
 $nodata_max=-1;          # Max consecutive flags<0 in lines of file
 $nodata_maxline="!";     # ... and line number(s) where it occurs

} foreach (@ARGV) {

 # Accumulate input file names
 if($infiles ne ""){
   $infiles = "$infiles, $_";
 } else {
   $infiles = $_;
 }

}

while (<>){

 $tot_line=0;             # sum of line data
 $num_line=0;             # number of line data items with flag>0

 # extract field info, skipping initial date field
 chomp;
 @fields = split(/\s+/);
 $nf = @fields;
 $date = $fields[0];
 for($field=1; $field<$nf; $field+=2){
   $datum = $fields[$field] +0.0; 
   $flag  = $fields[$field+1] +0; 
   if(($flag+1<2)){
     $nodata++;
   }else{
     # check run of data-absent fields
     if($nodata_max==$nodata and ($nodata>0)){
       $nodata_maxline = "$nodata_maxline, $fields[0]";
     }
     if($nodata_max<$nodata and ($nodata>0)){
       $nodata_max = $nodata;
       $nodata_maxline=$fields[0];
     }
     # re-initialise run of nodata counter
     $nodata = 0; 
     # gather values for averaging
     $tot_line += $datum;
     $num_line++;
   }
 }

 # totals for the file so far
 $tot_file += $tot_line;
 $num_file += $num_line;

 printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
        $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n"; printf "File(s) = %s\n", $infiles; printf "Total = %10.3f\n", $tot_file; printf "Readings = %6i\n", $num_file; printf "Average = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",

      $nodata_max, $nodata_maxline;</perl>

Sample output:

bash$ perl -f readings.pl readings.txt | tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

Python

<python># Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput import sys

nodata = 0; # Current run of consecutive flags<0 in lines of file nodata_max=-1; # Max consecutive flags<0 in lines of file nodata_maxline=[]; # ... and line number(s) where it occurs

tot_file = 0 # Sum of file data num_file = 0 # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():

 tot_line=0;             # sum of line data
 num_line=0;             # number of line data items with flag>0

 # extract field info
 field = line.split()
 date  = field[0]
 data  = [float(f) for f in field[1::2]]
 flags = [int(f)   for f in field[2::2]]

 for datum, flag in zip(data, flags):
   if flag<1:
     nodata += 1
   else:
     # check run of data-absent fields
     if nodata_max==nodata and nodata>0:
       nodata_maxline.append(date)
     if nodata_max<nodata and nodata>0:
       nodata_max=nodata
       nodata_maxline=[date]
     # re-initialise run of nodata counter
     nodata=0; 
     # gather values for averaging
     tot_line += datum
     num_line += 1

 # totals for the file so far
 tot_file += tot_line
 num_file += num_line

 print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
       date, 
       len(data) -num_line, 
       num_line, tot_line, 
       tot_line/num_line if (num_line>0) else 0)

print "" print "File(s) = %s" % (", ".join(infiles),) print "Total = %10.3f" % (tot_file,) print "Readings = %6i" % (num_file,) print "Average = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (

   nodata_max, ", ".join(nodata_maxline))</python>

Sample output:

bash$ /cygdrive/c/Python26/python readings.py readings.txt|tail
Line:  2004-12-29  Reject:  1  Accept: 23  Line_tot:     56.300  Line_avg:      2.448
Line:  2004-12-30  Reject:  1  Accept: 23  Line_tot:     65.300  Line_avg:      2.839
Line:  2004-12-31  Reject:  1  Accept: 23  Line_tot:     47.300  Line_avg:      2.057

File(s)  = readings.txt
Total    = 1358393.400
Readings = 129403
Average  =     10.497

Maximum run(s) of 589 consecutive false readings ends at line starting with date(s): 1993-03-05
bash$

Example Data: readings.txt

File readings.txt

I had hoped to add the full file to this page in a scrolling window but forgot about the size. It is too large as a text file for google docs so I zipped it and embedded it in an OpenOffice writer file then put that on google docs here.

(I would appreciate someone putting the zipped file on Rosetta Code then adjusting the sentence above).