Text processing/2

From Rosetta Code
Revision as of 03:38, 11 September 2014 by rosettacode>Gerard Schildberger (→‎{{header|REXX}}: removed OVERFLOW from PRE html tag.)
Task
Text processing/2
You are encouraged to solve this task according to the task description, using any language you may know.

The following data shows a few lines from the file readings.txt (as used in the Data Munging task).

The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. Periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.

The fields (from the left) are:

 DATESTAMP [ VALUEn FLAGn ] * 24

i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with that instrument, in which case that instrument's value should be ignored.

A sample from the full data file readings.txt is:

1991-03-30	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1
1991-03-31	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	10.000	1	20.000	1	20.000	1	20.000	1	35.000	1	50.000	1	60.000	1	40.000	1	30.000	1	30.000	1	30.000	1	25.000	1	20.000	1	20.000	1	20.000	1	20.000	1	20.000	1	35.000	1
1991-03-31	40.000	1	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2	0.000	-2
1991-04-01	0.000	-2	13.000	1	16.000	1	21.000	1	24.000	1	22.000	1	20.000	1	18.000	1	29.000	1	44.000	1	50.000	1	43.000	1	38.000	1	27.000	1	27.000	1	24.000	1	23.000	1	18.000	1	12.000	1	13.000	1	14.000	1	15.000	1	13.000	1	10.000	1
1991-04-02	8.000	1	9.000	1	11.000	1	12.000	1	12.000	1	12.000	1	27.000	1	26.000	1	27.000	1	33.000	1	32.000	1	31.000	1	29.000	1	31.000	1	25.000	1	25.000	1	24.000	1	21.000	1	17.000	1	14.000	1	15.000	1	12.000	1	12.000	1	10.000	1
1991-04-03	10.000	1	9.000	1	10.000	1	10.000	1	9.000	1	10.000	1	15.000	1	24.000	1	28.000	1	24.000	1	18.000	1	14.000	1	12.000	1	13.000	1	14.000	1	15.000	1	14.000	1	15.000	1	13.000	1	13.000	1	13.000	1	12.000	1	10.000	1	10.000	1

The task:

  1. Confirm the general field format of the file
  2. Identify any DATESTAMPs that are duplicated.
  3. What number of records have good readings for all instruments.

Ada

<lang ada>with Ada.Calendar; use Ada.Calendar; with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;

with Generic_Map;

procedure Data_Munging_2 is

  package Time_To_Line is new Generic_Map (Time, Natural);
  use Time_To_Line;
  File    : File_Type;
  Line_No : Natural := 0;
  Count   : Natural := 0;
  Stamps  : Map;

begin

  Open (File, In_File, "readings.txt");
  loop
     declare
        Line    : constant String := Get_Line (File);
        Pointer : Integer := Line'First;
        Flag    : Integer;
        Year, Month, Day : Integer;
        Data    : Float;
        Stamp   : Time;
        Valid   : Boolean := True;
     begin
        Line_No := Line_No + 1;
        Get (Line, Pointer, SpaceAndTab);
        Get (Line, Pointer, Year);
        Get (Line, Pointer, Month);
        Get (Line, Pointer, Day);
        Stamp := Time_Of (Year_Number (Year), Month_Number (-Month), Day_Number (-Day));
        begin
           Add (Stamps, Stamp, Line_No);
        exception
           when Constraint_Error =>
              Put (Image (Year) & Image (Month) & Image (Day) & ": record at " & Image (Line_No));
              Put_Line (" duplicates record at " & Image (Get (Stamps, Stamp)));
        end;
        Get (Line, Pointer, SpaceAndTab);
        for Reading in 1..24 loop
           Get (Line, Pointer, Data);
           Get (Line, Pointer, SpaceAndTab);
           Get (Line, Pointer, Flag);
           Get (Line, Pointer, SpaceAndTab);
           Valid := Valid and then Flag >= 1;
        end loop;
        if Pointer <= Line'Last then
           Put_Line ("Unrecognized tail at " & Image (Line_No) & ':' & Image (Pointer));
        elsif Valid then
           Count := Count + 1;
        end if;
     exception
        when End_Error | Data_Error | Constraint_Error | Time_Error =>
           Put_Line ("Syntax error at " & Image (Line_No) & ':' & Image (Pointer));
     end;
  end loop;

exception

  when End_Error =>
     Close (File);
     Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total");

end Data_Munging_2;</lang> Sample output

1990-3-25: record at 85 duplicates record at 84
1991-3-31: record at 456 duplicates record at 455
1992-3-29: record at 820 duplicates record at 819
1993-3-28: record at 1184 duplicates record at 1183
1995-3-26: record at 1911 duplicates record at 1910
Valid records 5017 of 5471 total

Aime

<lang aime>void check_format(list l) {

   integer i;
   text s;
   if (l_length(l) != 49) {
       error("wrong number of fields");
   }
   s = lf_q_text(l);
   if (length(s) != 10 || s[4] != '-' || s[7] != '-') {
       error("bad date format");
   }
   atoi(delete(delete(s, 7), 4));
   i = 1;
   while (i < 49) {
       l_r_real(l, i, atof(l_q_text(l, i)));
       i += 1;
       l_r_integer(l, i, atoi(l_q_text(l, i)));
       i += 1;
   }

}

integer main(void) {

   integer goods;
   file f;
   list l;
   record r;
   goods = 0;
   f_affix(f, "readings.txt");
   while (f_list(f, l, 0) != -1) {
       if (!trap(check_format, l)) {
           if (r_key(r, l_head(l))) {
               v_form("duplicate ~ line\n", l_head(l));
           } else {
               integer i;
               r_put(r, l_head(l), 0);
               i = 2;
               while (i < 49) {
                   if (l_q_integer(l, i) != 1) {
                       break;
                   }
                   i += 2;
               }
               if (49 < i) {
                   goods += 1;
               }
           }
       }
   }
   o_integer(goods);
   o_text(" good unique lines\n");
   return 0;

}</lang>

Output:

(the "reading.txt" needs to be converted to UNIX end-of-line)

duplicate 1990-03-25 line
duplicate 1991-03-31 line
duplicate 1992-03-29 line
duplicate 1993-03-28 line
duplicate 1995-03-26 line
5013 good unique lines

AutoHotkey

<lang autohotkey>; Author: AlephX Aug 17 2011 data = %A_scriptdir%\readings.txt

Loop, Read, %data% { Lines := A_Index

   StringReplace, dummy, A_LoopReadLine, %A_Tab%,, All UseErrorLevel
   Loop, parse, A_LoopReadLine, %A_Tab%

{ wrong := 0 if A_index = 1 { Date := A_LoopField if (Date == OldDate) { WrongDates = %WrongDates%%OldDate% at %Lines%`n TotwrongDates++ Wrong := 1 break } } else { if (A_loopfield/1 < 0) { Wrong := 1 break }

} }

if (wrong == 1) totwrong++ else valid++

if (errorlevel <> 48) { if (wrong == 0) { totwrong++ valid-- } unvalidformat++ }

olddate := date }

msgbox, Duplicate Dates:`n%wrongDates%`nRead Lines: %lines%`nValid Lines: %valid%`nwrong lines: %totwrong%`nDuplicates: %TotWrongDates%`nWrong Formatted: %unvalidformat%`n </lang>

Sample Output:

Duplicate Dates:
1990-03-25 at 85
1991-03-31 at 456
1992-03-29 at 820
1993-03-28 at 1184
1995-03-26 at 1911

Read Lines: 5471
Valid Lines: 5129
wrong lines: 342
Duplicates: 5
Wrong Formatted: 0

AWK

A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.

Gradually tie down the format.

(In each case offending lines will be printed)

If their are any scientific notation fields then their will be an e in the file: <lang awk>bash$ awk '/[eE]/' readings.txt bash$</lang> Quick check on the number of fields: <lang awk>bash$ awk 'NF != 49' readings.txt bash$</lang> Full check on the file format using a regular expression: <lang awk>bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt bash$</lang> Full check on the file format as above but using regular expressions allowing intervals (gnu awk): <lang awk>bash$ awk --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt bash$</lang>


Identify any DATESTAMPs that are duplicated.

Accomplished by counting how many times the first field occurs and noting any second occurrences. <lang awk>bash$ awk '++count[$1]==2{print $1}' readings.txt 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 bash$</lang>


What number of records have good readings for all instruments.

<lang awk>bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt Total records 5471 OK records 5017 or 91.7017 % bash$</lang>

C

<lang c>#include <stdio.h>

  1. include <string.h>
  2. include <stdlib.h>
  3. include <unistd.h>
  4. include <sys/types.h>
  5. include <sys/stat.h>
  6. include <fcntl.h>

typedef struct { const char *s; int ln, bad; } rec_t; int cmp_rec(const void *aa, const void *bb) { const rec_t *a = aa, *b = bb; return a->s == b->s ? 0 : !a->s ? 1 : !b->s ? -1 : strncmp(a->s, b->s, 10); }

int read_file(const char *fn) { int fd = open(fn, O_RDONLY); if (fd == -1) return 0;

struct stat s; fstat(fd, &s);

char *txt = malloc(s.st_size); read(fd, txt, s.st_size); close(fd);

int i, j, lines = 0, k, di, bad; for (i = lines = 0; i < s.st_size; i++) if (txt[i] == '\n') { txt[i] = '\0'; lines++; }

rec_t *rec = calloc(sizeof(rec_t), lines); const char *ptr, *end; rec[0].s = txt; rec[0].ln = 1; for (i = 0; i < lines; i++) { if (i + 1 < lines) { rec[i + 1].s = rec[i].s + strlen(rec[i].s) + 1; rec[i + 1].ln = i + 2; } if (sscanf(rec[i].s, "%4d-%2d-%2d", &di, &di, &di) != 3) { printf("bad line %d: %s\n", i, rec[i].s); rec[i].s = 0; continue; } ptr = rec[i].s + 10;

for (j = k = 0; j < 25; j++) { if (!strtod(ptr, (char**)&end) && end == ptr) break; k++, ptr = end; if (!(di = strtol(ptr, (char**)&end, 10)) && end == ptr) break; k++, ptr = end; if (di < 1) rec[i].bad = 1; }

if (k != 48) { printf("bad format at line %d: %s\n", i, rec[i].s); rec[i].s = 0; } }

qsort(rec, lines, sizeof(rec_t), cmp_rec); for (i = 1, bad = rec[0].bad, j = 0; i < lines && rec[i].s; i++) { if (rec[i].bad) bad++; if (strncmp(rec[i].s, rec[j].s, 10)) { j = i; } else printf("dup line %d: %.10s\n", rec[i].ln, rec[i].s); }

free(rec); free(txt); printf("\n%d out %d lines good\n", lines - bad, lines); return 0; }

int main() { read_file("readings.txt"); return 0; }</lang>

Output:
dup line 85: 1990-03-25
dup line 456: 1991-03-31
dup line 820: 1992-03-29
dup line 1184: 1993-03-28
dup line 1911: 1995-03-26

5017 out 5471 lines good

C++

Library: Boost

<lang cpp>#include <boost/regex.hpp>

  1. include <fstream>
  2. include <iostream>
  3. include <vector>
  4. include <string>
  5. include <set>
  6. include <cstdlib>
  7. include <algorithm>

using namespace std ;

boost::regex e ( "\\s+" ) ;

int main( int argc , char *argv[ ] ) {

  ifstream infile( argv[ 1 ] ) ; 
  vector<string> duplicates ;
  set<string> datestamps ; //for the datestamps
  if ( ! infile.is_open( ) ) { 
     cerr << "Can't open file " << argv[ 1 ] << '\n' ;
     return 1 ; 
  }   
  int all_ok = 0  ;//all_ok for lines in the given pattern e
  int pattern_ok = 0 ; //overall field pattern of record is ok
  while ( infile ) { 
     string eingabe ;
     getline( infile , eingabe ) ;
     boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j  ;//we tokenize on empty fields
     vector<string> fields( i, j ) ;
     if ( fields.size( ) == 49 ) //we expect 49 fields in a record
        pattern_ok++ ;
     else
        cout << "Format not ok!\n" ;
     if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated
        int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement
                                                   //devices and values
        for ( int n = 1 ; atoi( fields[ 2 * n ].c_str( ) ) >= 1 ; n++ ) {
           if ( n == howoften ) {
              all_ok++ ;
              break ;
           }
        }
     }
     else {
        duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp
     }
  }
  infile.close( ) ;
  cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ;
  copy( duplicates.begin( ) , duplicates.end( ) ,
        ostream_iterator<string>( cout , "\n" ) ) ;
  cout << all_ok << " records were complete and ok!\n" ;
  return 0 ;

}</lang>

Output:
Format not ok!
The following 6 datestamps were duplicated:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
2004-12-31

C#

<lang csharp>using System; using System.Collections.Generic; using System.Text.RegularExpressions; using System.IO;

namespace TextProc2 {

   class Program
   {
       static void Main(string[] args)
       {
           Regex multiWhite = new Regex(@"\s+");
           Regex dateEx = new Regex(@"^\d{4}-\d{2}-\d{2}$");
           Regex valEx = new Regex(@"^\d+\.{1}\d{3}$");
           Regex flagEx = new Regex(@"^[1-9]{1}$");
           
           int missformcount = 0, totalcount = 0;
           Dictionary<int, string> dates = new Dictionary<int, string>();
           using (StreamReader sr = new StreamReader("readings.txt"))
           {
               string line = sr.ReadLine();
               while (line != null)
               {
                   line = multiWhite.Replace(line, @" ");                    
                   string[] splitLine = line.Split(' ');
                   if (splitLine.Length != 49)
                       missformcount++;
                   if (!dateEx.IsMatch(splitLine[0]))                        
                       missformcount++;                    
                   else
                       dates.Add(totalcount + 1, dateEx.Match(splitLine[0]).ToString());
                   int err = 0;                    
                   for (int i = 1; i < splitLine.Length; i++)
                   {
                       if (i%2 != 0)
                       {
                           if (!valEx.IsMatch(splitLine[i]))                          
                               err++;
                       }
                       else
                       {
                           if (!flagEx.IsMatch(splitLine[i]))
                               err++;                                                        
                       }                        
                   }
                   if (err != 0) missformcount++;
                   line = sr.ReadLine();
                   totalcount++;                    
               }
           }
           int goodEntries = totalcount - missformcount;
           Dictionary<string,List<int>> dateReverse = new Dictionary<string,List<int>>();
           foreach (KeyValuePair<int, string> kvp in dates)
           {
               if (!dateReverse.ContainsKey(kvp.Value))
                   dateReverse[kvp.Value] = new List<int>();
               dateReverse[kvp.Value].Add(kvp.Key);
           }
           Console.WriteLine(goodEntries + " valid Records out of " + totalcount);
           foreach (KeyValuePair<string, List<int>> kvp in dateReverse)
           {
               if (kvp.Value.Count > 1)
                   Console.WriteLine("{0} is duplicated at Lines : {1}", kvp.Key, string.Join(",", kvp.Value));                    
           }
       }
   }

}</lang>

5017 valid Records out of 5471
1990-03-25 is duplicated at Lines : 84,85
1991-03-31 is duplicated at Lines : 455,456
1992-03-29 is duplicated at Lines : 819,820
1993-03-28 is duplicated at Lines : 1183,1184
1995-03-26 is duplicated at Lines : 1910,1911

COBOL

Works with: OpenCOBOL

<lang cobol> IDENTIFICATION DIVISION.

      PROGRAM-ID. text-processing-2.
      ENVIRONMENT DIVISION.
      INPUT-OUTPUT SECTION.
      FILE-CONTROL.
          SELECT readings ASSIGN Input-File-Path
              ORGANIZATION LINE SEQUENTIAL
              FILE STATUS file-status.
      
      DATA DIVISION.
      FILE SECTION.
      FD  readings.
      01  reading-record.
          03  date-stamp          PIC X(10).
          03  FILLER              PIC X.
          03  input-data          PIC X(300).
      LOCAL-STORAGE SECTION.
      78  Input-File-Path         VALUE "readings.txt".
      78  Num-Data-Points         VALUE 48.
      01  file-status             PIC XX.
      01  current-line            PIC 9(5).
      01  num-date-stamps-read    PIC 9(5).
      01  read-date-stamps-area.
          03  read-date-stamps    PIC X(10) OCCURS 1 TO 10000 TIMES
                                  DEPENDING ON num-date-stamps-read
                                  INDEXED BY date-stamp-idx.
      01  offset                  PIC 999.
      01  data-len                PIC 999.
      01  data-flag               PIC X.
          88  data-not-found      VALUE "N".
      01  data-field              PIC X(25).
      01  i                       PIC 99.
      01  num-good-readings       PIC 9(5).
      01  reading-flag            PIC X.
          88 bad-reading          VALUE "B".
      01  delim                   PIC X.
      PROCEDURE DIVISION.
      DECLARATIVES.
      readings-error SECTION.
          USE AFTER ERROR ON readings
          DISPLAY "An error occurred while using " Input-File-Path
          DISPLAY "Error code " file-status
          DISPLAY "The program will terminate."
          CLOSE readings
          GOBACK
          .
      END DECLARATIVES.
      main-line.
          OPEN INPUT readings
          *> Process each line of the file.
          PERFORM FOREVER
              READ readings
                  AT END
                      EXIT PERFORM
              END-READ
              ADD 1 TO current-line
              IF reading-record = SPACES
                  DISPLAY "Line " current-line " is blank."
                  EXIT PERFORM CYCLE
              END-IF
              PERFORM check-duplicate-date-stamp
              *> Check there are 24 data pairs and see if all the
              *> readings are ok.
              INITIALIZE offset, reading-flag, data-flag
              PERFORM VARYING i FROM 1 BY 1 UNTIL Num-Data-Points < i
                  PERFORM get-next-field
                  IF data-not-found
                      DISPLAY "Line " current-line " has missing "
                          "fields."
                      SET bad-reading TO TRUE
                      EXIT PERFORM
                  END-IF
                  *> Every other data field is the instrument flag.
                  IF FUNCTION MOD(i, 2) = 0 AND NOT bad-reading
                      IF FUNCTION NUMVAL(data-field) <= 0
                          SET bad-reading TO TRUE
                      END-IF
                  END-IF
                  ADD data-len TO offset
              END-PERFORM
              IF NOT bad-reading
                  ADD 1 TO num-good-readings
              END-IF
          END-PERFORM
          CLOSE readings
          *> Display results.
          DISPLAY SPACE
          DISPLAY current-line " lines read."
          DISPLAY num-good-readings " have good readings for all "
              "instruments."
          GOBACK
          .
      check-duplicate-date-stamp.
          SEARCH read-date-stamps
              AT END
                  ADD 1 TO num-date-stamps-read
                  MOVE date-stamp
                      TO read-date-stamps (num-date-stamps-read)
              WHEN read-date-stamps (date-stamp-idx) = date-stamp
                  DISPLAY "Date " date-stamp " is duplicated at "
                      "line " current-line "."
          END-SEARCH
          .
      get-next-field.
          INSPECT input-data (offset:) TALLYING offset
              FOR LEADING X"09"
          *> The fields are normally delimited by a tab.
          MOVE X"09" TO delim
          PERFORM find-num-chars-before-delim
          *> If the delimiter was not found...
          IF FUNCTION SUM(data-len, offset) > 300
              *> The data may be delimited by a space if it is at the
              *> end of the line.
              MOVE SPACE TO delim
              PERFORM find-num-chars-before-delim
              IF FUNCTION SUM(data-len, offset) > 300
                  SET data-not-found TO TRUE
                  EXIT PARAGRAPH
              END-IF
          END-IF
          IF data-len = 0
              SET data-not-found TO TRUE
              EXIT PARAGRAPH
          END-IF
          MOVE input-data (offset:data-len) TO data-field
          .
      find-num-chars-before-delim.
          INITIALIZE data-len
          INSPECT input-data (offset:) TALLYING data-len
              FOR CHARACTERS BEFORE delim
          .</lang>
Output:
Date 1990-03-25 is duplicated at line 00084.
Date 1991-03-31 is duplicated at line 00455.
Date 1992-03-29 is duplicated at line 00819.
Date 1993-03-28 is duplicated at line 01183.
Date 1995-03-26 is duplicated at line 01910.
 
05470 lines read.
05016 have good readings for all instruments.

D

<lang d>void main() {

   import std.stdio, std.array, std.string, std.regex, std.conv,
          std.algorithm;
   auto rxDate = `^\d\d\d\d-\d\d-\d\d$`.regex;
   // Works but eats lot of RAM in DMD 2.064.
   // auto rxDate = ctRegex!(`^\d\d\d\d-\d\d-\d\d$`);
   int[string] repeatedDates;
   int goodReadings;
   foreach (string line; "readings.txt".File.lines) {
       try {
           auto parts = line.split;
           if (parts.length != 49)
               throw new Exception("Wrong column count");
           if (parts[0].match(rxDate).empty)
               throw new Exception("Date is wrong");
           repeatedDates[parts[0]]++;
           bool noProblem = true;
           for (int i = 1; i < 48; i += 2) {
               if (parts[i + 1].to!int < 1)
                   // don't break loop because it's validation too.
                   noProblem = false;
               if (!parts[i].isNumeric)
                   throw new Exception("Reading is wrong: "~parts[i]);
           }
           if (noProblem)
               goodReadings++;
       } catch(Exception ex) {
           writefln(`Problem in line "%s": %s`, line, ex);
       }
   }
   writefln("Duplicated timestamps: %-(%s, %)",
           repeatedDates.byKey.filter!(k => repeatedDates[k] > 1));
   writeln("Good reading records: ", goodReadings);

}</lang>

Output:
Duplicated timestamps: 1990-03-25, 1991-03-31, 1992-03-29, 1993-03-28, 1995-03-26
Good reading records: 5017

Erlang

Uses function from Text_processing/1. It does some correctness checks for us. <lang Erlang> -module( text_processing2 ).

-export( [task/0] ).

task() -> Name = "priv/readings.txt", try File_contents = text_processing:file_contents( Name ), [correct_field_format(X) || X<- File_contents], {_Previous, Duplicates} = lists:foldl( fun date_duplicates/2, {"", []}, File_contents ), io:fwrite( "Duplicates: ~p~n", [Duplicates] ), Good = [X || X <- File_contents, is_all_good_readings(X)], io:fwrite( "Good readings: ~p~n", [erlang:length(Good)] )

catch _:Error -> io:fwrite( "Error: Failed when checking ~s: ~p~n", [Name, Error] ) end.


correct_field_format( {_Date, Value_flags} ) -> Corret_number = value_flag_records(), {correct_field_format, Corret_number} = {correct_field_format, erlang:length(Value_flags)}.

date_duplicates( {Date, _Value_flags}, {Date, Acc} ) -> {Date, [Date | Acc]}; date_duplicates( {Date, _Value_flags}, {_Other, Acc} ) -> {Date, Acc}.

is_all_good_readings( {_Date, Value_flags} ) -> value_flag_records() =:= erlang:length( [ok || {_Value, ok} <- Value_flags] ).

value_flag_records() -> 24. </lang>

Output:
12> text_processing2:task().
Duplicates: ["1995-03-26","1993-03-28","1992-03-29","1991-03-31","1990-03-25"]
Good readings: 5017

F#

<lang fsharp> let file = @"readings.txt"

let dates = HashSet(HashIdentity.Structural) let mutable ok = 0

do

 for line in System.IO.File.ReadAllLines file do
   match String.split [' '; '\t'] line with
   | [] -> ()
   | date::xys ->
       if dates.Contains date then
         printf "Date %s is duplicated\n" date
       else
         dates.Add date
       let f (b, t) h = not b, if b then int h::t else t
       let _, states = Seq.fold f (false, []) xys
       if Seq.forall (fun s -> s >= 1) states then
         ok <- ok + 1
 printf "%d records were ok\n" ok

</lang> Prints: <lang fsharp> Date 1990-03-25 is duplicated Date 1991-03-31 is duplicated Date 1992-03-29 is duplicated Date 1993-03-28 is duplicated Date 1995-03-26 is duplicated 5017 records were ok </lang>

Go

<lang go>package main

import (

   "bufio"
   "fmt"
   "io"
   "os"
   "strconv"
   "strings"

)

var fn = "readings.txt"

func main() {

   f, err := os.Open(fn)
   if err != nil {
       fmt.Println(err)
       return
   }
   defer f.Close()
   var allGood, uniqueGood int
   // map records not only dates seen, but also if an all-good record was
   // seen for the key date.
   m := make(map[string]bool)
   for lr := bufio.NewReader(f); ; {
       line, pref, err := lr.ReadLine()
       if err == io.EOF {
           break
       }
       if err != nil {
           fmt.Println(err)
           return
       }
       if pref {
           fmt.Println("Unexpected long line.")
           return
       }
       f := strings.Fields(string(line))
       if len(f) != 49 {
           fmt.Println("unexpected format,", len(f), "fields.")
           return
       }
       good := true
       for i := 1; i < 49; i += 2 {
           flag, err := strconv.Atoi(f[i+1])
           if err != nil {
               fmt.Println(err)
               return
           }
           if flag > 0 { // value is good
               _, err := strconv.ParseFloat(f[i], 64)
               if err != nil {
                   fmt.Println(err)
                   return
               }
           } else { // value is bad
               good = false
           }
       }
       if good {
           allGood++
       }
       previouslyGood, seen := m[f[0]]
       if seen {
           fmt.Println("Duplicate datestamp:", f[0])
           if !previouslyGood && good {
               m[string([]byte(f[0]))] = true
               uniqueGood++
           }
       } else {
           m[string([]byte(f[0]))] = good
           if good {
               uniqueGood++
           }
       }
   }
   fmt.Println("\nData format valid.")
   fmt.Println(allGood, "records with good readings for all instruments.")
   fmt.Println(uniqueGood,
       "unique dates with good readings for all instruments.")

}</lang> Output:

Duplicate datestamp: 1990-03-25
Duplicate datestamp: 1991-03-31
Duplicate datestamp: 1992-03-29
Duplicate datestamp: 1993-03-28
Duplicate datestamp: 1995-03-26

Data format valid.
5017 records with good readings for all instruments.
5013 unique dates with good readings for all instruments.

Haskell

<lang haskell> import Data.List (nub, (\\))

data Record = Record {date :: String, recs :: [(Double, Int)]}

duplicatedDates rs = rs \\ nub rs

goodRecords = filter ((== 24) . length . filter ((>= 1) . snd) . recs)

parseLine l = let ws = words l in Record (head ws) (mapRecords (tail ws))

mapRecords [] = [] mapRecords [_] = error "invalid data" mapRecords (value:flag:tail) = (read value, read flag) : mapRecords tail

main = do

 inputs <- (map parseLine . lines) `fmap` readFile "readings.txt"
 putStr (unlines ("duplicated dates:": duplicatedDates (map date inputs)))
 putStrLn ("number of good records: " ++ show (length $ goodRecords inputs))

</lang>

this script outputs:

duplicated dates:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
number of good records: 5017

Icon and Unicon

The following works in both languages. It assumes there is nothing wrong with duplicated timestamps that are on well-formed records.

<lang unicon>procedure main(A)

   dups := set()
   goodRecords := 0
   lastDate := badFile := &null
   f := A[1] | "readings.txt"
   fin := open(f) | stop("Cannot open file '",f,"'")

   while (fields := 0, badReading := &null, line := read(fin)) do {
       line ? {
           ldate := tab(many(&digits ++ '-')) | (badFile := "yes", next)
           if \lastDate == ldate then insert(dups, ldate)
           lastDate := ldate
           while tab(many(' \t')) do {
               (value := real(tab(many(&digits++'-.'))),
                tab(many(' \t')),
                flag := integer(tab(many(&digits++'-'))),
                fields +:= 1) | (badFile := "yes")
               if flag < 1 then badReading := "yes"
               }
           }
       if fields = 24 then goodRecords +:= (/badReading, 1)
       else badFile := "yes"
       }
   if (\badFile) then write(f," has field format issues.")
   write("There are ",goodRecords," records with all good readings.")
   if *dups > 0 then {
       write("The following dates have multiple records:")
       every writes(" ",!sort(dups))
       write()
       }

end</lang>

Sample run:

->tp2
There are 5017 records with all good readings.
The following dates have multiple records:
 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26
->

J

<lang j> require 'tables/dsv dates'

  dat=: TAB readdsv jpath '~temp/readings.txt'
  Dates=: getdate"1 >{."1 dat
  Vals=:  _99 ". >(1 + +: i.24){"1 dat
  Flags=: _99 ". >(2 + +: i.24){"1 dat
  # Dates                      NB. Total # lines

5471

  +/ *./"1 ] 0 = Dates         NB. # lines with invalid date formats

0

  +/ _99 e."1 Vals,.Flags      NB. # lines with invalid value or flag formats

0

  +/ *./"1   [0 < Flags        NB. # lines with only valid flags

5017

  ~. (#~ (i.~ ~: i:~)) Dates   NB. Duplicate dates

1990 3 25 1991 3 31 1992 3 29 1993 3 28 1995 3 26</lang>

Java

Translation of: C++
Works with: Java version 1.5+

<lang java5>import java.util.*; import java.util.regex.*; import java.io.*;

public class DataMunging2 {

   public static final Pattern e = Pattern.compile("\\s+");
   public static void main(String[] args) {
       try {
           BufferedReader infile = new BufferedReader(new FileReader(args[0]));
           List<String> duplicates = new ArrayList<String>();
           Set<String> datestamps = new HashSet<String>(); //for the datestamps
           String eingabe;
           int all_ok = 0;//all_ok for lines in the given pattern e
           while ((eingabe = infile.readLine()) != null) { 
               String[] fields = e.split(eingabe); //we tokenize on empty fields
               if (fields.length != 49) //we expect 49 fields in a record
                   System.out.println("Format not ok!");
               if (datestamps.add(fields[0])) { //not duplicated
                   int howoften = (fields.length - 1) / 2 ; //number of measurement
                                                            //devices and values
                   for (int n = 1; Integer.parseInt(fields[2*n]) >= 1; n++) {
                       if (n == howoften) {
                           all_ok++ ;
                           break ;
                       }
                   }
               } else {
                   duplicates.add(fields[0]); //first field holds datestamp
               }
           }
           infile.close();
           System.out.println("The following " + duplicates.size() + " datestamps were duplicated:");
           for (String x : duplicates)
               System.out.println(x);
           System.out.println(all_ok + " records were complete and ok!");
       } catch (IOException e) {
           System.err.println("Can't open file " + args[0]);
           System.exit(1);
       }
   }

}</lang> The program produces the following output:

The following 5 datestamps were duplicated:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
5013 records were complete and ok!

JavaScript

Works with: JScript

<lang javascript>// wrap up the counter variables in a closure. function analyze_func(filename) {

   var dates_seen = {};
   var format_bad = 0;
   var records_all = 0;
   var records_good = 0;
   return function() {
       var fh = new ActiveXObject("Scripting.FileSystemObject").openTextFile(filename, 1); // 1 = for reading
       while ( ! fh.atEndOfStream) {
           records_all ++;
           var allOK = true;
           var line = fh.ReadLine();
           var fields = line.split('\t');
           if (fields.length != 49) {
               format_bad ++;
               continue;
           }
           var date = fields.shift();
           if (has_property(dates_seen, date)) 
               WScript.echo("duplicate date: " + date);
           else
               dates_seen[date] = 1;
           while (fields.length > 0) {
               var value = parseFloat(fields.shift());
               var flag = parseInt(fields.shift(), 10);
               if (isNaN(value) || isNaN(flag)) {
                   format_bad ++;
               }
               else if (flag <= 0) {
                   allOK = false;
               }
           }
           if (allOK)
               records_good ++;
       }
       fh.close();
       WScript.echo("total records: " + records_all);
       WScript.echo("Wrong format: " + format_bad);
       WScript.echo("records with no bad readings: " + records_good);
   }

}

function has_property(obj, propname) {

   return typeof(obj[propname]) == "undefined" ? false : true;

}

var analyze = analyze_func('readings.txt'); analyze();</lang>

Lua

<lang lua>filename = "readings.txt" io.input( filename )

dates = {} duplicated, bad_format = {}, {} num_good_records, lines_total = 0, 0

while true do

   line = io.read( "*line" )
   if line == nil then break end
   
   lines_total = lines_total + 1
   date = string.match( line, "%d+%-%d+%-%d+" )
   if dates[date] ~= nil then
       duplicated[#duplicated+1] = date
   end    
   dates[date] = 1
   
   count_pairs, bad_values = 0, false
   for v, w in string.gmatch( line, "%s(%d+[%.%d+]*)%s(%-?%d)" ) do        
       count_pairs = count_pairs + 1        
       if tonumber(w) <= 0 then 
           bad_values = true 
       end        
   end
   if count_pairs ~= 24 then 
       bad_format[#bad_format+1] = date
   end
   if not bad_values then
       num_good_records = num_good_records + 1
   end

end

print( "Lines read:", lines_total ) print( "Valid records: ", num_good_records ) print( "Duplicate dates:" ) for i = 1, #duplicated do

   print( "   ", duplicated[i] )

end print( "Bad format:" ) for i = 1, #bad_format do

   print( "   ", bad_format[i] )

end</lang> Output:

Lines read:	5471
Valid records: 	5017
Duplicate dates:
   	1990-03-25
   	1991-03-31
   	1992-03-29
   	1993-03-28
   	1995-03-26
Bad format:

Mathematica

<lang Mathematica>data = Import["Readings.txt","TSV"]; Print["duplicated dates: "]; Select[Tally@data;;,1, #2>1&];;,1//Column Print["number of good records: ", Count[(Times@@#3;;All;;2)& /@ data, 1], " (out of a total of ", Length[data], ")"]</lang>

duplicated dates: 
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26

number of good records: 5017 (out of a total of 5471)

MATLAB / Octave

<lang MATLAB>function [val,count] = readdat(configfile) % READDAT reads readings.txt file % % The value of boolean parameters can be tested with % exist(parameter,'var')

if nargin<1,

  filename = 'readings.txt';

end;

fid = fopen(filename); if fid<0, error('cannot open file %s\n',a); end; [val,count] = fscanf(fid,'%04d-%02d-%02d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d %f %d \n'); fclose(fid);

count = count/51;

if (count<1) || count~=floor(count),

    error('file has incorrect format\n')

end;

val = reshape(val,51,count)';  % make matrix with 51 rows and count columns, then transpose it.

d = datenum(val(:,1:3)); % compute timestamps

printf('The following records are followed by a duplicate:'); dix = find(diff(d)==0) % check for to consequtive timestamps with zero difference

printf('number of valid records: %i\n ', sum( all( val(:,5:2:end) >= 1, 2) ) );</lang>

>> [val,count]=readdat;
The following records are followed by a duplicate:dix =

     84
    455
    819
   1183
   1910

number of valid records: 5017

OCaml

<lang ocaml>#load "str.cma" open Str

let strip_cr str =

 let last = pred(String.length str) in
 if str.[last] <> '\r' then (str) else (String.sub str 0 last)

let map_records =

 let rec aux acc = function
   | value::flag::tail ->
       let e = (float_of_string value, int_of_string flag) in
       aux (e::acc) tail
   | [_] -> invalid_arg "invalid data"
   | [] -> (List.rev acc)
 in
 aux [] ;;

let duplicated_dates =

 let same_date (d1,_) (d2,_) = (d1 = d2) in
 let date (d,_) = d in
 let rec aux acc = function
   | a::b::tl when same_date a b ->
       aux (date a::acc) tl
   | _::tl ->
       aux acc tl
   | [] ->
       (List.rev acc)
 in
 aux [] ;;

let record_ok (_,record) =

 let is_ok (_,v) = (v >= 1) in
 let sum_ok =
   List.fold_left (fun sum this ->
     if is_ok this then succ sum else sum) 0 record
 in
 (sum_ok = 24)

let num_good_records =

 List.fold_left  (fun sum record ->
   if record_ok record then succ sum else sum) 0 ;;

let parse_line line =

 let li = split (regexp "[ \t]+") line in
 let records = map_records (List.tl li)
 and date = (List.hd li) in
 (date, records)

let () =

 let ic = open_in "readings.txt" in
 let rec read_loop acc =
   try
     let line = strip_cr(input_line ic) in
     read_loop ((parse_line line) :: acc)
   with End_of_file ->
     close_in ic;
     (List.rev acc)
 in
 let inputs = read_loop [] in
 Printf.printf "%d total lines\n" (List.length inputs);
 Printf.printf "duplicated dates:\n";
 let dups = duplicated_dates inputs in
 List.iter print_endline dups;
 Printf.printf "number of good records: %d\n" (num_good_records inputs);
</lang>

this script outputs:

5471 total lines
duplicated dates:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
number of good records: 5017

Perl

<lang perl>use List::MoreUtils 'natatime'; use constant FIELDS => 49;

binmode STDIN, ':crlf';

 # Read the newlines properly even if we're not running on
 # Windows.

my ($line, $good_records, %dates) = (0, 0); while (<>)

  {++$line;
   my @fs = split /\s+/;
   @fs == FIELDS or die "$line: Bad number of fields.\n";
   for (shift @fs)
      {/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n";
       ++$dates{$_};}
   my $iterator = natatime 2, @fs;
   my $all_flags_okay = 1;
   while ( my ($val, $flag) = $iterator->() )
      {$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n";
       $flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n";
       $flag < 1 and $all_flags_okay = 0;}
   $all_flags_okay and ++$good_records;}

print "Good records: $good_records\n",

  "Repeated timestamps:\n",
  map {"  $_\n"}
  grep {$dates{$_} > 1}
  sort keys %dates;</lang>

Output:

Good records: 5017
Repeated timestamps:
  1990-03-25
  1991-03-31
  1992-03-29
  1993-03-28
  1995-03-26

Perl 6

Translation of: Perl
Works with: Rakudo version 2010.11

<lang perl6>my $fields = 49;

my ($good-records, %dates) = 0; for 1 .. * Z $*IN.lines -> $line, $s {

   my @fs = split /\s+/, $s;
   @fs == $fields or die "$line: Bad number of fields";
   given shift @fs {
       m/\d**4 \- \d**2 \- \d**2/ or die "$line: Bad date format";
       ++%dates{$_};
   }
   my $all-flags-okay = True;
   for @fs -> $val, $flag {
       $val ~~ /\d+ \. \d+/ or die "$line: Bad value format";
       $flag ~~ /^ \-? \d+/ or die "$line: Bad flag format";
       $flag < 1 and $all-flags-okay = False;
   }
   $all-flags-okay and ++$good-records;

}

say 'Good records: ', $good-records; say 'Repeated timestamps:'; say ' ', $_ for grep { %dates{$_} > 1 }, sort keys %dates;</lang>

Output:

Good records: 5017
Repeated timestamps:
  1990-03-25
  1991-03-31
  1992-03-29
  1993-03-28
  1995-03-26

The first version demonstrates that you can program Perl 6 almost like Perl 5. Here's a more idiomatic Perl 6 version that runs several times faster: <lang perl6>my $good-records; my $line; my %dates;

for lines() {

   $line++;
   / ^
   (\d ** 4 '-' \d\d '-' \d\d)
   [ \h+ \d+'.'\d+ \h+ ('-'?\d+) ] ** 24
   $ /
       or note "Bad format at line $line" and next;
   %dates.push: $0 => $line;
   $good-records++ if $1.all >= 1;

}

say "$good-records good records out of $line total";

say 'Repeated timestamps (with line numbers):'; .say for sort %dates.pairs.grep: *.value.elems > 1;</lang> Output:

5017 good records out of 5471 total
Repeated timestamps (with line numbers):
1990-03-25	84 85
1991-03-31	455 456
1992-03-29	819 820
1993-03-28	1183 1184
1995-03-26	1910 1911

Note how this version does validation with a single Perl 6 regex that is much more readable than the typical regex, and arguably expresses the data structure more straightforwardly. Here we use normal quotes for literals, and \h for horizontal whitespace.

Variables like $good-record that are going to be autoincremented do not need to be initialized. (Perl 6 allows hyphens in variable names, as you can see.)

The .push method on a hash is magical and loses no information; if a duplicate key is found in the pushed pair, an array of values is automatically created of the old value and the new value pushed. Hence we can easily track all the lines that a particular duplicate occurred at.

The .all method does "junctional" logic: it autothreads through comparators as any English speaker would expect. Junctions can also short-circuit as soon as they find a value that doesn't match, and the evaluation order is up to the computer, so it can be optimized or parallelized.

The final line simply greps out the pairs from the hash whose value is an array with more than 1 element. (Those values that are not arrays nevertheless have a .elems method that always reports 1.) The .pairs is merely there for clarity; grepping a hash directly has the same effect. Note that we sort the pairs after we've grepped them, not before; this works fine in Perl 6, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case.

PHP

<lang php>$handle = fopen("readings.txt", "rb"); $missformcount = 0; $totalcount = 0; $dates = array(); while (!feof($handle)) {

   $buffer = fgets($handle);

$line = preg_replace('/\s+/',' ',$buffer); $line = explode(' ',trim($line)); $datepattern = '/^\d{4}-\d{2}-\d{2}$/'; $valpattern = '/^\d+\.{1}\d{3}$/'; $flagpattern = '/^[1-9]{1}$/';

if(count($line) != 49) $missformcount++; if(!preg_match($datepattern,$line[0],$check)) $missformcount++; else $dates[$totalcount+1] = $check[0];

$errcount = 0; for($i=1;$i<count($line);$i++){ if($i%2!=0){ if(!preg_match($valpattern,$line[$i],$check)) $errcount++; }else{ if(!preg_match($flagpattern,$line[$i],$check)) $errcount++; } } if($errcount != 0) $missformcount++; $totalcount++; } fclose ($handle); $good = $totalcount - $missformcount; $duplicates = array_diff_key( $dates , array_unique( $dates )); echo 'Valid records ' . $good . ' of ' . $totalcount . ' total
'; echo 'Duplicates :
'; foreach ($duplicates as $key => $val){ echo $val . ' at Line : ' . $key . '
'; }</lang>

Valid records 5017 of 5471 total
Duplicates :
1990-03-25 at Line : 85
1991-03-31 at Line : 456
1992-03-29 at Line : 820
1993-03-28 at Line : 1184
1995-03-26 at Line : 1911

PL/I

<lang pli> /* To process readings produced by automatic reading stations. */

check: procedure options (main);

  declare 1 date, 2 (yy, mm, dd) character (2),
          (j1, j2) character (1);
  declare old_date character (6);
  declare line character (330) varying;
  declare R(24) fixed decimal, Machine(24) fixed binary;
  declare (i, k, n, faulty static initial (0)) fixed binary;
  declare input file;
  open file (input) title ('/READINGS.TXT,TYPE(CRLF),RECSIZE(300)');
  on endfile (input) go to done;
  old_date = ;
  k = 0;
  do forever;
     k = k + 1;
     get file (input) edit (line) (L);
     get string(line) edit (yy, j1, mm, j2, dd) (a(4), a(1), a(2), a(1), a(2));
     line = substr(line, 11);
     do i = 1 to length(line);
        if substr(line, i, 1) = '09'x then substr(line, i, 1) = ' ';
     end;
     line = trim(line);
     n = tally(line, ' ') - tally (line, '  ') + 1;
     if n ^= 48 then
        do;
           put skip list ('There are ' || n || ' readings in line ' || k);
        end;
     n = n/2;
     line = line || ' ';
     get string(line) list ((R(i), Machine(i) do i = 1 to n));
     if any(Machine < 1) ^= '0'B then
        faulty = faulty + 1;
     if old_date ^= ' ' then if old_date = string(date) then
        put skip list ('Dates are the same at line' || k);
     old_date = string(date);
  end;

done:

  put skip list ('There were ' || k || ' sets of readings');
  put skip list ('There were ' || faulty || ' faulty readings' );
  put skip list ('There were ' || k-faulty || ' good readings' );

end check; </lang>

PicoLisp

Put the following into an executable file "checkReadings": <lang PicoLisp>#!/usr/bin/picolisp /usr/lib/picolisp/lib.l

(load "@lib/misc.l")

(in (opt)

  (until (eof)
     (let Lst (split (line) "^I")
        (unless
           (and
              (= 49 (length Lst))     # Check total length
              ($dat (car Lst) "-")    # Check for valid date
              (fully                  # Check data format
                 '((L F)
                    (if F                         # Alternating:
                       (format L 3)               # Number
                       (>= 9 (format L) -9) ) )   # or flag
                 (cdr Lst)
                 '(T NIL .) ) )
           (prinl "Bad line format: " (glue " " Lst))
           (bye 1) ) ) ) )

(bye)</lang> Then it can be called as

$ ./checkReadings readings.txt

PowerShell

<lang powershell>$dateHash = @{} $goodLineCount = 0 get-content c:\temp\readings.txt |

   ForEach-Object {
       $line = $_.split(" |`t",2)
       if ($dateHash.containskey($line[0])) {
           $line[0] + " is duplicated"
       } else {
           $dateHash.add($line[0], $line[1])
       }
       $readings = $line[1].split()
       $goodLine = $true
       if ($readings.count -ne 48) { $goodLine = $false; "incorrect line length : $line[0]"  }
       for ($i=0; $i -lt $readings.count; $i++) {
           if ($i % 2 -ne 0) {                                
               if ([int]$readings[$i] -lt 1) {
                   $goodLine = $false
               }
           }
       }
       if ($goodLine) { $goodLineCount++ } 
   }

[string]$goodLineCount + " good lines" </lang>

Output:

1990-03-25 is duplicated
1991-03-31 is duplicated
1992-03-29 is duplicated
1993-03-28 is duplicated
1995-03-26 is duplicated
5017

An alternative using regular expression syntax: <lang powershell> $dateHash = @{} $goodLineCount = 0 ForEach ($rawLine in ( get-content c:\temp\readings.txt) ){

   $line = $rawLine.split(" |`t",2)
   if ($dateHash.containskey($line[0])) {
       $line[0] + " is duplicated"
   } else {
       $dateHash.add($line[0], $line[1])
   }
   $readings = [regex]::matches($line[1],"\d+\.\d+\s-?\d")
   if ($readings.count -ne 24) { "incorrect number of readings for date " + $line[0] }
   $goodLine = $true
   foreach ($flagMatch in [regex]::matches($line[1],"\d\.\d*\s(?<flag>-?\d)")) {
       if ([int][string]$flagMatch.groups["flag"].value -lt 1) { 
           $goodLine = $false 
       }
   }
   if ($goodLine) { $goodLineCount++}

} [string]$goodLineCount + " good lines" </lang>

Output:

1990-03-25 is duplicated
1991-03-31 is duplicated
1992-03-29 is duplicated
1993-03-28 is duplicated
1995-03-26 is duplicated
5017 good lines

PureBasic

Using regular expressions. <lang PureBasic>Define filename.s = "readings.txt"

  1. instrumentCount = 24

Enumeration

 #exp_date
 #exp_instruments
 #exp_instrumentStatus

EndEnumeration

Structure duplicate

 date.s
 firstLine.i
 line.i

EndStructure

NewMap dates() ;records line date occurs first NewList duplicated.duplicate() NewList syntaxError() Define goodRecordCount, totalLines, line.s, i Dim inputDate.s(0) Dim instruments.s(0)

If ReadFile(0, filename)

 CreateRegularExpression(#exp_date, "\d+-\d+-\d+")
 CreateRegularExpression(#exp_instruments, "(\t|\x20)+(\d+\.\d+)(\t|\x20)+\-?\d")
 CreateRegularExpression(#exp_instrumentStatus, "(\t|\x20)+(\d+\.\d+)(\t|\x20)+")
 Repeat
   line = ReadString(0, #PB_Ascii)
   If line = "": Break: EndIf
   totalLines + 1
 
   ExtractRegularExpression(#exp_date, line, inputDate())
   If FindMapElement(dates(), inputDate(0))
     AddElement(duplicated())
     duplicated()\date = inputDate(0)
     duplicated()\firstLine = dates()
     duplicated()\line = totalLines
   Else
     dates(inputDate(0)) = totalLines
   EndIf
   
   ExtractRegularExpression(#exp_instruments, Mid(line, Len(inputDate(0)) + 1), instruments())
   Define pairsCount = ArraySize(instruments()), containsBadValues = #False
   For i =  0 To pairsCount
     If Val(ReplaceRegularExpression(#exp_instrumentStatus, instruments(i), "")) < 1
       containsBadValues = #True
       Break
     EndIf
   Next
   
   If pairsCount <> #instrumentCount - 1
     AddElement(syntaxError()): syntaxError() = totalLines
   EndIf
   If Not containsBadValues
     goodRecordCount + 1
   EndIf
 ForEver
 CloseFile(0)
 
 If OpenConsole()
   ForEach duplicated()
     PrintN("Duplicate date: " + duplicated()\date + " occurs on lines " + Str(duplicated()\line) + " and " + Str(duplicated()\firstLine) + ".")
   Next
   ForEach syntaxError()
     PrintN( "Syntax error in line " + Str(syntaxError()))
   Next
   PrintN(#CRLF$ + Str(goodRecordCount) + " of " + Str(totalLines) + " lines read were valid records.")
   
   Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
   CloseConsole()
 EndIf

EndIf</lang> Sample output:

Duplicate date: 1990-03-25 occurs on lines 85 and 84.
Duplicate date: 1991-03-31 occurs on lines 456 and 455.
Duplicate date: 1992-03-29 occurs on lines 820 and 819.
Duplicate date: 1993-03-28 occurs on lines 1184 and 1183.
Duplicate date: 1995-03-26 occurs on lines 1911 and 1910.

5017 of 5471 lines read were valid records.

Python

<lang python>import re import zipfile import StringIO

def munge2(readings):

  datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
  valuPat = re.compile(r'[-+]?\d+\.\d+')
  statPat = re.compile(r'-?\d+')
  allOk, totalLines = 0, 0
  datestamps = set([])
  for line in readings:
     totalLines += 1
     fields = line.split('\t')
     date = fields[0]
     pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]
     lineFormatOk = datePat.match(date) and \
        all( valuPat.match(p[0]) for p in pairs ) and \
        all( statPat.match(p[1]) for p in pairs )
     if not lineFormatOk:
        print 'Bad formatting', line
        continue
     if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
        print 'Missing values', line
        continue
     if date in datestamps:
        print 'Duplicate datestamp', line
        continue
     datestamps.add(date)
     allOk += 1
  print 'Lines with all readings: ', allOk
  print 'Total records: ', totalLines
  1. zfs = zipfile.ZipFile('readings.zip','r')
  2. readings = StringIO.StringIO(zfs.read('readings.txt'))

readings = open('readings.txt','r') munge2(readings)</lang> The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows

Missing values 2004-12-29	2.900	1	2.700	1	2.800	1	3.300	1	2.900	1	2.300	1	0.000	0	1.700	1	1.900	1	2.300	1	2.600	1	2.900	1	2.600	1	2.600	1	2.600	1	2.700	1	2.300	1	2.200	1	2.100	1	2.000	1	2.100	1	2.100	1	2.300	1	2.400	1

Missing values 2004-12-30	2.400	1	2.600	1	2.600	1	2.600	1	3.000	1	0.000	0	3.300	1	2.600	1	2.900	1	2.400	1	2.300	1	2.900	1	3.500	1	3.700	1	3.600	1	4.000	1	3.400	1	2.400	1	2.500	1	2.600	1	2.600	1	2.800	1	2.400	1	2.200	1

Missing values 2004-12-31	2.400	1	2.500	1	2.500	1	2.400	1	0.000	0	2.400	1	2.400	1	2.400	1	2.200	1	2.400	1	2.500	1	2.000	1	1.700	1	1.400	1	1.500	1	1.900	1	1.700	1	2.000	1	2.000	1	2.200	1	1.700	1	1.500	1	1.800	1	1.800	1

Lines with all readings:  5013
Total records:  5471

Second Version

Modification of the version above to:

  • Remove continue statements so it counts as the AWK example does.
  • Generate mostly summary information that is easier to compare to other solutions.

<lang python>import re import zipfile import StringIO

def munge2(readings, debug=False):

  datePat = re.compile(r'\d{4}-\d{2}-\d{2}')
  valuPat = re.compile(r'[-+]?\d+\.\d+')
  statPat = re.compile(r'-?\d+')
  totalLines = 0
  dupdate, badform, badlen, badreading = set(), set(), set(), 0
  datestamps = set([])
  for line in readings:
     totalLines += 1
     fields = line.split('\t')
     date = fields[0]
     pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]

     lineFormatOk = datePat.match(date) and \
        all( valuPat.match(p[0]) for p in pairs ) and \
        all( statPat.match(p[1]) for p in pairs )
     if not lineFormatOk:
        if debug: print 'Bad formatting', line
        badform.add(date)
        
     if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ):
        if debug: print 'Missing values', line
     if len(pairs)!=24: badlen.add(date)
     if any( int(p[1]) < 1 for p in pairs ): badreading += 1

     if date in datestamps:
        if debug: print 'Duplicate datestamp', line
        dupdate.add(date)
     datestamps.add(date)
  print 'Duplicate dates:\n ', '\n  '.join(sorted(dupdate)) 
  print 'Bad format:\n ', '\n  '.join(sorted(badform)) 
  print 'Bad number of fields:\n ', '\n  '.join(sorted(badlen)) 
  print 'Records with good readings: %i = %5.2f%%\n' % (
     totalLines-badreading, (totalLines-badreading)/float(totalLines)*100 )
  print 'Total records: ', totalLines

readings = open('readings.txt','r') munge2(readings)</lang>

bash$  /cygdrive/c/Python26/python  munge2.py 
Duplicate dates:
  1990-03-25
  1991-03-31
  1992-03-29
  1993-03-28
  1995-03-26
Bad format:
  
Bad number of fields:
  
Records with good readings: 5017 = 91.70%

Total records:  5471
bash$ 

R

<lang R># Read in data from file dfr <- read.delim("d:/readings.txt", colClasses=c("character", rep(c("numeric", "integer"), 24))) dates <- strptime(dfr[,1], "%Y-%m-%d")

  1. Any bad values?

dfr[which(is.na(dfr))]

  1. Any duplicated dates

dates[duplicated(dates)]

  1. Number of rows with no bad values

flags <- as.matrix(dfr[,seq(3,49,2)])>0 sum(apply(flags, 1, all))</lang>

Racket

<lang racket>#lang racket (read-decimal-as-inexact #f)

files to read is a sequence, so it could be either a list or vector of files

(define (text-processing/2 files-to-read)

 (define seen-datestamps (make-hash))
 (define (datestamp-seen? ds) (hash-ref seen-datestamps ds #f))
 (define (datestamp-seen! ds pos) (hash-set! seen-datestamps ds pos))
 
 (define (fold-into-pairs l (acc null))
   (match l ['() (reverse acc)]
     [(list _) (reverse (cons l acc))]
     [(list-rest a b tl) (fold-into-pairs tl (cons (list a b) acc))]))
 
 (define (match-valid-field line pos)
   (match (string-split line)
     ;; if we don't hit an error, then the file is valid
     ((list-rest (not (pregexp #px"digit:{4}-digit:{2}-digit:{2}")) _)
      (error 'match-valid-field "invalid format non-datestamp at head: ~a~%" line))
     
     ;; check for duplicates
     ((list-rest (? datestamp-seen? ds) _)
      (printf "duplicate datestamp: ~a at line: ~a (first seen at: ~a)~%"
              ds pos (datestamp-seen? ds))
      #f)
     
     ;; register the datestamp as seen, then move on to rest of match
     ((list-rest ds _) (=> next-match-rule) (datestamp-seen! ds pos) (next-match-rule))
     
     ((list-rest
       _
       (app fold-into-pairs
            (list (list (app string->number (and (? number?) vs))
                        (app string->number (and (? integer?) statuss)))
                  ...)))
      (=> next-match-rule)
      (unless (= (length vs) 24) (next-match-rule))
      (not (for/first ((s statuss) #:unless (positive? s)) #t)))
     
     ;; if we don't hit an error, then the file is valid
     (else (error 'match-valid-field "bad field format: ~a~%" line))))
 
 (define (sub-t-p/1)
   (for/sum ((line (in-lines))
             (line-number (in-naturals 1)))
     (if (match-valid-field line line-number) 1 0)))  
 (for/sum ((file-name files-to-read))
   (with-input-from-file file-name sub-t-p/1)))

(printf "~a records have good readings for all instruments~%"

       (text-processing/2 (current-command-line-arguments)))</lang>

Example session:

$ racket 2.rkt readings/readings.txt
duplicate datestamp: 1990-03-25 at line: 85 (first seen at: 84)
duplicate datestamp: 1991-03-31 at line: 456 (first seen at: 455)
duplicate datestamp: 1992-03-29 at line: 820 (first seen at: 819)
duplicate datestamp: 1993-03-28 at line: 1184 (first seen at: 1183)
duplicate datestamp: 1995-03-26 at line: 1911 (first seen at: 1910)
5013 records have good readings for all instruments

REXX

This REXX program process the file mentioned in "text processing 1" and does further valiidate on the dates, flags, and data.

Some of the checks performed are:

  • checks for duplicated date records.
  • checks for a bad date (YYYY-MM-DD) format, among:
  • wrong length
  • year > current year
  • year < 1970 (to allow for posthumous data)
  • mm < 1 or mm > 12
  • dd < 1 or dd > days for the month
  • yyyy, dd, mm isn't numeric
  • missing data (or flags)
  • flag isn't an integer
  • flag contains a decimal point
  • data isn't numeric

In addition, all of the presented numbers (may) have commas inserted.

The program has (negated) code to write the report to a file in addition to the console. <lang rexx>/*REXX program to process instrument data from a data file. */ numeric digits 20 /*allow for bigger numbers. */ ifid='READINGS.TXT' /*the input file. */ ofid='READINGS.OUT' /*the outut file. */ grandSum=0 /*grand sum of whole file. */ grandflg=0 /*grand num of flagged data. */ grandOKs=0 longFlag=0 /*longest period of flagged data.*/ contFlag=0 /*longest continous flagged data.*/ oldDate =0 /*placeholder of penutilmate date*/ w =16 /*width of fields when displayed.*/ dupDates=0 /*count of duplicated timestamps.*/ badflags=0 /*count of bad flags (¬ integer).*/ badDates=0 /*count of bad dates (bad format)*/ badData =0 /*count of bad datas (¬ numeric).*/ ignoredR=0 /*count of ignored records (bad).*/ maxInstruments=24 /*maximum number of instruments. */ yyyyCurr=right(date(),4) /*get the current year (today). */ monDD. =31 /*number of days in every month. */

                                      /*February is figured on the fly.*/

monDD.4 =30 monDD.6 =30 monDD.9 =30 monDD.11=30

 do records=1  while lines(ifid)\==0  /*read until finished.           */
 rec=linein(ifid)                     /*read the next record (line).   */
 parse var rec datestamp Idata        /*pick off the dateStamp & data. */
 if datestamp==oldDate then do        /*found a duplicate timestamp.   */
                            dupDates=dupDates+1     /*bump the counter.*/
                            call sy datestamp copies('~',30),
                                     'is a duplicate of the',
                                     "previous datestamp."
                            ignoredR=ignoredR+1     /*bump ignoredRecs.*/
                            iterate   /*ignore this duplicate record.  */
                            end
 parse var datestamp yyyy '-' mm '-' dd   /*obtain YYYY, MM, and DD.   */
 monDD.2=28+leapyear(yyyy)            /*how long is February in YYYY ? */
                                      /*check for various bad formats. */
 if verify(yyyy||mm||dd,1234567890)\==0 |,
    length(datestamp)\==10   |,
    length(yyyy)\==4         |,
    length(mm  )\==2         |,
    length(dd  )\==2         |,
    yyyy<1970                |,
    yyyy>yyyyCurr            |,
    mm=0   | dd=0            |,
    mm>12  | dd>monDD.mm then do
                              badDates=badDates+1
                              call sy datestamp copies('~'),
                                                'has an illegal format.'
                              ignoredR=ignoredR+1   /*bump ignoredRecs.*/
                              iterate   /*ignore this bad date record. */
                              end
 oldDate=datestamp                    /*save datestamp for next read.  */
 sum=0
 flg=0
 OKs=0
   do j=1  until Idata=             /*process the instrument data.  */
   parse var Idata data.j flag.j Idata
   if pos('.',flag.j)\==0 |,          /*flag have a decimal point  -or-*/
      \datatype(flag.j,'W') then do   /*is the flag not a whole number?*/
                                 badflags=badflags+1    /*bump counter.*/
                                 call sy datestamp copies('~'),
                                         'instrument' j "has a bad flag:",
                                         flag.j
                                 iterate       /*ignore it & it's data.*/
                                 end
   if \datatype(data.j,'N') then do   /*is the flag not a whole number?*/
                                 badData=badData+1      /*bump counter.*/
                                 call sy datestamp copies('~'),
                                         'instrument' j "has bad data:",
                                         data.j
                                 iterate       /*ignore it & it's flag.*/
                                 end
   if flag.j>0 then do                /*if good data, ...              */
                    OKs=OKs+1
                    sum=sum+data.j
                    if contFlag>longFlag then do
                                              longdate=datestamp
                                              longFlag=contFlag
                                              end
                    contFlag=0
                    end
               else do                /*flagged data ...               */
                    flg=flg+1
                    contFlag=contFlag+1
                    end
   end   /*j*/
 if j>maxInstruments then do
                          badData=badData+1             /*bump counter.*/
                          call sy datestamp copies('~'),
                                  'too many instrument datum'
                          end
 if OKs\==0 then avg=format(sum/OKs,,3)
            else avg='[n/a]'
 grandOKs=grandOKs+OKs
 _=right(comma(avg),w)
 grandSum=grandSum+sum
 grandFlg=grandFlg+flg
 if flg==0  then  call sy datestamp ' average='_
            else  call sy datestamp ' average='_ '  flagged='right(flg,2)
 end   /*records*/

records=records-1 /*adjust for reading end-of-file.*/ if grandOKs\==0 then grandAvg=format(grandsum/grandOKs,,3)

               else grandAvg='[n/a]'

call sy call sy copies('=',60) call sy ' records read:' right(comma(records ),w) call sy ' records ignored:' right(comma(ignoredR),w) call sy ' grand sum:' right(comma(grandSum),w+4) call sy ' grand average:' right(comma(grandAvg),w+4) call sy ' grand OK data:' right(comma(grandOKs),w) call sy ' grand flagged:' right(comma(grandFlg),w) call sy ' duplicate dates:' right(comma(dupDates),w) call sy ' bad dates:' right(comma(badDates),w) call sy ' bad data:' right(comma(badData ),w) call sy ' bad flags:' right(comma(badflags),w) if longFlag\==0 then call sy ' longest flagged:' right(comma(longFlag),w) " ending at " longdate call sy copies('=',60) call sy exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LEAPYEAR subroutine─────────────────*/ leapyear: procedure; arg y /*year could be: Y, YY, YYY, YYYY*/ if length(y)==2 then y=left(right(date(),4),2)y /*adjust for YY year.*/ if y//4\==0 then return 0 /* not ≈ by 4? Not a leapyear.*/ return y//100\==0 | y//400==0 /*apply 100 and 400 year rule. */ /*──────────────────────────────────SY subroutine───────────────────────*/ sy: procedure; parse arg stuff; say stuff

   if  1==0  then call lineout ofid,stuff
   return

/*──────────────────────────────────COMMA subroutine────────────────────*/ comma: procedure; parse arg _,c,p,t;arg ,cu;c=word(c ",",1)

      if cu=='BLANK' then c=' ';o=word(p 3,1);p=abs(o);t=word(t 999999999,1)
      if \datatype(p,'W')|\datatype(t,'W')|p==0|arg()>4 then return _;n=_'.9'
      #=123456789;k=0;if o<0 then do;b=verify(_,' ');if b==0 then return _
      e=length(_)-verify(reverse(_),' ')+1;end;else do;b=verify(n,#,"M")
      e=verify(n,#'0',,verify(n,#"0.",'M'))-p-1;end
      do j=e to b by -p while k<t;_=insert(c,_,j);k=k+1;end;return _</lang>

output

  ∙
  ∙
  ∙
1991-03-31  average=          23.542
1991-03-31 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ is a duplicate of the previous datestamp.
1991-04-01  average=          23.217   flagged= 1
1991-04-02  average=          19.792
1991-04-03  average=          13.958
  ∙
  ∙
  ∙
============================================================
      records read:            5,471
   records ignored:                5
     grand     sum:        1,357,152.400
     grand average:               10.496
     grand OK data:          129,306
     grand flagged:            1,878
   duplicate dates:                5
         bad dates:                0
         bad  data:                0
         bad flags:                0
   longest flagged:              589  ending at  1993-03-05
============================================================

Ruby

<lang ruby>require 'set'

def munge2(readings, debug=false)

  datePat = /^\d{4}-\d{2}-\d{2}/
  valuPat = /^[-+]?\d+\.\d+/
  statPat = /^-?\d+/
  totalLines = 0
  dupdate, badform, badlen, badreading = Set[], Set[], Set[], 0
  datestamps = Set[[]]
  for line in readings
     totalLines += 1
     fields = line.split(/\t/)
     date = fields.shift
     pairs = fields.enum_slice(2).to_a

     lineFormatOk = date =~ datePat &&
       pairs.all? { |x,y| x =~ valuPat && y =~ statPat }
     if !lineFormatOk
        puts 'Bad formatting ' + line if debug
        badform << date
     end
        
     if pairs.length != 24 ||
          pairs.any? { |x,y| y.to_i < 1 }
        puts 'Missing values ' + line if debug
     end
     if pairs.length != 24
        badlen << date
     end
     if pairs.any? { |x,y| y.to_i < 1 }
        badreading += 1
     end

     if datestamps.include?(date)
        puts 'Duplicate datestamp ' + line if debug
        dupdate << date
     end
     datestamps << date
  end
  puts 'Duplicate dates:', dupdate.sort.map { |x| '  ' + x }
  puts 'Bad format:', badform.sort.map { |x| '  ' + x }
  puts 'Bad number of fields:', badlen.sort.map { |x| '  ' + x }
  puts 'Records with good readings: %i = %5.2f%%' % [
     totalLines-badreading, (totalLines-badreading)/totalLines.to_f*100 ]
  puts
  puts 'Total records:  %d' % totalLines

end

open('readings.txt','r') do |readings|

  munge2(readings)

end</lang>

Scala

Works with: Scala version 2.8

<lang scala>object DataMunging2 {

 import scala.io.Source
 import scala.collection.immutable.{TreeMap => Map}
 val pattern = """^(\d+-\d+-\d+)""" + """\s+(\d+\.\d+)\s+(-?\d+)""" * 24 + "$" r;
 def main(args: Array[String]) {
   val files = args map (new java.io.File(_)) filter (file => file.isFile && file.canRead)
   val (numFormatErrors, numValidRecords, dateMap) =
     files.iterator.flatMap(file => Source fromFile file getLines ()).
       foldLeft((0, 0, new Map[String, Int] withDefaultValue 0)) {
       case ((nFE, nVR, dM), line) => pattern findFirstMatchIn line map (_.subgroups) match {
         case Some(List(date, rawData @ _*)) =>
           val allValid = (rawData map (_ toDouble) iterator) grouped 2 forall (_.last > 0)
           (nFE, nVR + (if (allValid) 1 else 0), dM(date) += 1)
         case None => (nFE + 1, nVR, dM)
       }
     }
   dateMap foreach {
     case (date, repetitions) if repetitions > 1 => println(date+": "+repetitions+" repetitions")
     case _ =>
   }
   println("""|
              |Valid records: %d
              |Duplicated dates: %d
              |Duplicated records: %d
              |Data format errors: %d
              |Invalid data records: %d
              |Total records: %d""".stripMargin format (
             numValidRecords,
             dateMap filter { case (_, repetitions) => repetitions > 1 } size,
             dateMap.valuesIterable filter (_ > 1) map (_ - 1) sum,
             numFormatErrors,
             dateMap.valuesIterable.sum - numValidRecords,
             dateMap.valuesIterable.sum))
 }

}</lang>

Sample output:

1990-03-25: 2 repetitions
1991-03-31: 2 repetitions
1992-03-29: 2 repetitions
1993-03-28: 2 repetitions
1995-03-26: 2 repetitions

Valid records: 5017
Duplicated dates: 5
Duplicated records: 5
Data format errors: 0
Invalid data records: 454
Total records: 5471

Tcl

<lang tcl>set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1] set total [llength $data] set correct $total set datestamps {}

foreach line $data {

   set formatOk true
   set hasAllMeasurements true
   set date [lindex $line 0]
   if {[llength $line] != 49} { set formatOk false }
   if {![regexp {\d{4}-\d{2}-\d{2}} $date]} { set formatOk false }
   if {[lsearch $datestamps $date] != -1} { puts "Duplicate datestamp: $date" } {lappend datestamps $date}
   foreach {value flag} [lrange $line 1 end] {
       if {$flag < 1} { set hasAllMeasurements false }
       if {![regexp -- {[-+]?\d+\.\d+} $value] || ![regexp -- {-?\d+} $flag]} {set formatOk false}
   }   
   if {!$hasAllMeasurements} { incr correct -1 }
   if {!$formatOk} { puts "line \"$line\" has wrong format" }

}

puts "$correct records with good readings = [expr $correct * 100.0 / $total]%" puts "Total records: $total"</lang>

$ tclsh munge2.tcl 
Duplicate datestamp: 1990-03-25
Duplicate datestamp: 1991-03-31
Duplicate datestamp: 1992-03-29
Duplicate datestamp: 1993-03-28
Duplicate datestamp: 1995-03-26
5017 records with good readings = 91.7016998721%
Total records: 5471

Second version

To demonstate a different method to iterate over the file, and different ways to verify data types:

<lang tcl>set total [set good 0] array set seen {} set fh [open readings.txt] while {[gets $fh line] != -1} {

   incr total
   set fields [regexp -inline -all {[^ \t\r\n]+} $line]
   if {[llength $fields] != 49} {
       puts "bad format: not 49 fields on line $total"
       continue
   }
   if { ! [regexp {^(\d{4}-\d\d-\d\d)$} [lindex $fields 0] -> date]} {
       puts "bad format: invalid date on line $total: '$date'"
       continue
   }
   if {[info exists seen($date)]} {
       puts "duplicate date on line $total: $date"
   }
   incr seen($date)
   
   set line_format_ok true
   set readings_ignored 0
   foreach {value flag} [lrange $fields 1 end] {
       if { ! [string is double -strict $value]} {
           puts "bad format: value not a float on line $total: '$value'"
           set line_format_ok false
       }
       if { ! [string is int -strict $flag]} {
           puts "bad format: flag not an integer on line $total: '$flag'"
           set line_format_ok false
       }
       if {$flag < 1} {incr readings_ignored}
   }
   if {$line_format_ok && $readings_ignored == 0} {incr good}

} close $fh

puts "total: $total" puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]</lang> Results:

duplicate date on line 85: 1990-03-25
duplicate date on line 456: 1991-03-31
duplicate date on line 820: 1992-03-29
duplicate date on line 1184: 1993-03-28
duplicate date on line 1911: 1995-03-26
total: 5471
good:  5017 = 91.70%

Ursala

compiled and run in a single step, with the input file accessed as a list of strings pre-declared in readings_dot_txt <lang Ursala>#import std

  1. import nat

readings = (*F ~&c/;digits+ rlc ==+ ~~ -={` ,9%cOi&,13%cOi&}) readings_dot_txt

valid_format = all -&length==49,@tK27 all ~&w/`.&& ~&jZ\digits--'-.',@tK28 all ~&jZ\digits--'-'&-

duplicate_dates = :/'duplicated dates:'+ ~&hK2tFhhPS|| -[(none)]-!

good_readings = --' good readings'@h+ %nP+ length+ *~ @tK28 all ~='0'&& ~&wZ/`-

  1. show+

main = valid_format?(^C/good_readings duplicate_dates,-[invalid format]-!) readings</lang> output:

5017 good readings
duplicated dates:
1995-03-26
1993-03-28
1992-03-29
1991-03-31
1990-03-25

Vedit macro language

This implementation does the following checks:

  • Checks for duplicate date fields. Note: duplicates can still be counted as valid records, as in other implementations.
  • Checks date format.
  • Checks that value fields have 1 or more digits followed by decimal point followed by 3 digits
  • Reads flag value and checks if it is positive
  • Requires 24 value/flag pairs on each line

<lang vedit>#50 = Buf_Num // Current edit buffer (source data) File_Open("|(PATH_ONLY)\output.txt")

  1. 51 = Buf_Num // Edit buffer for output file

Buf_Switch(#50)

  1. 11 = #12 = #13 = #14 = #15 = 0

Reg_Set(15, "xxx")

While(!At_EOF) {

   #10 = 0
   #12++
   // Check for repeated date field
   if (Match(@15) == 0) {
       #20 = Cur_Line
       Buf_Switch(#51)   // Output file
       Reg_ins(15) IT(": duplicate record at ") Num_Ins(#20)
       Buf_Switch(#50)   // Input file
       #13++
   }
   // Check format of date field
   if (Match("|d|d|d|d-|d|d-|d|d|w", ADVANCE) != 0) {
       #10 = 1
       #14++
   }
   Reg_Copy_Block(15, BOL_pos, Cur_Pos-1)
   // Check data fields and flags:
   Repeat(24) {
       if ( Match("|d|*.|d|d|d|w", ADVANCE) != 0 || Num_Eval(ADVANCE) < 1) {
           #10 = 1
           #15++
           Break
       }
       Match("|W", ADVANCE)
   }
   if (#10 == 0) { #11++ }             // record was OK
   Line(1, ERRBREAK)

}

Buf_Switch(#51) // buffer for output data IN IT("Valid records: ") Num_Ins(#11) IT("Duplicates: ") Num_Ins(#13) IT("Date format errors: ") Num_Ins(#14) IT("Invalid data records:") Num_Ins(#15) IT("Total records: ") Num_Ins(#12)</lang> Sample output: <lang vedit>1990-03-25: duplicate record at 85 1991-03-31: duplicate record at 456 1992-03-29: duplicate record at 820 1993-03-28: duplicate record at 1184 1995-03-26: duplicate record at 1911

Valid records: 5017 Duplicates: 5 Date format errors: 0 Invalid data records: 454 Total records: 5471</lang>

zkl

<lang zkl> // the RegExp engine has a low limit on groups so

  // I can't use it to select all fields, only verify them

re:=RegExp(0'|^(\d+-\d+-\d+)| + 0'|\s+\d+\.\d+\s+-*\d+| * 24 + ".+$"); w:=Utils.Helpers.zipW(File("readings.txt"),[1..]); //-->lazy (line,line #) reg datep,N, good=0, dd=0; foreach line,n in (w){

  N=n;		// since n is local to this scope
  if (not re.search(line)){ println("Line %d: malformed".fmt(n)); continue; }
  date:=line[re.matchedNs[1].xplode()];  // I can group the date field
  if (datep==date){ dd+=1; println("Line %4d: dup date: %s".fmt(n,date)); }
  datep=date;
  if (line.replace("\t"," ").split(" ").filter()[1,*]  // blow fields apart, drop date
        .pump(Void,T.fp(Void.Read,1), // get (reading,status)
           fcn(_,s){  // stop on first problem status and return True
              if(s.strip().toInt()<1) T(Void.Stop,True) else False
      })) continue;
  good+=1;

} println("%d records read, %d duplicate dates, %d valid".fmt(N,dd,good));</lang>

Output:
Line   85: dup date: 1990-03-25
Line  456: dup date: 1991-03-31
Line  820: dup date: 1992-03-29
Line 1184: dup date: 1993-03-28
Line 1911: dup date: 1995-03-26
5471 records read, 5 duplicate dates, 5017 valid