Text processing/2: Difference between revisions
m (→{{header|Tcl}}: Bugfix) |
Underscore (talk | contribs) (Added Perl.) |
||
Line 63: | Line 63: | ||
Total records 5471 OK records 5017 or 91.7017 % |
Total records 5471 OK records 5017 or 91.7017 % |
||
bash$ </pre> |
bash$ </pre> |
||
=={{header|Perl}}== |
|||
<perl>use List::MoreUtils 'natatime'; |
|||
use constant FIELDS => 49; |
|||
binmode STDIN, ':crlf'; |
|||
# Read the newlines properly even if we're not running on |
|||
# Windows. |
|||
my ($line, $good_records, %dates) = (0, 0); |
|||
while (<>) |
|||
{++$line; |
|||
my @fs = split /\s+/; |
|||
@fs == FIELDS or die "$line: Bad number of fields.\n"; |
|||
for (shift @fs) |
|||
{/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n"; |
|||
++$dates{$_};} |
|||
my $iterator = natatime 2, @fs; |
|||
my $all_flags_okay = 1; |
|||
while ( my ($val, $flag) = $iterator->() ) |
|||
{$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n"; |
|||
$flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n"; |
|||
$flag < 1 and $all_flags_okay = 0;} |
|||
$all_flags_okay and ++$good_records;} |
|||
print "Good records: $good_records\n", |
|||
"Repeated timestamps:\n", |
|||
map {" $_\n"} |
|||
grep {$dates{$_} > 1} |
|||
sort keys %dates;</perl> |
|||
Output: |
|||
<pre>Good records: 5017 |
|||
Repeated timestamps: |
|||
1990-03-25 |
|||
1991-03-31 |
|||
1992-03-29 |
|||
1993-03-28 |
|||
1995-03-26</pre> |
|||
=={{header|Python}}== |
=={{header|Python}}== |
Revision as of 00:29, 15 November 2008
You are encouraged to solve this task according to the task description, using any language you may know.
The following data shows a few lines from the file readings.txt (as used in in the Data Munging task).
The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.
The fields (from the left) are:
DATESTAMP [ VALUEn FLAGn ] * 24
i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instruments is working and < 1 if their is some problem with that instrument in which case that instruments value should be ignored.
A sample from the full data file readings.txt is:
1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 1991-03-31 40.000 1 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 1991-04-01 0.000 -2 13.000 1 16.000 1 21.000 1 24.000 1 22.000 1 20.000 1 18.000 1 29.000 1 44.000 1 50.000 1 43.000 1 38.000 1 27.000 1 27.000 1 24.000 1 23.000 1 18.000 1 12.000 1 13.000 1 14.000 1 15.000 1 13.000 1 10.000 1 1991-04-02 8.000 1 9.000 1 11.000 1 12.000 1 12.000 1 12.000 1 27.000 1 26.000 1 27.000 1 33.000 1 32.000 1 31.000 1 29.000 1 31.000 1 25.000 1 25.000 1 24.000 1 21.000 1 17.000 1 14.000 1 15.000 1 12.000 1 12.000 1 10.000 1 1991-04-03 10.000 1 9.000 1 10.000 1 10.000 1 9.000 1 10.000 1 15.000 1 24.000 1 28.000 1 24.000 1 18.000 1 14.000 1 12.000 1 13.000 1 14.000 1 15.000 1 14.000 1 15.000 1 13.000 1 13.000 1 13.000 1 12.000 1 10.000 1 10.000 1
The task:
- Confirm the general field format of the file
- Identify any DATASTAMPs that are duplicated.
- What number of records have good readings for all instruments.
AWK
A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.
Gradually tie down the format.
(In each case offending lines will be printed)
If their are any scientific notation fields then their will be an e in the file:
bash$ awk '/[eE]/' readings.txt bash$
Quick check on the number of fields:
bash$ awk 'NF != 49' readings.txt bash$
Full check on the file format using a regular expression:
bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt bash$
Full check on the file format as above but using regular expressions allowing intervals (gnu awk):
bash$ awk --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt bash$
Identify any DATASTAMPs that are duplicated.
Accomplished by counting how many times the first field occurs and noting any second occurrences.
bash$ awk '++count[$1]==2{print $1}' readings.txt 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 bash$
What number of records have good readings for all instruments.
bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt Total records 5471 OK records 5017 or 91.7017 % bash$
Perl
<perl>use List::MoreUtils 'natatime'; use constant FIELDS => 49;
binmode STDIN, ':crlf';
# Read the newlines properly even if we're not running on # Windows.
my ($line, $good_records, %dates) = (0, 0); while (<>)
{++$line; my @fs = split /\s+/; @fs == FIELDS or die "$line: Bad number of fields.\n"; for (shift @fs) {/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n"; ++$dates{$_};} my $iterator = natatime 2, @fs; my $all_flags_okay = 1; while ( my ($val, $flag) = $iterator->() ) {$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n"; $flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n"; $flag < 1 and $all_flags_okay = 0;} $all_flags_okay and ++$good_records;}
print "Good records: $good_records\n",
"Repeated timestamps:\n", map {" $_\n"} grep {$dates{$_} > 1} sort keys %dates;</perl>
Output:
Good records: 5017 Repeated timestamps: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26
Python
<Python>import re import zipfile import StringIO
def munge2(readings):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}') valuPat = re.compile(r'[-+]?\d+\.\d+') statPat = re.compile(r'-?\d+') allOk, totalLines = 0, 0 datestamps = set([]) for line in readings: totalLines += 1 fields = line.split('\t') date = fields[0] pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]
lineFormatOk = datePat.match(date) and \ all( valuPat.match(p[0]) for p in pairs ) and \ all( statPat.match(p[1]) for p in pairs ) if not lineFormatOk: print 'Bad formatting', line continue
if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ): print 'Missing values', line continue
if date in datestamps: print 'Duplicate datestamp', line continue datestamps.add(date) allOk += 1
print 'Lines with all readings: ', allOk print 'Total records: ', totalLines
- zfs = zipfile.ZipFile('readings.zip','r')
- readings = StringIO.StringIO(zfs.read('readings.txt'))
readings = open('readings.txt','r') munge2(readings)</Python> The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows
Missing values 2004-12-29 2.900 1 2.700 1 2.800 1 3.300 1 2.900 1 2.300 1 0.000 0 1.700 1 1.900 1 2.300 1 2.600 1 2.900 1 2.600 1 2.600 1 2.600 1 2.700 1 2.300 1 2.200 1 2.100 1 2.000 1 2.100 1 2.100 1 2.300 1 2.400 1 Missing values 2004-12-30 2.400 1 2.600 1 2.600 1 2.600 1 3.000 1 0.000 0 3.300 1 2.600 1 2.900 1 2.400 1 2.300 1 2.900 1 3.500 1 3.700 1 3.600 1 4.000 1 3.400 1 2.400 1 2.500 1 2.600 1 2.600 1 2.800 1 2.400 1 2.200 1 Missing values 2004-12-31 2.400 1 2.500 1 2.500 1 2.400 1 0.000 0 2.400 1 2.400 1 2.400 1 2.200 1 2.400 1 2.500 1 2.000 1 1.700 1 1.400 1 1.500 1 1.900 1 1.700 1 2.000 1 2.000 1 2.200 1 1.700 1 1.500 1 1.800 1 1.800 1 Lines with all readings: 5013 Total records: 5471
Second Version
Modification of the version above to:
- Remove continue statements so it counts as the AWK example does.
- Generate mostly summary information that is easier to compare to other solutions.
<python>import re import zipfile import StringIO
def munge2(readings, debug=False):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}') valuPat = re.compile(r'[-+]?\d+\.\d+') statPat = re.compile(r'-?\d+') totalLines = 0 dupdate, badform, badlen, badreading = set(), set(), set(), 0 datestamps = set([]) for line in readings: totalLines += 1 fields = line.split('\t') date = fields[0] pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)] lineFormatOk = datePat.match(date) and \ all( valuPat.match(p[0]) for p in pairs ) and \ all( statPat.match(p[1]) for p in pairs ) if not lineFormatOk: if debug: print 'Bad formatting', line badform.add(date) if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ): if debug: print 'Missing values', line if len(pairs)!=24: badlen.add(date) if any( int(p[1]) < 1 for p in pairs ): badreading += 1 if date in datestamps: if debug: print 'Duplicate datestamp', line dupdate.add(date)
datestamps.add(date)
print 'Duplicate dates:\n ', '\n '.join(sorted(dupdate)) print 'Bad format:\n ', '\n '.join(sorted(badform)) print 'Bad number of fields:\n ', '\n '.join(sorted(badlen)) print 'Records with good readings: %i = %5.2f%%\n' % ( totalLines-badreading, (totalLines-badreading)/float(totalLines)*100 ) print 'Total records: ', totalLines
readings = open('readings.txt','r') munge2(readings) </python>
bash$ /cygdrive/c/Python26/python munge2.py Duplicate dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 Bad format: Bad number of fields: Records with good readings: 5017 = 91.70% Total records: 5471 bash$
Tcl
set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1] set total [llength $data] set correct $total set datestamps {} foreach line $data { set formatOk true set hasAllMeasurements true set date [lindex $line 0] if {[llength $line] != 49} { set formatOk false } if {![regexp {\d{4}-\d{2}-\d{2}} $date]} { set formatOk false } if {[lsearch $datestamps $date] != -1} { puts "Duplicate datestamp: $date" } {lappend datestamps $date} foreach {value flag} [lrange $line 1 end] { if {$flag < 1} { set hasAllMeasurements false } if {![regexp -- {[-+]?\d+\.\d+} $value] || ![regexp -- {-?\d+} $flag]} {set formatOk false} } if {!$hasAllMeasurements} { incr correct -1 } if {!$formatOk} { puts "line \"$line\" has wrong format" } } puts "$correct records with good readings = [expr $correct * 100.0 / $total]%" puts "Total records: $total"
$ tclsh munge2.tcl Duplicate datestamp: 1990-03-25 Duplicate datestamp: 1991-03-31 Duplicate datestamp: 1992-03-29 Duplicate datestamp: 1993-03-28 Duplicate datestamp: 1995-03-26 5017 records with good readings = 91.7016998721% Total records: 5471