Text processing/2: Difference between revisions
m
syntax highlighting fixup automation
Thundergnat (talk | contribs) m (syntax highlighting fixup automation) |
|||
Line 28:
{{trans|Python}}
<
V datePat = re:‘\d{4}-\d{2}-\d{2}’
V valuPat = re:‘[-+]?\d+\.\d+’
Line 73:
print("Records with good readings: #. = #2.2%\n".format(
totalLines - badreading, (totalLines - badreading) / Float(totalLines) * 100))
print(‘Total records: ’totalLines)</
{{out}}
Line 94:
=={{header|Ada}}==
{{libheader|Simple components for Ada}}
<
with Ada.Text_IO; use Ada.Text_IO;
with Strings_Edit; use Strings_Edit;
Line 156:
Close (File);
Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total");
end Data_Munging_2;</
Sample output
<pre>
Line 168:
=={{header|Aime}}==
<
{
integer i;
Line 220:
0;
}</
{{out}} (the "reading.txt" needs to be converted to UNIX end-of-line)
<pre>duplicate 19900325 line
Line 231:
=={{header|AutoHotkey}}==
<
data = %A_scriptdir%\readings.txt
Line 283:
msgbox, Duplicate Dates:`n%wrongDates%`nRead Lines: %lines%`nValid Lines: %valid%`nwrong lines: %totwrong%`nDuplicates: %TotWrongDates%`nWrong Formatted: %unvalidformat%`n
</syntaxhighlight>
Sample Output:
Line 310:
If their are any scientific notation fields then their will be an e in the file:
<
bash$</
Quick check on the number of fields:
<
bash$</
Full check on the file format using a regular expression:
<
bash$</
Full check on the file format as above but using regular expressions allowing intervals (gnu awk):
<
bash$</
Line 326:
Accomplished by counting how many times the first field occurs and noting any second occurrences.
<
1990-03-25
1991-03-31
Line 332:
1993-03-28
1995-03-26
bash$</
Line 338:
<div style="width:100%;overflow:scroll">
<
Total records 5471 OK records 5017 or 91.7017 %
bash$</
</div>
=={{header|C}}==
<
#include <string.h>
#include <stdlib.h>
Line 427:
read_file("readings.txt");
return 0;
}</
{{out}}
Line 441:
=={{header|C sharp|C#}}==
<
using System.Collections.Generic;
using System.Text.RegularExpressions;
Line 512:
}
}
}</
<pre>
Line 525:
=={{header|C++}}==
{{libheader|Boost}}
<
#include <fstream>
#include <iostream>
Line 576:
cout << all_ok << " records were complete and ok!\n" ;
return 0 ;
}</
{{out}}
Line 592:
=={{header|Clojure}}==
<syntaxhighlight lang="clojure">
(defn parse-line [s]
(let [[date & data-toks] (str/split s #"\s+")
Line 630:
(clojure.string/join " " (sort (:dupl-dates m)))))
(println (format "%d lines with no missing data" (:n-full-recs m)))))
</syntaxhighlight>
{{out}}
Line 641:
=={{header|COBOL}}==
{{works with|OpenCOBOL}}
<
PROGRAM-ID. text-processing-2.
Line 803:
INSPECT input-data (offset:) TALLYING data-len
FOR CHARACTERS BEFORE delim
.</
{{out}}
Line 818:
=={{header|D}}==
<
import std.stdio, std.array, std.string, std.regex, std.conv,
std.algorithm;
Line 854:
repeatedDates.byKey.filter!(k => repeatedDates[k] > 1));
writeln("Good reading records: ", goodReadings);
}</
{{out}}
<pre>Duplicated timestamps: 1990-03-25, 1991-03-31, 1992-03-29, 1993-03-28, 1995-03-26
Line 860:
=={{header|Eiffel}}==
<syntaxhighlight lang="eiffel">
class
APPLICATION
Line 976:
end
</syntaxhighlight>
{{out}}
<pre>
Line 996:
=={{header|Erlang}}==
Uses function from [[Text_processing/1]]. It does some correctness checks for us.
<syntaxhighlight lang="erlang">
-module( text_processing2 ).
Line 1,028:
value_flag_records() -> 24.
</syntaxhighlight>
{{out}}
<pre>
Line 1,037:
=={{header|F Sharp|F#}}==
<
let file = @"readings.txt"
Line 1,057:
ok <- ok + 1
printf "%d records were ok\n" ok
</syntaxhighlight>
Prints:
<
Date 1990-03-25 is duplicated
Date 1991-03-31 is duplicated
Line 1,066:
Date 1995-03-26 is duplicated
5017 records were ok
</syntaxhighlight>
=={{header|Factor}}==
{{works with|Factor|0.99 2020-03-02}}
<
prettyprint sequences sequences.extras sets splitting ;
Line 1,080:
[ "Duplicates:" print [ "\t" split1 drop ] map duplicates . ]
[ [ " \t" split rest <odds> [ string>number 0 <= ] none? ] count ]
bi pprint " records were good." print</
{{out}}
<pre>
Line 1,100:
Rather than copy today's data to a PDATA holder so that on the next read the new data may be compared to the old, a two-row array is used, with IT flip-flopping 1,2,1,2,1,2,... Comparison of the data as numerical values rather than text strings means that different texts that evoke the same value will not be regarded as different. If the data format were invalid, there would be horrible messages. There aren't, so ... the values should be read and plotted...
<syntaxhighlight lang="fortran">
Crunches a set of hourly data. Starts with a date, then 24 pairs of value,indicator for that day, on one line.
INTEGER Y,M,D !Year, month, and day.
Line 1,163:
900 CLOSE(IN) !Done.
END !Spaghetti rules.
</syntaxhighlight>
Output:
Line 1,175:
=={{header|Go}}==
<
import (
Line 1,249:
fmt.Println(uniqueGood,
"unique dates with good readings for all instruments.")
}</
{{out}}
<pre>
Line 1,264:
=={{header|Haskell}}==
<
import Data.List (nub, (\\))
Line 1,283:
putStr (unlines ("duplicated dates:": duplicatedDates (map date inputs)))
putStrLn ("number of good records: " ++ show (length $ goodRecords inputs))
</syntaxhighlight>
this script outputs:
Line 1,299:
duplicated timestamps that are on well-formed records.
<
dups := set()
goodRecords := 0
Line 1,331:
}
end</
Sample run:
Line 1,344:
=={{header|J}}==
<
dat=: TAB readdsv jpath '~temp/readings.txt'
Dates=: getdate"1 >{."1 dat
Line 1,363:
1992 3 29
1993 3 28
1995 3 26</
=={{header|Java}}==
{{trans|C++}}
{{works with|Java|1.5+}}
<
import java.util.regex.*;
import java.io.*;
Line 1,411:
}
}
}</
The program produces the following output:
<pre>
Line 1,425:
=={{header|JavaScript}}==
{{works with|JScript}}
<
function analyze_func(filename) {
var dates_seen = {};
Line 1,474:
var analyze = analyze_func('readings.txt');
analyze();</
=={{header|jq}}==
Line 1,480:
For this problem, it is convenient to use jq in a pipeline: the first invocation of jq will convert the text file into a stream of JSON arrays (one array per line):
<
The second part of the pipeline performs the task requirements. The following program is used in the second invocation of jq.
'''Generic Utilities'''
<
def runs:
reduce .[] as $item
Line 1,500:
def is_integral: test("^[-+]?[0-9]+$");
def is_date: test("[12][0-9]{3}-[0-9][0-9]-[0-9][0-9]");</
'''Validation''':
<
def validate_line(nr):
def validate_date:
Line 1,521:
def validate_lines:
. as $in
| range(0; length) as $i | ($in[$i] | validate_line($i + 1));</
'''Check for duplicate timestamps'''
<
[.[][0]] | sort | runs | map( select(.[1]>1) );</
'''Number of valid readings for all instruments''':
<
# but does check the validity of the record, including the date format:
def number_of_valid_readings:
Line 1,538:
and all(range(0; 24) | $in[2*. + 2] | (is_integral and tonumber >= 1) );
map(select(check)) | length ;</
'''Generate Report'''
<
"\nChecking for duplicate timestamps:",
duplicate_timestamps,
"\nThere are \(number_of_valid_readings) valid rows altogether."</
{{out}}
'''Part 1: Simple demonstration'''
To illustrate that the program does report invalid lines, we first use the six lines at the top but mangle the last line.
<
field 1 in line 6 has an invalid date: 991-04-03
line 6 has 47 fields
Line 1,564:
]
There are 5 valid rows altogether.</
'''Part 2: readings.txt'''
<
Checking for duplicate timestamps:
[
Line 1,592:
]
There are 5017 valid rows altogether.</
=={{header|Julia}}==
Refer to the code at https://rosettacode.org/wiki/Text_processing/1#Julia. Add at the end of that code the following:
<syntaxhighlight lang="julia">
dupdate = df[nonunique(df[:,[:Date]]),:][:Date]
println("The following rows have duplicate DATESTAMP:")
Line 1,602:
println("All values good in these rows:")
println(df[df[:ValidValues] .== 24,:])
</syntaxhighlight>
{{output}}
<pre>
Line 1,646:
=={{header|Kotlin}}==
<
import java.io.File
Line 1,687:
percent = allGood.toDouble() / count * 100.0
println("Number which are all good : $allGood (${"%5.2f".format(percent)}%)")
}</
{{out}}
Line 1,706:
=={{header|Lua}}==
<
io.input( filename )
Line 1,749:
for i = 1, #bad_format do
print( " ", bad_format[i] )
end</
Output:
<pre>Lines read: 5471
Line 1,765:
File is in user dir. Use Win Dir$ to open the explorer window and copy there the readings.txt
<
Document a$, exp$
\\ automatic find the enconding and the line break
Line 1,798:
}
TestThis
</syntaxhighlight>
{{out}}
<pre>
Line 1,813:
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<
Select[Tally@data[[;;,1]], #[[2]]>1&][[;;,1]]//Column
Print["number of good records: ", Count[(Times@@#[[3;;All;;2]])& /@ data, 1],
" (out of a total of ", Length[data], ")"]</
{{out}}
<pre>duplicated dates:
Line 1,828:
=={{header|MATLAB}} / {{header|Octave}}==
<
% READDAT reads readings.txt file
%
Line 1,856:
dix = find(diff(d)==0) % check for to consequtive timestamps with zero difference
printf('number of valid records: %i\n ', sum( all( val(:,5:2:end) >= 1, 2) ) );</
<pre>>> [val,count]=readdat;
Line 1,871:
=={{header|Nim}}==
<
const NumFields = 49
Line 1,934:
echo "Total Records: ", totalRecords
echo "Records with wrong format: ", badRecords
echo "Records where all instruments were OK: ", goodInstruments</
{{out}}
Line 1,947:
=={{header|OCaml}}==
<
open Str
Line 2,014:
Printf.printf "number of good records: %d\n" (num_good_records inputs);
;;</
this script outputs:
Line 2,028:
=={{header|Perl}}==
<
use constant FIELDS => 49;
Line 2,055:
map {" $_\n"}
grep {$dates{$_} > 1}
sort keys %dates;</
Output:
Line 2,067:
=={{header|Phix}}==
<!--<
<span style="color: #000080;font-style:italic;">-- demo\rosetta\TextProcessing2.exw</span>
<span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span> <span style="color: #000080;font-style:italic;">-- (include version/first of next three lines only)</span>
Line 2,101:
<span style="color: #0000FF;">?</span><span style="color: #008000;">"done"</span>
<span style="color: #0000FF;">{}</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">wait_key</span><span style="color: #0000FF;">()</span>
<!--</
{{out}}
<pre>
Line 2,113:
=={{header|PHP}}==
<
$missformcount = 0;
$totalcount = 0;
Line 2,147:
foreach ($duplicates as $key => $val){
echo $val . ' at Line : ' . $key . '<br>';
}</
<pre>Valid records 5017 of 5471 total
Duplicates :
Line 2,157:
=={{header|Picat}}==
<
go =>
Line 2,181:
check_field(Field) =>
Field == "-2" ; Field == "-1" ; Field == "0".</
{{out}}
Line 2,196:
=={{header|PicoLisp}}==
Put the following into an executable file "checkReadings":
<
(load "@lib/misc.l")
Line 2,217:
(bye 1) ) ) ) )
(bye)</
Then it can be called as
<pre>$ ./checkReadings readings.txt</pre>
=={{header|PL/I}}==
<
/* To process readings produced by automatic reading stations. */
Line 2,275:
put skip list ('There were ' || k-faulty || ' good readings' );
end check;
</syntaxhighlight>
=={{header|PowerShell}}==
<
$goodLineCount = 0
get-content c:\temp\readings.txt |
Line 2,301:
}
[string]$goodLineCount + " good lines"
</syntaxhighlight>
Output:
Line 2,312:
An alternative using regular expression syntax:
<
$dateHash = @{}
$goodLineCount = 0
Line 2,333:
}
[string]$goodLineCount + " good lines"
</syntaxhighlight>
Output:
Line 2,347:
=={{header|PureBasic}}==
Using regular expressions.
<
#instrumentCount = 24
Line 2,418:
CloseConsole()
EndIf
EndIf</
Sample output:
<pre>Duplicate date: 1990-03-25 occurs on lines 85 and 84.
Line 2,429:
=={{header|Python}}==
<
import zipfile
import StringIO
Line 2,469:
#readings = StringIO.StringIO(zfs.read('readings.txt'))
readings = open('readings.txt','r')
munge2(readings)</
The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows
<pre style="height:10ex;overflow:scroll">
Line 2,488:
* Generate mostly summary information that is easier to compare to other solutions.
<
import zipfile
import StringIO
Line 2,532:
readings = open('readings.txt','r')
munge2(readings)</
<pre>bash$ /cygdrive/c/Python26/python munge2.py
Duplicate dates:
Line 2,550:
=={{header|R}}==
<
dfr <- read.delim("d:/readings.txt", colClasses=c("character", rep(c("numeric", "integer"), 24)))
dates <- strptime(dfr[,1], "%Y-%m-%d")
Line 2,562:
# Number of rows with no bad values
flags <- as.matrix(dfr[,seq(3,49,2)])>0
sum(apply(flags, 1, all))</
=={{header|Racket}}==
<
(read-decimal-as-inexact #f)
;; files to read is a sequence, so it could be either a list or vector of files
Line 2,614:
(printf "~a records have good readings for all instruments~%"
(text-processing/2 (current-command-line-arguments)))</
Example session:
<pre>$ racket 2.rkt readings/readings.txt
Line 2,641:
Note that we sort the pairs after we've grepped them, not before; this works fine in Raku, sorting on the key and value as primary and secondary keys. Finally, pairs and arrays provide a default print format that is sufficient without additional formatting in this case.
<syntaxhighlight lang="raku"
my $line;
my %dates;
Line 2,659:
say 'Repeated timestamps (with line numbers):';
.say for sort %dates.pairs.grep: *.value.elems > 1;</
Output:
<pre>5017 good records out of 5471 total
Line 2,688:
<br><br>
The program has (negated) code to write the report to a file in addition to the console.
<
numeric digits 20 /*allow for bigger numbers. */
ifid='READINGS.TXT' /*name of the input file. */
Line 2,827:
return y//100\==0 | y//400==0 /*apply the 100 and the 400 year rule.*/
/*────────────────────────────────────────────────────────────────────────────*/
sy: say arg(1); call lineout ofid,arg(1); return</
'''output''' when using the default input file:
<pre style="height:35ex">
Line 2,857:
=={{header|Ruby}}==
<
def munge2(readings, debug=false)
Line 2,909:
open('readings.txt','r') do |readings|
munge2(readings)
end</
=={{header|Scala}}==
{{works with|Scala|2.8}}
<
import scala.io.Source
import scala.collection.immutable.{TreeMap => Map}
Line 2,951:
dateMap.valuesIterable.sum))
}
}</
Sample output:
Line 2,971:
=={{header|Sidef}}==
{{trans|Raku}}
<
var dates = Hash();
Line 2,984:
say "#{good_records} good records out of #{$.} total";
say 'Repeated timestamps:';
say dates.to_a.grep{ .value > 1 }.map { .key }.sort.join("\n");</
{{out}}
<pre>
Line 3,001:
Developed using the Snobol4 dialect Spitbol for Linux, version 4.0
<
v = array(24)
Line 3,052:
end
</syntaxhighlight>
{{out}}
<pre>1990-03-25: datestamp at row 85 duplicates datestamp at 84
Line 3,068:
=={{header|Tcl}}==
<
set total [llength $data]
set correct $total
Line 3,092:
puts "$correct records with good readings = [expr $correct * 100.0 / $total]%"
puts "Total records: $total"</
<pre>$ tclsh munge2.tcl
Duplicate datestamp: 1990-03-25
Line 3,107:
To demonstate a different method to iterate over the file, and different ways to verify data types:
<
array set seen {}
set fh [open readings.txt]
Line 3,145:
puts "total: $total"
puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]</
Results:
<pre>duplicate date on line 85: 1990-03-25
Line 3,158:
compiled and run in a single step, with the input file accessed as a list of strings
pre-declared in readings_dot_txt
<
#import nat
Line 3,171:
#show+
main = valid_format?(^C/good_readings duplicate_dates,-[invalid format]-!) readings</
output:
<pre>5017 good readings
Line 3,182:
=={{header|VBScript}}==
<
Set objFile = objFSO.OpenTextFile(objFSO.GetParentFolderName(WScript.ScriptFullName) &_
"\readings.txt",1)
Line 3,235:
objFile.Close
Set objFSO = Nothing</
{{Out}}
Line 3,256:
* Reads flag value and checks if it is positive
* Requires 24 value/flag pairs on each line
<
File_Open("|(PATH_ONLY)\output.txt")
#51 = Buf_Num // Edit buffer for output file
Line 3,303:
IT("Date format errors: ") Num_Ins(#14)
IT("Invalid data records:") Num_Ins(#15)
IT("Total records: ") Num_Ins(#12)</
Sample output:
<
1991-03-31: duplicate record at 456
1992-03-29: duplicate record at 820
Line 3,315:
Date format errors: 0
Invalid data records: 454
Total records: 5471</
=={{header|Wren}}==
Line 3,322:
{{libheader|Wren-fmt}}
{{libheader|Wren-sort}}
<
import "/pattern" for Pattern
import "/fmt" for Fmt
Line 3,364:
Fmt.print("Number of invalid records : $d ($5.2f)\%", invalid, percent)
percent = allGood/count * 100
Fmt.print("Number which are all good : $d ($5.2f)\%", allGood, percent)</
{{out}}
Line 3,383:
=={{header|zkl}}==
<
// I can't use it to select all fields, only verify them
re:=RegExp(0'|^(\d+-\d+-\d+)| + 0'|\s+\d+\.\d+\s+-*\d+| * 24 + ".+$");
Line 3,401:
good+=1;
}
println("%d records read, %d duplicate dates, %d valid".fmt(N,dd,good));</
{{out}}
<pre>
|