Jump to content

Merge and aggregate datasets: Difference between revisions

m
syntax highlighting fixup automation
m (syntax highlighting fixup automation)
Line 71:
{{trans|Python: Stdlib csv only}}
 
<langsyntaxhighlight lang="11l">V patients_csv =
‘PATIENT_ID,LASTNAME
1001,Hopper
Line 125:
 
L(record) result
print(‘| ’record.map(r -> r.center(10)).join(‘ | ’)‘ |’)</langsyntaxhighlight>
 
{{out}}
Line 138:
 
=={{header|AutoHotkey}}==
<langsyntaxhighlight AutoHotkeylang="autohotkey">Merge_and_aggregate(patients, visits){
ID := [], LAST_VISIT := [], SCORE_SUM := [], VISIT := []
for i, line in StrSplit(patients, "`n", "`r"){
Line 161:
output .= ID "`t" name "`t" LAST_VISIT[id] "`t" SCORE_SUM[id] "`t" SCORE_SUM[id]/VISIT[id] "`n"
return output
}</langsyntaxhighlight>
Examples:<langsyntaxhighlight AutoHotkeylang="autohotkey">patients =
(
PATIENT_ID,LASTNAME
Line 186:
 
MsgBox % Merge_and_aggregate(patients, visits)
return</langsyntaxhighlight>
{{out}}
<pre>PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG
Line 196:
 
=={{header|AWK}}==
<syntaxhighlight lang="awk">
<lang AWK>
# syntax: GAWK -f MERGE_AND_AGGREGATE_DATASETS.AWK RC-PATIENTS.CSV RC-VISITS.CSV
# files may appear in any order
Line 235:
exit(0)
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 247:
=={{header|C++}}==
Uses C++20
<langsyntaxhighlight lang="cpp">#include <iostream>
#include <optional>
#include <ranges>
Line 323:
cout << " |\n";
}
}</langsyntaxhighlight>
 
{{out}}
Line 336:
</pre>
=={{header|C sharp}}==
<langsyntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.Globalization;
Line 408:
}
 
}</langsyntaxhighlight>
{{out}}
<pre>
Line 420:
=={{header|Datalog}}==
The following implementation is for Souffle. Souffle does not currently include dates and times in the base distribution, but it is straightforward to use C timestamps:
<langsyntaxhighlight Clang="c">// datetime.cpp
#include <ctime>
#include <cstdint>
Line 429:
return mktime(&tmInfo); // localtime
}
}</langsyntaxhighlight>
 
Rather than combine the summaries, the date and score summaries have been presented separately.
<langsyntaxhighlight lang="prolog">/* rosetta.dl */
#define NaN to_float("NaN")
.functor from_date(date:symbol) : number
Line 467:
score_mean = mean score: {Visit(id, _, score), score != NaN}.
.output SummaryDates
.output SummaryScores</langsyntaxhighlight>
 
Then this is called using:
<langsyntaxhighlight lang="bash">g++ -shared -fPIC datetime.cpp -o libfunctors.so
souffle -D- rosetta.dl</langsyntaxhighlight>
{{output}}
<pre>---------------
Line 493:
=={{header|F_Sharp|F#}}==
Note that the scores are right justified to copy the task description. It would be more natural to leave them right justified.
<langsyntaxhighlight lang="fsharp">
// Merge and aggregate datasets. Nigel Galloway: January 6th., 2021
let rFile(fName)=seq{use n=System.IO.File.OpenText(fName)
Line 503:
let fG n g=let z=G.[n]|>Seq.sumBy(fun n->try float n.[2] with :? System.FormatException->0.0)
fN n g (G.[n]|>Seq.sort|>Seq.last).[1] (if z=0.0 then "" else string z) (if z=0.0 then "" else string(z/(float(Seq.length G.[n]))))
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 518:
=={{header|Go}}==
{{trans|Wren}}
<langsyntaxhighlight lang="go">package main
 
import (
Line 655:
}
mergePrint(merges)
}</langsyntaxhighlight>
 
{{out}}
Line 673:
Merging of fields and databases is defined as a monoid operation for corresponding types.
 
<langsyntaxhighlight lang="haskell">import Data.List
import Data.Maybe
import System.IO (readFile)
Line 738:
where
go [] = Nothing
go s = Just $ drop 1 <$> span (/= ch) s</langsyntaxhighlight>
 
<pre>let patients = readDB <$> readFile "patients.csv"
Line 765:
===Pretty tabulation===
 
<langsyntaxhighlight lang="haskell">
tabulateDB (DB ps) header cols = intercalate "|" <$> body
where
Line 787:
, \p -> case scores p of {[] -> []; s -> show (mean s)} ]
 
mean lst = sum lst / genericLength lst</langsyntaxhighlight>
 
<pre>*Main> main
Line 802:
 
In other words, we can set things up like this:
<langsyntaxhighlight Jlang="j">NB. setup:
require'jd pacman'
load JDP,'tools/csv_load.ijs'
Line 835:
csvload 'visits';1
 
jd'ref patients PATIENTID visits PATIENTID'</langsyntaxhighlight>
 
And, then we can run our query:
 
<langsyntaxhighlight Jlang="j">require'jd'
 
echo jd {{)n
Line 853:
p:patients,
v:p.visits
}} -.LF</langsyntaxhighlight>
 
Which displays this result:
Line 868:
Another approach would be to use J's csv library:
 
<langsyntaxhighlight Jlang="j">require'csv'
patients=: fixcsv {{)n
PATIENTID,LASTNAME
Line 904:
labels=. ;:'PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG'
labels,:id;nm;vdt;sum;avg
}}</langsyntaxhighlight>
 
Here:
<langsyntaxhighlight Jlang="j"> task''
┌──────────┬────────┬──────────┬─────────┬─────────┐
│PATIENT_ID│LASTNAME│LAST_VISIT│SCORE_SUM│SCORE_AVG│
Line 916:
│4004 │Wirth │2020-11-05│15.4 │7.7 │
│5005 │Kurtz │ │ │ │
└──────────┴────────┴──────────┴─────────┴─────────┘</langsyntaxhighlight>
 
If the empty score in visits was a display concern, we might instead do it this way:
 
<langsyntaxhighlight Jlang="j">task=: {{
P=. <@:>"1|:/:~}.patients
V=. <@:>"1|:/:~}.visits
Line 936:
labels=. ;:'PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG'
labels,:id;nm;vdt;sum;avg
}}</langsyntaxhighlight>
 
Which gives us:
 
<langsyntaxhighlight Jlang="j"> task''
┌──────────┬────────┬──────────┬─────────┬─────────┐
│PATIENT_ID│LASTNAME│LAST_VISIT│SCORE_SUM│SCORE_AVG│
Line 949:
│4004 │Wirth │2020-11-05│15.4 │7.7 │
│5005 │Kurtz │ │ │ │
└──────────┴────────┴──────────┴─────────┴─────────┘</langsyntaxhighlight>
 
=={{header|jq}}==
Line 960:
 
'''Ingesting CSV data'''
<syntaxhighlight lang="jq">
<lang jq>
# objectify/1 takes an array of atomic values as inputs, and packages
# these into an object with keys specified by the "headers" array and
Line 978:
([]; . + [ $row|objectify($headers) ]);
 
</syntaxhighlight>
</lang>
'''Aggregation functions'''
<syntaxhighlight lang="jq">
<lang jq>
# output {LAST_VISIT}
def LAST_VISIT($patient_id):
Line 990:
| {SCORE_SUM: add, count: length}
| {SCORE_SUM, SCORE_AVG: (if .SCORE_SUM and .count > 0 then .SCORE_SUM/.count else null end)};
</langsyntaxhighlight>'''The task'''<syntaxhighlight lang ="jq">
# Read the two tables:
INDEX($patients | [splits("\n")] | map(split(",")) | csv2jsonHelper[]; .PATIENT_ID) as $patients
Line 1,001:
# ... but display it as a sequence of JSON objects
| .[]
</syntaxhighlight>
</lang>
'''Invocation'''
 
Line 1,019:
=={{header|Julia}}==
{{trans|Python}}
<langsyntaxhighlight lang="julia">using CSV, DataFrames, Statistics
 
# load data from csv files
Line 1,058:
end
println(df_result)
</langsyntaxhighlight>{{out}}
<pre>
5×5 DataFrame
Line 1,072:
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<langsyntaxhighlight Mathematicalang="mathematica">a = ImportString["PATIENT_ID,LASTNAME
1001,Hopper
4004,Wirth
Line 1,096:
"SCORE_SUM" -> If[DeleteMissing@#[[All, "SCORE"]] =!= {}, Total@DeleteMissing@#[[All, "SCORE"]], Missing[]],
"SCORE_AVG" -> If[DeleteMissing@#[[All, "SCORE"]] =!= {}, Mean@DeleteMissing@#[[All, "SCORE"]], Missing[]]|> & /@
gr // Dataset</langsyntaxhighlight>
{{out}}
<pre> PATIENT_ID LASTNAME VISIT_DATE SCORE_SUM SCORE_AVG
Line 1,105:
5005 5005 Kurtz \[LongDash] \[LongDash] \[LongDash]</pre>
=={{header|Mercury}}==
<langsyntaxhighlight Prologlang="prolog">:- module rosetta.
 
:- interface.
Line 1,174:
(if MaxDate1 = date(0,0,0) then MaxDate = no else MaxDate = MaxDate1)),
Solutions),
foldl(io.write_line, Solutions, !IO).</langsyntaxhighlight>
{{out}}
<pre>{Id, Lastname, SumScores, AvgScores, MaxDate}:
Line 1,185:
=={{header|Nim}}==
===CSV files and tables===
<langsyntaxhighlight Nimlang="nim">import algorithm, parsecsv, strformat, strutils, tables
 
const NoValue = -1.0
Line 1,237:
let scoreSum = if count == 0: "" else: &"{sum:>4.1f}"
let scoreAvg = if count == 0: "" else: &"{sum / count.toFloat: >4.2f}"
echo &"| {id:^10} | {name:^10} | {lastVisit:^10} | {scoreSum:>7} | {scoreAvg:>6} |"</langsyntaxhighlight>
 
{{out}}
Line 1,250:
{{trans|Python}}
We use the high level standard library "db_sqlite" rather than the low level one "sqlite3".
<langsyntaxhighlight Nimlang="nim">import parseCsv, db_sqlite, sequtils, strutils
 
const FNames = ["patients1.csv", "patients2.csv"]
Line 1,302:
echo "| PATIENT_ID | LASTNAME | LAST_VISIT | SCORE_SUM | SCORE_AVG |"
for row in result:
echo "| " & row.mapit(it.center(10)).join(" | ") & '|'</langsyntaxhighlight>
 
{{out}}
Line 1,313:
 
=={{header|Phix}}==
<!--<langsyntaxhighlight Phixlang="phix">(phixonline)-->
<span style="color: #008080;">constant</span> <span style="color: #000000;">patients_txt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"""
PATIENT_ID,LASTNAME
Line 1,371:
<span style="color: #0000FF;">{</span><span style="color: #7060A8;">id</span><span style="color: #0000FF;">,</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #000000;">dt</span><span style="color: #0000FF;">,</span><span style="color: #000000;">scstr</span><span style="color: #0000FF;">,</span><span style="color: #000000;">avstr</span><span style="color: #0000FF;">})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<!--</langsyntaxhighlight>-->
{{out}}
<pre>
Line 1,385:
Weird requirement: "Use the most common libraries only when built-in functionality is not sufficient."
Not even a "use strict;" :)
<langsyntaxhighlight lang="perl">#!/usr/bin/perl
my $fmt = '| %-11s' x 5 . "|\n";
printf $fmt, qw( PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG);
Line 1,421:
3003,2020-11-12,
4004,2020-11-05,7.0
1001,2020-11-19,5.3</langsyntaxhighlight>
{{out}}
<pre>
Line 1,433:
=={{header|Prolog}}==
Implemented using SWI Prolog:
<langsyntaxhighlight lang="prolog">patient(1001,'Hopper').
patient(4004,'Wirth').
patient(3003,'Kemeny').
Line 1,468:
summaryScores(Id, Lastname, ScoreSum, ScoreMean),
writeln(summaryScores(Id, Lastname, ScoreSum, ScoreMean)),
fail.</langsyntaxhighlight>
{{output}}
<pre>summaryDates(1001,Hopper,2020-11-19)
Line 1,479:
false.</pre>
Implemented using XSB Prolog (which allows for user-defined aggregates):
<langsyntaxhighlight lang="prolog">:- import bagMax/2, bagCount/2, bagSum/2, bagReduce/4 from aggregs.
:- import julian_date/7, date_string/3 from iso8601.
:- import load_csv/2, add_cvt_type_hook/2 from proc_files.
Line 1,527:
sum(X,Y,Z) :- Z is X+Y.
:- hilog successor.
successor(X,_Y,Z) :- Z is X+1.</langsyntaxhighlight>
 
=={{header|PureBasic}}==
<langsyntaxhighlight PureBasiclang="purebasic">Structure Person
Name$
EndStructure
Line 1,590:
PrintN("") : scs=0 : c=0
Next
Input()</langsyntaxhighlight>
{{out}}
<pre>| PATIENT_ID | LASTNAME | LAST_VISIT | SCORE_SUM | SCORE_AVG |
Line 1,602:
 
===Python: Using pandas library===
<langsyntaxhighlight Pythonlang="python"># to install pandas library go to cmd prompt and type:
# cd %USERPROFILE%\AppData\Local\Programs\Python\Python38-32\Scripts\
# pip install pandas
Line 1,646:
df_result = df_group.agg({'VISIT_DATE': 'max', 'SCORE': [lambda x: x.sum(min_count=1),'mean']})
 
print(df_result)</langsyntaxhighlight>
<pre>
PATIENT_ID LASTNAME LAST_VISIT SCORE
Line 1,660:
Using only standard libraries and input from csv files.
 
<langsyntaxhighlight lang="python">import csv
 
fnames = 'patients.csv patients_visits.csv'.split()
Line 1,695:
#%%
for record in result:
print(f"| {' | '.join(f'{r:^10}' for r in record)} |")</langsyntaxhighlight>
 
{{out}}
Line 1,710:
(The SQL SELECT statement is modelled on that of the SQL entry elsewhere on this page).
 
<langsyntaxhighlight lang="python">import sqlite3
import csv
 
Line 1,774:
result = join_tables_and_group(conn)
for record in result:
print(f"| {' | '.join(f'{str(r):^10}' for r in record)} |")</langsyntaxhighlight>
 
{{out}}
Line 1,785:
 
=={{header|R}}==
<langsyntaxhighlight Rlang="r"># load data from csv files
# setwd("C:\Temp\csv\")
# df_patient <- read.csv(file="patients.csv", header = TRUE, sep = ",")
Line 1,826:
df_result <- merge(df_patient, df_agg, by = 'PATIENT_ID', all.x = TRUE)
 
print(df_result)</langsyntaxhighlight>
{{out}}
<pre>
Line 1,838:
 
=={{header|Raku}}==
<syntaxhighlight lang="raku" perl6line>my @names = map { ( <PATIENT_ID LASTNAME> Z=> .list ).hash },
( 1001, 'Hopper' ),
( 4004, 'Wirth' ),
Line 1,874:
my @out_field_names = <PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG>;
my @rows = @result.sort(*.<PATIENT_ID>).map(*.{@out_field_names});
say .map({$_ // ''}).fmt('%-10s', ' | ') for @out_field_names, |@rows;</langsyntaxhighlight>
{{out}}
<pre>
Line 1,886:
 
=={{header|REXX}}==
<langsyntaxhighlight lang="rexx">/* REXX */
patients='patients.csv'
l=linein(patients)
Line 1,966:
swl=swl wa.i
End
Return strip(swl)</langsyntaxhighlight>
{{out}}
<pre>| PATIENT_ID | LASTNAME | LAST_VISIT | SCORE_SUM | SCORE_AVG |
Line 1,975:
| 5005 | Kurtz | | | |</pre>
=={{header|SAS}}==
<langsyntaxhighlight lang="sas"> %let datefmt=E8601DA10.;
data patient;
infile "patient.csv" dsd dlm=',';
Line 1,994:
avg(score) as avg_score from visit group by id)
natural right join patient
order by id;</langsyntaxhighlight>
{{out}}
<pre>: id lastname max_date sum_score avg_score
Line 2,008:
Note: the PATIENT_ID 5005 is missing from the result, because MATCH FILES cannot add rows for missing keys, i.e. AFAIK a "left join" is not possible is SPSS.
 
<langsyntaxhighlight SPSSlang="spss">* set working directory to location of .csv files
 
CD 'C:\Temp\csv\'.
Line 2,073:
/last_visit = MAX(VISIT_DATE)
/score_avg = MEAN(SCORE)
/score_sum = SUM(SCORE).</langsyntaxhighlight>
 
{{out}}
Line 2,086:
=={{header|SQL}}==
 
<langsyntaxhighlight SQLlang="sql">-- drop tables
DROP TABLE IF EXISTS tmp_patients;
DROP TABLE IF EXISTS tmp_visits;
Line 2,144:
p.LASTNAME
ORDER BY
p.PATIENT_ID;</langsyntaxhighlight>
{{out}}
<pre>
Line 2,156:
 
=={{header|Transd}}==
<langsyntaxhighlight lang="scheme">#lang transd
 
MainModule: {
Line 2,214:
))
))
}</langsyntaxhighlight>{{out}}
<pre>
PATIENT_ID| LASTNAME|VISIT_DATE| SCORE| SCORE_AVG|
Line 2,225:
 
=={{header|TutorialD}}==
<syntaxhighlight lang="text">BEGIN;
TYPE Date UNION;
TYPE DateValid IS {Date POSSREP {year INTEGER, month INTEGER, day INTEGER}};
Line 2,260:
(SUMMARIZE visit BY {id}: {maxdate := MAX(date)} UNION
(EXTEND (patient {id} MINUS (visit {id})): {maxdate := DateUnknown()})) JOIN
patient</langsyntaxhighlight>
{{out}}
<pre>
Line 2,273:
{{libheader|Wren-sort}}
{{libheader|Wren-fmt}}
<langsyntaxhighlight lang="ecmascript">import "/fmt" for Fmt
import "/sort" for Sort
 
Line 2,368:
 
var merges = Patient.ids.map { |id| Merge.new(id) }.toList
Merge.print(merges)</langsyntaxhighlight>
 
{{out}}
10,333

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.