Merge and aggregate datasets: Difference between revisions

Line 71:

<~~lang~~ 11l>V patients_csv =

<syntaxhighlight lang="11l">V patients_csv =

‘PATIENT_ID,LASTNAME

1001,Hopper

Line 125:

L(record) result

print(‘| ’record.map(r -> r.center(10)).join(‘ | ’)‘ |’)</~~lang~~>

print(‘| ’record.map(r -> r.center(10)).join(‘ | ’)‘ |’)</syntaxhighlight>

Line 138:

=={{header|AutoHotkey}}==

<~~lang~~ ~~AutoHotkey~~>Merge_and_aggregate(patients, visits){

<syntaxhighlight lang="autohotkey">Merge_and_aggregate(patients, visits){

ID := [], LAST_VISIT := [], SCORE_SUM := [], VISIT := []

for i, line in StrSplit(patients, "`n", "`r"){

Line 161:

output .= ID "`t" name "`t" LAST_VISIT[id] "`t" SCORE_SUM[id] "`t" SCORE_SUM[id]/VISIT[id] "`n"

return output

}</~~lang~~>

}</syntaxhighlight>

Examples:<~~lang~~ ~~AutoHotkey~~>patients =

Examples:<syntaxhighlight lang="autohotkey">patients =

(

PATIENT_ID,LASTNAME

Line 186:

MsgBox % Merge_and_aggregate(patients, visits)

return</~~lang~~>

return</syntaxhighlight>

<pre>PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG

Line 196:

=={{header|AWK}}==

# syntax: GAWK -f MERGE_AND_AGGREGATE_DATASETS.AWK RC-PATIENTS.CSV RC-VISITS.CSV

# files may appear in any order

Line 235:

exit(0)

}

</syntaxhighlight>

</lang>

<pre>

Line 247:

=={{header|C++}}==

Uses C++20

<~~lang~~ cpp>#include <iostream>

<syntaxhighlight lang="cpp">#include <iostream>

#include <optional>

#include <ranges>

Line 323:

cout << " |\n";

}

}</~~lang~~>

}</syntaxhighlight>

Line 336:

</pre>

=={{header|C sharp}}==

<~~lang~~ csharp>using System;

<syntaxhighlight lang="csharp">using System;

using System.Collections.Generic;

using System.Globalization;

Line 408:

}

}</~~lang~~>

}</syntaxhighlight>

<pre>

Line 420:

=={{header|Datalog}}==

The following implementation is for Souffle. Souffle does not currently include dates and times in the base distribution, but it is straightforward to use C timestamps:

<~~lang~~ C>// datetime.cpp

<syntaxhighlight lang="c">// datetime.cpp

#include <ctime>

#include <cstdint>

Line 429:

return mktime(&tmInfo); // localtime

}

}</~~lang~~>

}</syntaxhighlight>

Rather than combine the summaries, the date and score summaries have been presented separately.

<~~lang~~ prolog>/* rosetta.dl */

<syntaxhighlight lang="prolog">/* rosetta.dl */

#define NaN to_float("NaN")

.functor from_date(date:symbol) : number

Line 467:

score_mean = mean score: {Visit(id, _, score), score != NaN}.

.output SummaryDates

.output SummaryScores</~~lang~~>

.output SummaryScores</syntaxhighlight>

Then this is called using:

<~~lang~~ bash>g++ -shared -fPIC datetime.cpp -o libfunctors.so

<syntaxhighlight lang="bash">g++ -shared -fPIC datetime.cpp -o libfunctors.so

souffle -D- rosetta.dl</~~lang~~>

souffle -D- rosetta.dl</syntaxhighlight>

<pre>---------------

Line 493:

=={{header|F_Sharp|F#}}==

Note that the scores are right justified to copy the task description. It would be more natural to leave them right justified.

<~~lang~~ fsharp>

// Merge and aggregate datasets. Nigel Galloway: January 6th., 2021

let rFile(fName)=seq{use n=System.IO.File.OpenText(fName)

Line 503:

let fG n g=let z=G.[n]|>Seq.sumBy(fun n->try float n.[2] with :? System.FormatException->0.0)

fN n g (G.[n]|>Seq.sort|>Seq.last).[1] (if z=0.0 then "" else string z) (if z=0.0 then "" else string(z/(float(Seq.length G.[n]))))

</syntaxhighlight>

</lang>

<pre>

Line 518:

=={{header|Go}}==

<~~lang~~ go>package main

<syntaxhighlight lang="go">package main

import (

Line 655:

}

mergePrint(merges)

}</~~lang~~>

}</syntaxhighlight>

Line 673:

Merging of fields and databases is defined as a monoid operation for corresponding types.

<~~lang~~ haskell>import Data.List

<syntaxhighlight lang="haskell">import Data.List

import Data.Maybe

import System.IO (readFile)

Line 738:

where

go [] = Nothing

go s = Just $ drop 1 <$> span (/= ch) s</~~lang~~>

go s = Just $ drop 1 <$> span (/= ch) s</syntaxhighlight>

<pre>let patients = readDB <$> readFile "patients.csv"

Line 765:

===Pretty tabulation===

<~~lang~~ haskell>

tabulateDB (DB ps) header cols = intercalate "|" <$> body

where

Line 787:

, \p -> case scores p of {[] -> []; s -> show (mean s)} ]

mean lst = sum lst / genericLength lst</~~lang~~>

mean lst = sum lst / genericLength lst</syntaxhighlight>

<pre>*Main> main

Line 802:

In other words, we can set things up like this:

<~~lang~~ J>NB. setup:

<syntaxhighlight lang="j">NB. setup:

require'jd pacman'

load JDP,'tools/csv_load.ijs'

Line 835:

csvload 'visits';1

jd'ref patients PATIENTID visits PATIENTID'</~~lang~~>

jd'ref patients PATIENTID visits PATIENTID'</syntaxhighlight>

And, then we can run our query:

<~~lang~~ J>require'jd'

<syntaxhighlight lang="j">require'jd'

echo jd {{)n

Line 853:

p:patients,

v:p.visits

}} -.LF</~~lang~~>

}} -.LF</syntaxhighlight>

Which displays this result:

Line 868:

Another approach would be to use J's csv library:

<~~lang~~ J>require'csv'

<syntaxhighlight lang="j">require'csv'

patients=: fixcsv {{)n

PATIENTID,LASTNAME

Line 904:

labels=. ;:'PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG'

labels,:id;nm;vdt;sum;avg

}}</~~lang~~>

}}</syntaxhighlight>

Here:

<~~lang~~ J> task''

<syntaxhighlight lang="j"> task''

┌──────────┬────────┬──────────┬─────────┬─────────┐

│PATIENT_ID│LASTNAME│LAST_VISIT│SCORE_SUM│SCORE_AVG│

Line 916:

│4004 │Wirth │2020-11-05│15.4 │7.7 │

│5005 │Kurtz │ │ │ │

└──────────┴────────┴──────────┴─────────┴─────────┘</~~lang~~>

└──────────┴────────┴──────────┴─────────┴─────────┘</syntaxhighlight>

If the empty score in visits was a display concern, we might instead do it this way:

<~~lang~~ J>task=: {{

<syntaxhighlight lang="j">task=: {{

P=. <@:>"1|:/:~}.patients

V=. <@:>"1|:/:~}.visits

Line 936:

labels=. ;:'PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG'

labels,:id;nm;vdt;sum;avg

}}</~~lang~~>

}}</syntaxhighlight>

Which gives us:

<~~lang~~ J> task''

<syntaxhighlight lang="j"> task''

┌──────────┬────────┬──────────┬─────────┬─────────┐

│PATIENT_ID│LASTNAME│LAST_VISIT│SCORE_SUM│SCORE_AVG│

Line 949:

│4004 │Wirth │2020-11-05│15.4 │7.7 │

│5005 │Kurtz │ │ │ │

└──────────┴────────┴──────────┴─────────┴─────────┘</~~lang~~>

└──────────┴────────┴──────────┴─────────┴─────────┘</syntaxhighlight>

=={{header|jq}}==

Line 960:

'''Ingesting CSV data'''

# objectify/1 takes an array of atomic values as inputs, and packages

# these into an object with keys specified by the "headers" array and

Line 978:

([]; . + [ $row|objectify($headers) ]);

</syntaxhighlight>

</lang>

'''Aggregation functions'''

# output {LAST_VISIT}

def LAST_VISIT($patient_id):

Line 990:

| {SCORE_SUM: add, count: length}

| {SCORE_SUM, SCORE_AVG: (if .SCORE_SUM and .count > 0 then .SCORE_SUM/.count else null end)};

</~~lang~~>'''The task'''<lang jq>

</syntaxhighlight>'''The task'''<syntaxhighlight lang="jq">

# Read the two tables:

INDEX($patients | [splits("\n")] | map(split(",")) | csv2jsonHelper[]; .PATIENT_ID) as $patients

Line 1,001:

# ... but display it as a sequence of JSON objects

| .[]

</syntaxhighlight>

</lang>

'''Invocation'''

Line 1,019:

=={{header|Julia}}==

<~~lang~~ julia>using CSV, DataFrames, Statistics

<syntaxhighlight lang="julia">using CSV, DataFrames, Statistics

# load data from csv files

Line 1,058:

end

println(df_result)

</~~lang~~>{{out}}

</syntaxhighlight>{{out}}

<pre>

5×5 DataFrame

Line 1,072:

=={{header|Mathematica}}/{{header|Wolfram Language}}==

<~~lang~~ ~~Mathematica~~>a = ImportString["PATIENT_ID,LASTNAME

<syntaxhighlight lang="mathematica">a = ImportString["PATIENT_ID,LASTNAME

1001,Hopper

4004,Wirth

Line 1,096:

"SCORE_SUM" -> If[DeleteMissing@#[[All, "SCORE"]] =!= {}, Total@DeleteMissing@#[[All, "SCORE"]], Missing[]],

"SCORE_AVG" -> If[DeleteMissing@#[[All, "SCORE"]] =!= {}, Mean@DeleteMissing@#[[All, "SCORE"]], Missing[]]|> & /@

gr // Dataset</~~lang~~>

gr // Dataset</syntaxhighlight>

<pre> PATIENT_ID LASTNAME VISIT_DATE SCORE_SUM SCORE_AVG

Line 1,105:

5005 5005 Kurtz \[LongDash] \[LongDash] \[LongDash]</pre>

=={{header|Mercury}}==

<~~lang~~ ~~Prolog~~>:- module rosetta.

<syntaxhighlight lang="prolog">:- module rosetta.

:- interface.

Line 1,174:

(if MaxDate1 = date(0,0,0) then MaxDate = no else MaxDate = MaxDate1)),

Solutions),

foldl(io.write_line, Solutions, !IO).</~~lang~~>

foldl(io.write_line, Solutions, !IO).</syntaxhighlight>

<pre>{Id, Lastname, SumScores, AvgScores, MaxDate}:

Line 1,185:

=={{header|Nim}}==

===CSV files and tables===

<~~lang~~ ~~Nim~~>import algorithm, parsecsv, strformat, strutils, tables

<syntaxhighlight lang="nim">import algorithm, parsecsv, strformat, strutils, tables

const NoValue = -1.0

Line 1,237:

let scoreSum = if count == 0: "" else: &"{sum:>4.1f}"

let scoreAvg = if count == 0: "" else: &"{sum / count.toFloat: >4.2f}"

echo &"| {id:^10} | {name:^10} | {lastVisit:^10} | {scoreSum:>7} | {scoreAvg:>6} |"</~~lang~~>

echo &"| {id:^10} | {name:^10} | {lastVisit:^10} | {scoreSum:>7} | {scoreAvg:>6} |"</syntaxhighlight>

Line 1,250:

We use the high level standard library "db_sqlite" rather than the low level one "sqlite3".

<~~lang~~ ~~Nim~~>import parseCsv, db_sqlite, sequtils, strutils

<syntaxhighlight lang="nim">import parseCsv, db_sqlite, sequtils, strutils

const FNames = ["patients1.csv", "patients2.csv"]

Line 1,302:

for row in result:

echo "| " & row.mapit(it.center(10)).join(" | ") & '|'</~~lang~~>

echo "| " & row.mapit(it.center(10)).join(" | ") & '|'</syntaxhighlight>

Line 1,313:

=={{header|Phix}}==

constant patients_txt = split("""

PATIENT_ID,LASTNAME

Line 1,371:

{id,name,dt,scstr,avstr})

end for

<pre>

Line 1,385:

Weird requirement: "Use the most common libraries only when built-in functionality is not sufficient."

Not even a "use strict;" :)

<~~lang~~ perl>#!/usr/bin/perl

<syntaxhighlight lang="perl">#!/usr/bin/perl

my $fmt = '| %-11s' x 5 . "|\n";

printf $fmt, qw( PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG);

Line 1,421:

3003,2020-11-12,

4004,2020-11-05,7.0

1001,2020-11-19,5.3</~~lang~~>

1001,2020-11-19,5.3</syntaxhighlight>

<pre>

Line 1,433:

=={{header|Prolog}}==

Implemented using SWI Prolog:

<~~lang~~ prolog>patient(1001,'Hopper').

<syntaxhighlight lang="prolog">patient(1001,'Hopper').

patient(4004,'Wirth').

patient(3003,'Kemeny').

Line 1,468:

summaryScores(Id, Lastname, ScoreSum, ScoreMean),

writeln(summaryScores(Id, Lastname, ScoreSum, ScoreMean)),

fail.</~~lang~~>

fail.</syntaxhighlight>

<pre>summaryDates(1001,Hopper,2020-11-19)

Line 1,479:

false.</pre>

Implemented using XSB Prolog (which allows for user-defined aggregates):

<~~lang~~ prolog>:- import bagMax/2, bagCount/2, bagSum/2, bagReduce/4 from aggregs.

<syntaxhighlight lang="prolog">:- import bagMax/2, bagCount/2, bagSum/2, bagReduce/4 from aggregs.

:- import julian_date/7, date_string/3 from iso8601.

:- import load_csv/2, add_cvt_type_hook/2 from proc_files.

Line 1,527:

sum(X,Y,Z) :- Z is X+Y.

:- hilog successor.

successor(X,_Y,Z) :- Z is X+1.</~~lang~~>

successor(X,_Y,Z) :- Z is X+1.</syntaxhighlight>

=={{header|PureBasic}}==

<~~lang~~ ~~PureBasic~~>Structure Person

<syntaxhighlight lang="purebasic">Structure Person

Name$

EndStructure

Line 1,590:

PrintN("") : scs=0 : c=0

Input()</~~lang~~>

Input()</syntaxhighlight>

Line 1,602:

===Python: Using pandas library===

<~~lang~~ ~~Python~~># to install pandas library go to cmd prompt and type:

<syntaxhighlight lang="python"># to install pandas library go to cmd prompt and type:

# cd %USERPROFILE%\AppData\Local\Programs\Python\Python38-32\Scripts\

# pip install pandas

Line 1,646:

df_result = df_group.agg({'VISIT_DATE': 'max', 'SCORE': [lambda x: x.sum(min_count=1),'mean']})

print(df_result)</~~lang~~>

print(df_result)</syntaxhighlight>

<pre>

PATIENT_ID LASTNAME LAST_VISIT SCORE

Line 1,660:

Using only standard libraries and input from csv files.

<~~lang~~ python>import csv

<syntaxhighlight lang="python">import csv

fnames = 'patients.csv patients_visits.csv'.split()

Line 1,695:

#%%

for record in result:

print(f"| {' | '.join(f'{r:^10}' for r in record)} |")</~~lang~~>

print(f"| {' | '.join(f'{r:^10}' for r in record)} |")</syntaxhighlight>

Line 1,710:

(The SQL SELECT statement is modelled on that of the SQL entry elsewhere on this page).

<~~lang~~ python>import sqlite3

<syntaxhighlight lang="python">import sqlite3

import csv

Line 1,774:

result = join_tables_and_group(conn)

for record in result:

print(f"| {' | '.join(f'{str(r):^10}' for r in record)} |")</~~lang~~>

print(f"| {' | '.join(f'{str(r):^10}' for r in record)} |")</syntaxhighlight>

Line 1,785:

=={{header|R}}==

<~~lang~~ R># load data from csv files

<syntaxhighlight lang="r"># load data from csv files

# setwd("C:\Temp\csv\")

# df_patient <- read.csv(file="patients.csv", header = TRUE, sep = ",")

Line 1,826:

df_result <- merge(df_patient, df_agg, by = 'PATIENT_ID', all.x = TRUE)

print(df_result)</~~lang~~>

print(df_result)</syntaxhighlight>

<pre>

Line 1,838:

=={{header|Raku}}==

<lang ~~perl6~~>my @names = map { ( <PATIENT_ID LASTNAME> Z=> .list ).hash },

<syntaxhighlight lang="raku" line>my @names = map { ( <PATIENT_ID LASTNAME> Z=> .list ).hash },

( 1001, 'Hopper' ),

( 4004, 'Wirth' ),

Line 1,874:

my @out_field_names = <PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG>;

my @rows = @result.sort(*.<PATIENT_ID>).map(*.{@out_field_names});

say .map({$_ // ''}).fmt('%-10s', ' | ') for @out_field_names, |@rows;</~~lang~~>

say .map({$_ // ''}).fmt('%-10s', ' | ') for @out_field_names, |@rows;</syntaxhighlight>

<pre>

Line 1,886:

=={{header|REXX}}==

<~~lang~~ rexx>/* REXX */

<syntaxhighlight lang="rexx">/* REXX */

patients='patients.csv'

l=linein(patients)

Line 1,966:

swl=swl wa.i

End

Return strip(swl)</~~lang~~>

Return strip(swl)</syntaxhighlight>

Line 1,975:

| 5005 | Kurtz | | | |</pre>

=={{header|SAS}}==

<~~lang~~ sas> %let datefmt=E8601DA10.;

<syntaxhighlight lang="sas"> %let datefmt=E8601DA10.;

data patient;

infile "patient.csv" dsd dlm=',';

Line 1,994:

avg(score) as avg_score from visit group by id)

natural right join patient

order by id;</~~lang~~>

order by id;</syntaxhighlight>

<pre>: id lastname max_date sum_score avg_score

Line 2,008:

Note: the PATIENT_ID 5005 is missing from the result, because MATCH FILES cannot add rows for missing keys, i.e. AFAIK a "left join" is not possible is SPSS.

<~~lang~~ ~~SPSS~~>* set working directory to location of .csv files

<syntaxhighlight lang="spss">* set working directory to location of .csv files

CD 'C:\Temp\csv\'.

Line 2,073:

/last_visit = MAX(VISIT_DATE)

/score_avg = MEAN(SCORE)

/score_sum = SUM(SCORE).</~~lang~~>

/score_sum = SUM(SCORE).</syntaxhighlight>

Line 2,086:

=={{header|SQL}}==

<~~lang~~ ~~SQL~~>-- drop tables

<syntaxhighlight lang="sql">-- drop tables

DROP TABLE IF EXISTS tmp_patients;

DROP TABLE IF EXISTS tmp_visits;

Line 2,144:

p.LASTNAME

ORDER BY

p.PATIENT_ID;</~~lang~~>

p.PATIENT_ID;</syntaxhighlight>

<pre>

Line 2,156:

=={{header|Transd}}==

<~~lang~~ scheme>#lang transd

<syntaxhighlight lang="scheme">#lang transd

MainModule: {

Line 2,214:

))

}</~~lang~~>{{out}}

}</syntaxhighlight>{{out}}

<pre>

Line 2,225:

=={{header|TutorialD}}==

<lang>BEGIN;

<syntaxhighlight lang="text">BEGIN;

TYPE Date UNION;

TYPE DateValid IS {Date POSSREP {year INTEGER, month INTEGER, day INTEGER}};

Line 2,260:

(SUMMARIZE visit BY {id}: {maxdate := MAX(date)} UNION

(EXTEND (patient {id} MINUS (visit {id})): {maxdate := DateUnknown()})) JOIN

patient</~~lang~~>

patient</syntaxhighlight>

<pre>

Line 2,273:

<~~lang~~ ecmascript>import "/fmt" for Fmt

<syntaxhighlight lang="ecmascript">import "/fmt" for Fmt

import "/sort" for Sort

Line 2,368:

var merges = Patient.ids.map { |id| Merge.new(id) }.toList

Merge.print(merges)</~~lang~~>

Merge.print(merges)</syntaxhighlight>