Merge and aggregate datasets

Task

Merge and aggregate two datasets as provided in .csv files into a new resuling dataset. Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient. Either load the data from the .csv files or create the required data structures hard-coded.

patients.csv file contents:

PATIENT_ID,LASTNAME
1001,Hopper
4004,Wirth
3003,Kemeny
2002,Gosling
5005,Kurtz

visits.csv file contents:

PATIENT_ID,VISIT_DATE,SCORE
2002,2020-09-10,6.8
1001,2020-09-17,5.5
4004,2020-09-24,8.4
2002,2020-10-08,
1001,,6.6
3003,2020-11-12,
4004,2020-11-05,7.0
1001,2020-11-19,5.3

Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand. Merge and group per patient id and name, get the maximum visit date, get the sum of scores per patient and the average score per patient to get the resulting datasset:

| PATIENT_ID | LASTNAME | LAST_VISIT |  SCORE_SUM | SCORE_AVG |
| 1001       |  Hopper  | 2020-11-19 |  17.4      | 5.80      |
| 2002       |  Gosling | 2020-10-08 |   6.8      | 6.80      |
| 3003       |  Kemeny  | 2020-11-12 |            |           |
| 4004       |  Wirth   | 2020-11-05 |  15.4      | 7.70      |
| 5005       |  Kurtz   |            |            |           |

Related tasks

Perl

Weird requirement: "Use the most common libraries only when built-in functionality is not sufficient." Not even a "use strict;" :) <lang perl>#!/usr/bin/perl my $fmt = '| %-11s' x 5 . "|\n"; printf $fmt, qw( PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG); my ($names, $visits) = do { local $/; split /^\n/m, }; my %score; for ( $visits =~ /^\d.*/gm )

 {
 my ($id, undef, $score) = split /,/;
 $score{$id} //= [, ];
 $score and $score{$id}[0]++, $score{$id}[1] += $score;
 }

for ( sort $names =~ /^\d.*/gm )

 {
 my ($id, $name) = split /,/;
 printf $fmt, $id, $name, ( sort $visits =~ /^$id,(.*?),/gm,  )[-1],
   $score{$id}[0]
     ? ( $score{$id}[1], $score{$id}[1] / $score{$id}[0])
     : (, );
 }

__DATA__ PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz

PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08, 1001,,6.6 3003,2020-11-12, 4004,2020-11-05,7.0 1001,2020-11-19,5.3</lang>

Output:

| PATIENT_ID | LASTNAME   | LAST_VISIT | SCORE_SUM  | SCORE_AVG  |
| 1001       | Hopper     | 2020-11-19 | 17.4       | 5.8        |
| 2002       | Gosling    | 2020-10-08 | 6.8        | 6.8        |
| 3003       | Kemeny     | 2020-11-12 |            |            |
| 4004       | Wirth      | 2020-11-05 | 15.4       | 7.7        |
| 5005       | Kurtz      |            |            |            |

Python

The Python example is incomplete, the VISIT_DATE is not aggregated. The date value are purposefully provided as ISO format, so that they can also be sorted as strings. Is it possible to aggregate strings without using extra libraries?

to install pandas library goto cmd prompt and type:
cd %USERPROFILE%\AppData\Local\Programs\Python\Python38-32\Scripts\
pip install pandas

import pandas as pd

load datasets

df_patients = pd.read_csv (r'patients.csv') df_visits = pd.read_csv (r'visits.csv')

megre on PATIENT_ID

df_merge = df_patients.merge(df_visits, on='PATIENT_ID', how='left')

group by sum/mean

df_group = df_merge.groupby(['PATIENT_ID','LASTNAME'], as_index=False) df_result = df_group.agg({'SCORE': ['sum','mean']})

print(df_result) </lang>

R

<lang R># load datasets from file

df_patient <- read.csv(file="patients.csv", header = TRUE, sep = ",")
df_visits <- read.csv(file="visits.csv", header = TRUE, sep = ",")

load datasets hard coded

df_patient <- read.table(text = " PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz ", header = TRUE, sep = ",")

df_visits <- read.table(text = " PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08,NULL 1001,NULL,6.6 3003,2020-11-12,NULL 4004,2020-11-05,7.0 1001,2020-11-19,5.3 ", header = TRUE, sep = ",")

result

df_result <- data.frame(

 cbind(
   PATIENT_ID = names(tapply(df_visits$VISIT_DATE, list(df_visits$PATIENT_ID), max, na.rm=TRUE)),
   last_visits = tapply(df_visits$VISIT_DATE, list(df_visits$PATIENT_ID), max, na.rm=TRUE),
   score_avg = tapply(df_visits$SCORE, list(df_visits$PATIENT_ID), mean, na.rm=TRUE),
   score_sum = tapply(df_visits$SCORE, list(df_visits$PATIENT_ID), sum, na.rm=TRUE)
 )

)

combine two datasets in r
all.x = all the non matching cases of df_patient are appended to the result as well (i.e. 'left join')

df_result2 <- merge(df_patient, df_result, by = 'PATIENT_ID', all.x = TRUE)</lang>

SQL

<lang SQL> -- drop tables DROP TABLE IF EXISTS tmp_patients; DROP TABLE IF EXISTS tmp_visits;

-- create tables CREATE TABLE tmp_patients( PATIENT_ID INT, LASTNAME VARCHAR(20) );

CREATE TABLE tmp_visits( PATIENT_ID INT, VISIT_DATE DATE, SCORE NUMERIC(4,1) );

-- load data from csv files into tables /* -- Note: for LOAD DATA LOCAL to work you have to enable `local-infile` on both the client and server -- else you'll get the error "#1148 command is not allowed.." LOAD DATA LOCAL INFILE '/home/csvimport/patients.csv' INTO TABLE `tmp_patients` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;

LOAD DATA LOCAL INFILE '/home/csvimport/visits.csv' INTO TABLE `tmp_visits` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;

/

-- insert records into tables INSERT INTO tmp_patients(PATIENT_ID, LASTNAME) VALUES (1001, 'Hopper'), (4004, 'Wirth'), (3003, 'Kemeny'), (2002, 'Gosling'), (5005, 'Kurtz');

INSERT INTO tmp_visits(PATIENT_ID, VISIT_DATE, SCORE) VALUES (2002, '2020-09-10', 6.8), (1001, '2020-09-17', 5.5), (4004, '2020-09-24', 8.4), (2002, '2020-10-08', NULL), (1001, NULL, 6.6), (3003, '2020-11-12', NULL), (4004, '2020-11-05', 7.0), (1001, '2020-11-19', 5.3);

-- join tables and group SELECT p.PATIENT_ID, p.LASTNAME, MAX(VISIT_DATE) AS LAST_VISIT, SUM(SCORE) AS SCORE_SUM, CAST(AVG(SCORE) AS DECIMAL(10,2)) AS SCORE_AVG FROM tmp_patients p LEFT JOIN tmp_visits v ON v.PATIENT_ID = p.PATIENT_ID GROUP BY p.PATIENT_ID, p.LASTNAME ORDER BY p.PATIENT_ID; </lang>