Merge and aggregate datasets: Difference between revisions

Content added Content deleted

Inline

Revision as of 23:26, 6 December 2020

Merge and aggregate datasets

Task

Merge and aggregate two datasets as provided in .csv files into a new resuling dataset. Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient. Either load the data from the .csv files or create the required data structures hard-coded.

patients.csv file contents:

PATIENT_ID,LASTNAME
1001,Hopper
4004,Wirth
3003,Kemeny
2002,Gosling
5005,Kurtz

visits.csv file contents:

PATIENT_ID,VISIT_DATE,SCORE
2002,2020-09-10,6.8
1001,2020-09-17,5.5
4004,2020-09-24,8.4
2002,2020-10-08,
1001,,6.6
3003,2020-11-12,
4004,2020-11-05,7.0
1001,2020-11-19,5.3

Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand. Merge and group per patient id and last name, get the maximum visit date and get the sum and average of the scores per patient to get the resulting datasset. Note that the visit date is purposefully provided as ISO format, so that it could also be processed as text and sorted alphabetically to determine the maximum date.

| PATIENT_ID | LASTNAME | LAST_VISIT |  SCORE_SUM | SCORE_AVG |
| 1001       |  Hopper  | 2020-11-19 |  17.4      | 5.80      |
| 2002       |  Gosling | 2020-10-08 |   6.8      | 6.80      |
| 3003       |  Kemeny  | 2020-11-12 |            |           |
| 4004       |  Wirth   | 2020-11-05 |  15.4      | 7.70      |
| 5005       |  Kurtz   |            |            |           |

Note

this task is aimed in particular at programming languages that are used in data science and data processing, suchs as Python, R, SPSS, MATLAB etc. General programming languages such as BASIC, C++, Java etc. may not be suited as well for this task, although everyone is free to add examples as they see fit.

Perl

Weird requirement: "Use the most common libraries only when built-in functionality is not sufficient." Not even a "use strict;" :) <lang perl>#!/usr/bin/perl my $fmt = '| %-11s' x 5 . "|\n"; printf $fmt, qw( PATIENT_ID LASTNAME LAST_VISIT SCORE_SUM SCORE_AVG); my ($names, $visits) = do { local $/; split /^\n/m, }; my %score; for ( $visits =~ /^\d.*/gm )

 {
 my ($id, undef, $score) = split /,/;
 $score{$id} //= [, ];
 $score and $score{$id}[0]++, $score{$id}[1] += $score;
 }

for ( sort $names =~ /^\d.*/gm )

 {
 my ($id, $name) = split /,/;
 printf $fmt, $id, $name, ( sort $visits =~ /^$id,(.*?),/gm,  )[-1],
   $score{$id}[0]
     ? ( $score{$id}[1], $score{$id}[1] / $score{$id}[0])
     : (, );
 }

__DATA__ PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz

PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08, 1001,,6.6 3003,2020-11-12, 4004,2020-11-05,7.0 1001,2020-11-19,5.3</lang>

Output:

| PATIENT_ID | LASTNAME   | LAST_VISIT | SCORE_SUM  | SCORE_AVG  |
| 1001       | Hopper     | 2020-11-19 | 17.4       | 5.8        |
| 2002       | Gosling    | 2020-10-08 | 6.8        | 6.8        |
| 3003       | Kemeny     | 2020-11-12 |            |            |
| 4004       | Wirth      | 2020-11-05 | 15.4       | 7.7        |
| 5005       | Kurtz      |            |            |            |

Python

<lang Python># to install pandas library goto cmd prompt and type:

cd %USERPROFILE%\AppData\Local\Programs\Python\Python38-32\Scripts\
pip install pandas

import pandas as pd

load data from csv files

df_patients = pd.read_csv (r'patients.csv') df_visits = pd.read_csv (r'visits.csv')

# load data hard coded, create data frames import io str_patients = """PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz """ df_patients = pd.read_csv(io.StringIO(str_patients), sep = ",") str_visits = """PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08, 1001,,6.6 3003,2020-11-12, 4004,2020-11-05,7.0 1001,2020-11-19,5.3 """ df_visits = pd.read_csv(io.StringIO(str_visits), sep = ",")

typecast from string to datetime so .agg can use 'max'

df_visits['VISIT_DATE'] = pd.to_datetime(df_visits['VISIT_DATE'])

merge on PATIENT_ID

df_merge = df_patients.merge(df_visits, on='PATIENT_ID', how='left')

group by sum/mean

df_group = df_merge.groupby(['PATIENT_ID','LASTNAME'], as_index=False)

note: you can use 'sum' instead of the lambda function but that returns NaN as 0 (zero)

df_result = df_group.agg({'VISIT_DATE': 'max', 'SCORE': [lambda x: x.sum(min_count=1),'mean']})

print(df_result)</lang>

R

<lang R># load data from csv files

df_patient <- read.csv(file="patients.csv", header = TRUE, sep = ",")
df_visits <- read.csv(file="visits.csv", header = TRUE, sep = ",")

load datasets hard coded

df_patient <- read.table(text = " PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz ", header = TRUE, sep = ",")

df_visits <- read.table(text = " PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08,NULL 1001,NULL,6.6 3003,2020-11-12,NULL 4004,2020-11-05,7.0 1001,2020-11-19,5.3 ", header = TRUE, sep = ",")

result

df_result <- data.frame(

 cbind(
   PATIENT_ID = names(tapply(df_visits$VISIT_DATE, list(df_visits$PATIENT_ID), max, na.rm=TRUE)),
   last_visits = tapply(df_visits$VISIT_DATE, list(df_visits$PATIENT_ID), max, na.rm=TRUE),
   score_avg = tapply(df_visits$SCORE, list(df_visits$PATIENT_ID), mean, na.rm=TRUE),
   score_sum = tapply(df_visits$SCORE, list(df_visits$PATIENT_ID), sum, na.rm=TRUE)
 )

)

combine two datasets in r
all.x = all the non matching cases of df_patient are appended to the result as well (i.e. 'left join')

df_result2 <- merge(df_patient, df_result, by = 'PATIENT_ID', all.x = TRUE)</lang>

SQL

<lang SQL> -- drop tables DROP TABLE IF EXISTS tmp_patients; DROP TABLE IF EXISTS tmp_visits;

-- create tables CREATE TABLE tmp_patients( PATIENT_ID INT, LASTNAME VARCHAR(20) );

CREATE TABLE tmp_visits( PATIENT_ID INT, VISIT_DATE DATE, SCORE NUMERIC(4,1) );

-- load data from csv files /* -- Note: LOAD DATA LOCAL requires `local-infile` enabled on both the client and server else you get error "#1148 command is not allowed.." LOAD DATA LOCAL INFILE '/home/csvimport/patients.csv' INTO TABLE `tmp_patients` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES; LOAD DATA LOCAL INFILE '/home/csvimport/visits.csv' INTO TABLE `tmp_visits` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;

/

-- load data hard coded INSERT INTO tmp_patients(PATIENT_ID, LASTNAME) VALUES (1001, 'Hopper'), (4004, 'Wirth'), (3003, 'Kemeny'), (2002, 'Gosling'), (5005, 'Kurtz');

INSERT INTO tmp_visits(PATIENT_ID, VISIT_DATE, SCORE) VALUES (2002, '2020-09-10', 6.8), (1001, '2020-09-17', 5.5), (4004, '2020-09-24', 8.4), (2002, '2020-10-08', NULL), (1001, NULL, 6.6), (3003, '2020-11-12', NULL), (4004, '2020-11-05', 7.0), (1001, '2020-11-19', 5.3);

-- join tables and group SELECT p.PATIENT_ID, p.LASTNAME, MAX(VISIT_DATE) AS LAST_VISIT, SUM(SCORE) AS SCORE_SUM, CAST(AVG(SCORE) AS DECIMAL(10,2)) AS SCORE_AVG FROM tmp_patients p LEFT JOIN tmp_visits v ON v.PATIENT_ID = p.PATIENT_ID GROUP BY p.PATIENT_ID, p.LASTNAME ORDER BY p.PATIENT_ID;</lang>

@@ Line 105: / Line 105: @@
 df_visits = pd.read_csv (r'visits.csv')
-# load data hard coded, create data frames
+''' # load data hard coded, create data frames
-# import io
+import io
-# str_patients = """PATIENT_ID,LASTNAME
+str_patients = """PATIENT_ID,LASTNAME
-# 1001,Hopper
+,Hopper
-# 4004,Wirth
+,Wirth
-# 3003,Kemeny
+,Kemeny
-# 2002,Gosling
+,Gosling
-# 5005,Kurtz
+,Kurtz
-# """
+"""
-# df_patients = pd.read_csv(io.StringIO(str_patients), sep = ",")
+df_patients = pd.read_csv(io.StringIO(str_patients), sep = ",")
-# str_visits = """PATIENT_ID,VISIT_DATE,SCORE
+str_visits = """PATIENT_ID,VISIT_DATE,SCORE
-# 2002,2020-09-10,6.8
+,2020-09-10,6.8
-# 1001,2020-09-17,5.5
+,2020-09-17,5.5
-# 4004,2020-09-24,8.4
+,2020-09-24,8.4
-# 2002,2020-10-08,
+,2020-10-08,
-# 1001,,6.6
+,,6.6
-# 3003,2020-11-12,
+,2020-11-12,
-# 4004,2020-11-05,7.0
+,2020-11-05,7.0
-# 1001,2020-11-19,5.3
+,2020-11-19,5.3
-# """
+"""
-# df_visits = pd.read_csv(io.StringIO(str_visits), sep = ",")
+df_visits = pd.read_csv(io.StringIO(str_visits), sep = ",")
+'''
 # typecast from string to datetime so .agg can use 'max'