Merge and aggregate datasets: Difference between revisions

Content added Content deleted

Inline

Revision as of 14:42, 6 December 2020

Merge and aggregate datasets

Task

Merge and aggregate two datasets as provided in .csv files into a new resuling dataset. Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient. Either load the data from the .csv files or create the required data structures hard-coded.

patients.csv file contents:

PATIENT_ID,LASTNAME
1001,Hopper
4004,Wirth
3003,Kemeny
2002,Gosling
5005,Kurtz

visits.csv file contents:

PATIENT_ID,VISIT_DATE,SCORE
2002,2020-09-10,6.8
1001,2020-09-17,5.5
4004,2020-09-24,8.4
2002,2020-10-08,
1001,,6.6
3003,2020-11-12,
4004,2020-11-05,7.0
1001,2020-11-19,5.3

Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand. Merge and group per patient id and name, get the maximum visit date, get the sum of scores per patient and the average score per patient to get the resulting datasset:

| PATIENT_ID | LASTNAME | LAST_VISIT |  SCORE_SUM | SCORE_AVG |
| 1001       |  Hopper  | 2020-11-19 |  17.4      | 5.80      |
| 2002       |  Gosling | 2020-10-08 |   6.8      | 6.80      |
| 3003       |  Kemeny  | 2020-11-12 |            |           |
| 4004       |  Wirth   | 2020-11-05 |  15.4      | 7.70      |
| 5005       |  Kurtz   |            |            |           |

Related tasks

R

<lang R># load datasets from file

df_patient <- read.csv(file="patients.csv", header = TRUE, sep = ",")
df_visit <- read.csv(file="visit.csv", header = TRUE, sep = ",")

load datasets hard coded

df_patient <- read.table(text = " PATIENT_ID,LASTNAME 1001,Hopper 4004,Wirth 3003,Kemeny 2002,Gosling 5005,Kurtz ", header = TRUE, sep = ",")

df_visit <- read.table(text = " PATIENT_ID,VISIT_DATE,SCORE 2002,2020-09-10,6.8 1001,2020-09-17,5.5 4004,2020-09-24,8.4 2002,2020-10-08,NULL 1001,NULL,6.6 3003,2020-11-12,NULL 4004,2020-11-05,7.0 1001,2020-11-19,5.3 ", header = TRUE, sep = ",")

result

df_result <- data.frame(

 cbind(
   PATIENT_ID = names(tapply(df_visit$VISIT_DATE, list(df_visit$PATIENT_ID), max, na.rm=TRUE)),
   last_visit = tapply(df_visit$VISIT_DATE, list(df_visit$PATIENT_ID), max, na.rm=TRUE),
   score_avg = tapply(df_visit$SCORE, list(df_visit$PATIENT_ID), mean, na.rm=TRUE),
   score_sum = tapply(df_visit$SCORE, list(df_visit$PATIENT_ID), sum, na.rm=TRUE)
 )

)

combine two datasets in r
all.x = all the non matching cases of df_patient are appended to the result as well (i.e. 'left join')

df_result2 <- merge(df_patient, df_result, by = 'PATIENT_ID', all.x = TRUE)</lang>

SQL

<lang SQL> -- drop tables DROP TABLE IF EXISTS tmp_patients; DROP TABLE IF EXISTS tmp_visit;

-- create tables CREATE TABLE tmp_patients( PATIENT_ID INT, LASTNAME VARCHAR(20) );

CREATE TABLE tmp_visit( PATIENT_ID INT, VISIT_DATE DATE, SCORE NUMERIC(4,1) );

-- load data from csv files into tables /* -- Note: for LOAD DATA LOCAL to work you have to enable `local-infile` on both the client and server -- else you'll get the error "#1148 command is not allowed.." LOAD DATA LOCAL INFILE '/home/csvimport/patients.csv' INTO TABLE `tmp_patients` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;

LOAD DATA LOCAL INFILE '/home/csvimport/visits.csv' INTO TABLE `tmp_visit` FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;

/

-- insert records into tables INSERT INTO tmp_patients(PATIENT_ID, LASTNAME) VALUES (1001, 'Hopper'), (4004, 'Wirth'), (3003, 'Kemeny'), (2002, 'Gosling'), (5005, 'Kurtz');

INSERT INTO tmp_visit(PATIENT_ID, VISIT_DATE, SCORE) VALUES (2002, '2020-09-10', 6.8), (1001, '2020-09-17', 5.5), (4004, '2020-09-24', 8.4), (2002, '2020-10-08', NULL), (1001, NULL, 6.6), (3003, '2020-11-12', NULL), (4004, '2020-11-05', 7.0), (1001, '2020-11-19', 5.3);

-- join tables and group SELECT p.PATIENT_ID, p.LASTNAME, MAX(VISIT_DATE) AS LAST_VISIT, SUM(SCORE) AS SCORE_SUM, CAST(AVG(SCORE) AS DECIMAL(10,2)) AS SCORE_AVG FROM tmp_patients p LEFT JOIN tmp_visit v ON v.PATIENT_ID = p.PATIENT_ID GROUP BY p.PATIENT_ID, p.LASTNAME ORDER BY p.PATIENT_ID; </lang>

@@ Line 1: / Line 1: @@
 {{task|Basic language learning}}Merge and aggregate datasets
-[[Category:Simple]]
 ;Task:
-Merge and aggregate two datasets as provided in .csv files.
+Merge and aggregate two datasets as provided in .csv files into a new resuling dataset.
 Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient.
+Either load the data from the .csv files or create the required data structures hard-coded.
-For missing values use /0, NA or NULL or whatever is appropriate.
-Either load the data from CSV files or create the required data structures hard-coded.
-Create the result in-memory or output it to screen or file, whichever is appropriate for the programming language at hand.
 patients.csv file contents:
@@ Line 32: / Line 28: @@
 </nowiki>
+Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand.
-Merge and aggregate these two datasets to get resulting datasset:
+Merge and group per patient id and name, get the maximum visit date, get the sum of scores per patient and the average score per patient to get the resulting datasset:
  <nowiki>| PATIENT_ID | LASTNAME | LAST_VISIT |  SCORE_SUM | SCORE_AVG |
 | 1001       |  Hopper  | 2020-11-19 |  17.4      | 5.80      |
@@ Line 89: / Line 86: @@
 =={{header|SQL}}==
-birthday.sql
 <lang SQL>
 -- drop tables