Talk:Merge and aggregate datasets

Duplication of task goals if not task name

So... this task is pretty much an exact duplicate of CSV data manipulation which has been around for 7+ years and has some 85 entries. Admittedly this task has slightly better defined goals and is less trivial, but a large percentage of the code from there could be lifted and used unchanged here.

Some overlap of tasks is inevitable, and honestly I think this one is probably more useful to demonstrate working with real-world data than the other. I hesitate to make any unilateral decisions (unlike with the recent deluge of "Find words containing whatever" tasks that we've been hit with,) but I also don't want to needlessly proliferate trivial variations. Thoughts? --Thundergnat (talk) 19:16, 7 December 2020 (UTC)

Missing fields in the CSV files. There might be a lot of overlap, but no "exact duplication", and handling of missing fields, although not highlighted, is a significant difference I think. --Paddy3118 (talk) 19:40, 7 December 2020 (UTC)

My motivation to submit this task was that I recently was working with R-script for the first time. I'm reasonably experienced with programming but had quite a hard time getting it to work.

The examples and tutorials on stackoverflow and other places are generally either too trivial, or too specific for one exact use-case. Merging, grouping and aggregating different datasets is a very common thing I encounter a lot for my work.

So that's why I submitted this task (after also asking here), and made sure to include the most common "hurdles", like missing records, missing values, multiple aggregator functions at once, working with date values and unorderd source files. --BdR (talk) 22:49, 7 December 2020 (UTC)

..."two datasets as provided in .csv files"...

Many examples don't read the csv from files. --Paddy3118 (talk) 19:42, 7 December 2020 (UTC)

Loading from .csv file is shortest code and more practical, but for quickly copying and testing the code examples the hard-coded data is easier. So, when possible, I try to include both, and then comment out the .csv load lines, see the Python example code. --BdR (talk) 21:58, 7 December 2020‎ (UTC)

Agreed, I find it very useful to have demos that "just run", and like you I usually add comments that show how to read the exact same stuff from a file. I have also added some links to related tasks. --Pete Lomax (talk) 07:25, 8 December 2020 (UTC)

Of course that "just run" is more than just a little bit handy for repl-it, tio, and the like. --Pete Lomax (talk) 02:35, 10 December 2020 (UTC)

The task says "Either load the data from the .csv files or create the required data structures hard-coded." so I took that to mean it wasn't required. The current implementations cover the full spectrum. Go, SQL, Wren, and now C++ took the hard-coded approach. Perl and Raku parse a text block. Julia, Phix, and R work as-if they are reading a file. Python, REXX, and SPSS actually do read .csv files. To me the interesting part of this task is combining the tables - I think this is the only task to do that. Reading from a .csv is covered by the CSV data manipulation task. What should be required?

Garbanzo (talk) 03:49, 5 January 2021 (UTC)

Cleaned Note

I removed the note about generalized programming languages. The solutions may not be as clean as a specialized language but it should still be possible. I also hard coded the data for the C++ entry. To me, the interesting part of this task is joining two tables and dealing with nulls. Garbanzo (talk) 03:05, 4 January 2021 (UTC)

If you don't complete the task by being able to read the files, then the C++ solution is not as comparable to the solutions that implement the task. Yes it is setup, but reading from csv files is a pretty common way of getting data for your "interesting bits".

If a very well known and easy to use source of C++ libraries, (Boost?), has a csv reader then you could employ that, but I'm not a great C++ programmer. --Paddy3118 (talk) 14:29, 4 January 2021 (UTC)

My apologies - The task description does allow input from other than .csv files. --Paddy3118 (talk) 13:07, 5 January 2021 (UTC)

Thanks. The description is more clear now. Garbanzo (talk) 06:07, 6 January 2021 (UTC)