Talk:Commatizing numbers

several numbers in the input

shouldn't 2000 become 2,000 ? --Walterpachl (talk) 05:07, 11 April 2014 (UTC)

If you're referring to the 4^th string to be commatized ===US$0017440 millions'..., then no. The 1^st numberic string is 0017440, not 2000 (which is the 2^nd numeric string). -- Gerard Schildberger (talk) 05:26, 11 April 2014 (UTC)

Ambiguous task spec. Oddly specific for some things while it glosses over others completely.

It seems to me that this task is really 3 subtasks that have been mashed together. That isn't necessarily a problem, but there is no separation of subtasks in the task description. It is all or nothing. As I interpret it, the first, and ostensibly the primary task, is to add appropriate separators to numbers to display them as numeric strings. That one I have no problem with. It is a fine, useful and commonly performed operation.

The second subtask, as I read it, is to algorithmically pick out numeric strings from mixed alphanumeric strings. Again, a fine, useful operation, though a little more touchy-feely and open to interpretation.

The third subtask, and the one I have the real problem with, is to algorithmically determine WHAT a particular numeric string represents so the "correct" arbitrary, inconsistent, seemingly pulled-out-of-the-air set of rules can be applied. I would be happier if the spec followed wikipedias guidelines for digit grouping (which incidentally doesn't mention scientific notation at all) but that's neither here nor there. Task writers are free to impose arbitrary restrictions.

My issue is that the task implies and assumes that there is some general algorithm that can be determined, but language is too messy and inconsistent to do that. Numeric strings VERY commonly are identifiers, not numbers: zip codes, phone numbers, credit card numbers, dates (it could be argued that 2014 is numeric, but really it just identifies a certain period of time to make it easier to refer to it). How about numbers in binary? hexadecimal? base 36?

Perhaps if someone could demonstrate how the current implementations "correctly" commatize the following strings:

   "The number 8675309 had to be taken out of service in every area code after Tommy Tutones song 'Jenny' was released"
   "Including in Beverly Hills 90210"
   "Use the credit card number 6011123456789876"
   "1946 was a good year for some; 6/9/1946 was a good day at most." 
   "hex 1234e56 = decimal 1091030"
   "bogus₍₃₆₎ = 1001010110101011001010100₍₂₎." 
   "123456789 is often used as an 'unused' demonstration social security number."

If the answer is "Well you need to know what the numeric string represents ahead of time and only use when appropriate.", then don't pretend that there is some general algorithm that can be applied. Knowing ahead of time is not an algorithm.--Thundergnat (talk) 15:54, 19 April 2014 (UTC)

I share your frustration with this task. Like you, I do not think that this task represents any sort of "general algorithm", and I'd much rather see efforts directed at geometric or mechanical modelling, or manipulation of audio data, or something interesting like that. That said, I would not characterize it as "ambiguous" so much as "completely arbitrary". And perhaps it would indeed be better to create one or more of the three modular tasks you've implied, and add links to them from here? That said, I guess I also need to acknowledge that knowing ahead of time is actually kind of important. --Rdm (talk) 16:07, 19 April 2014 (UTC)

Well it's a wiki. How about we give the original author some time to respond as well as thinking of what a re-write could look like ? --Paddy3118 (talk) 18:00, 19 April 2014 (UTC)

The task isn't to determine if a number (within a string) is to be commatized (or not) whether or not the number in context is to be commatized, but find a number (as series of digits or some characters that can be seen as a number) in a string and add commas (or some other characters based on various supplied options or paramaters). Adding examples of numbers that aren't commatizable based on knowledge that telephone numbers, years (as in dates), addresses, or numbers that may or may not be in base ten aren't part of the task, and other such inappropriate numbers, shouldn't be commatized. The scope in those cases would be daunting if not next to impossible and would be almost impossible, even if "only" some of the common rules for English would be used --- and I wouldn't want anybody to even try to list the rules for such an endeavor. There are just too many exceptions where a set of digits wouldn't be considered a number that should be commatized. So, the examples used in the task could've been: blah blah blah blah blah blab5679979yadda yadda yadda. but that would've been ··· well, boring beyond belief. The task's numeric strings supplied had extra context added only to illustrate some examples in which they could be used (for commatizing), and that extra context wasn't inferred to be determining if they (the numbers) should be used in making a determination if the numbers should be commatized or not. This task isn't about examining the words (in whatever human language shown) and interpret them and make a determination what the number means and if it is appropriate to commatize the number or not. This isn't an AI problem, but simply a mechanism to add commas (what whatever) to a series of (numberic) digits, irrespective of their context. -- Gerard Schildberger (talk) 18:25, 21 April 2014 (UTC)