Talk:Determine if a string has all the same characters

From Rosetta Code

What is a character?

For old-style strings where one character equals one byte, it's not really a problem. Nowadays with Unicode and multibyte characters, and much worse with Unicode equivalence it is. How are languages with Unicode support expected to deal with this?

Wikipedia gives the example of the character "ñ" which can be encoded by U+00F1, or alternately U+006E followed by U+0303. In Python, the latter would be a two "characters" string by default, which could be normalized with the unicodedata.normalize function. However, Notepad++ or MS Word correctly print both as ñ.

And while we are at it, note that, while "EEE" is a string which has all the same characters, "EΕЕ" is not.

Of course, the same comment applies to the other task.

Eoraptor (talk) 13:45, 30 October 2019 (UTC)




Not just of course,       ... the same comment applies to the other task,       there are numerous others.


There are many Rosetta Code tasks that deal with a   character   or   characters,   and there are many other tasks that use the word   character   or   characters,   all of them   (if not most of them,   unless noted otherwise)   as far as I recall,   refer to a byte,   ASCII/EBCDIC character(s),   to an 8-bit character,   or other such similar designations/nomenclatures.


Some of Rosetta Code tasks that I found which deal specifically with   character/characters   (words which are in their task title)   are   (and I'm sure that some were missed/overlooked):

   •  Character code
   •  Character codes
   •  Character matching
   •  Determine if a string has all the same characters
   •  Determine if a string has all unique characters
   •  Display an extended (non ASCII) character onto the terminal
   •  Given a character value in your language, print its code
   •  Idiomatically determine all the characters that can be used for symbols
   •  Just in time processing on a character stream
   •  Read a file character by character/UTF8
   •  Remove the first and last characters from a string/Top and tail
   •  Special characters
   •  Split a character string based on change of character
   •  String Character Length
   •  Strip a set of characters from a string
   •  Strip control codes and extended characters from a string
   •  Terminal control/Display an extended character
   •  Terminal Control/Reading a character at a specific location on the screen
   •  Words Of Equal Characters


There are way too many other Rosetta Code tasks to mention that deal with   characters   in some form or another.


Since Rosetta Code deals (essentially) computer programming languages,   and almost always   (in my opinion),   deals with characters that are, in essence, the same as bytes (8 bits).   This Rosetta Code task is in that tradition.   I didn't expect to have to define what a character   is,   just as all the other tasks dealt with this question.  

If somebody wants to add a definition to all the existing Rosetta Code tasks that use the word character,   I would not stand in their way.


As for the problem of  

                    How are languages with Unicode support expected to deal with this?  

is a question I'm not able to answer competently.   Perhaps that query could be answered/addressed by learned people with more experience with Unicode support in languages they know well.   Defining   character(s)   at this late point would really add a lot of verbiage to quite a few Rosetta Code tasks,   not to mention discussion pages.     -- Gerard Schildberger (talk) 15:56, 30 October 2019 (UTC)


As an aside,   I always thought of   Thundergnat   as quite the character,   but he is much more than 8 bits;   a description I hope he takes as a compliment.       😉       -- Gerard Schildberger (talk) 15:56, 30 October 2019 (UTC)

Agreed, other tasks may suffer. But not all tasks which deal with characters. The String length is an obvious example, but it clearly states the requirements. It does not seem exaggerated to ask for the same precision, especially for a new task. By the way, you write "deals with characters that are, in essence, the same as bytes (8 bits)", but the characters I deal with professionaly are often UTF-8, and they would always be if I weren't on Windows - and I suspect I'm not alone. Regarding characters and Unicode, I suggest this article by Joel Spolsky. The first Unicode standard was published in 1992, maybe we could pretend we know about it yet. Eoraptor (talk) 16:19, 30 October 2019 (UTC)
By the way, you just used Unicode, but maybe you didn't realize: the smiley in your answer is U+1F609. See here. Eoraptor (talk) 16:28, 30 October 2019 (UTC)
Yes, I was aware.   ASCII   '03'x   is so primative.         -- Gerard Schildberger (talk) 16:47, 30 October 2019 (UTC)
Specifying a "character" without also specifying an encoding is pretty vague, but I don't necessarily think that is always a bad thing. In this particular task, I think is it somewhat useful to leave it up to interpretation since that way it doesn't lock out languages that may not be so modern encoding aware (Unicode) but also doesn't constrain unnecessarily the one that are. It might be useful to encourage some verbiage in each languages task entry about any such constraints or abilities, but I am somewhat against enforcing any particular encoding. Rather err on the side of inclusivity and deal with some fuzzy definitions than enforce rigid compliance and remove room to explore. As a point of fact, I took some pains to demonstrate in these tasks how my particular favorite language deals with some thorny issues when dealing with multi-byte utf-8 encoded Unicode. (Such as Unicode equivalence. :-) )
Quote As an aside, I always thought of Thundergnat as quite the character, but he is much more than 8 bits... End Quote
I resemble that remark. --Thundergnat (talk) 23:47, 30 October 2019 (UTC)
As far as unicode is concerned and bearing in mind it's not needed for the 'compulsory' examples which Gerard has set anyway, I agree with the thrust of what Thundergnat has said that a 'character' should be defined in whatever way seems most natural for the language you're using.
In the case of Go, a character (or 'rune' as we prefer to call it) is simply a unicode code point expressed as a 4 byte integer. String literals are encoded as UTF-8 and are not normalized by default (though there is a supplemental package which can do this). Consequently, an accented character is not the same as the corresponding unaccented character plus the accent. Also, unlike Perl 6, there appears to be no easy way to deal with emoji ZWJ sequences at the present time. I've therefore had to be careful in the Go examples to only use emojis which are complete in themselves. --PureFox (talk) 17:11, 31 October 2019 (UTC)
Ok, I'm fine with that. It means that different program will give different results for the same input, but it seems to be the consensus, and we are not going to reimplement ICU, nor to dumb down languages which are able to deal with Unicode. By the way, the langages I use (Python, R, Stata mostly) don't normalize either by default. Eoraptor (talk) 18:23, 31 October 2019 (UTC)