Talk:Determine if a string has all the same characters: Difference between revisions

From Rosetta Code
Content added Content deleted
mNo edit summary
mNo edit summary
Line 2: Line 2:
For old-style strings where one character equals one byte, it's not really a problem. Nowadays with Unicode and multibyte characters, and much worse with [https://en.wikipedia.org/wiki/Unicode_equivalence Unicode equivalence] it is. How are languages with Unicode support expected to deal with this?
For old-style strings where one character equals one byte, it's not really a problem. Nowadays with Unicode and multibyte characters, and much worse with [https://en.wikipedia.org/wiki/Unicode_equivalence Unicode equivalence] it is. How are languages with Unicode support expected to deal with this?


Wikipedia gives the example of the character "ñ" which can be encoded by U+00F1, or alternately U+006E followed by U+0303 (which is "ñ"). In Python, the latter would be a two "characters" string by default, which could be normalized with the unicodedata.normalize function. However, Notepad++ or MS Word correctly print both as ñ.
Wikipedia gives the example of the character "ñ" which can be encoded by U+00F1, or alternately U+006E followed by U+0303. In Python, the latter would be a two "characters" string by default, which could be normalized with the unicodedata.normalize function. However, Notepad++ or MS Word correctly print both as ñ.


And while we are at it, note that, while "EEE" is a string which ''has all the same characters'', "EΕЕ" is not.
And while we are at it, note that, while "EEE" is a string which ''has all the same characters'', "EΕЕ" is not.

Revision as of 14:06, 30 October 2019

What is a character?

For old-style strings where one character equals one byte, it's not really a problem. Nowadays with Unicode and multibyte characters, and much worse with Unicode equivalence it is. How are languages with Unicode support expected to deal with this?

Wikipedia gives the example of the character "ñ" which can be encoded by U+00F1, or alternately U+006E followed by U+0303. In Python, the latter would be a two "characters" string by default, which could be normalized with the unicodedata.normalize function. However, Notepad++ or MS Word correctly print both as ñ.

And while we are at it, note that, while "EEE" is a string which has all the same characters, "EΕЕ" is not.

Of course, the same comment applies to the other task.

Eoraptor (talk) 13:45, 30 October 2019 (UTC)