Talk:URL decoding: Difference between revisions

From Rosetta Code
Content added Content deleted
No edit summary
No edit summary
Line 19: Line 19:
:::::::*LC_COLLATE=en_US.UTF-8.
:::::::*LC_COLLATE=en_US.UTF-8.
::::::On the machine it worked none of those are set. For Rosetta Code purposes for awk to rely on the shell environment to do a urldecode seems not right? At least not very portable. (and I don't understand why enabling UTF-8 would make it not work.) -- [[User:3havj7t3nps8z8wij3g9|3havj7t3nps8z8wij3g9]] ([[User talk:3havj7t3nps8z8wij3g9|talk]]) 19:51, 27 May 2015 (UTC)
::::::On the machine it worked none of those are set. For Rosetta Code purposes for awk to rely on the shell environment to do a urldecode seems not right? At least not very portable. (and I don't understand why enabling UTF-8 would make it not work.) -- [[User:3havj7t3nps8z8wij3g9|3havj7t3nps8z8wij3g9]] ([[User talk:3havj7t3nps8z8wij3g9|talk]]) 19:51, 27 May 2015 (UTC)

:::::::: If you pass the url from stdin it's not hardcoded, that's all.

:::::::: Meanwhile, it's the same character sequence emitted regardless of the environment. The issue with those variables is how they get handled by the OS (which is responsible for putting the font on the screen).

:::::::: As for why it's broken for you when you set those variables, I will guess that that's because whatever is interpreting those variables expects that it's talking to something other than what it's really talking to. Where if you erased them they would be passed through unchanged (and handled as utf-8 by a different part of the OS). --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 00:28, 28 May 2015 (UTC)

Revision as of 00:28, 28 May 2015

Task update suggestion: support for extended ascii UTF-8. -- 3havj7t3nps8z8wij3g9 (talk) 05:31, 26 May 2015 (UTC)

in what way? --Rdm (talk) 08:21, 26 May 2015 (UTC)
Say for example Google search `Abdu'l-Bahá .. https://www.google.com/search?q=%60Abdu%27l-Bah%C3%A1 .. how to decode %60Abdu%27l-Bah%C3%A1 = `Abdu'l-Bahá? -- 3havj7t3nps8z8wij3g9 (talk) 16:04, 26 May 2015 (UTC)
Any existing implementation should have no problem with the url https://www.google.com/search?q=%60Abdu%27l-Bah%C3%A1 - so it would be reasonable to add that as a test case. --Rdm (talk) 18:29, 26 May 2015 (UTC)
Ok added it as a test case. I know it breaks the Awk code. I left a note saying where to find working gawk code, but it lists every potential UTF-8 character so it's large (and given the possibilities not even complete). I suspect other languages could have similar problems. -- 3havj7t3nps8z8wij3g9 (talk) 00:47, 27 May 2015 (UTC)
I had no serious problem with the existing awk implementation on your new example. I did have two minor issues I needed to deal with:
  1. The url being decoded is hardcoded into the example. I dealt with this by replacing the hardcoded url. A more general solution might place the url on stdin.
  2. I use using LC_ALL=C which prevented display of text as utf-8. I dealt with this by unsetting that environmental variable. (LC_CTYPE and LANG might have similar effects, but I was not using them.)
I suspect that if you were encountering issues that they might be similar. --Rdm (talk) 04:31, 27 May 2015 (UTC)
If you pass the url from stdin, is the decode work being done by the shell? When I ran it on one machine it worked, but on another machine it didn't. I looked at the code and saw it wasn't multi-byte aware, and so figured the working machine was a fluke. As you say something with LC and LANG makes sense - on the machine it didn't work is set:
  • LANG=en_US.UTF-8
  • LC_ALL=en_US.UTF-8
  • LC_COLLATE=en_US.UTF-8.
On the machine it worked none of those are set. For Rosetta Code purposes for awk to rely on the shell environment to do a urldecode seems not right? At least not very portable. (and I don't understand why enabling UTF-8 would make it not work.) -- 3havj7t3nps8z8wij3g9 (talk) 19:51, 27 May 2015 (UTC)
If you pass the url from stdin it's not hardcoded, that's all.
Meanwhile, it's the same character sequence emitted regardless of the environment. The issue with those variables is how they get handled by the OS (which is responsible for putting the font on the screen).
As for why it's broken for you when you set those variables, I will guess that that's because whatever is interpreting those variables expects that it's talking to something other than what it's really talking to. Where if you erased them they would be passed through unchanged (and handled as utf-8 by a different part of the OS). --Rdm (talk) 00:28, 28 May 2015 (UTC)