Talk:URL decoding: Difference between revisions

no edit summary
(maybe think of this as code points (which must be literal) vs characters (which in utf-8 can be a sequence of code points)
No edit summary
Line 41:
 
:::::::::::::::: Maybe think of this as utf-8 code points (which are what the awk code emits, in either version) vs characters (which in utf-8 can each be a sequence of one or more code points)? ---[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 09:13, 29 May 2015 (UTC)
:::::::::::::::::You are correct my mistake. Kevin's function is based on the extended ascii table. For the character "á" it encodes it as %E1 (which works in most browsers) however Kevin's function can't independently decode UTF-8 %C3%A1 back to "á", rather that depends on the OS (locale settings). The reason for UTF-8 is because most browsers and HTML pages encode in UTF-8 format so when you do web scrapping and want to extract a URL (say, a href link tag), it's encoded in UTF-8, and if you then want to display part of that URL (say, the name of a search term) you have to convert it back to visible characters. I've yet to find an OS-independent way to do it (in Awk) that doesn't rely on an external tool (such as Bill Poser's [http://billposer.org/Software/uni2ascii.html ascii2uni] .. which isn't very portable as an external tool). Really what I'm looking for is an Awk program that will covert RFC 2396 URI format (e.g. %C3%A9) -> Unicode, independent of locale settings. -- [[User:3havj7t3nps8z8wij3g9|3havj7t3nps8z8wij3g9]] ([[User talk:3havj7t3nps8z8wij3g9|talk]]) 16:14, 29 May 2015 (UTC)