Talk:URL encoding: Difference between revisions

From Rosetta Code
Content added Content deleted
No edit summary
No edit summary
Line 14: Line 14:
:: There probably should be soome required input(s) and output(s). I noticed the perl example is very cryptic using a library and provides no output. The output it would produce doesn't match the 'example' string as it only encodes data in the path portion of the URL and not the entire URL.
:: There probably should be soome required input(s) and output(s). I noticed the perl example is very cryptic using a library and provides no output. The output it would produce doesn't match the 'example' string as it only encodes data in the path portion of the URL and not the entire URL.
: --[[User:Dgamey|Dgamey]] 09:54, 17 June 2011 (UTC)
: --[[User:Dgamey|Dgamey]] 09:54, 17 June 2011 (UTC)
The point of encoding strings is to avoid confusion. Some characters, such as '+' and '?', tend to be metacharaters used by CGI interface (? for begining of query string, + for separating parameters), while '\r' '\n' must be encoded because they signify end of input; also encoding can carry whatever text not in low 127 bits and printable with "normal text", so dumber server or client software won't get totally confused. I don't know how much we need to conform to various RFCs here, maybe common sense would suffice. In principle you can escape the "http" too, and still conform to most standards, but that would be utterly pointless, wouldn't it? --[[User:Ledrug|Ledrug]] 02:23, 19 June 2011 (UTC)

Revision as of 02:23, 19 June 2011

I believe that some symbols are usually not encoded in URLs. The exact list varies, but they usually include the period (.) and hyphen (-), and sometimes underscore (_). Should we not include those? This is complicated by the fact that there are several standards on URI syntax (RFC 1738, RFC 3986), additional restrictions for specific protocols, like HTTP (e.g. the plus character (+) is encoded in form data), as well as lots of slightly different implementations across languages (and sometimes even in the same language). So whatever solutions that people present that use library functions will invariably encode a slightly smaller set of characters than in the task specification. It would be hard to keep all the solutions consistent. --98.210.210.193 07:06, 17 June 2011 (UTC)

Nailing this down would help since there are two tasks dependent on this (URL encoding and decoding). Sorting out and making sense of the current set of RFCs is probably a prerequisite.
RFC 3986 is about URIs and updates 1738 - these two appear to be the most relavent RFCs
RFC 1738 is about URLs
Superseding RFCs may only supersede some of the functionality (such as for a protocol like gopher)
Superseded RFCs should be ignored
As this task seems to be about HTTP URLs we should ignore some of the RFCs for other protocols like mail, tn3270, etc. There are also RFCs that extend functionality such as for extensions of protocols such as WebDav which would seem not to be part of the core task. Also, some of these RFC's have been marked as 'historic' a polite way of sayng obsolete.
This task is not restricted to HTTP urls, and can be applied to any string that can be encoded into this format.
I believe the example of an encoded url is in error (or not described properly). Specifically,
The string "http://foo bar/" would be encoded as "http%3A%2F%2Ffoo%20bar%2F".
Would only be encoded if this URL were being passed as data within another URL. See the RFC sections on Reserved Characters and When to Encode or Decode.
The task is to demonstrate the encoding mechanism, rather than when to use the application of this, so we can assume that this will be used in applications where the URL string requires encoding. --Markhobley 13:01, 17 June 2011 (UTC)
Fair enough. --Dgamey 01:43, 19 June 2011 (UTC)
There probably should be soome required input(s) and output(s). I noticed the perl example is very cryptic using a library and provides no output. The output it would produce doesn't match the 'example' string as it only encodes data in the path portion of the URL and not the entire URL.
--Dgamey 09:54, 17 June 2011 (UTC)

The point of encoding strings is to avoid confusion. Some characters, such as '+' and '?', tend to be metacharaters used by CGI interface (? for begining of query string, + for separating parameters), while '\r' '\n' must be encoded because they signify end of input; also encoding can carry whatever text not in low 127 bits and printable with "normal text", so dumber server or client software won't get totally confused. I don't know how much we need to conform to various RFCs here, maybe common sense would suffice. In principle you can escape the "http" too, and still conform to most standards, but that would be utterly pointless, wouldn't it? --Ledrug 02:23, 19 June 2011 (UTC)