Talk:Find URI in text: Difference between revisions

Unicode and URIs
(Unicode and URIs)
Line 23:
:it is not necessary to copy the example input exactly. if you can think of other examples that are worth testing, please include them too.
:as for the expected output, this is a question of the balance beween following the rfc and handling user expectations. for example, a <code> . </code> or <code> , </code> at the end of a URI is most likely not part of the URI according to user expectation, but it is a legal character in the RFC. which rule is better? i don't know. until someone can show a live URI that has <code> . </code> or <code> , </code> at the end i am inclined to remove them. in contrast the <code>()</code> case is somewhat easier to decide. if there is a <code>(</code> before the URI, then clearly the <code>)</code> at the end is also not part of the URI, but there are edge-cases too.--[[User:EMBee|eMBee]] 06:58, 8 January 2012 (UTC)
== Unicode and URIs ==
[http://www.ietf.org/rfc/rfc3986.txt RFC 3986] defines URIs and does not allow Unicode; however, the IETF addresses this in [http://www.ietf.org/rfc/rfc3987.txt RFC 3987] via the IRI mechanism which is related but separate. The syntactic definitions are very similar where most of the elements are extended. Two lower level elements are added 'iprivate' and 'ucschar' which are specific ranges of two byte percent encoded values. These elements percolate up through most of the higher syntax elements such as the authority, paths, and segments which have i-versions. Other elements such as 'scheme' and the IP address elements are left alone. There is also no 'ireserved' element. --[[User:Dgamey|Dgamey]] 14:50, 8 January 2012 (UTC)
: Having worked on a couple of projects that involve parsing things defined by RFCs, I've found that unless it's a use once and throw away solution that straying from the RFC or reinterpreting them is generally asking for trouble. --[[User:Dgamey|Dgamey]] 14:50, 8 January 2012 (UTC)
Anonymous user