Find URI in text

From Rosetta Code
Revision as of 12:29, 3 January 2012 by rosettacode>EMBee (Pike)
Find URI in text is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Write a function to search plain text for URIs.

The function should return a list of URIs found in the text.

The definition of a URI is given in RFC 3986.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

Consider the following issues:

  • . , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
  • a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
  • URIs can be something else besides http:// or https://

sample text:

this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org).

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'how to apply a regular expression')

Pike

<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:

http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org).";

array find_uris(string uritext) {

   array uris=({}); 
   int pos=0; 
   while((pos = search(uritext, "://", pos+1))>0)
   { 
       int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); 
       int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^ <>\"]%s")[0]); 
       if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')')
           postpos--;
       uris+= ({ uritext[pos-prepos..pos+postpos+2] });
   }
   return uris;

}</lang>