Find URI in text: Difference between revisions

Content added Content deleted

Inline

Revision as of 02:20, 4 January 2012

Write a function to search plain text for URIs.

The function should return a list of URIs found in the text.

The definition of a URI is given in RFC 3986.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

Consider the following issues:

. , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
URIs can be something else besides http:// or https://

sample text:

this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')

Pike

<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n" "http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n" "and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)\n";

array find_uris(string uritext) {

   array uris=({}); 
   int pos=0; 
   while((pos = search(uritext, "://", pos+1))>0)
   { 
       int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); 
       int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^\n\r\t <>\"]%s")[0]);

       if ((<'.',',','?','!',';'>)[uritext[pos+postpos+2]])
           postpos--;
       if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')')
           postpos--;
       if (uritext[pos-prepos-1]=='\ && uritext[pos+postpos+2]=='\)
           postpos--;  
       uris+= ({ uritext[pos-prepos..pos+postpos+2] });
   }
   return uris;

}

find_uris(uritext); Result: ({ /* 3 elements */

                "http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)",
                "http://mediawiki.org/)",
                "http://en.wikipedia.org/wiki/-)"
       })</lang>

TXR

<lang txr>@(define path (path))@\

 @(local x y)@\
 @(cases)@\
   (@(path x))@(path y)@(bind path `(@x)@y`)@\
 @(or)@\
   @{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\
 @(or)@\
   @{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\
 @(or)@\
   @(bind path "")@\
 @(end)@\

@(end) @(define url (url))@\

 @(local proto domain path)@\
 @{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\
 @(cases)/@(path path)@\
   @(bind url `@proto://@domain/@path`)@\
 @(or)@\
   @(bind url `@proto://@domain`)@\
 @(end)@\

@(end) @(collect) @ (all) @line @ (and) @ (coll)@(url url)@(end)@(flatten url) @ (end) @(end) @(output) LINE

   URLS

@ (repeat) @line @ (repeat)

   @url

@ (end) @ (end) @(end)</lang>

Test file:

$ cat url-data 
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
Confuse the parser: http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)

Run:

$ txr url.txr url-data 
LINE 
    URLS
----------------------
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
    http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
    http://mediawiki.org/
Confuse the parser: http://en.wikipedia.org/wiki/-)
    http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
    ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
    ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
    ftp://domain.name/path
leading junk ftp://domain.name/path/embedded?punct/uation.
    ftp://domain.name/path/embedded?punct/uation
leading junk ftp://domain.name/dangling_close_paren)
    ftp://domain.name/dangling_close_paren