Find URI in text: Difference between revisions

From Rosetta Code
Content added Content deleted
(collecting URIs for test input)
Line 17: Line 17:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:


Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')
Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')

Revision as of 02:20, 4 January 2012

Find URI in text is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Write a function to search plain text for URIs.

The function should return a list of URIs found in the text.

The definition of a URI is given in RFC 3986.

For searching URIs in particular "Appendix C. Delimiting a URI in Context" is noteworthy.

Consider the following issues:

  • . , ; ' ? ( ) are legal characters in a URI, but they are often used in plain text as a delimiter.
  • a user may type an URI as seen in the browser location-bar with non-ascii characters (which are not legal).
  • URIs can be something else besides http:// or https://

sample text:

this URI contains an illegal character, parentheses and a misplaced full stop:
http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).
and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)
")" is handled the wrong way by the mediawiki parser.
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)
if you have other interesting URIs for testing, please add them here:

Regular expressions to solve the task are fine, but alternative approaches are welcome too. (otherwise, this task would degrade into 'fining and applying the best regular expression')

Pike

<lang Pike>string uritext = "this URI contains an illegal character, parentheses and a misplaced full stop:\n" "http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (which is handled by http://mediawiki.org/).\n" "and another one just to confuse the parser: http://en.wikipedia.org/wiki/-)\n";

array find_uris(string uritext) {

   array uris=({}); 
   int pos=0; 
   while((pos = search(uritext, "://", pos+1))>0)
   { 
       int prepos = sizeof(array_sscanf(reverse(uritext[pos-20..pos-1]), "%[a-zA-Z0-9+.-]%s")[0]); 
       int postpos = sizeof(array_sscanf(uritext[pos+3..], "%[^\n\r\t <>\"]%s")[0]); 
       if ((<'.',',','?','!',';'>)[uritext[pos+postpos+2]])
           postpos--;
       if (uritext[pos-prepos-1]=='(' && uritext[pos+postpos+2]==')')
           postpos--;
       if (uritext[pos-prepos-1]=='\ && uritext[pos+postpos+2]=='\)
           postpos--;  
       uris+= ({ uritext[pos-prepos..pos+postpos+2] });
   }
   return uris;

}

find_uris(uritext); Result: ({ /* 3 elements */

                "http://en.wikipedia.org/wiki/Erich_K\303\244stner_(camera_designer)",
                "http://mediawiki.org/)",
                "http://en.wikipedia.org/wiki/-)"
       })</lang>

TXR

<lang txr>@(define path (path))@\

 @(local x y)@\
 @(cases)@\
   (@(path x))@(path y)@(bind path `(@x)@y`)@\
 @(or)@\
   @{x /[.,;'!?][^ \t\f\v]/}@(path y)@(bind path `@x@y`)@\
 @(or)@\
   @{x /[^ .,;'!?()\t\f\v]/}@(path y)@(bind path `@x@y`)@\
 @(or)@\
   @(bind path "")@\
 @(end)@\

@(end) @(define url (url))@\

 @(local proto domain path)@\
 @{proto /[A-Za-z]+/}://@{domain /[^ \/\t\f\v]+/}@\
 @(cases)/@(path path)@\
   @(bind url `@proto://@domain/@path`)@\
 @(or)@\
   @(bind url `@proto://@domain`)@\
 @(end)@\

@(end) @(collect) @ (all) @line @ (and) @ (coll)@(url url)@(end)@(flatten url) @ (end) @(end) @(output) LINE

   URLS

@ (repeat) @line @ (repeat)

   @url

@ (end) @ (end) @(end)</lang>

Test file:

$ cat url-data 
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
Confuse the parser: http://en.wikipedia.org/wiki/-)
ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
leading junk ftp://domain.name/path/embedded?punct/uation.
leading junk ftp://domain.name/dangling_close_paren)

Run:

$ txr url.txr url-data 
LINE 
    URLS
----------------------
Blah blah http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer). (Handled by http://mediawiki.org/).
    http://en.wikipedia.org/wiki/Erich_Kästner_(camera_designer)
    http://mediawiki.org/
Confuse the parser: http://en.wikipedia.org/wiki/-)
    http://en.wikipedia.org/wiki/-
ftp://domain.name/path(balanced_brackets)/foo.html
    ftp://domain.name/path(balanced_brackets)/foo.html
ftp://domain.name/path(balanced_brackets)/ending.in.dot.
    ftp://domain.name/path(balanced_brackets)/ending.in.dot
ftp://domain.name/path(unbalanced_brackets/ending.in.dot.
    ftp://domain.name/path
leading junk ftp://domain.name/path/embedded?punct/uation.
    ftp://domain.name/path/embedded?punct/uation
leading junk ftp://domain.name/dangling_close_paren)
    ftp://domain.name/dangling_close_paren