Web scraping

From Rosetta Code
Task
Web scraping
You are encouraged to solve this task according to the task description, using any language you may know.

Create a program that downloads the time from this URL: http://tycho.usno.navy.mil/cgi-bin/timer.pl and then prints the current UTC time by extracting just the UTC time from the web page's HTML.

Only use libraries that come at no extra monetary cost with the programming language and that are widely available and popular such as CPAN for Perl or Boost for C++.

Java

<java>import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection;


public class WebTime{ public static void main(String[] args){ try{ URL address = new URL( "http://tycho.usno.navy.mil/cgi-bin/timer.pl"); URLConnection conn = address.openConnection(); BufferedReader in = new BufferedReader( new InputStreamReader(conn.getInputStream())); String line; while(!(line = in.readLine()).contains("UTC")); System.out.println(line.substring(4)); }catch(IOException e){ System.err.println("error connecting to server."); e.printStackTrace(); } } } </java>

OCaml

<ocaml>let () =

 let status, header, page_content =
   make_request ~url:Sys.argv.(1) ~kind:GET ()
 in
 print_endline status;
 List.iter print_endline header;
 print_endline page_content;
</ocaml>

There are libraries for this, but it's rather interesting to see how to use a socket to achive this, so see the implementation of the above function make_request on this page.


Perl

<perl>use LWP::UserAgent;

my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl'; LWP::UserAgent->new->get($url)->content() =~ /
(.+? UTC)/

   and print "$1\n";</perl>

Python

<python>import urllib

page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') for line in page:

   if ' UTC\n' in line:
       print line.strip()[4:]
       break

page.close()</python> Sample output:

Aug. 20, 19:50:38 UTC