Web Scraping

From Rosetta Code

Jump to: navigation, search

Programming Task
This is a programming task. It lays out a problem which Rosetta Code users are encouraged to solve, using languages they know.

Code examples should be formatted along the lines of one of the existing prototypes.

Create a program that downloads the time from this URL: http://tycho.usno.navy.mil/cgi-bin/timer.pl and then prints the current UTC time by extracting just the UTC time from the web page's HTML.

If possible, only use libraries that come at no extra monetary cost with the programming language and that are widely available and popular such as CPAN for Perl or Boost for C++.


Contents

[edit] AWK

This is inspired by GETURL example in the manual for gawk.

#! /usr/bin/awk -f

BEGIN {
  purl = "/inet/tcp/0/tycho.usno.navy.mil/80"
  ORS = RS = "\r\n\r\n"
  print "GET /cgi-bin/timer.pl HTTP/1.0" |& purl
  purl |& getline header
  while ( (purl |& getline ) > 0 )
  {
     split($0, a, "\n")
     for(i=1; i <= length(a); i++)
     {
        if ( a[i] ~ /UTC/ )
        {
          sub(/^<BR>/, "", a[i])
          printf "%s\n", a[i]
        }
     }
  }
  close(purl)
}

[edit] Java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
 
 
public class WebTime{
	public static void main(String[] args){
		try{
			URL address = new URL(
					"http://tycho.usno.navy.mil/cgi-bin/timer.pl");
			URLConnection conn = address.openConnection();
			BufferedReader in = new BufferedReader(
					new InputStreamReader(conn.getInputStream()));
			String line;
			while(!(line = in.readLine()).contains("UTC"));
			System.out.println(line.substring(4));
		}catch(IOException e){
			System.err.println("error connecting to server.");
			e.printStackTrace();
		}
	}
}
 

[edit] OCaml

let () =
  let _,_, page_content = make_request ~url:Sys.argv.(1) ~kind:GET () in
 
  let lines = Str.split (Str.regexp "\n") page_content in
  let str =
    List.find
      (fun line ->
        try ignore(Str.search_forward (Str.regexp "UTC") line 0); true
        with Not_found -> false)
      lines
  in
  let str = Str.global_replace (Str.regexp "<BR>") "" str in
  print_endline str;
;;

There are libraries for this, but it's rather interesting to see how to use a socket to achieve this, so see the implementation of the above function make_request on this page.

[edit] Perl

use LWP::Simple;
 
my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl';
get($url) =~ /<BR>(.+? UTC)/
    and print "$1\n";

[edit] Python

import urllib
 
page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl')
for line in page:
    if ' UTC\n' in line:
        print line.strip()[4:]
        break
page.close()

Sample output:

Aug. 20, 19:50:38 UTC


[edit] Ruby

require "open-uri"
 
open('http://tycho.usno.navy.mil/cgi-bin/timer.pl') do |p|
    p.each_line do |line|
      if line =~ /UTC\n/
        puts line[4..-1]
        break
      end
    end
end

[edit] UNIX Shell

Works with: Bourne Again SHell

This solution uses curl, which can be downloaded for free (and very easily on GNU/Linux systems), and popular (at list in the GNU and *n*x world) utilities programs like grep and sed.

#! /bin/bash
curl -s http://tycho.usno.navy.mil/cgi-bin/timer.pl |
   grep 'UTC$' |
   sed -e 's/^<BR>//'
Personal tools