Web scraping: Difference between revisions
(sort) |
No edit summary |
||
Line 2: | Line 2: | ||
Create a program that downloads the time from this URL: [http://tycho.usno.navy.mil/cgi-bin/timer.pl http://tycho.usno.navy.mil/cgi-bin/timer.pl] and then prints the current UTC time by extracting just the UTC time from the web page's [[HTML]]. |
Create a program that downloads the time from this URL: [http://tycho.usno.navy.mil/cgi-bin/timer.pl http://tycho.usno.navy.mil/cgi-bin/timer.pl] and then prints the current UTC time by extracting just the UTC time from the web page's [[HTML]]. |
||
If possible, please only use libraries that come at no ''extra'' monetary cost or difficulty. |
|||
=={{header|AWK}}== |
=={{header|AWK}}== |
Revision as of 04:22, 7 December 2008
You are encouraged to solve this task according to the task description, using any language you may know.
Create a program that downloads the time from this URL: http://tycho.usno.navy.mil/cgi-bin/timer.pl and then prints the current UTC time by extracting just the UTC time from the web page's HTML.
If possible, please only use libraries that come at no extra monetary cost or difficulty.
AWK
This is inspired by GETURL example in the manual for gawk.
#! /usr/bin/awk -f
BEGIN {
purl = "/inet/tcp/0/tycho.usno.navy.mil/80"
ORS = RS = "\r\n\r\n"
print "GET /cgi-bin/timer.pl HTTP/1.0" |& purl
purl |& getline header
while ( (purl |& getline ) > 0 )
{
split($0, a, "\n")
for(i=1; i <= length(a); i++)
{
if ( a[i] ~ /UTC/ )
{
sub(/^<BR>/, "", a[i])
printf "%s\n", a[i]
}
}
}
close(purl)
}
Java
<java>import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection;
public class WebTime{
public static void main(String[] args){
try{
URL address = new URL(
"http://tycho.usno.navy.mil/cgi-bin/timer.pl");
URLConnection conn = address.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
String line;
while(!(line = in.readLine()).contains("UTC"));
System.out.println(line.substring(4));
}catch(IOException e){
System.err.println("error connecting to server.");
e.printStackTrace();
}
}
}
</java>
OCaml
<ocaml>let () =
let _,_, page_content = make_request ~url:Sys.argv.(1) ~kind:GET () in
let lines = Str.split (Str.regexp "\n") page_content in let str = List.find (fun line -> try ignore(Str.search_forward (Str.regexp "UTC") line 0); true with Not_found -> false) lines in let str = Str.global_replace (Str.regexp "
") "" str in print_endline str;
- </ocaml>
There are libraries for this, but it's rather interesting to see how to use a socket to achive this, so see the implementation of the above function make_request
on this page.
Perl
<perl>use LWP::UserAgent;
my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl';
LWP::UserAgent->new->get($url)->content() =~ /
(.+? UTC)/
and print "$1\n";</perl>
Python
<python>import urllib
page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') for line in page:
if ' UTC\n' in line: print line.strip()[4:] break
page.close()</python> Sample output:
Aug. 20, 19:50:38 UTC
Ruby
require "open-uri"
open('http://tycho.usno.navy.mil/cgi-bin/timer.pl') do |p|
p.each_line do |line| if line =~ /UTC\n/ puts line[4,line.length] break end end
end
UNIX Shell
This solution uses curl, which can be downloaded for free (and very easily on GNU/Linux systems), and popular (at list in the GNU and *n*x world) utilities programs like grep and sed.
#! /bin/bash
curl -s http://tycho.usno.navy.mil/cgi-bin/timer.pl |
grep 'UTC$' |
sed -e 's/^<BR>//'