Web scraping: Difference between revisions
No edit summary |
|||
Line 9: | Line 9: | ||
This is inspired by [http://www.gnu.org/software/gawk/manual/gawkinet/html_node/GETURL.html#GETURL GETURL] example in the manual for gawk. |
This is inspired by [http://www.gnu.org/software/gawk/manual/gawkinet/html_node/GETURL.html#GETURL GETURL] example in the manual for gawk. |
||
< |
<tt><pre>#! /usr/bin/awk -f |
||
BEGIN { |
BEGIN { |
||
Line 30: | Line 30: | ||
close(purl) |
close(purl) |
||
} |
} |
||
</pre></ |
</pre></tt> |
||
=={{header|Java}}== |
=={{header|Java}}== |
||
<java>import java.io.BufferedReader; |
<lang java>import java.io.BufferedReader; |
||
import java.io.IOException; |
import java.io.IOException; |
||
import java.io.InputStreamReader; |
import java.io.InputStreamReader; |
||
Line 57: | Line 57: | ||
} |
} |
||
} |
} |
||
</ |
</lang> |
||
=={{header|OCaml}}== |
=={{header|OCaml}}== |
||
<ocaml>let () = |
<lang ocaml>let () = |
||
let _,_, page_content = make_request ~url:Sys.argv.(1) ~kind:GET () in |
let _,_, page_content = make_request ~url:Sys.argv.(1) ~kind:GET () in |
||
Line 74: | Line 74: | ||
let str = Str.global_replace (Str.regexp "<BR>") "" str in |
let str = Str.global_replace (Str.regexp "<BR>") "" str in |
||
print_endline str; |
print_endline str; |
||
;;</ |
;;</lang> |
||
There are libraries for this, but it's rather interesting to see how to use a socket to achieve this, so see the implementation of the above function < |
There are libraries for this, but it's rather interesting to see how to use a socket to achieve this, so see the implementation of the above function <tt>make_request</tt> on [[Web_Scraping/OCaml|this page]]. |
||
=={{header|Perl}}== |
=={{header|Perl}}== |
||
<perl>use LWP::Simple; |
<lang perl>use LWP::Simple; |
||
my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl'; |
my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl'; |
||
get($url) =~ /<BR>(.+? UTC)/ |
get($url) =~ /<BR>(.+? UTC)/ |
||
and print "$1\n";</ |
and print "$1\n";</lang> |
||
=={{header|Python}}== |
=={{header|Python}}== |
||
<python>import urllib |
<lang python>import urllib |
||
page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') |
page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') |
||
Line 93: | Line 93: | ||
print line.strip()[4:] |
print line.strip()[4:] |
||
break |
break |
||
page.close()</ |
page.close()</lang> |
||
Sample output: |
Sample output: |
||
<pre>Aug. 20, 19:50:38 UTC</pre> |
<pre>Aug. 20, 19:50:38 UTC</pre> |
||
Line 99: | Line 99: | ||
=={{header|Ruby}}== |
=={{header|Ruby}}== |
||
<ruby>require "open-uri" |
<lang ruby>require "open-uri" |
||
open('http://tycho.usno.navy.mil/cgi-bin/timer.pl') do |p| |
open('http://tycho.usno.navy.mil/cgi-bin/timer.pl') do |p| |
||
Line 108: | Line 108: | ||
end |
end |
||
end |
end |
||
end</ |
end</lang> |
||
=={{header|UNIX Shell}}== |
=={{header|UNIX Shell}}== |
||
Line 114: | Line 114: | ||
This solution uses curl, which can be downloaded for free (and very easily on GNU/Linux systems), and popular (at list in the GNU and *n*x world) utilities programs like grep and sed. |
This solution uses curl, which can be downloaded for free (and very easily on GNU/Linux systems), and popular (at list in the GNU and *n*x world) utilities programs like grep and sed. |
||
< |
<tt><pre>#! /bin/bash |
||
curl -s http://tycho.usno.navy.mil/cgi-bin/timer.pl | |
curl -s http://tycho.usno.navy.mil/cgi-bin/timer.pl | |
||
grep 'UTC$' | |
grep 'UTC$' | |
||
sed -e 's/^<BR>//' |
sed -e 's/^<BR>//' |
||
</pre></ |
</pre></tt> |
||
[[Category:Input_Output]] |
[[Category:Input_Output]] |
Revision as of 15:57, 3 February 2009
You are encouraged to solve this task according to the task description, using any language you may know.
Create a program that downloads the time from this URL: http://tycho.usno.navy.mil/cgi-bin/timer.pl and then prints the current UTC time by extracting just the UTC time from the web page's HTML.
If possible, only use libraries that come at no extra monetary cost with the programming language and that are widely available and popular such as CPAN for Perl or Boost for C++.
AWK
This is inspired by GETURL example in the manual for gawk.
#! /usr/bin/awk -f BEGIN { purl = "/inet/tcp/0/tycho.usno.navy.mil/80" ORS = RS = "\r\n\r\n" print "GET /cgi-bin/timer.pl HTTP/1.0" |& purl purl |& getline header while ( (purl |& getline ) > 0 ) { split($0, a, "\n") for(i=1; i <= length(a); i++) { if ( a[i] ~ /UTC/ ) { sub(/^<BR>/, "", a[i]) printf "%s\n", a[i] } } } close(purl) }
Java
<lang java>import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection;
public class WebTime{
public static void main(String[] args){
try{
URL address = new URL(
"http://tycho.usno.navy.mil/cgi-bin/timer.pl");
URLConnection conn = address.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
String line;
while(!(line = in.readLine()).contains("UTC"));
System.out.println(line.substring(4));
}catch(IOException e){
System.err.println("error connecting to server.");
e.printStackTrace();
}
}
}
</lang>
OCaml
<lang ocaml>let () =
let _,_, page_content = make_request ~url:Sys.argv.(1) ~kind:GET () in
let lines = Str.split (Str.regexp "\n") page_content in let str = List.find (fun line -> try ignore(Str.search_forward (Str.regexp "UTC") line 0); true with Not_found -> false) lines in let str = Str.global_replace (Str.regexp "
") "" str in print_endline str;
- </lang>
There are libraries for this, but it's rather interesting to see how to use a socket to achieve this, so see the implementation of the above function make_request on this page.
Perl
<lang perl>use LWP::Simple;
my $url = 'http://tycho.usno.navy.mil/cgi-bin/timer.pl';
get($url) =~ /
(.+? UTC)/
and print "$1\n";</lang>
Python
<lang python>import urllib
page = urllib.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') for line in page:
if ' UTC\n' in line: print line.strip()[4:] break
page.close()</lang> Sample output:
Aug. 20, 19:50:38 UTC
Ruby
<lang ruby>require "open-uri"
open('http://tycho.usno.navy.mil/cgi-bin/timer.pl') do |p|
p.each_line do |line| if line =~ /UTC\n/ puts line[4..-1] break end end
end</lang>
UNIX Shell
This solution uses curl, which can be downloaded for free (and very easily on GNU/Linux systems), and popular (at list in the GNU and *n*x world) utilities programs like grep and sed.
#! /bin/bash curl -s http://tycho.usno.navy.mil/cgi-bin/timer.pl | grep 'UTC$' | sed -e 's/^<BR>//'