WiktionaryDumps to words: Difference between revisions

From Rosetta Code
Content added Content deleted
(→‎{{header|Java}}: Fixed regex using Pattern.DOTALL)
(→‎{{header|OCaml}}: Fixed regexp)
Line 100: Line 100:
match !tag_path with [] -> false
match !tag_path with [] -> false
| hd :: _ -> hd = tag
| hd :: _ -> hd = tag
in
let reg = Str.regexp_string "==French==" in
let matches s =
try let _ = Str.search_forward reg s 0 in true
with Not_found -> false
in
in
while not (Xmlm.eoi i) do
while not (Xmlm.eoi i) do
Line 111: Line 116:
if last_tag_is "text"
if last_tag_is "text"
then begin
then begin
let reg = Str.regexp_string "==French==" in
if matches s
if Str.string_match reg s 0
then print_endline !title
then print_endline !title
end
end

Revision as of 05:18, 27 December 2020

WiktionaryDumps to words is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
NOTE
Please help addressing the issues about this task on the discussion page. If you add another language, be aware that it may change in the future, and that you will need to update your example.

Use the wiktionary dump (input) to create a file equivalent than "/usr/share/dict/french" (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package wfrench.

Java

<lang java>import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException;

import java.util.regex.Pattern; import java.util.regex.Matcher;

class MyHandler extends DefaultHandler {

   private static final String TITLE = "title";
   private static final String TEXT = "text";
   private String lastTag = "";
   private String title = "";
   @Override
   public void characters(char[] ch, int start, int length) throws SAXException {
       String regex = ".*==French==.*";
       Pattern pat = Pattern.compile(regex, Pattern.DOTALL);
       switch (lastTag) {
           case TITLE:
               title = new String(ch, start, length);
               break;
           case TEXT:
               String text = new String(ch, start, length);
               Matcher mat = pat.matcher(text);
               if (mat.matches()) {
                   System.out.println(title);
               }
               break;
       }
   }
   @Override
   public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
       lastTag = qName;
   }
   @Override
   public void endElement(String uri, String localName, String qName) throws SAXException {
       lastTag = "";
   }

}

public class WiktoWords {

   public static void main(java.lang.String[] args) {
       try {
           SAXParserFactory spFactory = SAXParserFactory.newInstance();
           SAXParser saxParser = spFactory.newSAXParser();
           MyHandler handler = new MyHandler();
           saxParser.parse(new InputSource(System.in), handler);
       } catch(Exception e) {
           System.exit(1);
       }
   }

}</lang>

Output:
$ javac WiktoWords.java
$ wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \
    java WiktoWords 
hélice
pingouin
égoïsme
écholocation
nitroglycérine
croque-mitaine

OCaml

Using the library xmlm:

<lang ocaml>let () =

 let i = Xmlm.make_input ~strip:true (`Channel stdin) in
 let title = ref "" in
 let tag_path = ref [] in
 let push_tag tag =
   tag_path := tag :: !tag_path
 in
 let pop_tag () =
   match !tag_path with [] -> ()
   | _ :: tl -> tag_path := tl
 in
 let last_tag_is tag =
   match !tag_path with [] -> false
   | hd :: _ -> hd = tag
 in
 let reg = Str.regexp_string "==French==" in
 let matches s =
   try let _ = Str.search_forward reg s 0 in true
   with Not_found -> false
 in
 while not (Xmlm.eoi i) do
   match Xmlm.input i with
   | `Dtd dtd -> ()
   | `El_start ((uri, tag_name), attrs) -> push_tag tag_name
   | `El_end -> pop_tag ()
   | `Data s ->
       if last_tag_is "title"
       then title := s;
       if last_tag_is "text"
       then begin
         if matches s
         then print_endline !title
       end
 done</lang>
Output:
wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \
  ocaml str.cma -I $(ocamlfind query xmlm) xmlm.cma to_words.ml
livrer
observateur
qui a bu boira
quelque chose
grande parure
obiit
pleuvoir
voir
...