WiktionaryDumps to words

Revision as of 04:42, 27 December 2020 by rosettacode>Blue Prawn (→‎{{header|Java}}: Fixed regex using Pattern.DOTALL)

Use the wiktionary dump (input) to create a file equivalent than "/usr/share/dict/french" (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package wfrench.

WiktionaryDumps to words is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
NOTE
Please help addressing the issues about this task on the discussion page. If you add another language, be aware that it may change in the future, and that you will need to update your example.

Java

<lang java>import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException;

import java.util.regex.Pattern; import java.util.regex.Matcher;

class MyHandler extends DefaultHandler {

   private static final String TITLE = "title";
   private static final String TEXT = "text";
   private String lastTag = "";
   private String title = "";
   @Override
   public void characters(char[] ch, int start, int length) throws SAXException {
       String regex = ".*==French==.*";
       Pattern pat = Pattern.compile(regex, Pattern.DOTALL);
       switch (lastTag) {
           case TITLE:
               title = new String(ch, start, length);
               break;
           case TEXT:
               String text = new String(ch, start, length);
               Matcher mat = pat.matcher(text);
               if (mat.matches()) {
                   System.out.println(title);
               }
               break;
       }
   }
   @Override
   public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
       lastTag = qName;
   }
   @Override
   public void endElement(String uri, String localName, String qName) throws SAXException {
       lastTag = "";
   }

}

public class WiktoWords {

   public static void main(java.lang.String[] args) {
       try {
           SAXParserFactory spFactory = SAXParserFactory.newInstance();
           SAXParser saxParser = spFactory.newSAXParser();
           MyHandler handler = new MyHandler();
           saxParser.parse(new InputSource(System.in), handler);
       } catch(Exception e) {
           System.exit(1);
       }
   }

}</lang>

Output:
$ javac WiktoWords.java
$ wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \
    java WiktoWords 
hélice
pingouin
égoïsme
écholocation
nitroglycérine
croque-mitaine

OCaml

Using the library xmlm:

<lang ocaml>let () =

 let i = Xmlm.make_input ~strip:true (`Channel stdin) in
 let title = ref "" in
 let tag_path = ref [] in
 let push_tag tag =
   tag_path := tag :: !tag_path
 in
 let pop_tag () =
   match !tag_path with [] -> ()
   | _ :: tl -> tag_path := tl
 in
 let last_tag_is tag =
   match !tag_path with [] -> false
   | hd :: _ -> hd = tag
 in
 while not (Xmlm.eoi i) do
   match Xmlm.input i with
   | `Dtd dtd -> ()
   | `El_start ((uri, tag_name), attrs) -> push_tag tag_name
   | `El_end -> pop_tag ()
   | `Data s ->
       if last_tag_is "title"
       then title := s;
       if last_tag_is "text"
       then begin
         let reg = Str.regexp_string "==French==" in
         if Str.string_match reg s 0
         then print_endline !title
       end
 done</lang>
Output:
wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \
  ocaml str.cma -I $(ocamlfind query xmlm) xmlm.cma to_words.ml
livrer
observateur
qui a bu boira
quelque chose
grande parure
obiit
pleuvoir
voir
...