WiktionaryDumps to words: Difference between revisions

From Rosetta Code
Content added Content deleted
(Too vague)
(Trying to make a better explanation)
Line 1: Line 1:

;NOTE: Please wait for the issues of the discussion page on the vagueness of the task to be addressed before adding to this task.
;NOTE: Please clarify the explanation of the task if you can.

Use the [https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 wiktionary dump] to create a file equivalent than [http://manpages.ubuntu.com/manpages/bionic/man5/french.5.html "/usr/share/dict/french"]. Demonstrate how your language can handle this dump which is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package '''wfrench'''.
Use the [https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 wiktionary dump] (input) to create a file equivalent than [http://manpages.ubuntu.com/manpages/bionic/man5/french.5.html "/usr/share/dict/french"] (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package '''wfrench'''.


Revision as of 19:31, 9 December 2020

Please clarify the explanation of the task if you can.

Use the wiktionary dump (input) to create a file equivalent than "/usr/share/dict/french" (output). This dump is a big bz2'ed XML file of about 800MB. The "/usr/share/dict/french" file contains one word of the French language by line in a text file. This file is available in Ubuntu with the package wfrench.


<lang ocaml>let () =

 let i = Xmlm.make_input ~strip:true (`Channel stdin) in
 let title = ref "" in
 let tag_path = ref [] in
 let push_tag tag =
   tag_path := tag :: !tag_path
 let pop_tag () =
   match !tag_path with [] -> ()
   | _ :: tl -> tag_path := tl
 let last_tag_is tag =
   match !tag_path with [] -> false
   | hd :: _ -> hd = tag
 while not (Xmlm.eoi i) do
   match Xmlm.input i with
   | `Dtd dtd -> ()
   | `El_start ((uri, tag_name), attrs) -> push_tag tag_name
   | `El_end -> pop_tag ()
   | `Data s ->
       if last_tag_is "title"
       then title := s;
       if last_tag_is "text"
       then begin
         let reg = Str.regexp_string "==French==" in
         if Str.string_match reg s 0
         then print_endline !title
wget --quiet https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 -O - | bzcat | \
  ocaml str.cma -I $(ocamlfind query xmlm) xmlm.cma to_words.ml
qui a bu boira
quelque chose
grande parure