Talk:WiktionaryDumps to words

From Rosetta Code

Too vague

"Demonstrate how your language can handle this dump"? How?

You need to write a task where all examples are doing one shared thing that is comparable as a feature of those languages implementation of the task. If you mean to highlight one type of XML handling over another then this doesn't do it, for example. --Paddy3118 (talk) 10:00, 9 December 2020 (UTC)

The task, as explained, is to create a file equivalent than "/usr/share/dict/french" (output), using the wiktionary dump as input. Blue Prawn (talk) 19:27, 9 December 2020 (UTC)
I have no desire to download an 800 megabyte compressed file for a Rosetta Code task that is who-knows-how-large uncompressed. Surely the task doesn't need to use a file that large. --Chunes (talk) 20:41, 9 December 2020 (UTC)
You don't need to do so. Please see the OCaml example that only donwloads the first 1 or 2 megas. Blue Prawn (talk) 09:13, 10 December 2020 (UTC)
I would need that explaining to me. How does it quit after 1 or 2 megas and how does it tell wget|bzcat| to quit? --Pete Lomax (talk) 09:57, 10 December 2020 (UTC)
On Linux I just use Ctrl C to terminate all the commands (all the piped programs are terminated at the same time). On Windows, under Cygwin, I just do the same. I think this is the same too on MacOS. - Blue Prawn (talk) 12:50, 10 December 2020 (UTC)
Also you don't really have to download 800 megabytes on your hard drive, you can just read it from a stream. Blue Prawn (talk) 13:15, 10 December 2020 (UTC)
I too have some questions.
  1. What does wiktionary have to do with the task? Would any XML encoded word list do? If so, why does the task name include wiktionary?
Because I found it interesting to do something with the wiktionary, as I explained on the Village Pump page. - Blue Prawn (talk) 09:18, 10 December 2020 (UTC)
Also a word list is available for French with "/usr/share/dict/french", but I don't think that it's available for every languages, and the Wiktionary could be a good source for generating these files. If I understood correctly these words files are useful for spell checking. Blue Prawn (talk) 12:55, 10 December 2020 (UTC)
  1. Is the task supposed to show how to download and extract a large file in your particular language? The reference implementation just shells out and uses other tools.
The task is still a draft, if you think the download and uncompressed parts should be in the language, we can update the task. (and I will updated the ocaml too.) - Blue Prawn (talk) 09:18, 10 December 2020 (UTC)
  1. If the task is just extract a certain group of entries from an XML file, how does it differ significantly from XML/XPath?
--Thundergnat (talk) 21:57, 9 December 2020 (UTC)
Because we can not use the DOM method to parse 800MB of XML, we need to use the SAX method then. Most languages provide 2 different API for SAX and DOM XML parsing, but maybe not all. Blue Prawn (talk) 09:18, 10 December 2020 (UTC)

A common task

You can see on this post that some people are wondering how to do this task:
https://unix.stackexchange.com/questions/48939/add-new-language-to-usr-share-dict-words
The wordlist package in Debian don't seem to provide that many languages:
https://packages.debian.org/en/sid/wordlist
If we modify the ocaml script replacing "==French==" by "==Indonesian==" we can produce the word list for the Indonesian language quite easily.
-- Blue Prawn (talk) 13:10, 10 December 2020 (UTC)

Edit?

Hi, it's already 10 days no-one discusses anymore.
Can we allow adding new languages now?
Blue Prawn (talk) 18:04, 20 December 2020 (UTC)

That is more a reason to remove the "dumped" task altogrther as the original author doesn't seem to want to address these comments. --Paddy3118 (talk) 00:59, 21 December 2020 (UTC)
Hi Paddy, Sorry English is not my born language, so I'm not sure what you mean by [the "dumped" task].
Do you mean that this task is too simple because it's only about the act of dumping selected content from the input?
(I checked https://en.wiktionary.org/wiki/dump and try to see which definition would match the best, hopping that it's not definition 1, 7 or 8 which are quite pejorative.)
I do want to address the comments, but I already answered it all, and the discussion stopped after that. In French we say "qui ne dit mot consent" (silence is consent) so I thought that they now agree. Isn't it the case?
Blue Prawn (talk)

Download 800MB to spell check a document??!!

Maybe you can key Ctrl-C, maybe you only got half the language, and what if it's right at the end of the file? --Pete Lomax (talk) 00:35, 16 February 2021 (UTC)

Managed to get 5 words out of the first 240K, and then terminate download/unpack cleanly without having to key Ctrl-C. --Pete Lomax (talk) 23:52, 13 April 2021 (UTC)