Talk:XML/Input: Difference between revisions

Content added Content deleted

Inline

Revision as of 23:18, 2 June 2009

Interpreting XML?

The name of this task is XML Reading. Are we supposed to interpret the XML structure, or just extract the names in this particular example?

The AWK implementation only extracts any text between double quotes. That would not be useful in any practical purpose. I think the task should at least require to extract only the contents of the fields named "Name". Maybe the example input file should contain some other fields that are not to be extracted. --PauliKL 13:00, 1 June 2009 (UTC)

I'm tempted to say let the AWK example stand with comments about how it is scraping the XML and not properly parsing it; disappointingly many languages have to do it that way anyway and it is a common (if nasty) technique. —Donal Fellows 13:25, 1 June 2009 (UTC)

This task should definitely require stuctured XML parsing. We already have Web Scraping for more ad-hoc methods. To aid this, I would change the XML to something less trivial. --IanOsgood 19:04, 1 June 2009 (UTC)

I added a ~~entity~~ numeric character reference, since XML processors in general need to be able to handle & and the full character set. --Kevin Reid 00:44, 2 June 2009 (UTC)

Are you suggesting that the program should convert HTML entities and numeric references into some character encoding? I think that should be a separate task. And, AFAIK, it is HTML specific, not XML. --PauliKL 09:03, 2 June 2009 (UTC)

No. Numeric references, a small set of predefined entities, and the permitted character set, are part of the XML specification. All XML parsers must support them. Practically, I think it is better for Rosetta Code if our examples show robust, fully-general solutions rather than just-enough-for-the-example-at-hand. Don't spread code that will break when someone with an accent in their name comes along. --Kevin Reid 12:23, 2 June 2009 (UTC)

The purpose of Rosetta Code is not to provide robust, full applications such as XML parsers. Such an application would require thousands of lines of code. Nobody would write them. We should have simple, clearly defined tasks that solve some specific problem, or can be used as a (small) part of an application. The task has to be specified clearly enough so that the implementations will actually solve the problem instead of using shortcut to a known answer. I think this task should be about extracting information from XML file. Character conversion is an entirely different task. Often you would not even want the conversion to be done. And the conversion has nothing to do with code breaking. I now added Vedit example that extracts the data but does not do the conversion. And it does not break because of this, it just extracts the data as expected. --PauliKL 16:33, 2 June 2009 (UTC)

Certainly no RC example should contain a full XML parser. This one, however, should use a conformant XML parser library. In a task that is about processing XML, it is misleading to demonstrate half-baked solutions. This is not a matter of doing some "translation" -- it is an inherent part of parsing XML at all. The XML specification states that "REQUIRED" behavior of an XML processor is that "the indicated character is processed in place of the reference itself" when a character reference occurs in attribute values. --Kevin Reid 17:18, 2 June 2009 (UTC)

... then the task should be something like: show how to use a full featured XML parser to parse this doc, rather than extract the list of student names using whatever means desired. Only brand new AWK and Vedit macro language would be broken, as far as I can understand other codes. --ShinTakezou 23:18, 2 June 2009 (UTC)

Donal, the problem is that AWK implementation does not interpret the structure at all. It is quite possible to do some parsing even if there are no ready-made library routines for that. But that does not mean that we should implement a full XML parser. The task should be kept relatively simple.

I notice that the XML input file has now been changed. But the the task description needs to be changed, too. --PauliKL 09:14, 2 June 2009 (UTC)

Being the poster of the AWK solution, I have to admit it was a bit tongue-in-cheek - but also true to the XP rule "do the simplest thing that might possibly work", which the original code did for the original task. But rather than implement an XML parser in AWK, I'm rather ok with withdrawing the AWK code. --Suchenwi 10:17, 2 June 2009 (UTC)

I changed the task description slightly so that it now requires list of student names. Maybe this is enough to specify the task? --PauliKL 14:32, 2 June 2009 (UTC)

Still about AWK, and task definition

Unluckly the task specifications do not talk about full XML parser or what... it says: using whatever means desired... this means that a proper (set of) regular expression(s) in AWK can extract the list of the student names properly... still without knowing too much of the structure of an XML document, and without pretending to be a full featured XML parser... --ShinTakezou 22:05, 2 June 2009 (UTC)

@@ Line 11: / Line 11: @@
 ::::::The purpose of Rosetta Code is ''not'' to provide robust, full applications such as XML parsers. Such an application would require thousands of lines of code. Nobody would write them. We should have simple, clearly defined tasks that solve some specific problem, or can be used as a (small) part of an application. The task has to be specified clearly enough so that the implementations will actually solve the problem instead of using shortcut to a known answer. I think this task should be about extracting information from XML file. Character conversion is an entirely different task. Often you would not even want the conversion to be done. And the conversion has nothing to do with code breaking. I now added Vedit example that extracts the data but does not do the conversion. And it does not break because of this, it just extracts the data as expected. --[[User:PauliKL|PauliKL]] 16:33, 2 June 2009 (UTC)
 ::::::: Certainly no RC example should contain a full XML parser. This one, however, should ''use'' a conformant XML parser library. In a task that is about processing XML, it is misleading to demonstrate half-baked solutions. This is not a matter of doing some "translation" -- it is an inherent part of ''parsing XML at all''. The XML specification [http://www.w3.org/TR/xml11/#entproc states] that "REQUIRED" behavior of an XML processor is that "the indicated character is processed in place of the reference itself" when a character reference occurs in attribute values. --[[User:Kevin Reid|Kevin Reid]] 17:18, 2 June 2009 (UTC)
+:::::::: ... then the task should be something like: show how to use a full featured XML parser to parse this doc, rather than <cite>extract the list of student names using whatever means desired</cite>. Only brand new AWK and Vedit macro language would be broken, as far as I can understand other codes. --[[User:ShinTakezou|ShinTakezou]] 23:18, 2 June 2009 (UTC)
 ::Donal, the problem is that AWK implementation does not interpret the structure at all. It is quite possible to do some parsing even if there are no ready-made library routines for that. But that does not mean that we should implement a full XML parser. The task should be kept relatively simple.