Talk:Longest string challenge: Difference between revisions

← Older edit

Talk:Longest string challenge (view source)

Revision as of 14:18, 29 September 2011

14,548 bytes added , 12 years ago

m

→‎Conforms to the specs?: ya

Anonymous user

rosettacode>Dgamey

Revision as of 00:00, 16 August 2011 (view source) rosettacode>Dkf (→‎Comments / Feedback: Maybe structuring as Extra Credit?) ← Older edit		Latest revision as of 14:18, 29 September 2011 (view source) rosettacode>Dgamey m (→‎Conforms to the specs?: ya)
(34 intermediate revisions by 6 users not shown)
Line 45: :1. No comparison operators may be used. :2. No arithmetic operations, such as addition and subtraction, may be used. :3. The only datatypes you may ~~are~~ use are integer and string. In particular, you may not use lists. An additional restriction became apparent in the discussion. Line 56: :The guiding principle here should be that when using the language of your choice, try to solve this creatively showing off some of your language capabilities. If you need to bend the restrictions a bit, explain why and try to follow the intent. If you absolutely can't get around one, say why in your description. :Now having said that, the restrictions require some elaboration. ::* In general, the restrictions are meant to avoid the explicit use of these features. ::* "No comparison operators may be used" - At some level there must be some test that allows the solution to get at the length and determine if one string is longer. Comparison operators, in particular any less/greater comparison should be avoided. Various approaches allow for detecting the end of a string. Some of these involve implicitly using equal/not-equal; however, explicitly using equal/not-equal should be acceptable. ::* "No arithmetic operations" - Again, at some level something may have to advance through the string. Often there are ways a language can do this implicitly advance a cursor or pointer without explicitly using a +, - , ++, --, add, subtract, etc. ::* The datatype restrictions are amongst the most difficult to reinterpret. In ~~Icon~~the language of the original challenge strings are atomic datatypes and structured datatypes like lists are quite distinct and have many different operations that apply to them. This becomes a bit fuzzier with languages ~~like~~with ~~APL,~~a C,different orprogramming Jparadigm. The intent would be to avoid using an easy structure to accumulate the longest strings and spit them out. There will be some natural reinterpretation here. ::: To make this a bit more concrete, here are a couple of specific examples: ~~::: For example, in C a string is an array of chars using a couple of arrays as strings is in the spirit using a second array in a non-string like fashion would violate the intent.~~ :::: ~~Also~~In C, ina ~~APL~~string oris Jan ~~arrays~~array ~~are~~of ~~the~~chars, ~~core~~so ofusing ~~the~~a ~~language~~couple soof ~~ruling~~arrays ~~them~~as ~~out~~strings is ~~unfair. Meeting~~in the spirit ~~will~~while ~~come~~using ~~down~~a tosecond ~~how~~array ~~they~~in ~~are~~a ~~used~~non-string like fashion would violate the intent. :::: In APL or J, arrays are the core of the language so ruling them out is unfair. Meeting the spirit will come down to how they are used. ::* The added "No rereading" restriction is for practical reasons, re-reading stdin should be broken. I haven't outright banned the use of other files but I've discouraged them. Somewhere there may be a language that just sings when doing file manipulation and where that makes sense; however, for most there should be a way to accomplish without resorting to an externality. ::: Please keep in mind these are just examples and you may hit new territory finding a solution. There will be other cases like these. Explain your reasoning. You may want to open a discussion on the talk page as well. ::* The added "No rereading" restriction is for practical reasons, re-reading stdin should be broken. I haven't outright banned the use of other files but I've discouraged them as it is basically another form of a list. Somewhere there may be a language that just sings when doing file manipulation and where that makes sense; however, for most there should be a way to accomplish without resorting to an externality. :At the end of the day for the implementer this should be a bit of fun. As an implementer you represent the expertise in your language, the reader may have no knowledge of your language. For the reader it should give them insight into how people think outside the box in other languages. Comments, especially for non-obvious (to the reader) bits will be extremely helpful. While the implementations may be a bit artificial in the context of this task, the general techniques may be useful elsewhere. Task Implement a solution to the basic problem that adheres to the spirit of the restrictions. Describe how you circumvented or got around these 'restrictions' and met the 'spirit' of the challenge. Your supporting description may need to describe any challenges to interpreting the restrictions and how you interpretation. IfYou ~~you~~should ~~make~~state any assumptions, ~~like the inputs are trimmed of trailing blanks~~warnings, ~~you~~or ~~should~~other ~~state~~relevant ~~them~~points. This task is likely to encourage multiple different types of solutions. They should be substantially different approaches. Line 87 ⟶ 89: :::: The remaining references to Icon are historical/contextual such as explaining the nature of the restrictions. It could be replaced by the language used in the original challenge but it's just awkward. We could state that further uses are historical/contextual. :::: In "This becomes a bit fuzzier with languages like APL, C, or J" the languages were meant as an example. We can change this to something nebulous like "languages with a different paradigm" or we can make it clearer that these are just examples. ::::: I have changed the above line as suggested. --[[User:Dgamey\|Dgamey]] 11:51, 16 August 2011 (UTC) :::: The specific examples for C and J could be sanitized or referred to a section here on this page. Or it could be made very clear they are just examples. ::::: I will make it clear these are just examples and add some additional generalizing text. --[[User:Dgamey\|Dgamey]] 11:51, 16 August 2011 (UTC) :::: BTW, If anyone else has some suggested rewording or improvements, I'd welcome it. --[[User:Dgamey\|Dgamey]] 18:07, 15 August 2011 (UTC) :::: Also, I added a bit about comments being helpful in the last paragraph. --[[User:Dgamey\|Dgamey]] 12:09, 16 August 2011 (UTC) ::That said, I am wondering if this kind of page should belong to some special kind of category. Perhaps <nowiki>[[Category:Gimmick]]</nowiki>? And, if so, perhaps "Category:Gimmick" entries (or whatever you want to call them) need to follow different rules from normal tasks? --[[User:Rdm\|Rdm]] 17:00, 15 August 2011 (UTC) ::: Possibly, but Gimmick is a bit harsh. --[[User:Dgamey\|Dgamey]] 18:07, 15 August 2011 (UTC) Line 94 ⟶ 99: :: To be honest, I think this is a very poor task as currently structured. It goes out of its way to make things difficult with those restrictions (e.g., causing problems in languages that unify lists and arrays) and all without the purpose of achieving something that maps to a real life goal. By contrast, being able to handle very large numbers of lines (e.g., 10 million) where they may be long and where there may be many maxima, that actually means something vaguely useful. (Except that hardly never is line length of real relevance; I've never seen it for real.) If you insist on having the restrictions, make them Extra Credit points only. –[[User:Dkf\|Donal Fellows]] 00:00, 16 August 2011 (UTC) ::: I will respectfully disagree with some of your assertions. If every task on the site were purely utilitarian, we could cut out a fair percentage of tasks. Many also don't address real life goals. Anyone is free to write another task to handle millions of lines. Part of being a draft task is that things need working out. This task started with a straight copy and we are seeing if it can be adapted and generalized. Right now it's a bit of an experiment. I'm surprised that I've seen little in terms of positive suggestions about how to evolve this. Simply making it optional defeats the whole point. --[[User:Dgamey\|Dgamey]] 02:38, 16 August 2011 (UTC) :I quite like the rewrite. I wouldn't want to go so far as to add something like <nowiki>{{puzzle}}</nowiki> until we've all had a chance to see how things go with this task; and I think the name puzzle might be so easily miss-used on game tasks we have had in the past, that did not need the distinction. --[[User:Paddy3118\|Paddy3118]] 05:57, 16 August 2011 (UTC) :: Thanks Paddy. I think it needs a wee bit more work. I'm not sure if I should add the change to the main page so people don't miss it or give it a couple of days for feedback here. I also think Puzzle is the wrong name for that reason. It could be a stretch, a challenge, or we could add the word programming to it. --[[User:Dgamey\|Dgamey]] 11:51, 16 August 2011 (UTC) ===Ready to leave Draft?=== The task and talk page seems to have been stable for a while. I'm thinking it's about time to remove draft status. If I don't see more activity, I'm thinking that this should be done at the end of September. --[[User:Dgamey\|Dgamey]] 03:22, 9 September 2011 (UTC) ==Similar to [[Averages/Mode]]== Line 133 ⟶ 147: :::::::: I'm not there yet as I want to hear more. I am beginning to think that the intent isn't being stated in a positive way. Restrictions are by definition negative. What was the original intent here? Really, it was to get people to think outside of their particular box. This problem can be solved very conventionally and that's boring and pedestrian. The point perhaps is in your language of choice how would you solve this creatively showing off some of your language capabilities? This also lets people have a little bit of fun with it. The restrictions could be more guidance/example in this case. --[[User:Dgamey\|Dgamey]] 19:56, 14 August 2011 (UTC) ::::::::: If that is is the point, then rather than placing restrictions, just make that point in the task description, and let the implementer make the decision as to how this is best achieved. Maybe say "Demonstrate this creatively showing off some of your language capabilities." [[User:Markhobley\|Markhobley]] 07:46, 9 September 2011 (UTC) :::::::::: Borrowed from this idea in the last revision. I think there is a good balance between rigid restrictions and anarchy :) --[[User:Dgamey\|Dgamey]] 14:17, 29 September 2011 (UTC) ::::::::: (I find I am enjoying following the discussion and ultimately hope we ''can'' get the 'nuance' right to make this a good task). --[[User:Paddy3118\|Paddy3118]] 09:23, 15 August 2011 (UTC) Line 250 ⟶ 267: :Segfault isn't all that different from <code>kill -9</code>, or even calling <code>exit()</code> for that matter: the process is gone, along with all its memory pages and file handles, leaving not much to be compromised. A crashed program of course can leave behind some inconsistent state around such as half written files, but that's not a problem here. The C code can overrun buffers even if we use <code>fgets</code> (and the fgets length should be 1 less anyway), so the last fix didn't really fix anything, only adding a possibility of wrong result besides crashing. If there is a chance for the program to fail and we are not going to completely prevent it, I'd rather have it fail more obviously. --[[User:Ledrug\|Ledrug]] 03:13, 15 August 2011 (UTC) ::Segfault is not the only possible outcome from buffer overflow. Also, it's my understanding that the length argument to fgets is the buffer size -- if it's 65536 then a maximum of 65535 characters will be read as the final character to be placed in the buffer must be null. That said, if there were some other way to crash the program, I would like to understand it, and I would also like for that issue to be fixed. --[[User:Rdm\|Rdm]] 17:07, 15 August 2011 (UTC) :::The fgets needs to be 1 less because of the strcat of a newline right after it. Even with fgets, the accumulator buffer <code>buf</code> can still be overrun when we add lines to it (suppose the input has 2000 lines each 1000 chars long, for example). If you want to be safe, well, <lang c>#include <stdio.h> #include <string.h> #include <assert.h> int cmp(const char p, const char q) { while (p && q) { p = &p[1]; q = &q[1]; } return p; } int inc(int x) { return (int)&((char )x)[1]; } int dec(int x) { return (int)&((char )x)[-1]; } int gt(int x, int y) { while (y && x) y = dec(y), x = dec(x); return x; } int add(int x, int y) { while(y) x = inc(x), y = dec(y); return x; } / strlen(a) + 1 / int length(char a) { char x = 0; while (a) a = &a[1], x = &x[1]; return (int)x; } #define LINE_MAX 10 #define ACCU_MAX 30 int main() { char line[LINE_MAX]; char buf[ACCU_MAX] = {0}; char last = buf; char next = buf; while (fgets(line, LINE_MAX, stdin)) { /* check that fgets didn't truncate line, or result will be wrong / assert(gt(dec(LINE_MAX), length(line))); if (cmp(last, line)) continue; if (cmp(line, last)) next = buf; last = next; assert(!gt(add(length(buf), length(line)), ACCU_MAX)); strcpy(next, line); while (next) next = &next[1]; } printf("%s", buf); return 0; }</lang> I haven't carefully checked all the comparisons in the above, but it's on the right track at least. --[[User:Ledrug\|Ledrug]] 00:17, 17 August 2011 (UTC) :::: Ok, I see what you are saying. I have updated the C implementation to avoid buffer overflow problems. Note that this will not prevent crashes nor erroneous results -- it's only correct if the lines are short enough and if few enough of them are long. --[[User:Rdm\|Rdm]] 01:13, 17 August 2011 (UTC) ::::: Er I'd say <code>size <<= 1</code> qualifies as arithmetic operator. You could use the <code>add()</code> function above, but it gets really ugly looking, which is why I didn't bother to begin with. --[[User:Ledrug\|Ledrug]] 01:41, 17 August 2011 (UTC) :::::: It could be replaced by a function which returns the next available size (implemented as a large switch statement). --[[User:Rdm\|Rdm]] 04:03, 17 August 2011 (UTC) ::::::: Heh. I said somewhere on this page before, this task is gimmicky, and I wrote some gimmicky code as a proof of concept -- for fun. Then people came around and started to pile common sense stuff on it: <code>const</code> pointers, <code>fgets</code>, <code>realloc</code>, etc. The problem is, this is not a common sense task. We know that the buffer can be overrun, but what are the consequences? That someone will put the code on a webserver with root priviledge and accept unsanitized data? Or will it be run as a system daemon? I highly doubt it. I'd be fine with a giant, red, flashing warning label about buffer overrun, but making it correct by burying the (somewhat) interesting part with two hundred lines of safety code is going out of hand. I would suggest just slap on a warning label and leave it at that. The task requires a handicapped program, and we have a handicapped program, it's fitting. --[[User:Ledrug\|Ledrug]] 05:20, 17 August 2011 (UTC) :::::::: Now you 've pointed it out, it does verge on the absurd :-)<br> There is also much more talk than task entries. --[[User:Paddy3118\|Paddy3118]] 05:59, 17 August 2011 (UTC) ::::::::: I think I'll post the revised task descriptions with a small tweak about warnings. The intent was for demonstration code with some documentation. It would be fine to point out that the code isn't secure in the description. Now we've got all kinds of secondary functions. I'd rather see comments on the line p = &p[1] and the while loop showing the reader with little C what's happening. ::::::::: A separate task or family of tasks showing various safe vs. unsafe practices might be worth considering. --[[User:Dgamey\|Dgamey]] 11:39, 17 August 2011 (UTC) ::In other words, this program will not block for input: <lang c>#include <stdio.h> main() { Line 255 ⟶ 342: fgets(buf, 1, stdin); }</lang> --[[User:Rdm\|Rdm]] 17:10, 15 August 2011 (UTC) : A buffer overflow is a security violation. Input, that overflows the buffer, might overwrite the return address and hijack control of the program. A correct program would check bounds and report an error (or realloc() a longer buffer). Does anyone know how to check bounds without any comparison operators? --[[User:Kernigh\|Kernigh]] 21:34, 15 August 2011 (UTC) ::Use functions with bounds-checking built-in, such as strncpy. (Or one of Microsoft's _s extensions). --[[User:Short Circuit\|Michael Mol]] 23:24, 15 August 2011 (UTC) ::Alternately, XOR your return value, and use C's "0 is false, nonzero is true" behavior in conditional expression evaluation. --[[User:Short Circuit\|Michael Mol]] 23:24, 15 August 2011 (UTC) === bounds checking in C, take 2 === :I do not think that C code with buffer overflows should ever be considered to be simpler than C code that guards against them. It might be faster, but any apparent simplicity is deceptive since the possibility of buffer overflows pushes complexity out onto the user. That said, gets() here requires strcat(), because gets() drops the end of line character. Meanwhile fgets() does not trim off that character. And bounds checking could be implemented using memset() and then using the implementation's cmp() between a reference and an appropriately choosen spot near the end of buffer. So if we are will to pay the minor increase in complexity to use gets instead of fgets, I am not sure why we are not willing to pay a few extra lines to get an implementation without buffer overflow. --[[User:Rdm\|Rdm]] 16:14, 17 August 2011 (UTC) :: 1) I added a second version without buffer issues. 2) With fgets, you still need to check if the line is too long and was truncated; if it is, we are pretty much screwed and need to be able to read in the rest of the line and append it to the line buffer, which is a lot of work because of the "no arithmetics" clause. The second method does handle arbitrarily long input, but see how much more complicated it is. :: It's not that I particularly love gets or buffer overruns, and "faster" is totally not a concern for this task, the key issue is how easy it is to read the code. The original C code was really just demonstrating how to compare the lengths of two strings without directly measuring them, and I deliberately did not pretend it was safe code (I've ''never'' used gets before); declaring const and fgets, to me, feel like false promise of safety. :: That said, if you can check bounds and truncation correctly without making it overly complicated, feel free: I can't think of a way to do it without integer math, like in the second solution. --[[User:Ledrug\|Ledrug]] 18:27, 17 August 2011 (UTC) ::: Given that the current version does not always work, I do not see anything wrong with a replacement version which does not always work. That said, I think I prefer "exit with an error code" over "crash" for behavior in adverse conditions. But, ok, I'll try for a bounds checking version. --[[User:Rdm\|Rdm]] 18:32, 17 August 2011 (UTC) :::: The code I posted above on this talk page does that: it aborts if buffer would be overrun or line is too long. The second code I posted on task page only aborts on realloc failure and should function correctly if that doesn't happen. ''None'' of the changes made by various people to the original code could garantee a correct result or a timely abort, which is why I don't feel it was worth the trouble. --[[User:Ledrug\|Ledrug]] 18:45, 17 August 2011 (UTC) ::::: What do you think of the version I added? --[[User:Rdm\|Rdm]] 21:02, 17 August 2011 (UTC) :::::: <code>if (!longer(bufend, line))</code> isn't really checking for accumulator overflow, is it? Also, if you use <code>memset</code> to mark unused buffer region, they should to be called every time the accumulator is reset (i.e. found a new longest line), or it would be possible that there's a null byte in the tail region of <code>buf</code> before the reset, but accumulated strings haven't reached there yet, and next <code>if (!longer(bufend, line))</code> can give a false positive on overflow. (I think) --[[User:Ledrug\|Ledrug]] 21:24, 17 August 2011 (UTC) :::::EDIT yes that does happen. Modified the code to this: (LINE_MAX is your 1000000, your buffer would have been <code>#define ACCU_MAX (11LINE_MAX + 1)</code>, but the point is the same:<lang c>#define LINE_MAX 6 #define ACCU_MAX (4LINE_MAX+1) int main() { char line[LINE_MAX]; char buf[ACCU_MAX]; char linend= &line[LINE_MAX - 1]; char bufend= &buf[ACCU_MAX - LINE_MAX - 1]; char last = buf; char next = buf; memset(line, 1, LINE_MAX); memset(buf, 1, ACCU_MAX); buf[0]= buf[ACCU_MAX - 1]= 0; while (fgets(line, LINE_MAX, stdin)) { if (!linend) exit(1); if (longer(last, line)) continue; if (!longer(bufend, line)) exit(1); if (longer(line, last)) next = buf; last = next; strcpy(next, line); while (next) next = &next[1]; }</lang> and it aborts on the input <code>echo "aa\nbb\ncc\ndd\nee\nff\naaa" \| ./a.out</code> but works if you remove any of the doubles. --[[User:Ledrug\|Ledrug]] 21:28, 17 August 2011 (UTC) :::::: I do not consider "false positives" to be a bug. That's a (very minor) efficiency issue, but the point was to avoid buffer overflows, not to squeeze every last ounce out of the buffer. So I have one extra byte in the line buffer, and an extra line buffer sized region in the big buffer. If you prefer, imagine that those particular bytes were part of other variables. --[[User:Rdm\|Rdm]] 21:35, 17 August 2011 (UTC) :::::: Eh. If it's garanteed to fail under some situations where it has the resource to succeed, it's not an efficiency issue, it's a correctness issue. It's not even hard to fix:<lang c> while (fgets(line, LINE_MAX, stdin)) { if (!linend) exit(1); if (longer(last, line)) continue; if (longer(line, last)) { memset(buf, 1, ACCU_MAX); buf[0]= buf[ACCU_MAX - 1] = 0; next = buf; } if (!longer(bufend, line)) exit(1); last = next; strcpy(next, line); while (next) next = &next[1]; }</lang> --[[User:Ledrug\|Ledrug]] 21:45, 17 August 2011 (UTC) ::::::: Bytes from <code>bufend</code> on were not "resources" they were "protection". So no, I do not agree that it's a correctness issue. If that's a correctness issue, you might as well argue that a program is incorrect because some memory on the heap has not been allocated. Your ACCU_MAX is the moral equivalent of my <code>bufend</code>. I could agree to calling this an efficiency issue, but not a correctness issue. --[[User:Rdm\|Rdm]] 22:23, 17 August 2011 (UTC) :::: (deindent) Ok I guess that's what's confusing about the code. You could check available buffer at a more natural place: <code>next+1</code>. That's where you'd copy the line string to to begin with:<lang c> char line[LINE_MAX] = {0}; char buf[ACCU_MAX] = {0}; char last = buf; char next = buf; while (fgets(line, LINE_MAX, stdin)) { if (line[LINE_MAX - 2]) exit(1); if (longer(last, line)) continue; if (longer(line, last)) { memset(buf, 1, ACCU_MAX); buf[0] = buf[ACCU_MAX - 1] = 0; next = buf; } if (line && longer(&line[1], &next[1])) exit(1); strcpy(next, line); for (last = next; next; next = &next[1]); }</lang> (I didn't define the LINE_MAX and ACCU_MAX just to be different, it's easier to change the values for testing. They ''are'' related to your <code>bufend</code>). There's still one issue: <code>echo -n "a\naa\naaa"\|./a.out</code> fails, which has nothing to do with the modifications I made. --[[User:Ledrug\|Ledrug]] 23:10, 17 August 2011 (UTC) ::::: The treatment of the <code>echo -n "a\naa\naaa"</code> case depends on the definition of a "line". Since "line" was not defined in the task, I am comfortable saying that that represents an ill-formed file and thus cannot represent a correctness issue. (If I cared about that case, I would exit(1) for any line that did not end with a newline character.) --[[User:Rdm\|Rdm]] 23:29, 17 August 2011 (UTC) == Boring solution v. restrictions == Line 270 ⟶ 423: every write(!L) end</lang> == Conforms to the specs? == Hoping to confirm that the AHK solution didn't break any rules : Looks like it meets the spirit just fine. --[[User:Dgamey\|Dgamey]] 14:18, 29 September 2011 (UTC)