Talk:Bitwise IO: Difference between revisions

← Older edit

Talk:Bitwise IO (view source)

Revision as of 13:13, 28 March 2024

17,114 bytes added , 1 month ago

Add comment about the "naturalness" of big-endian order

Markjreed

1,479

edits

Revision as of 09:51, 23 December 2008 (view source) rosettacode>Dmitry-kazakov (Bits and bytes) ← Older edit		Latest revision as of 13:13, 28 March 2024 (view source) Markjreed (talk \| contribs) (Add comment about the "naturalness" of big-endian order)
(28 intermediate revisions by 5 users not shown)
Line 55: : But you don't need all this philosophy in order to unambiguously specify the task... (:-)) Merry Christmas! --[[User:Dmitry-kazakov\|Dmitry-kazakov]] 09:51, 23 December 2008 (UTC) ==After task rewriting; but still on the task sense dialog== I still can't get the point and how it is (was) related to the way the task is (was) specified. But '''to me''' it sounds a little bit confusing what you are saying. In common computerish world, a byte is a minimal addressable storage unit; nonetheless, it is '''meaningful''' to talk about bits ordering in it, and in fact many processors allow to handle single bits into a byte (or bigger "units"), once it is loaded into a processor register. E.g. motorola 680x0 family has bset, bclr and btst to set, clear or test a single bit into a register (and by convention the bit "labelled" as "0" is the LSB; it is not my way of saying it, it was the one of the ''engineers'' who wrote the manual —shipped from Motorola— where I've read about 680x0 instructions set). The same applies if you want to use logical bitwise operations that all the processors I know have; for instance ''and'', ''or'', ''exclusive or'', ''not''. To keep all the stuff coherent, you must "suppose" (define an already defined) bit ordering. A bit ordering is always defined, or a lot of things would be meaningless when computers are used. The following is an excerpt from RFC 793. <pre> TCP Header Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Source Port \| Destination Port \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Sequence Number \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Acknowledgment Number \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Data \| \|U\|A\|P\|R\|S\|F\| \| \| Offset\| Reserved \|R\|C\|S\|S\|Y\|I\| Window \| \| \| \|G\|K\|H\|T\|N\|N\| \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Checksum \| Urgent Pointer \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Options \| Padding \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| data \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ </pre> Numbers on the top define a bit ordering of 32 bits (4 bytes). In order to give "universal" (shareable and portable) meaning to "numbers" that span more than a single byte, the big endian convention is used, and this must be stated, and in fact it is, somewhere (but again, I would refer to the big-endian convention as the "natural" one, since we are used to write numbers from left to right, from the most significant digit to the less significant digit —the unity). In order to check, get or set the flags URG, ACK, PSH, RST, SYN and FIN the bit ordering is essential; You may need to translate from a convention to another (e.g. if I would have loaded the 16 bit of the Source Port into a register of 680x0, which is a big endian processor, I would had the right "number" of the Source Port, since endianness matchs, but to refer to bit here labelled as 15, I should use 0 instead). Despite the bit ordering here given, I can access URG flags by an ''and'' mask, which I can build without considering that labelling: the mask would be simply 0000 0000 0010 0000 0000 0000 0000 0000, which a can write shortly as hex number 00200000, i.e. simply hex 200000, which I could also express (not conveniently but keeping the "meaning") as 2097152 (base 10). Big endianness apart (since the data are stored somewhere by computers that can have different conventions), no more specification is needed to build the mask. The preferred ordering is the one in use on a particular system, and '''could''' be different. But it is not: since bytes are handled atomically, bit ordering into a byte, for all the ways common people are able to access them (i.e. using machine code or maybe assembly), is always the same. So that I can write the byte 01000001 (in hex 41), and be sure that it will be always the same, from "point" to "point" (i.e. transmission in the middle can change the thing, but some layer, at hw level or/and sw level, must "adjust" everything so that an "application" can read 0x41, and interpret it, e.g. as an ASCII "A"). Another example where nobody felt necessary to specify more than what it is obvious by exposing the matter itself, could be the MIPS instructions set you can read at [http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html MIPS reference]. In order to program an assembler for that MIPS, you don't need more information than the one that is already given in the page. Using "mine" implementation of ''bit'' ''oriented'' stream read/write functions, I could code the Add instruction in the following way: <pre> bits_write(0, 6, stdout); bits_write(source, 5, stdout); bits_write(target, 5, stdout); bits_write(destreg, 5, stdout); bits_write(0x20, 11, stdout); </pre> Where source, target and destreg are integers (only the first 5 bits are taken, so that the integer range is from 0 to 31). One could do it with ''and''s, ''or''s and ''shift''s, but then when writing the final 32 bit integer into memory, s/he must pay attention to endiannes, so again a bunch of and+shift could be used to "select" single bytes of the 32 bit integer and write the integer into memory in an endiannes-free way (a way which works on every processor, despite of its endianness). (Here I am also supposing that the processor in use is able to handle 32bit integers, i.e. is a 32 bit processor). As the MIPS reference given above, I don't need to define how the bits define the bytes in the sequence. It is straightforward and "natural", as the MIPS bit encoding is. If I want to write the bit sequence 101010010101001001010101010, I've just to group it 8 by 8: <pre> 1010 1001 0101 0010 0101 0101 010 \_______/ \_______/ \_______/ \____ </pre> and "pad" the last bit "creating" fake 0s until we have a number of bits multiple of 8: <pre> 1010 1001 0101 0010 0101 0101 010X XXXX \_______/ \_______/ \_______/ \_______/ </pre> So, the output bytes will be (in hex): A9 52 55 40. Also, to write that sequence as a whole, I must write <tt>bits_write(0x54A92AA, 27, stdout)</tt>, which could be a little bit confusing, but it is not, since if you write the hex number in base 2 you find exactly the sequence I wanted to write, right aligned (and only this one is a convention I decided in my code, and that could be different for different implementations, but this is also the most logical convention not to create code depending on the maximum size of a register: if the left-aligning would be left to the user of the function, he should code always it so that it takes into account the size of an "int" in that arch; in my implementation, this "count" is left to the functions, which in fact left align the bits taking into account the "real" size of the container —a #define also should allow to use the code on archs that have bytes made of a different number of bits, e.g. 9) Another way of writing the same sequence, and obtain the same bytes as output, could be: <pre> bits_write(1, 1, stdout); bits_write(0, 1, stdout); bits_write(1, 1, stdout); bits_write(0, 1, stdout); </pre> And so on. If the task is not well explained in these details, examples should clarify it. But maybe I am wrong. --[[User:ShinTakezou\|ShinTakezou]] 01:14, 28 December 2008 (UTC) : My point is about [http://en.wikipedia.org/wiki/Bit_numbering bit-endianness], it must be specified. You did it by providing an example in the task description. Note that the talk about writing strings into (binary?) files is superfluous and, worse, it is misleading, for exactly same reasons, BTW. Text is '''encoded''' when written as bytes (or any other storage unit). It is sloppiness of [[C]] language, which lets you mix these issues. If the file were UCS-32 encoded your text output would be rubbish. : TCP header defines bits because it contains fields shorter than one byte, and because the physical layer is bit-oriented, which is '''not''' the case for byte-oriented files. : If MIPS architecture has preferred bit-endianness, why should that wonder anybody? --[[User:Dmitry-kazakov\|Dmitry-kazakov]] 10:32, 28 December 2008 (UTC) The task specifies we are handling ASCII (encoded) strings. Hopely this is enough to avoid loosing information, that it would happen with any other encoding that uses "full" byte. The bit-endianness is just a labelling problem. Even in the wikipage you linked, no matter if the left bit is labelled as 7 or 0, the byte (binary number with at most 8 digit) is still 10010110. That we can read as the "number" 96 in hex (too lazy to get it in decimal now:D); and if I write such a byte "formed" by such bits into a file, I expect that a hexadecimal dump of that file will give me 96 (string) as output. These are details hidden into the code; no matter how you label the bits; the important fact is that when you use the functions to write the bits 10010110 as a "whole", you get the byte 96 into the output; and viceversa, when you read first 8 bits from a file having as first byte 96, you must get 10010110 (i.e. 96 :D). And the same if you write an arbitrary sequence, like 100101101100, as a "whole"; when you read back 12 bits, you get back 100101101100 (which is the "integer" 96C in hex) I still can't get the point of the statement of the second paragraph. When I "code" software at some not too low hw level, I deal with bytes, I can't see the bit-orientation. And it is why the RFC can talk that way letting programmer understand and code in the right way application handling that TCP data, disregarding the physical layer. These things could be an issue when writing lowlevel drivers, dealing with serial communication or whatever... But we are a little ''bit'' higher than this! Files are byte-oriented, and it is the reason why we need to pad with "spurious" bits if the bits sequence we want to write has no a number of bits multiple of 8 (supposing a byte "contains" 8 bit); but if we "expand" the bits of each byte, we have a sequence of bits (and maybe the last bits are padding bits...); this is the "vision" of the task. It does not wonder; it just hasn't specified a bit-endiannes, that as said before, is a labelling problem; encoding of the addu instruction is <pre> 0000 00ss ssst tttt dddd d000 0010 0001 </pre> and nobody is telling that the leftmost bit is the 0, or 31. No matter, since encoding of the instruction remain the same, and it is written into memory in the same way. So here indeed we don't know if MIPS prefers to call the leftmost bit 0 or 31. One could think about what's happening with a little endian processor; to have a feel on it <pre> 0000033A 681000 push word 0x10 </pre> from a disassembly; we have 68 (binary 01101000) followed by 16bit LE encoded value. If bits into the first instruction byte have meaning, we could say the encoding would be: <pre> push byte/word/dword + DATA -> 0110AB00 + DATA </pre> (It is a fantasy example, x86 push is not this way!) Where bits AB specifies if we are pushing a byte a word or a dword (32bit); AB=00 push byte, AB=10 push word, AB=11 push dword (while AB=01 could be used to specify another kind of instruction); and somewhere it will be stated that DATA must be LE encoded. But one must remember that once the DATA is got from memory, into the (8/16/32 bits) register there's no endianness; if stored DATA is hex 1000, this is, into the reg, just the 16 bit "integer" hex 1000. To talk about the encoding of push byte/word/dword, I don't need to specify a bit-endianness. I must know it when using intruction that manipulates single bits of a register (a said before, motorola 680x0 label the LSB as 0). <pre> 00000336 50 push ax 00000395 51 push cx 0000038A 52 push dx 0000045F 53 push bx 00000136 55 push bp 00000300 58 pop ax </pre> These "pushes"/pop suggest us that we could say the encoding of the push REG instruction is something like <pre> 0101PRRR RRR = 000 ax P = 0 push 011 bx 1 pop 001 cx 010 dx 101 bp </pre> It happens that x86 instrunctions are not all of the same length, but it should not throw confusion; the way we use to say how x86 instructions are encoded is the same as the one for the MIPS, the 680x0 or whatelse. And despite of the ''preferred'' endianness(!!) if we like to say it in a bit-wise manner: <pre> push word DATA -> 0110 1000 LLLL LLLL HHHH HHHH L = bits of the LS byte (Low) H = bits of the MS byte (High) </pre> And this way, which is sometime used, don't need to specify any "bit-endianness": it is clear how bits of the LS byte LLLL LLLL must be put. E.g. for the "integer" 0010, LS byte is 10 (binary 00010000) and MS byte is 00, so we fill L and H this way: <pre> LLLL LLLL HHHH HHHH 0001 0000 0000 0000 </pre> The endiannes which could lead to problems is the endianness regarding bytes for "integers" stored with more than a single byte. At this (not so low) level, bit-endianness is just a labelling issue and matter just when using instructions like 680x0 bset, bclr and so on. Hopely the task is clear(er) (at least an OCaml programmer seemed to have got it!), and I've learned «Ada allows specifying a bit order for data type representation» (but underlying implementation will need to map to hardware convention, so it would be faster just to use the "default", I suppose!) --[[User:ShinTakezou\|ShinTakezou]] 00:10, 6 January 2009 (UTC) : Everything in mathematics is just a labeling problem. Mathematics is a process of labeling, no more. As well as computing itself is, by the way. Your following reasoning makes no sense to me. When byte is a container of bits, you cannot name the ordinal number corresponding to the byte '''before''' you label its bits (more precisely define the encoding). The fallacy of your further reasoning is that you use a certain encoding (binary, positional, right to left) without naming it, and then start to argue that there is no any other, that this is natural (so there are others?), that everything else is superfluous etc. In logic A=>A, but proves nothing. : Here are some examples of encoding in between bits and bytes: [http://en.wikipedia.org/wiki/RADIX-50 4-bit character codes], [http://en.wikipedia.org/wiki/Binary-coded_decimal packed decimal numbers]. : This is an example of a serial bit-oriented protocol [http://en.wikipedia.org/wiki/Controller_Area_Network CAN], note how transmission conflicts are resolved in CAN using the identifier's bits put on the wire. Also note that a CAN controller is responsible to deliver CAN messages to the CPU in the endianness of the later. I.e. it must recode sequences of bits on the wire into 8-bytes data frames + identifiers. : More about [http://www.linuxjournal.com/article/6788 endianness] --[[User:Dmitry-kazakov\|Dmitry-kazakov]] 10:08, 6 January 2009 (UTC) :: Sorry at this point I think we can understand each others. I believe I've explained in a rather straightforward (even though too long) way the point, and can't do better myself. In my computer experience, the "problem" and the task is understandable, clear and not ambiguous. In a implementation-driven way I can say that you've got it as I intended iff the output of the program feeded with the bytes sequence (bytes written in hex) <pre> 41 42 41 43 55 53 </pre> :: (which in ASCII can be read as "ABACUS") is <pre> 83 0a 0c 3a b4 c0 </pre> :: i.e. if you save the output in a file and see it with a hexdumper, you see it; e.g. <pre> [mauro@beowulf-1w bitio]$ echo -n "ABACUS" \|./asciicompress \|hexdump -C 00000000 83 0a 0c 3a b4 c0 \|...:..\| 00000006 [mauro@beowulf-1w bitio]$ echo -n "ABACUS" \|./asciicompress \|./asciidecompress ABACUS[mauro@beowulf-1w bitio]$ </pre> ::--[[User:ShinTakezou\|ShinTakezou]] 18:10, 13 January 2009 (UTC) ::: As a small point: you said that most-to-least significant is the "natural" order, but I'd like to point out that that is only true in Western languages that are written left-to-right. In Arabic and Hebrew, decimal digits appear in the same order despite the surrounding text being read right-to-left, so the digits appear in least-to-most significant order. --[[User:Markjreed\|Markjreed]] ([[User talk:Markjreed\|talk]]) 13:12, 28 March 2024 (UTC) == PL/I bitstring longer than REXX'... == because the input seems to be STRINGS followed by '0D0A00'x [[User:Walterpachl\|Walterpachl]] 20:22, 2 November 2012 (UTC) [[User:Carl Johnson\|Carl Johnson]]