Talk:Bitwise IO: Difference between revisions

Add comment about the "naturalness" of big-endian order
No edit summary
(Add comment about the "naturalness" of big-endian order)
 
(30 intermediate revisions by 5 users not shown)
Line 34:
::I see Shin's point. In the real world, the LZW symbol stream is always packed into octets. The GIF and TIFF standards each specify a packing scheme. However, the method is not as simple as you may realize. As the symbol table is filled, the maximum bit length goes up one bit at specific points (i.e. 511 -> 512), and the packing scheme takes advantage of that. Perhaps Shin could reference one of these standards for the task.
::If Shin's actual goal is to measure the compression achieved by LZW compared to the input stream, that is more easily accomplished. The output symbols start at 9-bits, so simply multiply output symbols by 9 and divide by 8. For the test string of the task"TOBEORNOTTOBEORTOBEORNOT" (24 bytes), it compresses to 16 symbols which would pack into ((9*16)+7)/8 = 18 bytes. The calculation becomes more complex if there are more than 256 output symbols, because the symbol size increases. (I implemented an 8086 assembly TIFF LZW codec once upon a time.) --[[User:IanOsgood|IanOsgood]] 17:09, 22 December 2008 (UTC)
 
:::I think I have to rewrite the text of the task. The approach I've used of course has a direct analog into streaming of bytes. No one would have talked about endianness or fine maths points if I had said "write a stream of bytes", and read them "keeping the streaming order"; said badly, but just an example would have been enough to understand: <tt>printf("hello world")</tt>, or <tt>fwrite("\x45\x20\x50\xAA", 1, 4, out)</tt> ... the two functions printf and fwrite (in C) would have completed the task (fwrite being more general, while printf can't allow a "null" byte). Simply I wanted something similar, but for bits (and it is what my C impl does). One can also implement the whole stuff so that what was leftshift in my previous speech is now a rightshift; it is enough to pick bytes from "the bottom" and walk backward, which means seeking into a file (also if using buffering, and process the bytes from the end of the buffer); it will be inefficient if one would like to use the technics on a pipe, since in order to start the decompression, one must wait for the end of the stream (that could mean: buffering the whole stream... memory consuming). I wanted no to have a way of misuring efficiency (as you say, this can be done by exstimation); I've just implemented, in C, a working LZW compressor: instead of outputting an array, I output bits; added a big endian 32 bit integer to store how many bytes there were in the uncompressed file, and I've a compressor. I needed these bits routines to achieve my aim. The LZW C compression code for now can't fit into RC (I've just written one in Obj-C, which has a dictionary object..., the way the others did, ie. outputting an array), since I had to implement a Dictionary and a general way to handle strings of arbitrary bytes, so that I was able to translate the Java code.
 
4. You are dealing with words of a language, which always have someway elastic meaning and usage, and it was what I tried to stress. If a number is abstract for a mathematician (what is an abstract object without representation? a definition could be counted as a form of representation?), or if it is defined just as a successor of a successor of a successor ... of 0, e.g. 0SSSS would be a rapresentation of 4, which is a widespread common representation itself, or what, it's not so interesting after all. There's no single human being (mathematicians included) able to use abstract numbers. You can talk about these, but they would be a lot unuseful for a lot of tasks. Bit is the short form for Binary digIT; so let's talk about digits. What is a digits? A matter of definition. I say a matching couple {} is a digit. Here's it comes digits into Zermelo–Fraenkel. You could say {} itself is a representation of something. Yes, indeed every symbols could be considered a representation of something, more or less abstract. So that mathematicians using symbols to talk about abstract numbers, are using a representation of something; so that I can define the word "digit" to match a unit of their rapresentation. And I can also define someway what is first digit and last digit. Philosophically, I can say that a representation of something could be itself; my body is a representation of myself, and I feel also it is myself (My body is really me, sort of citation of a Cohen's song). And it becomes harder if we consider that the way we communicate is done by symbols and representations. If representations are nothing (since they exist Real Thing which are no representations and are not even only representations of themselves), our speechs are void. We exist by reference. Everything exists by reference. (I use ''referencing'' to mean "assigning a representation", or definition). Referencing is what makes the things exist. And it's why I believe there's such a digit in a number. I've just "created" it by referencing it in my speech (I've defined it, more or less explicitly). Back to the earthground, we use representations of numbers which has "concept" of digits, and first and last digit as I've defined before; human languages allow us to say simply that numbers are made of digits. If you write an math article about number theory, you can write as you prefer; but out of that world and article (or out of the world of the article), it makes few sense: for all practical and communicative use, numbers have digits. By the way, I define the less significative digit of ZF as the innermost {}. As you can see, there's no escape, since to express your thoughts (no matter the concepts involved!) you '''need''' using symbols, which always are representations of something!!
 
Definitions are important, make the things exist, as said before. A byte is an entity that is made of bits (bits are binary digits). By convention, eight. A byte-oriented file has bytes, and bytes being made by bits, contain bits (sometimes I could put "", but now I won't). A byte can be interpreted (interpretation and representation are different, even though one can link them some conventional way) as character or number. We are interested in the number interpretation. Since a byte is made by 8 binary digits, we understand that a byte is just a number rapresented in base 2 with at most 8 digits. So, it is a number from 0 to 255, no more. We learned about numerical system, and so we know what is a less significative digit (and if not, I can reference my previous talk and it is enough to continue speaking about LS digits). When I must write my age, I write 30, no 03. The same apply to a number written in base 2. It has a well know order. Conventional, but of that kind of common widespread convention that one can't ignore, since it's one of the basis of the communication; one can change the convention, but then can't be surprised a lot of persons misunderstand what s/he says. So, we have data called bytes that we can interpret as binary numbers, and this interpretation is suitable for many tasks, and there are simple way we can link between different interpretations. All, maybe by chance maybe not, is related.
And after all, it is all related to how the hardware is built and to the physical processes involved in keeping/retrieving a "''quantum''" of information into a thing we call simply "memory". Of course you are right, the meaning is in the interpretation. And haven't I given you a lot of different (but correlated) interpretations of the task? I've talk about a huge number, ''represented'' in base 2, and how to "manipulate" it in order to obtain smaller base 2 numbers, and how these are related to the inputs of the functions of the task. '''But''', the fact that byte in a computer a made of bits, is a little bit (!) below the level of an application. It is just something an application "must" know before, it is not it that decided how bits are packed into a byte, nor the order of them that makes it meaningful to say a+b=c i.e. and take a human representation that (by chance!) matchs what we know about sum of binary numbers (with special conventions for the computer world). Applications just use those conventions. They indeed can change it totally... but you need writing a lof of code to do so! Relying on the underlaying conventions, make it all easier. (In fact, try to use any of the other 8! arrangements of bits into a byte and still use your preferred language +, = and so on... I suppose you must write a lot of code to use them the same way as you normally do).
 
I believed my "mathematical" approach of the "huge {''rapresented in'' base 2} number" was clear; maybe I will rewrite the task using the byte streaming analogy (print "hello world" or print bin"11010100101001010110" should be similar...). Good Xmas and 2009 to all --[[User:ShinTakezou|ShinTakezou]] 02:01, 23 December 2008 (UTC)
 
: Yes, number is abstract and independent on the representation. This is why we can have different representations of numbers. This is why we can use numbers to describe so many different things. It is important to differentiate abstractions and representations, otherwise we would still be counting using fingers. One finger is not one.
 
: Digit is not a number, it is a symbol. [http://en.wikipedia.org/wiki/Numerical_digit See].
 
: Byte has many meanings. When byte is a minimal addressable/transportable storage unit, it becomes meaningless (and even dangerous) to talk about bits ordering in it. Sorry for being stubborn, but communication protocols in my bread. Too often people make nasty errors, leading to non-portable code, and sometimes putting someone's life in danger.
 
: There is no preferred ordering when we deal with conventional computer architectures, network protocols, parallel communication devices etc. Surely, since byte has a finite number of states, you can group states and then order these groups declaring them bits. But there is no preferred way to do it. If bits were addressable, then the order of addresses would be such a preferred way. In information systems there also are no bits in bytes, because bytes are used merely as a media layer.
 
: This is exactly your case. You have a sequence of bits, which you want to store as a sequence of bytes. Certainly you need to define how these bits define the bytes of in the sequence.
 
: But you don't need all this philosophy in order to unambiguously specify the task... (:-)) Merry Christmas! --[[User:Dmitry-kazakov|Dmitry-kazakov]] 09:51, 23 December 2008 (UTC)
 
==After task rewriting; but still on the task sense dialog==
 
I still can't get the point and how it is (was) related to the way the task is (was) specified. But '''to me''' it sounds a little bit confusing what you are saying. In common computerish world, a byte is a minimal addressable storage unit; nonetheless, it is '''meaningful''' to talk about bits ordering in it, and in fact many processors allow to handle single bits into a byte (or bigger "units"), once it is loaded into a processor register. E.g. motorola 680x0 family has bset, bclr and btst to set, clear or test a single bit into a register (and by convention the bit "labelled" as "0" is the LSB; it is not my way of saying it, it was the one of the ''engineers'' who wrote the manual &mdash;shipped from Motorola&mdash; where I've read about 680x0 instructions set). The same applies if you want to use logical bitwise operations that all the processors I know have; for instance ''and'', ''or'', ''exclusive or'', ''not''. To keep all the stuff coherent, you must "suppose" (define an already defined) bit ordering.
 
A bit ordering is always defined, or a lot of things would be meaningless when computers are used. The following is an excerpt from RFC 793.
 
<pre>
TCP Header Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
</pre>
 
Numbers on the top define a bit ordering of 32 bits (4 bytes). In order to give "universal" (shareable and portable) meaning to "numbers" that span more than a single byte, the big endian convention is used, and this must be stated, and in fact it is, somewhere (but again, I would refer to the big-endian convention as the "natural" one, since we are used to write numbers from left to right, from the most significant digit to the less significant digit &mdash;the unity). In order to check, get or set the flags URG, ACK, PSH, RST, SYN and FIN the bit ordering is essential; You may need to translate from a convention to another (e.g. if I would have loaded the 16 bit of the Source Port into a register of 680x0, which is a big endian processor, I would had the right "number" of the Source Port, since endianness matchs, but to refer to bit here labelled as 15, I should use 0 instead). Despite the bit ordering here given, I can access URG flags by an ''and'' mask, which I can build without considering that labelling: the mask would be simply 0000 0000 0010 0000 0000 0000 0000 0000, which a can write shortly as hex number 00200000, i.e. simply hex 200000, which I could also express (not conveniently but keeping the "meaning") as 2097152 (base 10). Big endianness apart (since the data are stored somewhere by computers that can have different conventions), no more specification is needed to build the mask.
 
The preferred ordering is the one in use on a particular system, and '''could''' be different. But it is not: since bytes are handled atomically, bit ordering into a byte, for all the ways common people are able to access them (i.e. using machine code or maybe assembly), is always the same. So that I can write the byte 01000001 (in hex 41), and be sure that it will be always the same, from "point" to "point" (i.e. transmission in the middle can change the thing, but some layer, at hw level or/and sw level, must "adjust" everything so that an "application" can read 0x41, and interpret it, e.g. as an ASCII "A").
 
Another example where nobody felt necessary to specify more than what it is obvious by exposing the matter itself, could be the MIPS instructions set you can read at [http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html MIPS reference]. In order to program an assembler for that MIPS, you don't need more information than the one that is already given in the page. Using "mine" implementation of ''bit'' ''oriented'' stream read/write functions, I could code the Add instruction in the following way:
 
<pre>
bits_write(0, 6, stdout);
bits_write(source, 5, stdout);
bits_write(target, 5, stdout);
bits_write(destreg, 5, stdout);
bits_write(0x20, 11, stdout);
</pre>
 
Where source, target and destreg are integers (only the first 5 bits are taken, so that the integer range is from 0 to 31). One could do it with ''and''s, ''or''s and ''shift''s, but then when writing the final 32 bit integer into memory, s/he must pay attention to endiannes, so again a bunch of and+shift could be used to "select" single bytes of the 32 bit integer and write the integer into memory in an endiannes-free way (a way which works on every processor, despite of its endianness). (Here I am also supposing that the processor in use is able to handle 32bit integers, i.e. is a 32 bit processor).
 
As the MIPS reference given above, I don't need to define how the bits define the bytes in the sequence. It is straightforward and "natural", as the MIPS bit encoding is. If I want to write the bit sequence 101010010101001001010101010, I've just to group it 8 by 8:
 
<pre>
1010 1001 0101 0010 0101 0101 010
\_______/ \_______/ \_______/ \____
</pre>
 
and "pad" the last bit "creating" fake 0s until we have a number of bits multiple of 8:
 
<pre>
1010 1001 0101 0010 0101 0101 010X XXXX
\_______/ \_______/ \_______/ \_______/
</pre>
 
So, the output bytes will be (in hex): A9 52 55 40. Also, to write that sequence as a whole, I must write <tt>bits_write(0x54A92AA, 27, stdout)</tt>, which could be a little bit confusing, but it is not, since if you write the hex number in base 2 you find exactly the sequence I wanted to write, right aligned (and only this one is a convention I decided in my code, and that could be different for different implementations, but this is also the most logical convention not to create code depending on the maximum size of a register: if the left-aligning would be left to the user of the function, he should code always it so that it takes into account the size of an "int" in that arch; in my implementation, this "count" is left to the functions, which in fact left align the bits taking into account the "real" size of the container &mdash;a #define also should allow to use the code on archs that have bytes made of a different number of bits, e.g. 9)
 
Another way of writing the same sequence, and obtain the same bytes as output, could be:
<pre>
bits_write(1, 1, stdout); bits_write(0, 1, stdout);
bits_write(1, 1, stdout); bits_write(0, 1, stdout);
</pre>
 
And so on. If the task is not well explained in these details, examples should clarify it. But maybe I am wrong. --[[User:ShinTakezou|ShinTakezou]] 01:14, 28 December 2008 (UTC)
 
: My point is about [http://en.wikipedia.org/wiki/Bit_numbering bit-endianness], it must be specified. You did it by providing an example in the task description. Note that the talk about writing strings into (binary?) files is superfluous and, worse, it is misleading, for exactly same reasons, BTW. Text is '''encoded''' when written as bytes (or any other storage unit). It is sloppiness of [[C]] language, which lets you mix these issues. If the file were UCS-32 encoded your text output would be rubbish.
 
: TCP header defines bits because it contains fields shorter than one byte, and because the physical layer is bit-oriented, which is '''not''' the case for byte-oriented files.
 
: If MIPS architecture has preferred bit-endianness, why should that wonder anybody? --[[User:Dmitry-kazakov|Dmitry-kazakov]] 10:32, 28 December 2008 (UTC)
 
The task specifies we are handling ASCII (encoded) strings. Hopely this is enough to avoid loosing information, that it would happen with any other encoding that uses "full" byte. The bit-endianness is just a labelling problem. Even in the wikipage you linked, no matter if the left bit is labelled as 7 or 0, the byte (binary number with at most 8 digit) is still 10010110. That we can read as the "number" 96 in hex (too lazy to get it in decimal now:D); and if I write such a byte "formed" by such bits into a file, I expect that a hexadecimal dump of that file will give me 96 (string) as output. These are details hidden into the code; no matter how you label the bits; the important fact is that when you use the functions to write the bits 10010110 as a "whole",
you get the byte 96 into the output; and viceversa, when you read first 8 bits from a file having as first byte 96, you must get 10010110 (i.e. 96 :D). And the same if you write an arbitrary sequence, like 100101101100, as a "whole"; when you read back 12 bits, you get back 100101101100 (which is the "integer" 96C in hex)
 
I still can't get the point of the statement of the second paragraph. When I "code" software at some not too low hw level, I deal with bytes, I can't see the bit-orientation. And it is why the RFC can talk that way letting programmer understand and code in the right way application handling that TCP data, disregarding the physical layer. These things could be an issue when writing lowlevel drivers, dealing with serial communication or whatever... But we are a little ''bit'' higher than this! Files are byte-oriented, and it is the reason why we need to pad with "spurious" bits if the bits sequence we want to write has no a number of bits multiple of 8 (supposing a byte "contains" 8 bit); but if we "expand" the bits of each byte, we have a sequence of bits (and maybe the last bits are padding bits...); this is the "vision" of the task.
 
It does not wonder; it just hasn't specified a bit-endiannes, that as said before, is a labelling problem; encoding of the addu instruction is
 
<pre>
0000 00ss ssst tttt dddd d000 0010 0001
</pre>
 
and nobody is telling that the leftmost bit is the 0, or 31. No matter, since encoding of the instruction remain the same, and it is written into memory in the same way. So here indeed we don't know if MIPS prefers to call the leftmost bit 0 or 31.
 
One could think about what's happening with a little endian processor; to have a feel on it
 
<pre>
0000033A 681000 push word 0x10
</pre>
 
from a disassembly; we have 68 (binary 01101000) followed by 16bit LE encoded value. If bits into the first instruction byte have meaning, we could say the encoding would be:
 
<pre>
push byte/word/dword + DATA -> 0110AB00 + DATA
</pre>
 
(It is a fantasy example, x86 push is not this way!) Where bits AB specifies if we are pushing a byte a word or a dword (32bit); AB=00 push byte, AB=10 push word, AB=11 push dword (while AB=01 could be used to specify another kind of instruction); and somewhere it will be stated that DATA must be LE encoded. But one must remember that once the DATA is got from memory, into the (8/16/32 bits) register there's no endianness; if stored DATA is hex 1000, this is, into the reg, just the 16 bit "integer" hex 1000. To talk about the encoding of push byte/word/dword, I don't need to specify a bit-endianness. I must know it when using intruction that manipulates single bits of a register (a said before, motorola 680x0 label the LSB as 0).
 
<pre>
00000336 50 push ax
00000395 51 push cx
0000038A 52 push dx
0000045F 53 push bx
00000136 55 push bp
00000300 58 pop ax
</pre>
 
These "pushes"/pop suggest us that we could say the encoding of the push REG instruction is something like
 
<pre>
0101PRRR RRR = 000 ax P = 0 push
011 bx 1 pop
001 cx
010 dx
101 bp
</pre>
 
It happens that x86 instrunctions are not all of the same length, but it should not throw confusion; the way we use to say how x86 instructions are encoded is the same as the one for the MIPS, the 680x0 or whatelse. And despite of the ''preferred'' endianness(!!) if we like to say it in a bit-wise manner:
 
<pre>
push word DATA -> 0110 1000 LLLL LLLL HHHH HHHH
L = bits of the LS byte (Low)
H = bits of the MS byte (High)
</pre>
 
And this way, which is sometime used, don't need to specify any "bit-endianness": it is clear how bits of the LS byte LLLL LLLL must be put. E.g. for the "integer" 0010, LS byte is 10 (binary 00010000) and MS byte is 00, so we fill L and H this way:
 
<pre>
LLLL LLLL HHHH HHHH
0001 0000 0000 0000
</pre>
 
The endiannes which could lead to problems is the endianness regarding bytes for "integers" stored with more than a single byte. At this (not so low) level, bit-endianness is just a labelling issue and matter just when using instructions like 680x0 bset, bclr and so on.
 
Hopely the task is clear(er) (at least an OCaml programmer seemed to have got it!), and I've learned
«Ada allows specifying a bit order for data type representation» (but underlying implementation will need to map to hardware convention, so it would be faster just to use the "default", I suppose!) --[[User:ShinTakezou|ShinTakezou]] 00:10, 6 January 2009 (UTC)
: Everything in mathematics is just a labeling problem. Mathematics is a process of labeling, no more. As well as computing itself is, by the way. Your following reasoning makes no sense to me. When byte is a container of bits, you cannot name the ordinal number corresponding to the byte '''before''' you label its bits (more precisely define the encoding). The fallacy of your further reasoning is that you use a certain encoding (binary, positional, right to left) without naming it, and then start to argue that there is no any other, that this is natural (so there are others?), that everything else is superfluous etc. In logic A=>A, but proves nothing.
: Here are some examples of encoding in between bits and bytes: [http://en.wikipedia.org/wiki/RADIX-50 4-bit character codes], [http://en.wikipedia.org/wiki/Binary-coded_decimal packed decimal numbers].
: This is an example of a serial bit-oriented protocol [http://en.wikipedia.org/wiki/Controller_Area_Network CAN], note how transmission conflicts are resolved in CAN using the identifier's bits put on the wire. Also note that a CAN controller is responsible to deliver CAN messages to the CPU in the endianness of the later. I.e. it must recode sequences of bits on the wire into 8-bytes data frames + identifiers.
: More about [http://www.linuxjournal.com/article/6788 endianness] --[[User:Dmitry-kazakov|Dmitry-kazakov]] 10:08, 6 January 2009 (UTC)
 
:: Sorry at this point I think we can understand each others. I believe I've explained in a rather straightforward (even though too long) way the point, and can't do better myself. In my computer experience, the "problem" and the task is understandable, clear and not ambiguous. In a implementation-driven way I can say that you've got it as I intended iff the output of the program feeded with the bytes sequence (bytes written in hex)
 
<pre>
41 42 41 43 55 53
</pre>
 
:: (which in ASCII can be read as "ABACUS") is
 
<pre>
83 0a 0c 3a b4 c0
</pre>
 
:: i.e. if you save the output in a file and see it with a hexdumper, you see it; e.g.
 
<pre>
[mauro@beowulf-1w bitio]$ echo -n "ABACUS" |./asciicompress |hexdump -C
00000000 83 0a 0c 3a b4 c0 |...:..|
00000006
[mauro@beowulf-1w bitio]$ echo -n "ABACUS" |./asciicompress |./asciidecompress
ABACUS[mauro@beowulf-1w bitio]$
</pre>
 
::--[[User:ShinTakezou|ShinTakezou]] 18:10, 13 January 2009 (UTC)
 
::: As a small point: you said that most-to-least significant is the "natural" order, but I'd like to point out that that is only true in Western languages that are written left-to-right. In Arabic and Hebrew, decimal digits appear in the same order despite the surrounding text being read right-to-left, so the digits appear in least-to-most significant order. --[[User:Markjreed|Markjreed]] ([[User talk:Markjreed|talk]]) 13:12, 28 March 2024 (UTC)
 
 
== PL/I bitstring longer than REXX'... ==
 
because the input seems to be STRINGS followed by '0D0A00'x
[[User:Walterpachl|Walterpachl]] 20:22, 2 November 2012 (UTC)
 
 
[[User:Carl Johnson|Carl Johnson]]
1,479

edits