Talk:Bitwise IO

From Rosetta Code
Revision as of 00:19, 22 December 2008 by rosettacode>ShinTakezou (→‎Real intention: too long half reply)

Real intention

The real intention of this task and code was to have a bunch of functions to test the LZW compressor / decompressor on a binary real output (instead of array or similar output); this way compression ratio statistics for the LZW of the task LZW compression can be done. --ShinTakezou 16:57, 19 December 2008 (UTC)

Hmm, strictly speaking it is not bit-oriented. It is rather a bit-stream interface to a byte-oriented I/O. Exactly because it is not bit I/O (as for instance a serial I/O is) you have an endianness issue here. So in your task you should specify whether it is big or little endian encoding of bits into bytes. Your code looks like big endian. Was it your intention? --Dmitry-kazakov 18:28, 19 December 2008 (UTC)
It is bit-oriented, as common software (not my intention to drive serial hardware this way!) I/O can be, i.e. you think you are writing one or more bits, indeed they are sent only grouped by eight, since we can only write bytes (and no more of one byte per time, or we have surely endianness issues). For the endianness while computing, it could be a point, but here I can't see how, since endianness issues are related to how bytes are stored into memory. Let us take a 0xFF packed into a 16 bit word. It will be written into memory, on little endian arch, as 0xFF 0x00. But when you take it, you can consider it logically (I love big endian!) as 0x00FF and you can be sure that if you perform a left shift, you will obtain 0x01FE... If you write it into memory, you have again 0xFE 0x01 on LE arch. But, if you shift left 9 time, you will obtain 0xFE00 with a carry (or whatever the status flag for a bit slipped away from left is called). Again, iff you write it into memory, and pretend to read it as sequencial bytes instead of word, you get an issue. Again, into memory it is 0x00 0xFE for LE arch. Luckly even LE processor handle data into registers in a more logical way!
You can object that a variable such d is stored into memory somewhere, so it is LE encoded. But all operations in a processor are rather endianness-logical (!)! So when I left-shift a 0x00FF that is stored into memory as 0xFF 0x00, the processor first load the datum, so that now you can think of it as 0x00FF, then perform a left shift, obtaining e.g. 0xFF00, then storing into memory, as 0x00 0xFF. If I'd read from memory byte by byte, loading and shifting to create a datum longer than one byte, I should have considered endianness seriously. But that's also why read and write operation are performed byte by byte rather than accumulating into a 16 or 32 bit word.
To say it briefly, I can't see any endianness issues. I am not interested how the processor stores the unsigned int I use to take the bits from; the fact is that when I do a datum << (32-4) for a 4 bit datum of 0xF, what I obtain is (luckly) 0xF0000000 (stored as 0x00 0x00 0x00 0xF0). When I shift again 1 bit left, I expect to have 0xE0000000, not 0xE0010000 (stored as 0x00 0x00 0x01 0xE0). I am telling it harder than it is. It's late for me... hopely tomorrow I will find better words. --ShinTakezou 01:07, 20 December 2008 (UTC)
Endianness is an issue here because you have to specify how a tuple of N bits is to be packed into an integral unit called byte, since your I/O deals with these units, not with bits. Note, this has nothing to do with the endianness of the machine. If N=8, then, to illustrate the case, let Bi=(1,0,0,0,1,0,0,1), then B is 13710 using big endian and 14510 using little endian encoding, and there are N! variants of packing in total. The code you provided looks like big endian. If you don't like the term endian, no problem. You can simply provide a formula, like byte = Σ Bi*2i-1 (=little endian) or Σ Bi*28-i (=big endian).
As far as I know, packing into bytes is as expected in common mathematics (that is, if we want to talk about endianness, big endian): less significative bits are on the right, most significative bits are on the left, so that being Ai with i ∈ [0,7] the binary digits, A0 is the unity bit, which is so to say A0⋅20. So having the binary number 10010001, this is packed into a byte as 10010001 simply. When we are interested in less than 8 bits, they should be counted (always so to say) from right to left; e.g. if I want to write the 4 bits 1001, these must be right-aligned into the variable holding the bits. But these are conventions the user of the functions must know, they are not mandatory to do the way I did. I've implemented all the stuff so that you must always right-align the bits, then the functions will shift to left so that the intended most significative bit of the bits datum becomes the leftmost bit in the container (unsigned int in the C implementation). It is enough the user of the functions gets knowledge about how your functions extract bits (in which order) from the data they want to pass; then, it is intended that the first (being it the leftmost or the rightmost according to your convention) must be the first of the output stream. So that it will be the first bit you read when you ask for a single bit from the so output stream. Maybe your misunderstanding comes from the fact that you handle single bits (in your expl, still not looked your code) as unit held by an array. In C it would be like having char bitarray[N], where each char can be only 0 or 1. Doing this way, you can decide your preferred convention, i.e. if bitarray[0] is the first bit to be output or it is instead bitarray[N-1]. In C this implementation would be hard to use; if I want to write integers less than 16, which can be stored in just 4 bits, it is enough I pass to the function the number, like 12 (binary 1100), and tell the function I want just (the first) 4 bits; otherwise, I (as user of the functions) should split the integer into chars, each rapresenting a bit, pack it into an array in the right endianness, and then pass the array to the function. Maybe this is easier in Ada (I don't know), but it would be harder in C. Consider e.g. the integers output array of the LZW task; in C, it is enough to take one integer per time, and say to the function to output (e.g.) 11 bits of that integer (of course, 11 bits must be enough to hold the value!); you are outputting 11-bits long words (which are the words of the LZW dictionary). Hope I've understood well what you've said and expressed well my explanation (too long as usual, sorry). --ShinTakezou 21:53, 20 December 2008 (UTC)

Ok, seen the code. In your implementation, you need this specification since you can choose. In the C implementation, or in any other lowlevel implementation, it is not a need since the choice is one: we must put bits so that the leftmost is the most significative, and rightmost the less significative (which is, so to say, the mathematical convention). How these significative bits are aligned into the container, is a choice of the implementation. The only important thing is that if I wanted to output the 4 bits 1100, then the only byte of output must be 0xC0 (i.e. 11000000 where bold bits are just for padding). In this way, when reading, the first bit will be 1, the second 1, the third 0 and the fourth 0. If we put one bit after another according to the reading order, we obtain 1100 (and we must obtain 1100 also if we read the bits in just one shot). This is the way bit oriented is meant. --ShinTakezou 22:07, 20 December 2008 (UTC)

  1. The thing you mean is the binary positional numeral system, which provides a notation of natural numbers. An unrelated issue, also.
  2. Your task, speaking mathematically, is about [bijective] mapping of ordered sets of bits onto ordered sets of numbers from the range 0..2N-1.
  3. I don't want to comment on C, but an argumentation that C does something hard cannot serve as an argument.
  4. Mathematically there is no "first" and "last" bits of numbers. Number is an integral entity. There are encoding and representations of numbers in the memory of a computer. Only a representation of the number may have first bits, provided that memory bits are ordered in some way. Which BTW is not the case on most modern architectures, and never was the case for files. So it does not make any sense anyway.

--Dmitry-kazakov 09:08, 21 December 2008 (UTC)

Interesting. I am planning to rewrite the task text, maybe in a more formal way (and hopefully clearer), but now I am more busy on LZW and Xmas stuffs. I don't think the following clarify the task, but it is worth noting (I believe so).

  1. All the task can be rewritten shortly: provide functions to write integer numbers packed into n bits (binary digits), and read integer numbers packed into n bits. Consider a very long binary number, composed by N bits. Bits of this huge number can be grouped in M integral number ni; the grouping is done taking L1 bits for n1, L2 bits for ni and so on. This suggests also the reading process (which is clarified in the next sentece, since we must know the convention of the grouping). About writing: consider M integral numbers ni; the number N will be n1 + n2 ⋅ 2L1 + n3 ⋅ 2L1 + L2 + ... which can be written as N = ∑i ni ⋅ 2j<iLj, i from 1 to M, j from 0 to i-1, and with L0 = 0. This also should clarify the reading (but to me it makes no clearer the task for everyone). (In this mathematicsish, nM is the first number the user requested to output; we could say it is the last, as natural; but this way reading from a stream would mean to have the whole stream —the huge number— into memory, instead of splitted into bytes)
  2. Yes I suppose you can see it that way. It is not important how you define the thing, the important is that the result is as expected. If I want to output the number 1100 in 4 bits, I must obtain the byte 11000000, bold for padding. If I want to output the number 1001 in 9 bits, I must obtain 00000100 10000000, bold for padding. If I want to output the number 1100, 4 bits, "followed" by the number 1001 in 9 bits, I must obtain 11000000 01001000. You can tell the user of your functions that to write the number 1100 (and obtain as output 11000000) s/he must pack the bits into an array so that A[0] = 1, A[1] = 1, A[2] = 0, A[3] = 0, or A[0] = 0, A[1] = 0, A[2] = 1, A[3] = 1 or in any other of the N! variants. Why the output must be that way? Conventionally, I want that if I read a bit per time and build the "number" by shifting leftward already stored bits, I obtain the intended "number". While writing 1100, I want that shifting leftward in the requested size (4 bits), the first bit which goes out is the first bit of output, the second is the second and so on. So let us have an accumulator of infinite size (not important) and the huge number of N bits, the MSB being 1100.... Put the "accumulator" A in front of the huge number H, like A|H (you can consider the whole as the same huge number H, since at the beginning the accumulator is void, so that it is just as writing a lot of zeros in front of the H, like ...00000H, where H are N binary digits). Now, leftshift the "number" A|H (ie multiply it by 2), you obtain 1|100.... The accumulator holds (infinite sequence of 0s)1. It is as if you have read into the accumulator one bit. Leftshift again. Now, we have 11|00....., you can leftshift two more times, obtaining 1100|.... Now the accumulator holds the number 1100, since we have read the same number of bits we wrote (of course, this must be known!). If we want a "new" number, we clear the accumulator and continue the process. Imagining it this way should also make clearer why I called this bit oriented reading (writing is not so different). When building the tape, I must provide a way I give the bits I want to write. Here, A and H are swapped: the accumulator A holds exactly the bits I want to write, H is a growing number, at the beginning just 0 (i.e. an infinite sequence of zeros), but of course in any point of the process it just hold a sequence of bits. I put in the accumulator (shrunk to 4 bits) the number 1100, having H|1100 (it would have been 01100 if I wanted to write the same number, but packed into 5 bits). Leftshifting gives H1|1000, then H11|0000, then H110|0000, then H1100|0000. At this point I have "appended" the 4bits binary number 1100 to H, building a bigger number. This all suggests an implementation, but I can't say it is mandatory to implement it this way. I just want the following: if the huge number is 1100H, and I ask for 4 bits, I must obtain 1100 (as if I say, in the natural language, to take the first four bits, write it apart, and then delete them, e.g. 1100H). If the number is H, and I say I want to write the number 1100, I must obtain H1100. In base 10 seeing it like a turn-game: player 1 write a number (he can write also leading 0 digits), e.g. 06744. Player 2 writes another number, just after the previous; he wants to write 998, so that on the paper now we can read 06744998. Now back: player 1 must extracts digits in order to obtain back its number; he extracts 06744 and to signal how many digits he took, delete them: 06744; player 2 must do the same, ignoring of course deleted digits; he takes 998. If we have N players, we obtain a big "unique" number composed by N numbers. Of course, each player, in order to get back its number, must remember rightly how many digits it had, and must hope the player coming before remembered its number of digits too. E.g., if the player 1 deletes just 0674, even though player 2 remembered he wrote 3 digits, he obtains 499, which is wrong... Hope this stream of consciousness helps you to help me to understand how to write the task clearly and (oh yes!) briefly! Rather than as maths, I would say it in a manipulation-of-bits-data way.
  3. Can't understand the sentence. HL languages sometimes make it harder to make things that in assembly would be a lot easier and "direct". C is HL, but not so HL. The sentence "Maybe this is easier in Ada (I don't know), but it would be harder in C" (did you refer to this "hard"?), refer to the fact that it would be hard (difficult) to implement in C the task the way you implemented in Ada (i.e. using array to hold the bits): it would make the code not so usable (for LZW, e.g.); the best way (at least, in C) is to manipulate bits with shifts, ands and ors (nearest to operations a processor can do). Not all implementation will follow this way; I wonder how the code can be used to implement a real-world LZW compressor based on the code given in LZW compression.
  4. Mathematically, there's a "first" bit and a "last" bit. Mathematicians can feel disgust for the expression, but once one learn about numerical positional systems, it is easily understandable what one can mean by saying first digit and last digit of a number. Take it as a short form for "the least significative digit" (LSB for binary) and "the most significative non-zero digit" (but in computer world where we are used to consider integers packed into "fixed size" container, so this can be simply "the most significative digit", MSB for binary). My usage of first is clear by the context (or so I believed), and sometimes refers to the order of intended output rather than mathematical stuffs, so that the first bit of a stream is the one we wrote first, and the last the one we wrote last, e.g. F1100.....0101L. Users of the functions must know if the first bit of output will be the rightmost (or the bit held into the first/last element of the array) or the leftmost (or the bit held into the last/first element of the array) of the input "bit string", so they can accomodate the bits to obtain the output they wanted. Rappresentations of numbers are numbers, unless you want to consider physical processes, since bits are "rappresentations" of physical processes into the hardware; but noone is interested in these details while programming. I suppose "representation" is what I've called "packing" (packaging?). We pack numbers into fixed finite size container since for a computer numbers must be "real", not abstract entities. The concept of LSB (least significative bit) is ok for abstract entities too. The MSB no, we must use the "most significative non-zero bit" form. A file has an ordering too. We can imagine a file as a sequences of bytes, so there's a first byte and a last byte. Since each byte "contains" bits, if we look at this bitstream, there's again a first bit and a last bit in the sequence, as said before. If we consider it like a whole huge number with N digits (the number 00030 has 5 digits, even if we can write as 30, and it's the same number but with 2 digits only), as I explained before, then the first being the LSB, would be the "last" in the precious speech. (But I believed it was clear the meaning of my usage). Noone is interested in the physical ordering of the bits into the memory; likely they are rather scattered; but these are uninteresting details we can disregard while programming: we can consider all bits as if they had an ordering. The same for file: we are not interested in how the data are organized on disk; it is enough we can get them in the order we expect!

I will try to rewrite the task taking into account these misunderstandings, but not before I make the LZW working (which by the way is also a test-bed for the bits_read/write functions:D!), understand how to split into smaller tasks, and iff this Xmas won't kill me. --ShinTakezou 00:19, 22 December 2008 (UTC)