Read a file character by character/UTF8

From Rosetta Code
Revision as of 20:17, 29 December 2013 by Walterpachl (talk | contribs) (added REXX version 2 (show my understanding of this task))
Read a file character by character/UTF8 is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Read a file one character at a time, as opposed to reading the entire file at once.

The solution may be implemented as a procedure, which returns the next character in the file on each consecutive call (returning EOF when the end of the file is reached).

The procedure should support the reading of files containing UTF8 encoded wide characters, returning whole characters for each consecutive read.

See also

Run BASIC

<lang runbasic>open file.txt" for binary as #f numChars = 1 ' specify number of characters to read a$ = input$(#f,numChars) ' read number of characters specified b$ = input$(#f,1) ' read one character close #f</lang>

Perl 6

Perl 6 has a built in method .getc to get a single character from an open file handle. File handles default to UTF-8, so they will handle multi-byte characters correctly.

To read a single character at a time from the Standard Input terminal; $*IN in Perl 6: <lang perl6>.say while defined $_ = $*IN.getc;</lang>

Or, from a file: <lang perl6>my $filename = 'whatever';

my $in = open( $filename, :r ) or die "$!\n";

print $_ while defined $_ = $in.getc;</lang>

Python

Works with: Python version 2.7

<lang python> with open(filename,"rb") as f:

 while True:
   onebyte=f.read(1)
   if not onebyte:
     break
   byte=onebyte[0]

</lang>

Racket

Don't we all love self reference? <lang racket>

  1. lang racket
This file contains utf-8 charachters
λ, α, γ ...

(for ([c (in-port read-char (open-input-file "read-file.rkt"))])

 (display c))

</lang> Output: <lang racket>

  1. lang racket
This file contains utf-8 charachters
λ, α, γ ...

(for ([c (in-port read-char (open-input-file "read-file.rkt"))])

 (display c))

</lang>

REXX

version 1

REXX doesn't support UTF8 encoded wide characters, just bytes.

The task's requirement stated that EOF was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure). <lang rexx>/*REXX pgm reads/shows a file char by char, returning 'EOF' when done. */ parse arg f . /* F is the fileID to be read.*/

                                      /* [↓]  show the file's contents.*/

if f\== then do j=1 until x=='EOF' /*J count's the file's characters*/

               x=getchar(f);    y=    /*get a character  or  an 'EOF'. */
               if x>>' '   then y=x   /*display   X   if presentable.  */
               say  right(j,12)   'character,  (hex,char)'    c2x(x)    y
               end   /*j*/            /* [↑] only show X if not low hex*/

exit /*stick a fork in it, we're done.*/ /*───────────────────────────────GETCHAR subroutine─────────────────────*/ getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'

                                                       return charin(z)</lang>

input   file:   ABC

123 [¬ a prime]

output   (for the above input file):

           1 character,  (hex,char) 31 1
           2 character,  (hex,char) 32 2
           3 character,  (hex,char) 33 3
           4 character,  (hex,char) 20
           5 character,  (hex,char) 5B [
           6 character,  (hex,char) AA ¬
           7 character,  (hex,char) 20
           8 character,  (hex,char) 61 a
           9 character,  (hex,char) 20
          10 character,  (hex,char) 70 p
          11 character,  (hex,char) 72 r
          12 character,  (hex,char) 69 i
          13 character,  (hex,char) 6D m
          14 character,  (hex,char) 65 e
          15 character,  (hex,char) 5D ]
          16 character,  (hex,char) 0D
          17 character,  (hex,char) 0A
          18 character,  (hex,char) 454F46 EOF

version 2

<lang rexx>/* REXX ---------------------------------------------------------------

  • 29.12.2013 Walter Pachl
  • read one utf8 character at a time
  • see http://de.wikipedia.org/wiki/UTF-8#Kodierung
  • sorry this is in German but the encoding table should be obvious
  • --------------------------------------------------------------------*/

oid='utf8.txt';'erase' oid /* first create file containing utf8 chars*/ Call charout oid,'79'x Call charout oid,'C3A4'x Call charout oid,'C2AE'x Call charout oid,'E282AC'x Call charout oid,'F09D849E'x Call lineout oid fid='utf8.txt' /* then read it and show the contents */ Do While chars(fid)>0

 c8=get_utf8char(fid)
 Say left(c8,4) c2x(c8)
 End

Say 'EOF' Exit

get_utf8char: Procedure

 Parse Arg f
 c=charin(f)
 b=c2b(c)
 If left(b,1)=0 Then
   Nop
 Else Do
   p=pos('0',b)
   Do i=1 To p-2
     If chars(f)=0 Then Do
       Say 'illegal contents in file' f
       Leave
       End
     c=c||charin(f)
     End
   End
 Return c

c2b: Return x2b(c2x(arg(1)))</lang> output:

y    79
ä   C3A4
®   C2AE
€  E282AC
𝄞 F09D849E
EOF

Ruby

Utf-8 is the default encoding since Ruby 2.0. In Ruby 1.9 use the magic comment "#encoding: utf-8" on the first line. <lang ruby>DATA.each_char{|c| p c}

__END__ characters: λ, α, γ</lang>

Tcl

To read a single character from a file, use: <lang tcl>set ch [read $channel 1]</lang> This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the lseek OS call); the extent of buffering can be controlled via: <lang tcl>fconfigure $channel -buffersize $byteCount</lang> When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option.