Read a file character by character/UTF8: Difference between revisions

m
syntax highlighting fixup automation
(Read a file character by character/UTF8 in QBasic)
m (syntax highlighting fixup automation)
Line 13:
 
=={{header|Action!}}==
<langsyntaxhighlight Actionlang="action!">byte X
Proc Main()
Line 25:
Close(1)
Return</langsyntaxhighlight>
 
=={{header|AutoHotkey}}==
{{works with|AutoHotkey 1.1}}
<langsyntaxhighlight AutoHotkeylang="autohotkey">File := FileOpen("input.txt", "r")
while !File.AtEOF
MsgBox, % File.Read(1)</langsyntaxhighlight>
 
 
=={{header|BASIC256}}==
<langsyntaxhighlight BASIC256lang="basic256">f = freefile
filename$ = "file.txt"
 
Line 44:
end while
close f
end</langsyntaxhighlight>
 
=={{header|C}}==
<langsyntaxhighlight Clang="c">#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
Line 67:
 
return EXIT_SUCCESS;
}</langsyntaxhighlight>
 
=={{header|C++}}==
<langsyntaxhighlight lang="cpp">
#include <fstream>
#include <iostream>
Line 94:
return EXIT_SUCCESS;
}
</syntaxhighlight>
</lang>
 
=={{header|C sharp|C#}}==
<langsyntaxhighlight lang="csharp">using System;
using System.IO;
using System.Text;
Line 124:
}
}
}</langsyntaxhighlight>
 
=={{header|Common Lisp}}==
{{works with|CLISP}}{{works with|Clozure CL}}{{works with|CMUCL}}{{works with|ECL (Lisp)}}{{works with|SBCL}}{{works with|ABCL}}
 
<langsyntaxhighlight lang="lisp">;; CLISP puts the external formats into a separate package
#+clisp (import 'charset:utf-8 'keyword)
 
Line 135:
(loop for c = (read-char s nil)
while c
do (format t "~a" c)))</langsyntaxhighlight>
 
=={{header|Crystal}}==
Line 142:
The encoding is UTF-8 by default, but it can be explicitly specified.
 
<langsyntaxhighlight lang="ruby">File.open("input.txt") do |file|
file.each_char { |c| p c }
end</langsyntaxhighlight>
 
or
 
<langsyntaxhighlight lang="ruby">File.open("input.txt") do |file|
while c = file.read_char
p c
end
end</langsyntaxhighlight>
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| System.Classes}}
{{Trans|C#}}
<syntaxhighlight lang="delphi">
<lang Delphi>
program Read_a_file_character_by_character_UTF8;
 
Line 186:
end;
readln;
end.</langsyntaxhighlight>
 
=={{header|Déjà Vu}}==
 
<langsyntaxhighlight lang="dejavu">#helper function that deals with non-ASCII code points
local (read-utf8-char) file tmp:
!read-byte file
Line 228:
!close file
return
!.</langsyntaxhighlight>
 
=={{header|Factor}}==
<syntaxhighlight lang="text">USING: kernel io io.encodings.utf8 io.files strings ;
IN: rosetta-code.read-one
 
"input.txt" utf8 [
[ read1 dup ] [ 1string write ] while drop
] with-file-reader</langsyntaxhighlight>
 
 
=={{header|FreeBASIC}}==
<langsyntaxhighlight lang="freebasic">Dim As Long f
f = Freefile
 
Line 253:
Wend
Close #f
Sleep</langsyntaxhighlight>
 
=={{header|FunL}}==
<langsyntaxhighlight lang="funl">import io.{InputStreamReader, FileInputStream}
 
r = InputStreamReader( FileInputStream('input.txt'), 'UTF-8' )
Line 263:
print( chr(ch) )
r.close()</langsyntaxhighlight>
 
=={{header|Go}}==
<langsyntaxhighlight lang="go">package main
 
import (
Line 287:
fmt.Printf("%c", r)
}
}</langsyntaxhighlight>
 
=={{header|Haskell}}==
Line 293:
{{Works with|GHC|7.8.3}}
 
<langsyntaxhighlight lang="haskell">#!/usr/bin/env runhaskell
 
{- The procedure to read a UTF-8 character is just:
Line 334:
xs -> forM_ xs $ \name -> do
putStrLn name
withFile name ReadMode processOneFile</langsyntaxhighlight>
{{out}}
<pre>
Line 350:
First, we know that the first 8-bit value in a utf-8 sequence tells us the length of the sequence needed to represent that character. Specifically: we can convert that value to binary, and count the number of leading 1s to find the length of the character (except the length is always at least 1 character long).
 
<langsyntaxhighlight Jlang="j">u8len=: 1 >. 0 i.~ (8#2)#:a.&i.</langsyntaxhighlight>
 
So now, we can use indexed file read to read a utf-8 character starting at a specific file index. What we do is read the first octet and then read as many additional characters as we need based on whatever we started with. If that's not possible, we will return EOF:
 
<langsyntaxhighlight Jlang="j">indexedread1u8=:4 :0
try.
octet0=. 1!:11 y;x,1
Line 361:
'EOF'
end.
)</langsyntaxhighlight>
 
The length of the result tells us what to add to the file index to find the next available file index for reading.
Line 370:
 
=={{header|Java}}==
<langsyntaxhighlight lang="java">import java.io.FileReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
Line 384:
}
}
}</langsyntaxhighlight>
 
=={{header|jq}}==
jq being stream-oriented, it makes sense to define `readc` so that it emits a stream of the UTF-8 characters in the input:
<langsyntaxhighlight lang="jq">def readc:
inputs + "\n" | explode[] | [.] | implode;</langsyntaxhighlight>
 
Example:
<syntaxhighlight lang="sh">
<lang sh>
echo '过活' | jq -Rn 'include "readc"; readc'
"过"
"活"
"\n"</langsyntaxhighlight>
 
=={{header|Julia}}==
Line 402:
The built-in <code>read(stream, Char)</code> function reads a single UTF8-encoded character from a given stream.
 
<langsyntaxhighlight lang="julia">open("myfilename") do f
while !eof(f)
c = read(f, Char)
println(c)
end
end</langsyntaxhighlight>
 
=={{header|Kotlin}}==
<langsyntaxhighlight lang="scala">// version 1.1.2
 
import java.io.File
Line 425:
}
}
}</langsyntaxhighlight>
 
=={{header|Lua}}==
{{works with|Lua|5.3}}
<syntaxhighlight lang="lua">
<lang Lua>
-- Return whether the given string is a single ASCII character.
function is_ascii (str)
Line 487:
end
end
</syntaxhighlight>
</lang>
{{out}}
𝄞 A ö Ж € 𝄞 Ε λ λ η ν ι κ ά y ä ® € 成 长 汉
Line 494:
from revision 27, version 9.3, of M2000 Environment, Chinese 长 letter displayed in console (as displayed in editor)
 
<syntaxhighlight lang="m2000 interpreter">
<lang M2000 Interpreter>
Module checkit {
\\ prepare a file
Line 548:
}
checkit
</syntaxhighlight>
</lang>
 
using document as final$
 
<syntaxhighlight lang="m2000 interpreter">
<lang M2000 Interpreter>
Module checkit {
\\ prepare a file
Line 612:
checkit
 
</syntaxhighlight>
</lang>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<langsyntaxhighlight Mathematicalang="mathematica">str = OpenRead["file.txt"];
ToString[Read[str, "Character"], CharacterEncoding -> "UTF-8"]</langsyntaxhighlight>
 
=={{header|NetRexx}}==
Line 626:
 
:The file <tt>data/utf8-001.txt</tt> it a UTF-8 encoded text file containing the following:&nbsp;&#x79;&#xE4;&#xAE;&#x20AC;&#x1D11E;&#x1D122;&#x31;&#x32;.
<langsyntaxhighlight NetRexxlang="netrexx">/* NetRexx */
options replace format comments java crossref symbols nobinary
numeric digits 20
Line 739:
say
return
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 763:
As in fact the file would be read line by line, even if the characters are actually yielded one by one, it may be considered as cheating. So, we provide a function and an iterator which read bytes one by one.
 
<langsyntaxhighlight Nimlang="nim">import unicode
 
proc readUtf8(f: File): string =
Line 779:
res.add f.readChar()
if res.validateUtf8() == -1: break
yield res</langsyntaxhighlight>
 
=={{header|Pascal}}==
<langsyntaxhighlight lang="pascal">(* Read a file char by char *)
program ReadFileByChar;
var
Line 800:
Close(OutputFile)
end.
</syntaxhighlight>
</lang>
=={{header|Perl}}==
<langsyntaxhighlight lang="perl">binmode STDOUT, ':utf8'; # so we can print wide chars without warning
 
open my $fh, "<:encoding(UTF-8)", "input.txt" or die "$!\n";
Line 810:
}
 
close $fh;</langsyntaxhighlight>
 
If the contents of the ''input.txt'' file are <code>aă€⼥</code> then the output would be:
Line 827:
could easily add this to that file permanently, and document/autoinclude it properly.
 
<!--<langsyntaxhighlight Phixlang="phix">-->
<span style="color: #008080;">constant</span> <span style="color: #000000;">INVALID_UTF8</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">#FFFD</span>
Line 902:
<span style="color: #008080;">return</span> <span style="color: #000000;">res</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<!--</langsyntaxhighlight>-->
 
Test code:
 
<!--<langsyntaxhighlight Phixlang="phix">-->
<span style="color: #000080;font-style:italic;">--string utf8 = "aă€⼥" -- (same results as next)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">utf8</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf32_to_utf8</span><span style="color: #0000FF;">({</span><span style="color: #000000;">#0061</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#0103</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#20ac</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#2f25</span><span style="color: #0000FF;">})</span>
Line 933:
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<!--</langsyntaxhighlight>-->
 
{{out}}
Line 947:
=={{header|PicoLisp}}==
Pico Lisp uses UTF-8 until told otherwise.
<syntaxhighlight lang="picolisp">
<lang PicoLisp>
(in "wordlist"
(while (char)
(process @))
</syntaxhighlight>
</lang>
 
=={{header|Python}}==
{{works with|Python|2.7}}
<langsyntaxhighlight lang="python">
def get_next_character(f):
# note: assumes valid utf-8
Line 976:
for c in get_next_character(f):
print(c)
</syntaxhighlight>
</lang>
 
{{works with|Python|3}}
Python 3 simplifies the handling of text files since you can specify an encoding.
<langsyntaxhighlight lang="python">def get_next_character(f):
"""Reads one character from the given textfile"""
c = f.read(1)
Line 990:
with open("input.txt", encoding="utf-8") as f:
for c in get_next_character(f):
print(c, sep="", end="")</langsyntaxhighlight>
 
=={{header|QBasic}}==
<langsyntaxhighlight lang="qbasic">f = FREEFILE
filename$ = "file.txt"
 
Line 1,002:
PRINT char$;
WEND
CLOSE #f</langsyntaxhighlight>
 
=={{header|Racket}}==
Don't we all love self reference?
<langsyntaxhighlight lang="racket">
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
</lang>
Output:
<langsyntaxhighlight lang="racket">
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
</lang>
 
=={{header|Raku}}==
Line 1,026:
 
To read a single character at a time from the Standard Input terminal; $*IN in Raku:
<syntaxhighlight lang="raku" perl6line>.say while defined $_ = $*IN.getc;</langsyntaxhighlight>
 
Or, from a file:
<syntaxhighlight lang="raku" perl6line>my $filename = 'whatever';
 
my $in = open( $filename, :r ) orelse .die;
 
print $_ while defined $_ = $in.getc;</langsyntaxhighlight>
 
=={{header|REXX}}==
Line 1,040:
<br>The task's requirement stated that '''EOF''' was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure).
<br>Note that displaying of characters that may modify screen behavior such as tab usage, backspaces, line feeds, carriage returns, "bells" and others are suppressed, but their hexadecimal equivalents are displayed.
<langsyntaxhighlight lang="rexx">/*REXX program reads and displays a file char by char, returning 'EOF' when done. */
parse arg iFID . /*iFID: is the fileID to be read. */
/* [↓] show the file's contents. */
Line 1,050:
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'; return charin(z)</langsyntaxhighlight>
'''input''' &nbsp; file: &nbsp; '''ABC'''
<br>and was created by the DOS command (under Windows/XP): &nbsp; &nbsp; '''echo 123 [¬ a prime]> ABC'''
Line 1,080:
 
===version 2===
<langsyntaxhighlight lang="rexx">/* REXX ---------------------------------------------------------------
* 29.12.2013 Walter Pachl
* read one utf8 character at a time
Line 1,119:
Return c
 
c2b: Return x2b(c2x(arg(1)))</langsyntaxhighlight>
output:
<pre>y 79
Line 1,129:
 
=={{header|Ring}}==
<langsyntaxhighlight lang="ring">
fp = fopen("C:\Ring\ReadMe.txt","r")
r = fgetc(fp)
Line 1,137:
end
fclose(fp)
</syntaxhighlight>
</lang>
Output:
<pre>
Line 1,165:
{{works with|Ruby|1.9}}
 
<langsyntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do |f|
f.each_char{|c| p c}
end</langsyntaxhighlight>
 
or
 
<langsyntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do |f|
while c = f.getc
p c
end
end</langsyntaxhighlight>
 
=={{header|Run BASIC}}==
<langsyntaxhighlight lang="runbasic">open file.txt" for binary as #f
numChars = 1 ' specify number of characters to read
a$ = input$(#f,numChars) ' read number of characters specified
b$ = input$(#f,1) ' read one character
close #f</langsyntaxhighlight>
 
=={{header|Rust}}==
Line 1,193:
originally.
 
<langsyntaxhighlight Rustlang="rust">use std::{
convert::TryFrom,
fmt::{Debug, Display, Formatter},
Line 1,303:
 
Ok(())
}</langsyntaxhighlight>
 
 
Line 1,314:
the file [http://seed7.sourceforge.net/libraries/utf8.htm#STD_UTF8_OUT STD_UTF8_OUT] is used.
 
<langsyntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "utf8.s7i";
 
Line 1,331:
close(inFile);
end if;
end func;</langsyntaxhighlight>
 
{{out}}
Line 1,343:
 
=={{header|Sidef}}==
<langsyntaxhighlight lang="ruby">var file = File('input.txt') # the input file contains: "aă€⼥"
var fh = file.open_r # equivalent with: file.open('<:utf8')
fh.each_char { |char|
printf("got character #{char} [U+%04x]\n", char.ord)
}</langsyntaxhighlight>
{{out}}
<pre>
Line 1,358:
=={{header|Smalltalk}}==
{{works with|Smalltalk/X}}
<langsyntaxhighlight lang="smalltalk">|utfStream|
utfStream := 'input' asFilename readStream asUTF8EncodedStream.
[utfStream atEnd] whileFalse:[
Transcript showCR:'got char ',utfStream next.
].
utfStream close.</langsyntaxhighlight>
 
=={{header|Tcl}}==
To read a single character from a file, use:
<langsyntaxhighlight lang="tcl">set ch [read $channel 1]</langsyntaxhighlight>
This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the <tt>lseek</tt> OS call); the extent of buffering can be controlled via:
<syntaxhighlight lang ="tcl">fconfigure $channel -buffersize $byteCount</langsyntaxhighlight>
When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option.
 
=={{header|Wren}}==
<langsyntaxhighlight lang="ecmascript">import "io" for File
 
File.open("input.txt") { |file|
Line 1,388:
offset = offset + 1
}
}</langsyntaxhighlight>
 
=={{header|zkl}}==
zkl doesn't know much about UTF-8 or Unicode but is able to test whether a string or number is valid UTF-8 or not. This code uses that to build a state machine to decode a byte stream into UTF-8 characters.
<langsyntaxhighlight lang="zkl">fcn readUTF8c(chr,s=""){ // transform UTF-8 character stream
s+=chr;
try{ s.len(8); return(s) }
catch{ if(s.len()>6) throw(__exception) } // 6 bytes max for UTF-8
return(Void.Again,s); // call me again with s & another character
}</langsyntaxhighlight>
Used to modify a zkl iterator, it can consume any stream-able (files, strings, lists, etc) and provides support for foreach, map, look ahead, push back, etc.
<langsyntaxhighlight lang="zkl">fcn utf8Walker(obj){
obj.walker(3) // read characters
.tweak(readUTF8c)
}</langsyntaxhighlight>
<langsyntaxhighlight lang="zkl">s:="-->\u20AC123"; // --> e2,82,ac,31,32,33 == -->€123
utf8Walker(s).walk().println();
 
Line 1,409:
foreach c in (utf8Walker(Data(Void,s,"\n"))){ print(c) }
 
utf8Walker(Data(Void,0xe2,0x82,"123456")).walk().println(); // € is short 1 byte</langsyntaxhighlight>
{{out}}
<pre>
Line 1,418:
</pre>
If you wish to push a UTF-8 stream through one or more functions, you can use the same state machine:
<langsyntaxhighlight lang="zkl">stream:=Data(Void,s,"\n").howza(3); // character stream
stream.pump(List,readUTF8c,"print")</langsyntaxhighlight>
{{out}}<pre>-->€123</pre>
and returns a list of the eight UTF-8 characters (with newline).
Or, if file "foo.txt" contains the characters:
<langsyntaxhighlight lang="zkl">File("foo.txt","rb").howza(3).pump(List,readUTF8c,"print");</langsyntaxhighlight>
produces the same result.
 
10,333

edits