Read a file character by character/UTF8: Difference between revisions

← Older edit

Read a file character by character/UTF8 (view source)

Revision as of 10:31, 2 February 2024

5,777 bytes added , 3 months ago

m

→‎{{header|Wren}}: Changed to Wren S/H

PureFox

9,476

edits

Revision as of 11:26, 25 March 2021 (view source) Petelomax (talk \| contribs) m (→‎{{header\|Phix}}: added syntax colouring the hard way) ← Older edit		Latest revision as of 10:31, 2 February 2024 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Changed to Wren S/H)
(22 intermediate revisions by 10 users not shown)
Line 11: *   [[Read a file line by line]] <br><br> =={{header\|Action!}}== <syntaxhighlight lang="action!">byte X Proc Main() Open (1,"D:FILENAME.TXT",4,0) Do X=GetD(1) Put(X) Until EOF(1) Od Close(1) Return</syntaxhighlight> =={{header\|AutoHotkey}}== {{works with\|AutoHotkey 1.1}} <~~lang~~syntaxhighlight ~~AutoHotkey~~lang="autohotkey">File := FileOpen("input.txt", "r") while !File.AtEOF MsgBox, % File.Read(1)</~~lang~~syntaxhighlight> =={{header\|BASIC256}}== <syntaxhighlight lang="basic256">f = freefile filename$ = "file.txt" open f, filename$ while not eof(f) print chr(readbyte(f)); end while close f end</syntaxhighlight> =={{header\|C}}== <~~lang~~syntaxhighlight Clang="c">#include <stdio.h> #include <wchar.h> #include <stdlib.h> Line 39 ⟶ 67: return EXIT_SUCCESS; }</~~lang~~syntaxhighlight> =={{header\|C++}}== <syntaxhighlight lang="cpp"> #include <fstream> #include <iostream> #include <locale> using namespace std; int main(void) { /* If your native locale doesn't use UTF-8 encoding * you need to replace the empty string with a * locale like "en_US.utf8" / std::locale::global(std::locale("")); // for C++ std::cout.imbue(std::locale()); ifstream in("input.txt"); wchar_t c; while ((c = in.get()) != in.eof()) wcout<<c; in.close(); return EXIT_SUCCESS; } </syntaxhighlight> =={{header\|C sharp\|C#}}== <~~lang~~syntaxhighlight lang="csharp">using System; using System.IO; using System.Text; Line 69 ⟶ 124: } } }</~~lang~~syntaxhighlight> =={{header\|Common Lisp}}== {{works with\|CLISP}}{{works with\|Clozure CL}}{{works with\|CMUCL}}{{works with\|ECL (Lisp)}}{{works with\|SBCL}}{{works with\|ABCL}} <~~lang~~syntaxhighlight lang="lisp">;; CLISP puts the external formats into a separate package #+clisp (import 'charset:utf-8 'keyword) Line 80 ⟶ 135: (loop for c = (read-char s nil) while c do (format t "~a" c)))</~~lang~~syntaxhighlight> =={{header\|Crystal}}== Line 87 ⟶ 142: The encoding is UTF-8 by default, but it can be explicitly specified. <~~lang~~syntaxhighlight lang="ruby">File.open("input.txt") do \|file\| file.each_char { \|c\| p c } end</~~lang~~syntaxhighlight> or <~~lang~~syntaxhighlight lang="ruby">File.open("input.txt") do \|file\| while c = file.read_char p c end end</~~lang~~syntaxhighlight> =={{header\|Delphi}}== {{libheader\| System.SysUtils}} {{libheader\| System.Classes}} {{Trans\|C#}} <syntaxhighlight lang="delphi"> ~~<lang Delphi>~~ program Read_a_file_character_by_character_UTF8; Line 131 ⟶ 186: end; readln; end.</~~lang~~syntaxhighlight> =={{header\|Déjà Vu}}== <~~lang~~syntaxhighlight lang="dejavu">#helper function that deals with non-ASCII code points local (read-utf8-char) file tmp: !read-byte file Line 173 ⟶ 228: !close file return !.</~~lang~~syntaxhighlight> =={{header\|Factor}}== <syntaxhighlight lang="text">USING: kernel io io.encodings.utf8 io.files strings ; IN: rosetta-code.read-one "input.txt" utf8 [ [ read1 dup ] [ 1string write ] while drop ] with-file-reader</~~lang~~syntaxhighlight> =={{header\|FreeBASIC}}== <syntaxhighlight lang="freebasic">Dim As Long f f = Freefile Dim As String filename = "file.txt" Dim As String1 txt Open filename For Binary As #f While Not Eof(f) txt = String(Lof(f), 0) Get #f, , txt Print txt; Wend Close #f Sleep</syntaxhighlight> =={{header\|FunL}}== <~~lang~~syntaxhighlight lang="funl">import io.{InputStreamReader, FileInputStream} r = InputStreamReader( FileInputStream('input.txt'), 'UTF-8' ) Line 191 ⟶ 263: print( chr(ch) ) r.close()</~~lang~~syntaxhighlight> =={{header\|Go}}== <~~lang~~syntaxhighlight lang="go">package main import ( Line 215 ⟶ 287: fmt.Printf("%c", r) } }</~~lang~~syntaxhighlight> =={{header\|Haskell}}== Line 221 ⟶ 293: {{Works with\|GHC\|7.8.3}} <~~lang~~syntaxhighlight lang="haskell">#!/usr/bin/env runhaskell {- The procedure to read a UTF-8 character is just: Line 262 ⟶ 334: xs -> forM_ xs $ \name -> do putStrLn name withFile name ReadMode processOneFile</~~lang~~syntaxhighlight> {{out}} <pre> Line 278 ⟶ 350: First, we know that the first 8-bit value in a utf-8 sequence tells us the length of the sequence needed to represent that character. Specifically: we can convert that value to binary, and count the number of leading 1s to find the length of the character (except the length is always at least 1 character long). <~~lang~~syntaxhighlight Jlang="j">u8len=: 1 >. 0 i.~ (8#2)#:a.&i.</~~lang~~syntaxhighlight> So now, we can use indexed file read to read a utf-8 character starting at a specific file index. What we do is read the first octet and then read as many additional characters as we need based on whatever we started with. If that's not possible, we will return EOF: <~~lang~~syntaxhighlight Jlang="j">indexedread1u8=:4 :0 try. octet0=. 1!:11 y;x,1 Line 289 ⟶ 361: 'EOF' end. )</~~lang~~syntaxhighlight> The length of the result tells us what to add to the file index to find the next available file index for reading. Line 298 ⟶ 370: =={{header\|Java}}== The ''FileReader'' class offers a ''read'' method which will return the integer value of each character, upon each call.<br /> ~~<lang java>import java.io.FileReader;~~ When the end of the stream is reached, -1 is returned.<br /> You can implement this task by enclosing a ''FileReader'' within a class, and generating a new character via a method return. <syntaxhighlight lang="java"> import java.io.FileReader; import java.io.IOException; import java.nio.charset.StandardCharsets; public class ~~Main~~Program { private final FileReader reader; public ~~static void main~~Program(String[] ~~args~~path) throws IOException { ~~var~~ reader = new FileReader(~~"input.txt"~~path, StandardCharsets.~~UTF_8~~UTF_16); ~~while (true) {~~ ~~int c = reader.read();~~ ~~if (c == -1) break;~~ ~~System.out.print(Character.toChars(c));~~ } } ~~}</lang>~~ /** @return integer value from 0 to 0xffff, or -1 for EOS / public int nextCharacter() throws IOException { return reader.read(); } public void close() throws IOException { reader.close(); } } </syntaxhighlight> ===Using Java 11=== <syntaxhighlight lang="java"> import java.io.BufferedReader; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; public final class ReadFileByCharacter { public static void main(String[] aArgs) { Path path = Path.of("input.txt"); try ( BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8) ) { int value; while ( ( value = reader.read() ) != END_OF_STREAM ) { System.out.println((char) value); } } catch (IOException ioe) { ioe.printStackTrace(); } } private static final int END_OF_STREAM = -1; } </syntaxhighlight> {{ out }} <pre> R o s e t t a </pre> =={{header\|jq}}== jq being stream-oriented, it makes sense to define `readc` so that it emits a stream of the UTF-8 characters in the input: <~~lang~~syntaxhighlight lang="jq">def readc: inputs + "\n" \| explode[] \| [.] \| implode;</~~lang~~syntaxhighlight> Example: <syntaxhighlight lang="sh"> ~~<lang sh>~~ echo '过活' \| jq -Rn 'include "readc"; readc' "过" "活" "\n"</~~lang~~syntaxhighlight> =={{header\|Julia}}== Line 330 ⟶ 450: The built-in <code>read(stream, Char)</code> function reads a single UTF8-encoded character from a given stream. <~~lang~~syntaxhighlight lang="julia">open("myfilename") do f while !eof(f) c = read(f, Char) println(c) end end</~~lang~~syntaxhighlight> =={{header\|Kotlin}}== <~~lang~~syntaxhighlight lang="scala">// version 1.1.2 import java.io.File Line 353 ⟶ 473: } } }</~~lang~~syntaxhighlight> =={{header\|Lua}}== {{works with\|Lua\|5.3}} <syntaxhighlight lang="lua"> ~~<lang Lua>~~ -- Return whether the given string is a single ASCII character. function is_ascii (str) Line 415 ⟶ 535: end end </syntaxhighlight> ~~</lang>~~ {{out}} 𝄞 A ö Ж € 𝄞 Ε λ λ η ν ι κ ά y ä ® € 成长汉 Line 422 ⟶ 542: from revision 27, version 9.3, of M2000 Environment, Chinese 长 letter displayed in console (as displayed in editor) <syntaxhighlight lang="m2000 interpreter"> ~~<lang M2000 Interpreter>~~ Module checkit { \\ prepare a file Line 476 ⟶ 596: } checkit </syntaxhighlight> ~~</lang>~~ using document as final$ <syntaxhighlight lang="m2000 interpreter"> ~~<lang M2000 Interpreter>~~ Module checkit { \\ prepare a file Line 540 ⟶ 660: checkit </syntaxhighlight> ~~</lang>~~ =={{header\|Mathematica}}/{{header\|Wolfram Language}}== <syntaxhighlight lang="mathematica">str = OpenRead["file.txt"]; ToString[Read[str, "Character"], CharacterEncoding -> "UTF-8"]</syntaxhighlight> =={{header\|NetRexx}}== Line 550 ⟶ 674: :The file <tt>data/utf8-001.txt</tt> it a UTF-8 encoded text file containing the following: yä®€𝄞𝄢12. <~~lang~~syntaxhighlight ~~NetRexx~~lang="netrexx">/ NetRexx / options replace format comments java crossref symbols nobinary numeric digits 20 Line 663 ⟶ 787: say return </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 681 ⟶ 805: CodePoint: index="009" character_count="1" id="U+00032" hex="0x000032" dec="0000050" oct="0000062" char="2" utf-16="0032" utf-8="32" name="DIGIT TWO" </pre> =={{header\|Nim}}== As most system languages, Nim reads bytes and provides functions to decode bytes into Unicode runes. The normal way to read a stream of UTF-8 characters would be to read the file line by line and decode each line using the “utf-8” iterator which yields UTF-8 characters as strings (one by one) or using the “runes” iterator which yields the UTF-8 characters as Runes (one by one). As in fact the file would be read line by line, even if the characters are actually yielded one by one, it may be considered as cheating. So, we provide a function and an iterator which read bytes one by one. <syntaxhighlight lang="nim">import unicode proc readUtf8(f: File): string = ## Return next UTF-8 character as a string. while true: result.add f.readChar() if result.validateUtf8() == -1: break iterator readUtf8(f: File): string = ## Yield successive UTF-8 characters from file "f". var res: string while not f.endOfFile: res.setLen(0) while true: res.add f.readChar() if res.validateUtf8() == -1: break yield res</syntaxhighlight> =={{header\|Pascal}}== <~~lang~~syntaxhighlight lang="pascal">( Read a file char by char ) program ReadFileByChar; var Line 701 ⟶ 848: Close(OutputFile) end. </syntaxhighlight> ~~</lang>~~ =={{header\|Perl}}== <~~lang~~syntaxhighlight lang="perl">binmode STDOUT, ':utf8'; # so we can print wide chars without warning open my $fh, "<:encoding(UTF-8)", "input.txt" or die "$!\n"; Line 711 ⟶ 858: } close $fh;</~~lang~~syntaxhighlight> If the contents of the ''input.txt'' file are <code>aă€⼥</code> then the output would be: Line 728 ⟶ 875: could easily add this to that file permanently, and document/autoinclude it properly. <!--<~~lang~~syntaxhighlight ~~Phix~~lang="phix">--> <span style="color: #008080;">constant</span> <span style="color: #000000;">INVALID_UTF8</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">#FFFD</span> Line 803 ⟶ 950: <span style="color: #008080;">return</span> <span style="color: #000000;">res</span> <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> <!--</~~lang~~syntaxhighlight>--> Test code: <!--<~~lang~~syntaxhighlight ~~Phix~~lang="phix">--> <span style="color: #000080;font-style:italic;">--string utf8 = "aă€⼥" -- (same results as next)</span> <span style="color: #004080;">string</span> <span style="color: #000000;">utf8</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf32_to_utf8</span><span style="color: #0000FF;">({</span><span style="color: #000000;">#0061</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#0103</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#20ac</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#2f25</span><span style="color: #0000FF;">})</span> Line 834 ⟶ 981: <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span> <!--</~~lang~~syntaxhighlight>--> {{out}} Line 848 ⟶ 995: =={{header\|PicoLisp}}== Pico Lisp uses UTF-8 until told otherwise. <syntaxhighlight lang="picolisp"> ~~<lang PicoLisp>~~ (in "wordlist" (while (char) (process @)) </syntaxhighlight> ~~</lang>~~ =={{header\|Python}}== {{works with\|Python\|2.7}} <~~lang~~syntaxhighlight lang="python"> def get_next_character(f): # note: assumes valid utf-8 Line 877 ⟶ 1,024: for c in get_next_character(f): print(c) </syntaxhighlight> ~~</lang>~~ {{works with\|Python\|3}} Python 3 simplifies the handling of text files since you can specify an encoding. <~~lang~~syntaxhighlight lang="python">def get_next_character(f): """Reads one character from the given textfile""" c = f.read(1) Line 891 ⟶ 1,038: with open("input.txt", encoding="utf-8") as f: for c in get_next_character(f): print(c, sep="", end="")</~~lang~~syntaxhighlight> =={{header\|QBasic}}== <syntaxhighlight lang="qbasic">f = FREEFILE filename$ = "file.txt" OPEN filename$ FOR BINARY AS #f WHILE NOT EOF(f) char$ = STR$(LOF(f)) GET #f, , char$ PRINT char$; WEND CLOSE #f</syntaxhighlight> =={{header\|Racket}}== Don't we all love self reference? <~~lang~~syntaxhighlight lang="racket"> #lang racket ; This file contains utf-8 charachters: λ, α, γ ... (for ([c (in-port read-char (open-input-file "read-file.rkt"))]) (display c)) </syntaxhighlight> ~~</lang>~~ Output: <~~lang~~syntaxhighlight lang="racket"> #lang racket ; This file contains utf-8 charachters: λ, α, γ ... (for ([c (in-port read-char (open-input-file "read-file.rkt"))]) (display c)) </syntaxhighlight> ~~</lang>~~ =={{header\|Raku}}== Line 915 ⟶ 1,074: To read a single character at a time from the Standard Input terminal; $IN in Raku: <syntaxhighlight lang="raku" ~~perl6~~line>.say while defined $_ = $IN.getc;</~~lang~~syntaxhighlight> Or, from a file: <syntaxhighlight lang="raku" ~~perl6~~line>my $filename = 'whatever'; my $in = open( $filename, :r ) orelse .die; print $_ while defined $_ = $in.getc;</~~lang~~syntaxhighlight> =={{header\|REXX}}== Line 929 ⟶ 1,088: <br>The task's requirement stated that '''EOF''' was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure). <br>Note that displaying of characters that may modify screen behavior such as tab usage, backspaces, line feeds, carriage returns, "bells" and others are suppressed, but their hexadecimal equivalents are displayed. <~~lang~~syntaxhighlight lang="rexx">/REXX program reads and displays a file char by char, returning 'EOF' when done. / parse arg iFID . /iFID: is the fileID to be read. / / [↓] show the file's contents. / Line 939 ⟶ 1,098: exit /stick a fork in it, we're all done. / /──────────────────────────────────────────────────────────────────────────────────────/ getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'; return charin(z)</~~lang~~syntaxhighlight> '''input'''   file:   '''ABC''' <br>and was created by the DOS command (under Windows/XP):     '''echo 123 [¬ a prime]> ABC''' Line 969 ⟶ 1,128: ===version 2=== <~~lang~~syntaxhighlight lang="rexx">/ REXX --------------------------------------------------------------- * 29.12.2013 Walter Pachl * read one utf8 character at a time Line 1,008 ⟶ 1,167: Return c c2b: Return x2b(c2x(arg(1)))</~~lang~~syntaxhighlight> output: <pre>y 79 Line 1,018 ⟶ 1,177: =={{header\|Ring}}== <~~lang~~syntaxhighlight lang="ring"> fp = fopen("C:\Ring\ReadMe.txt","r") r = fgetc(fp) Line 1,026 ⟶ 1,185: end fclose(fp) </syntaxhighlight> ~~</lang>~~ Output: <pre> Line 1,054 ⟶ 1,213: {{works with\|Ruby\|1.9}} <~~lang~~syntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do \|f\| f.each_char{\|c\| p c} end</~~lang~~syntaxhighlight> or <~~lang~~syntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do \|f\| while c = f.getc p c end end</~~lang~~syntaxhighlight> =={{header\|Run BASIC}}== <~~lang~~syntaxhighlight lang="runbasic">open file.txt" for binary as #f numChars = 1 ' specify number of characters to read a$ = input$(#f,numChars) ' read number of characters specified b$ = input$(#f,1) ' read one character close #f</~~lang~~syntaxhighlight> =={{header\|Rust}}== Line 1,082 ⟶ 1,241: originally. <~~lang~~syntaxhighlight ~~Rust~~lang="rust">use std::{ convert::TryFrom, fmt::{Debug, Display, Formatter}, Line 1,192 ⟶ 1,351: Ok(()) }</~~lang~~syntaxhighlight> Line 1,203 ⟶ 1,362: the file [http://seed7.sourceforge.net/libraries/utf8.htm#STD_UTF8_OUT STD_UTF8_OUT] is used. <~~lang~~syntaxhighlight lang="seed7">$ include "seed7_05.s7i"; include "utf8.s7i"; Line 1,220 ⟶ 1,379: close(inFile); end if; end func;</~~lang~~syntaxhighlight> {{out}} Line 1,232 ⟶ 1,391: =={{header\|Sidef}}== <~~lang~~syntaxhighlight lang="ruby">var file = File('input.txt') # the input file contains: "aă€⼥" var fh = file.open_r # equivalent with: file.open('<:utf8') fh.each_char { \|char\| printf("got character #{char} [U+%04x]\n", char.ord) }</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,247 ⟶ 1,406: =={{header\|Smalltalk}}== {{works with\|Smalltalk/X}} <~~lang~~syntaxhighlight lang="smalltalk">\|utfStream\| utfStream := 'input' asFilename readStream asUTF8EncodedStream. [utfStream atEnd] whileFalse:[ Transcript showCR:'got char ',utfStream next. ]. utfStream close.</~~lang~~syntaxhighlight> =={{header\|Tcl}}== To read a single character from a file, use: <~~lang~~syntaxhighlight lang="tcl">set ch [read $channel 1]</~~lang~~syntaxhighlight> This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the <tt>lseek</tt> OS call); the extent of buffering can be controlled via: <syntaxhighlight lang ="tcl">fconfigure $channel -buffersize $byteCount</~~lang~~syntaxhighlight> When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option. =={{header\|V (Vlang)}}== <syntaxhighlight lang="v (vlang)"> import os fn main() { file := './file.txt' mut content_arr := []u8{} if os.is_file(file) == true { content_arr << os.read_bytes(file) or { println('Error: can not read') exit(1) } } else { println('Error: can not find file') exit(1) } println(content_arr.bytestr()) } </syntaxhighlight> =={{header\|Wren}}== <~~lang~~syntaxhighlight ~~ecmascript~~lang="wren">import "io" for File File.open("input.txt") { \|file\| Line 1,277 ⟶ 1,459: offset = offset + 1 } }</~~lang~~syntaxhighlight> =={{header\|zkl}}== zkl doesn't know much about UTF-8 or Unicode but is able to test whether a string or number is valid UTF-8 or not. This code uses that to build a state machine to decode a byte stream into UTF-8 characters. <~~lang~~syntaxhighlight lang="zkl">fcn readUTF8c(chr,s=""){ // transform UTF-8 character stream s+=chr; try{ s.len(8); return(s) } catch{ if(s.len()>6) throw(__exception) } // 6 bytes max for UTF-8 return(Void.Again,s); // call me again with s & another character }</~~lang~~syntaxhighlight> Used to modify a zkl iterator, it can consume any stream-able (files, strings, lists, etc) and provides support for foreach, map, look ahead, push back, etc. <~~lang~~syntaxhighlight lang="zkl">fcn utf8Walker(obj){ obj.walker(3) // read characters .tweak(readUTF8c) }</~~lang~~syntaxhighlight> <~~lang~~syntaxhighlight lang="zkl">s:="-->\u20AC123"; // --> e2,82,ac,31,32,33 == -->€123 utf8Walker(s).walk().println(); Line 1,298 ⟶ 1,480: foreach c in (utf8Walker(Data(Void,s,"\n"))){ print(c) } utf8Walker(Data(Void,0xe2,0x82,"123456")).walk().println(); // € is short 1 byte</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,307 ⟶ 1,489: </pre> If you wish to push a UTF-8 stream through one or more functions, you can use the same state machine: <~~lang~~syntaxhighlight lang="zkl">stream:=Data(Void,s,"\n").howza(3); // character stream stream.pump(List,readUTF8c,"print")</~~lang~~syntaxhighlight> {{out}}<pre>-->€123</pre> and returns a list of the eight UTF-8 characters (with newline). Or, if file "foo.txt" contains the characters: <~~lang~~syntaxhighlight lang="zkl">File("foo.txt","rb").howza(3).pump(List,readUTF8c,"print");</~~lang~~syntaxhighlight> produces the same result.