Convert CSV records to TSV: Difference between revisions

Content added Content deleted

Inline

Revision as of 18:00, 13 November 2022

The present task is focused on the conversion of strings of comma-separated values (CSV) to strings of tab-separated values (TSV) subject to the specific set of rules given below. These generally correspond to conventions regarding CSV and TSV files, but this task is primarily concerned with records considered separately. When considering a text file, a "record" should be seen as a line in the file as determined by the NEWLINE conventions of the operating system.

In accordance with the robustness principle of processing, if a string or input line begins with a valid (possibly empty) sequence of comma-separated values but is followed by nonsense, then processing should continue based on the identified values, with the remainder of the line or string being ignored.

A CSV record

Our starting point will be a character set that includes ASCII; the language of regular expressions (which will be denoted by strings of the form r'REGEX'); and the following PEG (parsing expression grammar) grammar for a single CSV record:

  record = field (',' field)*
  field  = (ws* quoted_field ws*) / unquoted_field
  quoted_field = '"' quoted_field_content '"'
  ws     = ' '

  unquoted_field := '[^,"]*'

  quoted_field_content := '(("")|([^"]))*'

Since it has already been stipulated that partitioning of a file into lines is determined by the NEWLINE convention of the computing platform, the NEWLINE cannot appear in either `quoted_field` or `quoted_field_content` so there is no need to express that as a restriction in the PEG.

Robustness

As mentioned above, if an input line or string begins with a valid sequence of comma-separated values but is followed by nonsense, then processing should continue using the valid values.

Files, NEWLINE, and end-of-file

When converting a file of text on a particular platform, the converter should adhere to the conventions regarding lines, NEWLINES and end-of-file for that platform as much as possible, but if the last line in the file is not followed by a NEWLINE, then that line should be processed as though it were followed by a new line.

csv2tsv

When converting a CSV record to a TSV record, only the sequence of `unquoted_field` and `quoted_field_content` values is relevant. Before emitting these values as tab-separated values, the following transformations must be performed:

each literal tab (TAB) must be replaced by the two-character string "\t";
each embedded LF (LINEFEED) must be replaced by the two-character string "\n";
each embedded CR (CARRIAGERETURN) must be replaced by the two-character string "\r";
each backslash that occurs in a two-character substring of the form r'\\[nrt]' must be duplicated;
each literal NUL character (#x0) should be replaced by the two-character string "\0".

Whether or not backslashes in general must be duplicated is optional.

EXAMPLES

In these examples, a line of the form:

'input' => 'tsv' # comment

signifies that the input as given on the left should be transformed into the 'tsv' string given on the right, in both cases minus the surrounding quotes. On either side, <tab> signifies the tab character.

'a,"b"'      => 'a<tab>b'   # <tab> is the tab character
'"a","b""c"' => 'a<tab>b"c' # double-quotes are normalized
''           => ''          # an empty line is translated to an empty line
',a'         => '<tab>a'    # an empty field is translated to an empty field
'a,"'        => 'a<tab>'    # the trailing nonsense is ignored, but the comma is not
' a , "b" '  => ' a <tab>b' # whitespace around quoted fields is ignored
'"12",34'    => '12<tab>34' # numeric strings are not treated specially
'a<tab>b'    => 'a\tb'      # literal tabs in the CSV must be converted to "\t"
'a\tb'       => 'a\\tb'     # ... and so an escaped t must be distinguished
'a\n\rb'     => 'a\\n\\rb'  # literal backslashes are also duplicated for n, r
'a�b'        => 'a\0b'      # NUL is translated to the two-character string "\0"
'a<RETURN>b' => 'a\rb'      # <RETURN> is the control character

TASKS

Display a program, procedure or function, preferably named csv2tsv, which, when given a text file of lines or strings beginning with a CSV record as defined above, produces a corresponding stream of TSV records in accordance with all the above requirements.

When run on different platforms, the program should transform a given sequence of records, when presented in accordance with the NEWLINE conventions of each computing platform, in the same way, adjusting for the NEWLINE conventions of the platform.

Use the following as a test file, noting any discrepancies with the requirements, and specifically stating how strings containing backslashes (besides those preceding r'[nrt]) are transformed.

TEST FILE

a,"b"
"a","b""c"

,a
a,"
 a , "b"
"12",34
a	b, That is a TAB character
a\tb
a\n\rb
a�b, That is a NUL character
a
b, That is a LF (linefeed) character
a\b

jq

Works with: jq

The following program can also be used with gojq, the Go implementation of jq, but NUL (#x0) is left unaltered.

In this entry, the PEG grammar for "record" as defined in the task description is used directly, the point being that PEG operators correspond closely to jq operators, notably PEG's '/' to jq's '//'.

In translating the PEG grammar to a jq program, the main idea is define a pipeline for each grammar rule. A JSON object with keys "remainder" and "result" is passed through this pipeline, consuming the text in .remainder and building up .result.

## The PEG * operator:
def star(E): (E | star(E)) // . ;

## Helper functions:

# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
  # on failure, match yields empty
  (.remainder | match("^" + $re)) as $match
  | .remainder |= .[$match.length :]
  | .match = $match.string;

def parse($re):
  consume($re)
  | .result = .result + [.match] ;

def ws: consume(" *");

### Parse a string into comma-separated values

def quoted_field_content:
  parse("((\"\")|([^\"]))*")
  | .result[-1] |= gsub("\"\""; "\"");

def unquoted_field: parse("[^,\"]*");

def quoted_field: consume("\"") | quoted_field_content | consume("\"");

def field: (ws | quoted_field | ws) // unquoted_field;

def record: field | star(consume(",") | field);

def parse: {remainder: .} | record | .result;

def csv2tsv:
  parse
  | @tsv ;

# Transform an entire file:
inputs | csv2tsv

Output:

Backquotes are uniformly duplicated.
gojq does not, and currently cannot, handle NUL (#x0) properly.

Wren

Library: Wren-ioutil

Library: Wren-str

Backslashes are only duplicated for escaped \t, \n and \r.

import "./ioutil" for FileUtil
import "./str" for Str

// These could be given as command line arguments
// but we hard-code them for the purposes of this task.
var fileName1 = "test.csv"
var fileName2 = "test.tsv"

// This next line assumes a line break of "\r\n" for Windows or "\n" otherwise.
var lines = FileUtil.readLines(fileName1)

// Get rid of an empty last line if there is one.
if (lines.count > 1 && lines[-1] == "") lines = lines[0...-1]

var lc = lines.count

// Normalize fields before rejoining with \t.
for (i in 0...lc) {
    // Str.splitCSv treats quoted fields by default as unquoted if there's any
    // leading whitespace but we can't do that here.
    var fields = Str.splitCsv(lines[i], ",", false)
    for (i in 0...fields.count) {
        var numQuotes = fields[i].count { |c| c == "\"" }
        // Treat it as a quoted field for this task if there's at least two quotes
        // and then remove any surrounding whitespace and the outer quotes.
        if (numQuotes > 1) fields[i] = fields[i].trim().trim("\"")

        fields[i] = fields[i].replace("\"\"", "\"")
                             .replace("\\t", "\\\\t")
                             .replace("\\r", "\\\\r") 
                             .replace("\\n", "\\\\n")
                             .replace("\t", "\\t")
                             .replace("\n", "\\n")
                             .replace("\r", "\\r")                             
                             .replace("\0", "\\0")                              
    }
    // Not sure how 'nonsense' is defined but for now blank
    // the final field if it contains nothing but quotes.
    if (fields[-1].count > 0 && fields[-1].all { |c| c == "\"" }) fields[-1] = ""
    lines[i] = fields.join("\t")
}

// Write lines (as amended) to fileName2.
FileUtil.writeLines(fileName2, lines)

// Write contents of fileName2 to the terminal with tabs replaced by <tab> and
// with each line surrounded with single quotes.
lines = FileUtil.readLines(fileName2)[0...-1]
                .map { |line| "'%(line)'".replace("\t", "<tab>") }
                .join("\n")
System.print(lines)

Output:

'a<tab>b'
'a<tab>b"c'
''
'<tab>a'
'a<tab>'
' a <tab>b'
'12<tab>34'
'a\tb<tab> That is a TAB character'
'a\\tb'
'a\\n\\rb'
'a\0b<tab> That is a NUL character'
'a\rb<tab> That is a LF (linefeed) character'
'a\b'

@@ Line 52: / Line 52: @@
 # each embedded LF (LINEFEED) must be replaced by the two-character string "\n";
 # each embedded CR (CARRIAGERETURN) must be replaced by the two-character string "\r";
-# each backslash that occurs in a two-character substring of the form '\\[nrt]' must be duplicated;
+# each backslash that occurs in a two-character substring of the form r'\\[nrt]' must be duplicated;
 # each literal NUL character (#x0) should be replaced by the two-character string "\0".
@@ Line 92: / Line 92: @@
 Use the following as a test file, noting any discrepancies with the
 requirements, and specifically stating how strings containing backslashes
-(besides those preceding the character 'n', 't' or 'r') are transformed.
+(besides those preceding r'[nrt]) are transformed.
 ; TEST FILE