introduction
A very common file format for transferring data is the Comma (or TAB) Separated Value file. These formats appear simple, but have several complications that make implementing a CSV/TSV parser slightly more challenging than just breaking a row of text by its separator.
It’s probably best to start with some definitions before diving into any details. In this file format, which is a text file where each line of text is like a row in a spreadsheet, and each of those rows is made up of columns.
The generally accepted practice is to call the rows Records, and the columns Fields. The first line is typically included, but not required, and contains the Field names.
basic parsing
Fields are separated by their separator character, which is a comma for a CSV and a TAB for a TSV:
id,animal,sound 1,cat,Meow 2,dog,Bark 3,cow,Moo
EMBEDDED COMMAS
This works perfectly fine for a TSV, since the value in a Field probably wouldn’t contain a TAB character. But what about a CSV? It seems reasonable that the value in a Field could have one or more commas.
id,animal,sound,advantage,size 1,cat,Meow,warm, purring, fuzzy,small 2,dog,Bark,loyal, funny, protective,medium 3,cow,Moo,tippable, clears grass,large
There’s no reasonable way to parse the Fields in this file (note that Field names in the first row, if included, are not required to match the Record Fields). The solution is to require double quotes around a Field that contains a comma:
id,animal,sound,advantage,size 1,cat,Meow,"warm, purring, fuzzy",small 2,dog,Bark,"loyal, funny, protective",medium 3,cow,Moo,"tippable, clears grass",large
Now we can parse this file, since the double quotes indicate the beginning and end of a Field. If we see a starting double quote, we can just keep reading until we see a closing double quote.
embedded double quotes
But what about double quotes themselves? It’s certainly common to see double quotes in different situations in a Field. Text in the field could be quoted, or a double quote could be used to mean inches. If we used the above logic of using an opening and closing double quote to know when a Field begins and ends, something like this will cause trouble:
id,name,size 1,shoe,12 2,round clock,8"x8" 3,used laptop,just "like new!"
The answer is to “escape” the double quotes in some way, and the decision was to use double double quotes. However, this causes its own problem, if we’re using double quotes to determine if commas are part of a Field, or if they are the separator between Fields, how can we tell? The answer, again, is to require double quotes around a Field that contains a double quote (or, more accurately now, an escaped double quote):
id,name,size 1,shoe,12 2,round clock,"8""x8""" 3,used laptop,"just ""like new!"""
This use of escaping double quotes, and requiring double quotes around a Field that contains double quotes is common to both CSV and TSV.
embedded newlines or carriage returns
The final major thing to consider when parsing a CSV or TSV is that a Field could contain a newline or carriage return character:
id,name,description 1,shoe,these are fine shoes 2,round clock,features: 2 hands round, 8" maple wood 3,used laptop,technical specs: 2048 Neurobit 1098X3 CPU 12 Parsec pseudo-drive 15" screen
Since a newline (or carriage return) should indicate the end of the current Field and also the completion of the current Record, how do we deal with them as part of a Field? I’m sure you’ve guessed that double quotes are involved, and they are:
id,name,description 1,shoe,these are fine shoes 2,round clock,"features: 2 hands round, 8"" maple wood" 3,used laptop,"technical specs: 2048 Neurobit 1098X3 CPU 12 Parsec pseudo-drive 15"" screen"
Notice that there are also escaped double quotes and commas in these Fields as well. The opening and closing double quotes allow for those in the Field.
other consideration
In general, this covers the major issues faced when parsing a CSV or TSV file. Note that a TSV can also allow for some special escaped characters:
character | escaped value |
newline (\n character) | \n as two characters |
carriage return (\r character) | \r as two characters |
tab (\t character) | \t as two characters |
backslash | \\ as two characters |
*It?s hard to find knowledgeable people on this topic, but you sound like you know what you?re talking about! Thanks