csv.scm - Comma-Separated Value (CSV) Utilities in Scheme

Version 0.3, 21 July 2004, `http://www.neilvandyke.org/csv-scm/' by
Neil W. Van Dyke <neil@neilvandyke.org>

     Copyright (C) 2004 Neil W. Van Dyke.  This program is Free
     Software; you can redistribute it and/or modify it under the terms
     of the GNU Lesser General Public License as published by the Free
     Software Foundation; either version 2.1 of the License, or (at
     your option) any later version.  This program is distributed in
     the hope that it will be useful, but without any warranty; without
     even the implied warranty of merchantability or fitness for a
     particular purpose.  See the GNU Lesser General Public License
     [LGPL] for details.


The `csv.scm' Scheme library provides utilities for reading various
kinds of what are commonly known as "comma-separated value" (CSV) files.
Since there is no standard CSV format [Creativyst], this library permits
CSV readers to be constructed from a specification of the peculiarities
of a given variant.  A default reader handles the majority of formats.

   One of the main uses of this library is to import legacy data from
old crusty applications into Scheme for data conversion and other
processing.  To that end, this library includes various conveniences
for iterating over parsed CSV rows, and for converting CSV input to
Scheme XML format [SXML].

   This library requires R5RS, [SRFI-6], [SRFI-23], and an
`integer->char' procedure that accepts ASCII values.

   This version 0.1 release is for real-world testing purposes.  Please
report your experiences to the author.

   Other implementations of some kind of CSV reading for Scheme include
Gauche's `text.csv' module [Gauche-CSV], and the Scsh-specific
`record-reader' and related procedures [Scsh].  This library intends to
be portable and more comprehensive.  CSV reading libraries also exist
for popular languages like Java [Ostermiller], Perl [Rath] [Widemann],
and Python [Cole].

Reader Specs

CSV readers are constructed using "reader specs", which are sets of
attribute-value pairs, represented in Scheme as association lists keyed
on symbols.  Each attribute has a default value if not specified
otherwise.  The attributes are:

     Symbol representing the newline, or record-terminator, convention.
     The convention can be a fixed character sequence (`lf', `crlf', or
     `cr', corresponding to combinations of line-feed and
     carriage-return), any string of one or more line-feed and
     carriage-return characters (`lax'), or adaptive (`adapt').
     `adapt' attempts to detect the newline convention at the start of
     the input and assume that convention for the remainder of the
     input.  Default: `lax'

     Non-null list of characters that serve as field separators.
     Normally, this will be a list of one character.  Default: `(#\,)'
     (list of the comma character)

     Character that should be treated as the quoted field delimiter
     character, or `#f' if fields cannot be quoted.  Note that there
     can be only one quote character.  Default: `#\"' (double-quote)

     Boolean for whether or not a sequence of two `quote-char' quote
     characters within a quoted field constitute an escape sequence for
     including a single `quote-char' within the string.  Default: `#t'

     List of characters, possibly null, which comment out the entire
     line of input when they appear as the first character in a line.
     Default: `()' (null list)

     List of characters, possibly null, that are considered "whitespace"
     constituents for purposes of the `strip-leading-whitespace?' and
     `strip-trailing-whitespace?' attributes described below.  Default:
     `(#\space)' (list of the space character)

     Boolean for whether or not leading whitespace in fields should be
     stripped.  Note that whitespace within a quoted field is never
     stripped.  Default: `#f'

     Boolean for whether or not trailing whitespace in fields should be
     stripped.  Note that whitespace within a quoted field is never
     stripped.  Default: `#f'

     Boolean for whether or not newline sequences are permitted within
     quoted fields.  If true, then the newline characters are included
     as part of the field value; if false, then the newline sequence is
     treated as a premature record termination.  Default: `#t'

Making Reader Makers

CSV readers are procedures that are constructed dynamically to close
over a particular CSV input and yield a parsed row value each time the
procedure is applied.  For efficiency reasons, the reader procedures
are themselves constructed by another procedure,
`make-csv-reader-maker', for particular CSV reader specs.

 -- Procedure: make-csv-reader-maker reader-spec
     Constructs a CSV reader constructor procedure from the READER-SPEC,
     with unspecified attributes having their default values.

     For example, given the input file `fruits.csv' with the content:

          apples  |  2 |  0.42
          bananas | 20 | 13.69

     a reader for the file's apparent format can be constructed like:

          (define make-food-csv-reader
             '((separator-chars            . (#\|))
               (strip-leading-whitespace?  . #t)
               (strip-trailing-whitespace? . #t))))

     The resulting `make-food-csv-reader' procedure accepts one
     argument, which is either an input port from which to read, or a
     string from which to read.  Our example input file then can be be
     read by opening an input port on a file and using our new
     procedure to construct a reader on it:

          (define next-row
            (make-food-csv-reader (open-input-file "fruits.csv")))

     This reader, `next-row', can then be called repeatedly to yield a
     parsed representation of each subsequent row.  The parsed format
     is a list of strings, one string for each column.  The null list
     is yielded to indicate that all rows have already been yielded.

          (next-row) => ("apples" "2" "0.42")
          (next-row) => ("bananas" "20" "13.69")
          (next-row) => ()

Making Readers

In addition to being constructed from the result of
`make-csv-reader-maker', CSV readers can also be constructed using

 -- Procedure: make-csv-reader in [reader-spec]
     Construct a CSV reader on the input IN, which is an input port or a
     string.  If READER-SPEC is given, and is not the null list, then a
     "one-shot" reader constructor is constructed with that spec and
     used.  If READER-SPEC is not given, or is the null list, then the
     default CSV reader constructor is used.  For example, the reader
     from the `make-csv-reader-maker' example could alternatively have
     been constructed like:

          (define next-row
             (open-input-file "fruits.csv")
             '((separator-chars            . (#\|))
               (strip-leading-whitespace?  . #t)
               (strip-trailing-whitespace? . #t)))))

Basic Input Conveniences

Several convenience procedures are provided for iterating over the CSV
rows and for converting the CSV into a list.  To the dismay of some
Scheme purists, each of these procedures accepts a READER-OR-IN
argument, which can be a CSV reader, an input port, or a string.  If
not a CSV reader, then the default reader constructor is used.  For
example, all three of the following are equivalent:

     (csv->list                                     STRING  )
     (csv->list (make-csv-reader                    STRING ))
     (csv->list (make-csv-reader (open-input-string STRING)))

 -- Procedure: csv-for-each proc reader-or-in
     Similar to Scheme's `for-each', applies PROC, a procedure of one
     argument, to each parsed CSV row in series.  READER-OR-IN is the
     CSV reader, input port, or string.  The return

 -- Procedure: csv-map proc reader-or-in
     Similar to Scheme's `map', applies PROC, a procedure of one
     argument, to each parsed CSV row in series, and yields a list of
     the values of each application of PROC, in order.  READER-OR-IN is
     the CSV reader, input port, or string.

 -- Procedure: csv->list reader-or-in
     Yields a list of CSV row lists from input READER-OR-IN, which is a
     CSv reader, input port, or string.

Converting CSV to SXML

The `csv->sxml' procedure can be used to convert CSV to [SXML] format,
for processing with various XML tools.

 -- Procedure: csv->sxml reader-or-in [row-element [col-elements]]
     Reads CSV from input READER-OR-IN (which is a CSV reader, input
     port, or string), and yields an SXML representation.  If given,
     ROW-ELEMENT is a symbol for the XML row element.  If ROW-ELEMENT
     is not given, the default is the symbol `row'.  If given
     COL-ELEMENTS is a list of symbols for the XML column elements.  If
     not given, or there are more columns in a row than given symbols,
     column element symbols are of the format `col-N', where N is the
     column number (the first column being number 0, not 1).

     For example, given a CSV-format file `friends.csv' that has the

          Binoche,Ste. Brune,33-1-2-3
          Posey,Main St.,555-5309
          Ryder,Cellblock 9,

     with elements not given, the result is:

          (csv->sxml (open-input-file "friends.csv"))
           (row (col-0 "Binoche") (col-1 "Ste. Brune")  (col-2 "33-1-2-3"))
           (row (col-0 "Posey")   (col-1 "Main St.")    (col-2 "555-5309"))
           (row (col-0 "Ryder")   (col-1 "Cellblock 9") (col-2 "")))

     With elements given, the result is like:

          (csv->sxml (open-input-file "friends.csv")
                     '(name address phone))
          (*TOP* (friend (name    "Binoche")
                         (address "Ste. Brune")
                         (phone   "33-1-2-3"))
                 (friend (name    "Posey")
                         (address "Main St.")
                         (phone   "555-5309"))
                 (friend (name    "Ryder")
                         (address "Cellblock 9")
                         (phone   "")))


The `csv.scm' source code file contains a regression test suite for the
library itself, in procedure `csv-internal:test'.  This test suite is
disabled by default, but can be enabled by editing the source code.


Version 0.3 -- 21 July 2004
     Minor documentation changes.  Test suite now disabled by default.

Version 0.2 -- 1 June 2004
     Fixed strange `case'-related bug exhibited under Gauche 0.8 and in `csv-internal:make-portreader/positional'.  Thanks to
     Grzegorz Chrupa/la for reporting.

Version 0.1 -- 31 May 2004
     First release, for testing with real-world input.


     Dave Cole, "CSV module for Python," Web page, viewed 26 May 2004.

     Creativyst, "The Comma Separated Value (CSV) File Format: Create
     or parse data in this popular pseudo-standard format," Web page,
     viewed 26 May 2004.

     Gauche `text.csv' module, Gauche version 0.8.

     Free Software Foundation, "GNU Lesser General Public License,"
     Version 2.1, February 1999, 59 Temple Place, Suite 330, Boston, MA
     02111-1307 USA.

     Stephen Ostermiller "com.Ostermiller.util.CSVParser," Web page,
     viewed 26 May 2004.

     Christopher Rath and Mark Mielke, "CSV.pm," version 2.0, Web page,
     viewed 26 May 2004.

     Olin Shivers, Brian D. Carlstrom, Martin Gasbichler, and Mike
     Sperber, "Scsh Reference Manual," Scsh release 0.6.6.

     William D. Clinger, "Basic String Ports," SRFI 6, 1 July 1999.

     Stephan Houben, "Error reporting mechanism," SRFI 23, 26 April

     Oleg Kiselyov, "SXML," revision 3.0.

     Jochen Wiedmann, "Text::CSV_XS" Perl 5.6.1 module.