A (Revised) Report on SLAN: Scheme List-Atom Notation

Frank Mitchell

Posted: 2023-02-06
Last Modified: 2023-03-31
Word Count: 3171
Tags: java scheme software programming

Table of Contents

Introduction

SLAN is the Scheme List-Atom Notation1, because I really wanted the acronym SLAN. Like XML, YAML, JSON, TOML, and my own ELTN it’s a language for describing data structures, not performing calculations. In SLAN everything is a list of “atoms” – strings, numbers, “symbols”, and booleans – and other lists.

SLAN somewhat resembles the programming language Scheme, but even smaller: entire data types and constructions have been cut out to make parsing simpler.

History

Since the invention of LISP in 1960 and Scheme around 1975 programmers have used S-expressions – space separated words surrounded by parentheses – as a simple file format. SXML writes XML using S-expressions.

While contemplating how to serialize what’s essentially a rules engine2 as a Java .properties file, I realized I had other choices. The project I’m working on parses JSON, so to avoid infinte regress I decided on something simpler. Thus SLAN was born.

File Format

UPDATED 2023-03-01 to restrict non-ASCII characters.

The SLAN file format consists entirely of ASCII characters representing Scheme-like lists. As a simple serialization format (not a mathematical programming language) it can represent only a few basic data types: strings, “symbols” (basically strings without quotes), simple decimal numbers, booleans, and of course lists.

More lenient parsers may allow non-ASCII characters in a string constant or in comments, but not elsewhere in the file. If the parser does allow non-ASCII characters, the file may only be encoded as UTF-8.

The intent is that SLAN avoids the complexity of character encodings by staying compatible with 1960s technology.

Comments

Following Scheme R6RS, SLAN regards the following as comments:

The parser will treat comment text as whitespace.

Data Types

The following are the only data types in SLAN.

List

Lists are sequences of other data types, including nested lists. SLAN doesn’t care whether one implements lists as linked cons cells, arrays, or an arbitrary data structure that’s simply serialized as nested lists.

Symbol

In SLAN symbols and strings are two ways of writing a sequence of characters, one more restrictive than the other.

Nevertheless, parsers should indicate whether a value is a symbol or a list, in case the application attaches semantic meaning to one of the other. Applications can use a “symbol” like a C or Java enum, the name of a datatype, the name of a (remote?) procedure call, or any other meaning programmers give to names. That said, the current specification has no mechanism to verify symbols against a list of “known” names; the application must do that.

String

After Lists, Strings are the main data type in SLAN. As SLAN exists to serialize and deserialize data structures, a String could denote an arbitrary character sequence, a specific data structure like a date, or even a complicated number.

Number

Numbers in SLAN resemble those in JSON: a whole part, an optional fractional part, and an optional exponent. That’s it. Nothing like a Scheme number.

Boolean

A basic true-or false value.

Empty List

In Scheme the Empty List has special status. Scheme implements lists as linked lists of “cons” cells, and the empty list is effectively a null pointer. In SLAN we continue the tradition of Empty List being the equivalent of nil or null in other languages. It’s the only value equivalent to Boolean false; all others are true.

Syntax

Updated 2023-02-17 for the following:

Updated 2023-02-19 for the following:

Updated 2023-03-01 for the following:

Updated 2023-03-31 for the following:

Here’s the formal syntax of SLAN. Hang on …

stream      ::= WS? list (WS? list)* WS?
list        ::= "(" WS? value (WS value)* WS? ")"
value       ::= list | emptylist | SYMBOL | STRING | NUMBER | BOOLEAN
emptylist   ::= "()" | "(" WS+ ")"

SYMBOL      ::= #ICHAR #SCHAR* | #NCHAR
 #ICHAR     ::= #LETTER | '!' | '$' | '%' | '&' | '*' | '/' | ':'
                 | '<' | '=' | '>' | '?' | '~' | '_' | '^'
 #NCHAR     ::= '.' | '+' | '-'
 #SCHAR     ::= #ICHAR | #DIGIT | #NCHAR
 #LETTER    ::= [A-Za-z]
 #DIGIT     ::= [0-9]

STRING      ::= '\"' #CHAR*  '\"'
 #CHAR      ::= [^"\] | "\" #ESCAPE
 #ESCAPE    ::= '\'' | '\"' | '\\'
                 | 'a' | 'b' | 'f' | 'n' | 'r' | 't' | 'v' 
                 | "x" #HEXDIGIT #HEXDIGIT 
                 | #OCTDIGIT #OCTDIGIT #OCTDIGIT
                 | "u{" #HEXDIGIT+ "}"
 #HEXDIGIT  ::= [0-9A-Za-z]
 #OCTDIGIT  ::= [0-7]

NUMBER      ::= (#SIGN)?(#WHOLE)(#DECIMAL)?(#EXPONENT)?
                | (#SIGN)?(#WHOLE)?(#DECIMAL)(#EXPONENT)?
                | (#SIGN)?(#WHOLE)'/'(#POSITIVE)
 #SIGN      ::= '+' | '-'
 #WHOLE     ::= '0' | #POSITIVE
 #POSITIVE  ::= [1-9][0-9]*
 #DECIMAL   ::= '.'[0-9]+
 #EXPONENT  ::= [eE][+-][0-9]+

BOOLEAN     ::= "#t" | "#f"

WS          ::= ( ' ' | '\f' | '\t' | `\r' | `\n' | '\v' | COMMENT )+
COMMENT     ::= ( ';' ... #NEWLINE ) | ( "#|" ... "|#" )
 #NEWLINE   ::= "\r" | "\n" | "\r\n"

#RESERVED   ::= '\'' | '`' | '@' | ',' | '{' | '}' | '[' | ']'
                | [\x00-\x08] | [\x0E-\x1F] | '\x7F' 
                | [\u{80}..\u{10FFFF}]

The above uses the following notation:

var
A parser production.
VAR
A lexer token. (Not that there’s much difference with a parser production.)
#VAR
Part of a top-level lexer token, defined to make the definition easier to follow, not a token in and of itself.
var | var |
alternatives for a production or token.
var?
Zero or one of var.
var*
Zero or more of var.
var+
One or more of var
... exp
Any sequence of characters until exp (a production or token).
()
a group of productions or tokens.
'' or ""
a literal sequence of characters as a single token.
[]
One of the characters in the set.
[^]
One of the characters not in the set.

Notation notes:

Escape Sequences

ADDED 2023-02-28: “\z”

The Scheme grammar I found didn’t define escapes other than the necessary ones for double quotes and backslashes. I’ve added several from C, JSON and ELTN, and for brevity I’ll simply repeat most of the table from ELTN.

Escape Byte(s) Meaning
\a 0x07 bell
\b 0x08 backspace
\f 0x0c form feed
\n 0x0a newline
\r 0x0d carriage return
\t 0x09 horizontal tab
\v 0x0b vertical tab
\\ \ backslash
\" " quotation mark / double quote
\’ ' apostrophe / single quote
\z skip to the next non-whitespace character
\xXX 0xXX byte value in hexadecimal digits
\DDD 0DDD byte value in octal digits
\u{XXX} utf8(XXX) UTF-8 bytes for code point 0xXXX.

\z I borrowed from Lua. Essentially this:

"This is a string with a newline in it. \z
        ... oh, wait, it's not."

becomes “This is a string with a newline in it. … oh, wait, it’s not.”

The Unicode escape \u{ HEXDIGIT+ } I also borrowed from Lua. The {} are part of the token, and group an arbitrary number of hex digits that may include code points outside the Basic Multilingual Plane (BMP). Implementations may reject values above 0x10FFFF.

Semantic Concerns

Rejected and Postponed Suggestions

Long Strings

I thought about adding “long strings” into this language, either Lua style:

[[
Four score and seven years ago our fathers brought forth on this continent
...
that government of the people, by the people, for the people, shall not perish
from the earth.
]]

or Python style:

"""
Four score and seven years ago our fathers brought forth on this continent
...
that government of the people, by the people, for the people, shall not perish
from the earth.
"""

Both of these evaluate to:

"Four score and seven years ago our fathers brought forth on this continent\n...\nthat government of the people, by the people, for the people, shall not perish\nfrom the earth.\n"

Really, though, how is a program going to decide when it’s time to go long?

ADDED 2023-02-28

Also note that nothing in the grammar prevents a string from containing a newline, tab, or other whitespace. They will be interpreted literally. In scheme something like this:

(set! s 
"Four score and seven years ago our fathers brought forth on this continent
...
that government of the people, by the people, for the people, shall not perish
from the earth.
")

produces the same results as the “long notation” above, apart from whether the line endings are Unix or Windows. See Backslash Escapes, above, for ways to remove unwanted whitespace.

Scheme Character Type

In Scheme a character sometimes behaves like a number that translates to an (ASCII) code point, and sometimes like a very short string. Python demonstrates single-character strings do a good enough job. As I’m writing a data format not a programming language, I’m simply leaving them out.

We may be reaching an end of the era where each character always has exactly one number. Even in Unicode one can make an ü with a single character or with a u and a combining umlaut (ü). Application programmers may neither know or care what numbers translate to what strings, as long as they can slice a string after a ü and not end up with a u and an ¨ … or an 0xC3 and an 0xBC3.

Scheme Numbers

Numbers in Scheme get really complicated really fast: inexact vs. exact, rational, real, and complex numbers in binary, octal, decimal, or hexadecimal. Here’s the full syntax of numbers from an official specification, slighly massaged to fit our conventions above:

num[r]           ::=  prefix[r] complex[r]
complex[r]       ::=  real[r] | real[r] "@" real[r]
                        | real[r] "+" imag[r] | real[r] "-" imag[r]
                        | "+" imag[r] | "-" imag[r]
imag[r]          ::=  "i" | ureal[r] "i"
real[r]          ::=  sign ureal[r]
ureal[r]         ::=  uinteger[r] | uinteger[r] "/" uinteger[r] | decimal[r]
uinteger[r]      ::=  digit[r]+ "#"*
prefix[r]        ::=  radix[r] exactness | exactness radix[r]
decimal[10]      ::=  uinteger[10] exponent
                        | "." digit[10]+ "#"* suffix
                        | digit[10]+ "." digit[10]* "#"* suffix
                        | digit[10]+ "#"+ "." "#"* suffix
suffix           ::=  empty | exponent
exponent         ::=  exponent_marker sign digit[10]+
exponent_marker  ::=  "e" | "s" | "f" | "d" | "l"
sign             ::=  empty | "+" | "-"
exactness        ::=  empty | "#i" | "#e"
radix[2]         ::=  "#b"
radix[8]         ::=  "#o"
radix[10]        ::=  empty | "#d"
radix[16]        ::=  "#x"
digit[2]         ::=  "0" | "1"
digit[8]         ::=  "0" | "1" | ... | "7"
digit[10]        ::=  "0" | "1" | ... | "9"
digit[16]        ::=  digit[10] | "a" | "b" | "c" | "d" | "e" | "f"
empty            ::=  ""

So if I understand this and everything I know from other specifications:

OK. I’ll get right on that.

Maybe someday I’ll add rational numbers and alternate radixes, and pubish a Revised Revised Report on SLAN.4 Don’t hold your breath, though.

Scheme Special Forms

The following forms have no special meaning in SLAN:

Again, it’s a persistence and communication format, not a general purpose language. That said, nothing stops an end-user from imbuing SLAN lists with meaning, so the symbols involved are not disallowed either.

Vector Type(s)

A vector is essentially an array of values rather than a (linked) list. As the distinction doesn’t matter to the SLAN format, I’m simply leaving them out.

Implementations

SLANPP: The SLAN Pull Parser

One might write a Pull Parser to parse SLAN with an interface like this.

public interface SlanPullParser {
    public enum SlanEvent {
        SYNTAX ERROR,
        START_STREAM,
        START_LIST,
        END_LIST,
        SYMBOL,
        STRING,
        NUMBER,
        BOOLEAN_TRUE,
        BOOLEAN_FALSE,
        EMPTY_LIST,
        END_STREAM
    }

    public boolean hasNext() throws IOException;

    public void next() throws IOException;

    public SlanEvent getEvent();

    public String getString();    // value of SYMBOL or STRING

    public boolean isNumber();    // if NUMBER or maybe number-like STRING

    public Number getNumber();    // converts string ... if possible

    public boolean isTrue();      // not BOOLEAN_FALSE or EMPTY_LIST
}

And here’s one way one might use it:

try {
    while (parser.hasNext()) {
        parser.next()
        switch (parser.getEvent())
            case START_STREAM:
                ...
                break;
            case START_LIST:
                ...
                break;
            case STRING:
                String str = parser.getString();
                ...
                break;
            case SYMBOL:
                String sym = parser.getSymbol();
                ...
                break;
            case BOOLEAN_TRUE:
            case BOOLEAN_FALSE:
            case EMPTY_LIST:
                boolean flag = parser.isTrue();
                ...
                break;
            case END_LIST:
                ...
                break;
            case END_STREAM:
                ...
                break;
            default:
                // ERROR ERROR ERROR
                ...
                break;
        }
    }
} finally {
    parser.close();
}

SLAN-List: Linked List

A parser could also simply build nested lists, just like John McCarthy used to make. The application would have to walk it just like any other parse tree.

public interface SlanValue {
    enum Type { LIST, SYMBOL, STRING, NUMBER, BOOLEAN, EMPTY_LIST }

    public Type getType();

    public String getSymbol();    // SYMBOL (using String#intern())

    public String getString();    // SYMBOL, STRING, NUMBER, BOOLEAN (Java)

    public boolean isNumber();    // if NUMBER

    public Number getNumber();    // NUMBER

    public Boolean isTrue();      // any (true if not `#f` or `()`)

    public boolean isList();      // if LIST or EMPTY_LIST

    public SlanValue getHead();   // start of LIST (null if `()`)

    public SlanValue getTail();   // rest of LIST (null if end of list)

    public String toString();     // serializes self as SLAN
}

SLAN-B: The SLAN Builder

This interface allows one to build a SLAN document with methods that (almost) guarantee valid SLAN at the end.

The API providing a Builder assumes responsibility for writing the SLAN out to a file or URL, flushing the connection, and closing it properly.

public interface SlanBuilder {
    /**
     * Add a String.
     * The builder will escape any characters necessary.
     * @param value string
     * @return this
     */
    SlanBuilder addString(String value);

    /**
     * Add a symbol.
     * If value contains characters not allowed in a symbol the
     * resulting symbol will substitute '_'.
     * @param value symbol
     * @return this
     */
    SlanBuilder addSymbol(String value);

    /**
     * Add a number.
     * If a Double or Float, a NaN or +/-Infinity value will substitute
     * the Empty List.
     * @param value number 
     * @return this
     */
    SlanBuilder addNumber(Number value);

    /**
     * Add a SLAN Boolean, "#t" or "#f".
     * @param value boolean value
     * @return this
     */
    SlanBuilder addBoolean(boolean value);

    /**
     * Add an Empty List "()".
     * @return this
     */
    SlanBuilder addEmptyList();

    /**
     * Begin nested list. 
     * Subsequent calls until {@link #addListEnd} add to the nested list.
     * @return a builder for the new list
     */
    SlanBuilder openList();

    /**
     * End nested list.
     * @return the builder for the parent list.
     */
    SlanBuilder closeList();

    /** 
     * Calls a procedure that builds a sublist.
     * Roughly equivalent to:
     * <pre>
     *    this.openList(); 
     *    proc.build(this);
     *    this.closeList();
     * </pre>
     */
    default SlanBuilder addList(SlanBuilder.Procedure proc) {
        SlanBuilder result = this;
        try {
            result = this.openList();
            proc.build(result);
        } finally {
            result = closeList();
        }
        return result;
    }

    /**
     * Interface for Ruby-like method procedures.
     */
    @FunctionalInterface
    public interface Procedure {
        void build(SlanBuilder builder);
    }
}

Other Implementations

REMOVED 2023-02-27: the “Slan Function Oriented Marshaller” because it and related APIs were still in flux.

One could write SLAN parsers and emitters in any other language: C, Go, Python, Ruby, even Scheme. The APIs might vaguely resemble the ones above, adjusted for the capabilities and idioms of the language.

MIT License

Added 2023-02-26, because of the interfaces listed above

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


  1. Formerly the “Scheme-Like Abridged Notation”, but this sounds much better and more descriptive. JSON has Objects (and Arrays), ELTN has Tables, SLAN has Lists … and Atoms. ↩︎

  2. Specifically how to choose among Java classes to wrap an object based on the signature(s) of the wrapper’s constructor(s). For extra credit, imagine the user offers up multiple arguments, e.g. (byte[], int, int). I think I implemented something like this for Rhino back in the day, but I can’t remember the details. ↩︎

  3. The bytes of ü (0xFC) in UTF-8. Obvi. ↩︎

  4. Then, like the Revised Reports on Scheme, there’ll be a Revised Revised Revised Report (R3), an R4, R5, R6 … ↩︎