Prototype: EEP18 Considered Harmful: The problems with Erlang to JSON term translation

2:40 pm ECMA / Javascript, ECMAScript, Erlang, General Interest, Programming, Rants, Tools and Libraries, Web and Web Standards

THIS IS ONLY HALF WRITTEN.  I have been sitting on this post, waiting for the mood to finish it, for months; because EEP18 is now being treated as a likely implement, I am immediately publishing the half-written version, because it exposes many (though not all) of the serious, irreconcilable problems with EEP18.

On the mailing list, people are actively trying to bring Erlang up to snuff with regards to web standards.  One of the more unfortunate choices being discussed is JSON as a data notation.  JSON, unfortunately, does not actually map to Erlang in a useful way.  Joe Armstrong has gone as far as to suggest BIFs, which are decidedly unrealistic as well as unnecessary.  My goal is to create a JSON handling library.  However, the mailing list is beginning to put momentum behind an alternative proposal which is currently presented in BIF form.

This post explains why my approach is different.  Many of the issues herein are discussed by the tabled EEP (EEP 18, “JSON BIFs” by Rickard O’Keefe), but some are not, and some of these issues are accepted when I believe they should not be.  It is my position that EEP 18 is unacceptably dangerous.  I will explain why.

This paper assumes you are familiar with Erlang and with fundamental containers (the list, the array and the key/value map).  It is very helpful, but not required, to be familiar with JSON, or JavaScript or any ECMA derived language such as ActionScript.

Premise

There’s a movement starting to use Erlang for web work.  There are several stumbling blocks to that end.  Among them are a simple primary webserver, a simple primary unicode system and a simple primary JSON manager.

The webserver problem is mostly solved: there’s the httpd module, there’s yaws, there’s mochiweb, there’s the currently unavailable work at Tobbe’s Red Hot Erlang Blog, there’s even Joe’s HTTPD tutorial.  YAWS and MochiWeb in particular get a lot of action these days.  The situation isn’t amazingly straightforward, but it’s fairly straightforward; we’re in “Good Enough” territory.  (I’m building another webserver that behaves like factor’s drop-and-go server, based on Joe’s tutorial, but that’s not for here.)

The unicode problem, however, as well as the JSON problem, are not solved.  Unfortunately, whereas the Erlang community has had the foresight to deal with complex problems in modules first then to move them to syntax later, this process seems to be failing with both JSON and Unicode.  It can be argued that some of the choices made in each process are dangerous.  This community will, by and on the whole, remember the re: module, which is being replaced now with a partially incompatible successor that takes account of limitations and problems in the prior attempt, as well as moves to a stronger RE dialect.  It is important that this ability be retained for JSON and Unicode, both of which are subtly strikingly difficult problems, and both of which are unlikely to be Gotten Right™ on their respective first attempts.

The Principle of Least Surprise

One of the most important parts of writing libraries is to not write nasty shocks into place for users.  In transcoding libraries, there is one rule that defines least surprise more powerfully than any other: round-trip translations must not lose data.  No configuration of EEP 18 can achieve this.  Indeed, it is my position that a one to one translation of JSON to Erlang terms cannot exist, and that any attempt to present a not-1:1 translation as a translation is unacceptable, in that people will expect j2e(e2j(X)) == X, and that cannot be true.  This is especially important given that the suggestion that these translations become BIFs seems to be being taken seriously; foo_to_bar(X) bifs are currently never lossy, and this would create a worrying change in the meaning of several basic naming practices.

It is of critical importance, in my opinion, that we do not provide libraries which fail round-trip conversion in either direction.  At this time, EEP 18 attempts to satisfy this clause with creation-time configurability; I will explain my stance that this is inadequate below.

Why Translation is Unclear

There are, in fact, quite a few problems that prevent 1:1 translation.  We’ll go over them one by one.

  1. The notations offer different fundamental containers
    1. Erlang offers dense sequence (“tuple”, {}) and singly linked list (“list”, []) containers.  The erlang standard library offers other containers; I discuss later in this document why I’m not using them.
    2. JSON offers dense sequence (“array”, []) and key-value map (“object”, {}) containers.  That’s it.
  2. The notations offer different fundamental datatypes
    1. JSON has a fundamental string type; erlang doesn’t.
    2. Erlang has atoms; JSON doesn’t.
    3. JSON has booleans and “null”; Erlang doesn’t.  For transcoding, pretending they’re atoms creates ambiguity, and is therefore unacceptable.
    4. JSON has explicit support for unicode characters in strings.  Erlang doesn’t have strings at all, but rather lists of characters (in the way that C has arrays of characters).  Those lists are context and usage defined; C++ programmers may think of this as parallel to array strings vs std::string.  Erlang currently has no concept of Unicode (though that’s another issue I’m working on as divergent from the current mailing list / EEP approach.)
    5. JSON and Erlang have very different lists of quoted terms for strings.
      1. Erlang supports embedded octal with shortening, and a bunch of semi-defunct control characters like form feed ("\f") and escape ("\e").
      2. JSON supports 16-bit Unicode character embedding.
      3. Problematically, JSON does not define whether that embedding is UTF16, UCS2 or something else.  Most software implementations assume UTF16.  This document will carefully avoid the issue, which is a serious defect in this document, and a serious defect in JSON.
    6. Erlang terms are byte-available, meaning Erlang programmers may be aware of endianness; JSON objects are not.  This suggests that the handling library needs to either make a choice about internal endianness, or needs to provide control to the user.
  3. The notation for similar containers is dissimilar
  4. Similar notations are similar, not identical
  5. Dangerous string ambiguities

Working from http://sc.tri-bit.com/outgoing/scjson%20parser%20halp.html

5 Responses

  1. Per Gustafsson Says:

    foo_to_bar(X) bifs are currently never lossy

    I do not think this is correct if you look at list_to_binary and binary_to_list you find that list_to_binary(binary_to_list(X)) == X holds, but binary_to_list(list_to_binary(X)) == X does not hold since X could have been a deep list.

    This is minor point of course, but I don’t understand why you would want to be able to encode any erlang term in JSON. I do not think that this is necessary since there are several Erlang values which makes no sense outside of Erlang e.g. ports and pids. I think it would be more reasonable to define a 1-1 mapping between all JSON terms and a subset of Erlang terms. Which seems to be what EEP 18 does.

  2. John Haugeland Says:

    I’m not that worried about encoding all Erlang terms as JSON, I’m just using that as an example of the flaws in the current EEP.

    What I am worried about is representing every JSON term in Erlang unambiguously, and with the Unicode requirements of JSON, that isn’t even possible in Erlang right now.

    Above and beyond that, the only way to make strings unambiguous is to present them as binaries, and with Unicode’s UTF8 requirements, that means using binaries to represent a varying width encoding, which is a performance nightmare in the making (especially for a language without meaningful manual iteration).

    See also this document that I’m working on.

  3. Hynek (Pichi) Vychodil Says:

    Not exactly even flat list:
    1> binary_to_list(list_to_binary([255])).
    “\377″
    2> binary_to_list(list_to_binary([256])).
    ** exception error: bad argument
    in function list_to_binary/1
    called as list_to_binary([256])

    First one show another problem. There is not only semantic same, but also source code form (“[255]” is not same as “\”\377\”"). If you understand JSON as code for JS semantic object, than you can achieve e2j(j2e(X))) == X in semantic manner but not in code. I think it is enough but should be well documented and frequently mentioned. Anyway, I agree with you, that EEP 18 is too much fresh and dangerous to implement as BIF in current state of art.

  4. D Smith Says:

    > Not exactly even flat list:
    > 1> binary_to_list(list_to_binary([255])).
    > “\377″
    > 2> binary_to_list(list_to_binary([256])).
    > ** exception error: bad argument
    > in function list_to_binary/1
    > called as list_to_binary([256])

    The \377 is replace in stream by your terminal. It looks like a four character list but it is a single element in the list with an integer value of 255. Believe me it’s just your terminal emulator playing with your mind. Don’t believe me! Change the character encoding of your TE.

  5. Chris Anderson Says:

    In CouchDB it’s the e2j(j2e(Json)) == Json we’re interested in. We’ve been using the EEP’s mapping for a few months now, without trouble — we’re mostly excited about the possibility of native speed with a C implementation. JSON encoding and decoding is currently our biggest bottleneck.

    From my perspective, JSON round-trip-ability is the main event. It makes more sense to me, to construct valid-as-JSON Erlang terms, and then convert them for external use.

    This gives the advantage of the Erlang programmer having full control over the JSON output. (At the expensive of an automatic 2-way Erlang->JSON->Erlang serialization.)

    Marshaling arbitrary Erlang terms could be done by serializing terms to lists and proplists suitable for 2-way-JSON transformation.

    I don’t feel qualified to speak about the naming issue, but there are enough experienced Erlangers behind the proposal to ease my mind.

Leave a Comment

Your comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.