Progress Report 2 | Frank Mitchell's Blog

My coding projects progress in fits and starts. For now I’ve suspended work on ELTN and turned my attention to Json Pull Parser.

JSON Pull Parser

Unbeknownst to me, Java already has a JSON standard: JSONP. Its focus, however, seems to be enterprise application servers. The top-level javax.json package mostly revolves around trees of JsonValue objects; javax.json.stream resembles a bare bones version of my pull parser. My intent was a small library with minimal footprint, yet with enough conveniences to use it as a primary API. For example, JSONP reports a key within a JSON Object once when encountered; my JSONPP saves the name until it finishes parsing the corresponding “value”, which might itself be an Object or Array. (Although I might provide an option to turn that off to reduce overhead.)

Right now JSONPP covers the whole JSON standard. Next I need to make sure I handle invalid input in a sensible way, and report errors with enough information for a human to understand the problem. (Here internationalization rears its ugly head.)

After that, I have a whole list of tasks:

Clean up the code, add Javadoc documentation, and include the MIT license in every source file.
Build and run some performance tests.
Explore possible optimizations, including:
- using buffers instead of reading one character at a time.
- writing my own UTF-8 converter. (See A Note on Unicode below.)
- replacing hand-written code with Java’s regular expression library.
See if I can adapt JSONPP to the needs of non-blocking I/O.¹

A Note on Unicode

If you know all about character encodings and Unicode, skip to the next session.

Computers represent characters using an encoding from numbers to characters. ASCII, the most used encoding, covers the keys on a standard American computer keyboard. ASCII uses the numbers 0-127, which a computer can represent with only 7 bits. (Each additional bit doubles the range of numbers). Languages other than American English, mathematics, and various other domains need additional characters. Microsoft, Apple, and various foreign standards bodies defined 8-bit encodings that assigned 128-255 to various and usually incompatable letters and symbols. Those who don’t use Latin letters at all had to devise yet more encodings. Even today most text files use one of the 8-bit or variable-length encodings, depending on platform and language.

The Unicode standard assigns a number to every character in existence, theoretically. It’s a strict superset of ASCII. Older versions of Unicode required only 16 bits (two bytes) to represent every character or symbol. China and Japan pointed out that the 16 bit standard didn’t cover all the characters and symbols they use. The most recent standard requires 18 bits to cover all characters.

Java’s native char type is 16 bits (two bytes) long. (Most languages assumed characters take up one byte, except for variable-length encodings like Japan’s Shift-JIS.) This was enough to cover what the new standard calls the Basic Multilingual Plane (BMP) back in 2003, but not now. Fortunately Unicode’s UTF-16 encoding. defines how to represent characters beyond the BMP using two 16-bit “surrogate pairs”. Core Java classes can convert any other encoding to UTF-16 and back.

JSON takes no position on character encodings, but nearly every character in its grammar is an ASCII character. (The exception is JSON Strings, which can contain any currently defined Unicode character with a few restrictions.) For backward compatibility with older software, Unicode defines a variable length encoding called UTF-8. ASCII characters are unchanged; other characters require two, three, or now four bytes.

A stream of JSON would most commonly use either “clean” 7-bit ASCII or UTF-8 for portability. Any encoding would have to include ASCII symbols used in the JSON format. Therefore, a possible optimization would read raw bytes from a file, socket, or pipe and code my own UTF-8 conversion, optimizing for the ASCII subset. Internally JSONPP uses Unicode “code points” for a number of reasons; Java APIs would convert from UTF-8 to UTF-16 to 32-bit integers.

If the caller specifies an encoding other that ASCII/UTF-8, I’d use the Java APIs because I don’t want to re-code every encoding. If I needed to optimize that I could use an alternate implementation that uses Java chars exclusively.

Old Dog Learns New Tricks

When I ahem stopped coding in Java, I was using Java 6. Now, the “old” version is Java 8, with Java 9 and up providing much needed mechanisms. The default keyword alone surprised me. Implementing methods on an interface? If you could also define fields on all implementors you’d have true multiple inheritance. Java 9’s module mechanism solves a lot of other problems.

Originally I wrote JSONPP using vi, then recently switched to the Eclipse IDE. Eclipse looked and worked mostly the same. Its greatest advantages are name completion and automated refactoring. The former I depended on even more than previously, both because Java 8 is different and because I’ve forgotten even the core APIs. The latter I probably overused as I extracted methods from repeated code, then inlined those methods because I thought I had a better idea. After I finish bullet-proofing and optimizing I might merge some of the classes I separated.

(to be continued)

Instead of waiting to read or write to a single data stream, non-blocking I/O blocks on multiple input/output stream and exits when one or more streams have read or written data. It’s much more efficient when a server has to handle a lot of connections at once. ↩︎