CodePoint

Frank Mitchell

Posted: 2023-03-31
Last Modified: 2023-07-21
Word Count: 2031
Tags: java programming

Table of Contents

Excerpted and expanded from JSONPP.

A simple API to abstract out Unicode code points from the underlying source – Readers / Writers, InputStreams / OutputStreams, CharSequences / StringBuffers, even a java.nio.ByteBuffer of ASCII, Latin-1, or UTF-8 bytes – took on a life of its own. Since I planned other parser projects, I split it off into its own library called “CodePoint”.

Design

Provider

A facade called CodePoint hides the exact classes used to wrap I/O classes and buffers. CodePoint notionally1 calls the Service Loader to load an implementation of its two primary methods, getSource() and getSink().

The default implementation cross-references the object and its class or interface to wrap with known implementations of sources and sinks. To get a list of “known” implementations, it uses something like2 the Service Loader configuration file to get a list of classes, inspects the constructors of each one, and builds tables of what sources and sinks can wrap which input and output classes. It assumes constructors take the class it can wrap and a Charset to which it converts output. An annotation indicates if a class or its constructor only accept specific charsets.

Clients of the CodePoint API really only need to know the CodePoint class and the Source and Sink interfaces. Implementers of Sources and Sinks need write classes with public constructors and list their classes in a configuration file in their library’s jar. (Ideally. I’m still working on it.)

Source

Sources provide a stream of incoming Unicode code points.

Since this API originated as a way to parse files as a stream of Unicode code points, the Source API walks through each code point one at a time. Perhaps it needs bulk read methods, but in the general case those would read Unicode code points, i.e. ints. Nothing stops an implementation from buffering in the background for efficiency.

InputStream in;
Charset cs;

try (CodePointSource source = CodePoint.getSource(in, cs)) {
    while (source.hasNext()) {
        source.next();
        int cp = source.getCodePoint();
        // do something with `cp`
    }
} catch (IOException e) {
    // do something with `e`
}

Most of CodePointSource’s methods throw IOException, because it reads I/O. IntStream, in contrast, assumes all the code points are in memory and valid. Thus if those assumptions are violated it would have to throw a RuntimeException. At least Sources are up-front.

Sink

Sinks open an outgoing stream of Unicode code points as single ints or an IntStream. For convenience, though, it also accepts chars and CharSequences.

int cp;
char c;
String text1;
StringBuilder text2;
StringBuffer text3;
OutputStream out;
Charset cs;

try (CodePointSink sink = CodePoint.getSink(out, cs)) {
    sink.putCodePoint(cp);
    sink.append(c).append(text1).append(text2);
    sink.putCodePoints(text3.codePoints()); // or sink.append(text3);
    sink.flush();
} catch (IOException e) {
    // do something with `e`
}

As seen below, all implementers really need to override are putCodePoint(), flush(), and close().

API

Code Point

Note: Actual implementation has been stubbed out.

package com.frank_mitchell.codepoint;

import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Objects;

/**
 * Wraps an input or output object with an instance of {@link CodePointSource} or
 * {@link CodePointSink}.
 * This class and its static methods are a facade for an instance of
 * {@link CodePointProvider}.
 *
 * @author Frank Mitchell
 *
 * @see CodePointProvider
 */
public class CodePoint {

    /**
     * Wrap an input object with a {@link CodePointSource}.
     * @param <T> The type of in
     * @param in an object providing a stream of characters
     * @param cs the {@link Charset} of characters from in
     * @return a CodePointSource wrapping {@code in}
     * @throws IOException if wrapping or reading in caused an exception
     */
    public static <T> CodePointSource getSource(T in, Charset cs)
            throws IOException {
        // ...
    }

    /**
     * Wrap an input object with a {@link CodePointSource}.
     * @param <T> The type of in
     * @param clz the type of in when looking for a suitable wrapper.
     * @param in an object providing a stream of characters
     * @param cs the {@link Charset} of characters from in
     * @return a CodePointSource wrapping {@code in}
     * @throws IOException if wrapping or reading in caused an exception
     */
    public static <T> CodePointSource getSource(Class<T> clz, T in, Charset cs)
            throws IOException {
        // ...
    }

    /**
     * Wrap an output object with a {@link CodePointSink}.
     * @param <T> The type of out
     * @param out an object accepting a stream of characters
     * @param cs the {@link Charset} of characters from out
     * @return a CodePointSink wrapping {@code out}
     * @throws IOException if wrapping or writing to out caused an exception
     */
    public static <T> CodePointSink getSink(T out, Charset cs)
            throws IOException {
        // ...
    }

    /**
     * Wrap an output object with a {@link CodePointSink}.
     * @param <T> The type of out
     * @param clz the type of out when looking for a suitable wrapper.
     * @param out an object accepting a stream of characters
     * @param cs the {@link Charset} of characters from out
     * @return a CodePointSink wrapping {@code out}
     * @throws IOException if wrapping or writing to out caused an exception
     */
    public static <T> CodePointSink getSink(Class<T> clz, T out, Charset cs) 
            throws IOException {
        // ...
    }
}

Code Point Provider

package com.frank_mitchell.codepoint;

import java.io.IOException;
import java.nio.charset.Charset;

/**
 * Determines a {@link CodePointSource} or {@link CodePointSink} for a given
 * object.
 *
 * It uses the {@link ClassLoader}, configuration files, reflection, and a few
 * heuristics to instantiate an appropriate class. Other implementers of
 * {@link CodePointSource} and {@link CodePointSink} merely need to include in
 * the jar a UTF-8 file named {@code META-INF/codepoint/classes.conf} containing
 * the fully qualified binary names of classes implementing
 * {@link CodePointSource} or {@link CodePointSink}, separated only by
 * whitespace or comments ('#' until the end of the line).
 *
 * Sources and sinks that can only handle a restricted range of {@link Charset}s
 * must indicate which sets with {@link ForCharset} on the relevant
 * constructor(s).
 *
 * @author Frank Mitchell
 */
public interface CodePointProvider {

    /**
     * Wrap an output object with a {@link CodePointSink}.
     * @param <T> The type of out
     * @param clz the type of out when looking for a suitable wrapper.
     * @param out an object accepting a stream of characters
     * @param cs the {@link Charset} of characters from {@code out}
     * @return a CodePointSink wrapping {@code out}
     * @throws IOException if wrapping or writing to {@code out}
     *                     caused an exception
     */
    <T> CodePointSink getSink(Class<T> clz, T out, Charset cs)
            throws IOException;

    /**
     * Wrap an input object with a {@link CodePointSource}.
     * @param <T> The type of {@code in}
     * @param clz the type of {@code in} when looking for a suitable wrapper.
     * @param in an object providing a stream of characters
     * @param cs the {@link Charset} of characters from {@code in}
     * @return a CodePointSource wrapping {@code in}
     * @throws IOException if wrapping or reading from {@code in}
     *                     caused an exception
     */
    <T> CodePointSource getSource(Class<T> clz, T in, Charset cs) 
            throws IOException;
}

Code Point Source

package com.frank_mitchell.codepoint;

import java.io.Closeable;
import java.io.IOException;
import java.util.Iterator;

/**
 * An iterator over an external sequence of Unicode code points.
 * Using {@code int} instead of {@code char} is a bit of 
 * future-proofing for when streams commonly contain characters
 * outside of the Basic Multilingual Plane (0x0000 - 0xFFFF).
 * Implementers can transparently decode UTF-8 or UTF-16 multi-byte
 * characters into a single code point.  (At least until Unicode expands
 * past 32 bits.)
 * 
 * Unlike standard Java {@link Iterator}s, advancing the iterator and
 * reading the next item in the sequence can be two separate actions.
 * That way one can pass the source to other methods and they can read
 * the last code point read without altering state.
 * 
 * @author Frank Mitchell
 */
public interface CodePointSource extends Closeable {

    /**
     * Read the current code point after the last call to {@link #next()}.
     * 
     * @return current code point.
     */
    int getCodePoint();

    /**
     * Whether this source still has code points remaining.
     * This method may read ahead to the next character, which may
     * cause an exception.
     * 
     * @return whether this object has a next code point.
     * 
     * @throws java.io.IOException if read-ahead throws an exception
     */
    boolean hasNext() throws IOException;

    /**
     * Get the next code point.
     * 
     * @throws IOException 
     */
    void next() throws IOException;
   
    /**
     * Close the underlying IO or NIO object.
     * 
     * @throws IOException from the underlying object.
     */
    @Override
    void close() throws IOException;
 }

Code Point Sink

package com.frank_mitchell.codepoint;

import java.io.Closeable;
import java.io.Flushable;
import java.io.IOException;
import java.util.PrimitiveIterator;
import java.util.stream.IntStream;

/**
 * Write Unicode code points to external output.
 *
 * @author Frank Mitchell
 */
public interface CodePointSink extends Appendable, Flushable, Closeable {
    /**
     * Writes a single code point to underlying output.
     * @param cp code point
     * @throws IOException if the underlying output throws an exception
     */
    void putCodePoint(int cp) throws IOException;

    /**
     * Write a stream of code points to underlying output.
     * @param cps stream of code points
     * @throws IOException if the underlying output throws an exception
     */
    default void putCodePoints(final IntStream cps) throws IOException {
        PrimitiveIterator.OfInt iter = cps.iterator();
        while (iter.hasNext()){
            putCodePoint(iter.nextInt());
        }
    }

    @Override
    default Appendable append(final char c) throws IOException {
        // TODO: keep a buffer so we can detect surrogates?
        putCodePoint(c);
        return this;
    }

    @Override
    default Appendable append(final CharSequence csq) throws IOException {
        putCodePoints(csq.codePoints());
        return this;
    }

    @Override
    default Appendable append(final CharSequence csq,
                                final int start, 
                                final int end)
            throws IOException {
        return append(csq.subSequence(start, end));
    }

    /**
     * Flush the underlying output's buffers, if any.
     * @throws IOException if the underlying output throws an exception
     */
    @Override
    void flush() throws IOException;

    /**
     * Close the underlying output.
     * @throws IOException if the underlying output throws an exception
     */
    @Override
    void close() throws IOException;
}

ForCharsets Annotation

package com.frank_mitchell.codepoint;

import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.nio.charset.Charset;

/**
 * An annotation on a {@link CodePointSource} or {@link CodePointSink} that
 * informs {@link CodePoint} that its instances only work on
 * specific character set.
 * For example, a Source might be optimized to read only UTF-8 (and, by
 * extension, ASCII), so it would set {@code ForCharsets("UTF-8")}.
 * Listing "ASCII" would be helpful but not necessary, as {@code CodePoint}
 * infers ASCII from UTF-8.
 *
 * @author Frank Mitchell
 */
@Retention(RetentionPolicy.RUNTIME)
public @interface ForCharsets {

    /**
     * Denotes the name of a {@link Charset}s this object handles.
     * A name should pass {@link Charset#checkName(java.lang.String)}.
     *
     * @return the name of {@link Charset}s a Source or Sink handles
     */
    String[] names();
}

MIT License

Copyright 2023 Frank Mitchell

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


  1. As of this writing, it’s hardcoded to create an instance of the default implementation. ↩︎

  2. ServiceLoader automatically instantiates a “service” class with a zero-argument constructor and caches it for further use. codepoint, on the other hand, instantiates a wrapper class with a constructor for the object’s class, superclass, or implemented interface and an optional java.nio.Charset, then throws it away when reading is done. I thought about changing the protocol to a zero-length constructor followed by a method to sent the current input method, but not only is that hard with generics it requires wrappers to reset their state if that method’s called again. I decided to go with the usual Java convention of using an input instance once, closing it, then throwing it away. ↩︎