FileIO

simple file I/O for node.js

2010-01-11

What is it?

Rather than using promises, FileIO starts with an API using only callback functions. This is an efficient approach, and a little lighter than a promise style. A promise API can be trivially implemented on top of an API based on callbacks. The callback API leads naturally to a continuation passing style, where rather than calling functions and then manipulating their return values, we call a function, say readFile, which will generate some data, possibly asynchronously, and then provide it with a function to handle the data.

In FileIO, API methods that expose asynchronous operations return a /continuable/. A continuable is a function which takes a continuation as an argument, and then perhaps does some asynchronous I/O, and calls the continuation with the result. Here "continuation" can be read as a synonym for "callback function"; the only difference is one of emphasis.

The main difference between a promise style and a callback style is the sequencing of creating the asynchronous request and providing the handler function to deal with the eventual value.

// promise style
var promise=readFile("11.txt")
promise.addCallback(function(data){print(data)})

// callback style
var continuable=readFile("11.txt")
continuable(function(data){print(data)})

In the promise style, the promise is first created, which, conceptually, immediately spawns some I/O process in the background. Then a handler is attached to the promise to deal with the event that will be emitted when the data becomes available. There is some complexity associated with promises because of the fact that the handler is attached /after/ the promise is created. In the callback style, a continuable is returned, which can be passed around and manipulated, combined with other continuables, and so on, much like a promise, but unlike a promise, no I/O is performed until the continuation is provided. On the second line, the function that prints the data is provided, and the asynchronous I/O request is made, and eventually the continuation is called with the contents of the file...

Errors

...or, perhaps, an error.

In the example above we neglected to show how errors are dealt with, so let's rectify that. Of course we know that any I/O action can potentially fail, for example readFile will fail if the file does not exist.

// handlers for success and failure
function successHandler(data){
  print("success!\n")
  print(data)}
function failureHandler(error){
  print("failure:\n")
  print(error)}

// promise style
var promise=readFile("11.txt")
promise.addCallback(successHandler)
promise.addErrback(failureHandler)

// callback style
var continuable=readFile("11.txt")
continuable(either(failureHandler,successHandler))

In the promise case, one of two events will be emitted, either a success event, or a failure event. We have the option of adding event handlers for either, neither, or both. In the callback style implemented here, there is only one callback function which receives either a success result or a failure result. In this style it becomes easier to deal with errors than to ignore them, which encourages good habits. To distinguish in the handler between success and failure we use the Either type and helper functions like either.

The Either Type

The Either type is an idiom borrowed directly from Haskell's beautiful type system, where it is commonly used to deal with operations that may either succeed or fail, which is exactly what we have here. (If you doubt that a type system can be beautiful, or if "static typing" makes you think of Java or C++, I highly recommend Haskell!)

In the example above, the continuation passed will be called with a value of type Either Error String.

This means either an Error or a String, and the ability to tell them apart at runtime unambiguously. More generally, a value of type Either a b is either a value of type a, called a Left, or of type b, called a Right, where a and b can be any types at all. At runtime, what you do with an Either value is test whether it is a Left or Right, and then extract the value of the corresponding type from it.

The Either type is an incredibly useful, powerful, and practical tool. In JavaScript an Either return value does away with messy and unpredictable type checking, and leads to simple and elegant APIs. JavaScript programmers spend a great deal of time and energy trying to find ways to determine at runtime what the type of some value is, and sometimes this is quite hard or even impossible. To deal with an Either result is very simple, one simply tests whether it is a Left or Right value, and then handles it appropriately. In my experience, JavaScript programming is greatly simplified by the simple discipline of doing away with runtime type checks altogether, whenever and wherever possible, which turns out to be almost everywhere. (The exception is generally at API boundaries, where type checking is a convenience to the user, to catch errors in the use of the API as early as possible and fail with an appropriate error message.)

In FileIO, Either values are created as an array of two elements, the first of which is either 0 or 1, indicating Left or Right, and the second of which is the actual value, of the appropriate type. There are of course many other ways Either could be implemented, and the implementation details matter little; what matters is that if a function receives a value of type Either a b, it can determine the result unambiguously to be a Left or Right, without resort to typeof or other error-prone techniques.

If we want to explicitly deal with the Either type in the example above, we could pass a callback which tests whether the argument is a Left or Right and then deals with the error or a success result. Instead here we used the either() helper function, which takes two functions, one for each of the Left and Right types, and returns a function that handles an Either value by testing it and then dispatching to the appropriate function.

Streams

Say we want to open a file and read the data in the file by chunks, dealing with each chunk in turn. This is streaming, which permits dealing with huge data while using only constant memory. How can we provide an interface to a stream in JavaScript?

In FileIO, the answer is, again, using functions. A stream is a function, which takes a consumer as its first argument. A consumer is also a function, similar to the continuations or callback functions used above, but, unlike continuations, a consumer may be called multiple times as the stream generates events.

Let's use the node REPL to demonstrate. If you have node installed you can follow along with the commands below.

inimino@boshi:~/fileIO$ wget http://www.gutenberg.org/files/11/11.txt
inimino@boshi:~/fileIO$ wget http://boshi.inimino.org/3box/fileIO/fileIO_cps.js
inimino@boshi:~/fileIO$ rlwrap node-repl
node> require.paths.push('.') // make sure the module's location is in the require path
node> puts=require('sys').puts // puts is a convenient way to dump a string to the terminal
node> file=require('fileIO_cps') // require the module
node> s=file.streamLines('11.txt','ascii')
[Function]

The value returned by streamLines is a stream which will generate a chunk for each line in the file, followed by eof. Note that it is shown by the REPL as a function.

To begin streaming, we provide the stream with a consumer. The stream will call this consumer repeatedly with 'chunk', 'error', or 'eof' as the first argument, followed by further arguments as appropriate for the message type.

This approach enforces a kind of pure OO programming, in the message-based style of Smalltalk, where objects have no observable or modifiable internal state and only support the operation of sending a message.

First we will try a stream consumer that simply prints the kinds of messages it receives.

node> s(function(message_type){puts(message_type)})
node> 
chunk

Note that the consumer is called asynchronously, so in the transcript above the "node> " prompt returns instantly, and then "chunk" is printed after it, once the stream has opened the file and read from it.

A stream will generate its first message once the consumer is provided to it and any necessary I/O is complete. This is analogous to the behavior of a continuable, which begins its I/O immediately once it receives a continuation. Like a continuable, a stream will not do any I/O until the consumer is provided, so creating a stream is an inexpensive and synchronous operation. The first message of a stream will either be a 'chunk' message or an 'error'. Once an 'error' or 'eof' message is sent to the consumer, the stream will not produce any further chunks and is considered closed.

When a chunk is received, the stream may be called again, with either a 'next' message, to get the next chunk, or a 'close' message, to throw away the rest of the stream and do any necessary cleanup, in this case closing the file handle.

node> s('next')
chunk
node> s('next')
chunk
node> s('next')
chunk

In this case, we notice that the chunk lines are printed before the next "node>" prompt appears. This is because the stream of lines is backed by another stream of larger chunks read from the filesystem. Since the first chunk read from the stream contained more than one line, the line stream doesn't need to do any I/O to return the next line, so it happens synchronously, using the same API. This is another significant difference between Promises or EventEmitters and a callback style. In a callback style, the same API can support either asynchronous retrieval of data, or synchronous callback when the data is already immediately available, and it is not necessary to re-enter the event loop to make any of this work. Promises can do this by storing the data that is generated, in case an event handler is added later, but especially for event streams, the callback style is cleaner and does away with a lot of overhead and complexity. This means callbacks are ideal for operations where events will often occur immediately (perhaps because they are cached or queued), while some may happen asynchronously. The callback style makes the case where a result is immediately available as efficient as it can be.

Rather than print out "chunk" for each line in the rest of the file, we decide we are done with this stream and close it:

node> s('close')
node>

This doesn't return anything, and the consumer function isn't called (the API assumes you know when you've closed the stream yourself). Now that we've closed the stream, we can try to read from it again:

node> s('next')
error

Note that "error" is just the message type, there is a more informative message contained in the second parameter to the consumer.

Let's open a new stream for the same file, and construct a more interesting consumer function this time:

node> s=file.streamLines('11.txt','ascii')
[Function]
node> s(function(msg,param){if(msg=='chunk'){ puts(param.slice(0,-1)); s('next') }else{ puts(msg) }})

This function prints each line from the file to the terminal, and requests the next chunk each time by sending a 'next' message to the stream. If any other kind of message is seen, it prints the message to the terminal. Note that streamLines does not trim the line separator from the lines it returns, and puts() also adds a line terminator, so the output would be double-spaced if we did not strip the last character using .slice(0,-1).

Type Notation

To clarify the API, the type of each function is given in a notation similar to that used in Haskell. Unlike in Haskell, the type annotations are not part of the program and are not verified by the compiler, but they provide a hint as to how an API method is intended to be used.

Here is the currently exported API:

// either :: (a → c, b → c) → Either a b → c
// readFile :: (path::String, encoding::String) → Continuable Either Error String
// writeFile :: (path::String, data::String, encoding::String, mode::Int) → Continuable Either Error _
// appendFile :: (String, String, String, Int) → Continuable Either Error _
// copyFile :: (src::String, dest::String) → Continuable Either Error _
// streamFile :: (path::String, encoding::String) → Stream
// streamFileBy :: (path::String, encoding::String, separator::String) → Stream
// streamLines(path,encoding) equivalent to streamFileBy(path,encoding,"\n")
// writeStream :: (path::String, encoding::String, mode::Int) → Stream → Continuable Either Error Int

And the stream type:

// type Stream = ( Consumer StreamMsg | StreamCtrlMsg ) → _
// type StreamMsg = "error" | "chunk", String, Int | "eof"
// type StreamCtrlMsg = "next" | "close"
// type Consumer a = a → _ // intended to be called repeatedly
// type Opt a :: A | Undefined

The analogy with the Haskell type system isn't quite as useful for impure functions, like the Consumer type, which Haskell doesn't have.