chunkd: single-node data storage service
chunkd is a low-level key/value storage service for anonymous data objects. The intended use is as a low-level piece of large-scale distributed data storage infrastructure, for Project Hail.
Keys are opaque binary data, from 1 byte to 1024 bytes in size.
Values ("the data") may be any size, be it 1 byte, 1 megabyte, 100 gigabytes, anything. Object size is bounded only by available disk storage on the local node.
Currently the service is TCP-based. But its message structure is intended to be portable, making Infiniband/RDMA, SCTP, and other protocols a future possibility.
From chunkd's simple GET/PUT/DELETE API, you can build cluster-wide redundant storage services, exporting data storage capability from any device supported by your OS, via a single, standard, easy to use, device-agnostic interface.
The network protocol consists of TCP-based messages. Each message has a fixed-size binary header, followed by optional data. There is no separate authentication step; each message is signed and checksummed using HMAC.
|NOP||Verify request/response path (including auth)||user||response code|
|GET||Retrieve object data, metadata||user, key, key length||response code, SHA1 checksum, last-modified time, data length, data|
|GET metadata||Retrieve object metadata||user, key, key length||response code, SHA1 checksum, last-modified time, data length|
|PUT||Store object||user, key, key length, data, data length||response code, SHA1 checksum|
|DELETE||Delete object||user, key, key length||response code|
|LIST||Retrieve list of all objects||user||response code, XML document listing all known object metadata associated with the user|
A single chunkd instance is generally designed to be run on a "cloud node", generally a "1U data center server", which probably equates to a physical or virtualized instance of: single multi-core CPU, 2-4GB RAM, gige, single disk. That example lends itself to 1000's of such chunkd storage nodes.
But it is also quite valid to have much larger chunkd nodes, perhaps tied to larger servers, 10gige, multiple disks and/or SAN networks.
Current version 1.0 technical design goals include
- Multiple worker threads, to enable I/O parallelism. I/O parallelism is necessary because it
- is the only way to max out storage hardware command and completion queues
- enables greater optimizations on a TCQ/NCQ-enabled storage device, compared to slower command-at-a-time solutions
- No internal data caching. Leverage kernel pagecache.
- Use POSIX filesystem API for our "database". Avoid sql, db4, sqlite, etc.
- Register ourselves in a CLD cell, for discovery by higher level services.
Early beta. Protocol implementation is feature complete, but server is not yet optimized for performance. Most notably, the server is still single-threaded.
Strong cryptographic authentication is built into chunkd's network protocol, but the server's implementation of authenication is essentially a placeholder, no-op implementation. It should be easy to plug in SASL, Kerberos, or PAM into the current, modular framework.
chunkd is loosely modelled on the chunk server described in the GoogleFS paper (PDF).
Developers: browse the git repo, or check out from git://git.kernel.org/pub/scm/daemon/distsrv/chunkd.git
A wealth of projects large and small awaits interested contributors. Programmers will need to learn git, participate on the hail-devel mailing list, check out the source code, build and set up the project. Contributions from non-programmers are welcome as well -- documentation, feedback, and in general using the software.
Here are some suggestions for projects:
- document and debug chunkd network protocol
- document setup and system administration procedures
- write chunkd client libraries for Java, Python, C++, or other languages