The DistKV tutorial

Installation

This part is easy. pip install distkv.

You now have, or should have, a distkv command-line utility. If not, use this script:

#!/usr/bin/env python3

import sys
# sys.path[0:0] = (".", "../asyncserf")  # for development

from distkv.command import cmd
cmd()

You also need a running Serf <http://serf.io> daemon.

Start the server

You start an initial server with this command:

one $ distkv server -i Testing $(hostname)
Running.

By default, your DistKV server will talk to the local Serf process. You can configure the destination by adapting the config file:

one $ distkv -C server.serf.host=my-serfer server -i Testing $(hostname)

You can now retrieve the root value:

one $ distkv client data get
"Testing"
one $

As the purpose of DistKV is to be a distributed key-value storage, you can start another server on a different host:

two $ distkv server $(hostname)
Running.

This will take a few seconds for the servers to sync up with each other. You can verify that the second server has successfully synced up:

two $ distkv client data get
"Testing"
two $

(NB: The root value is not special; by convention, it identifies the DistKV network.)

You can now kill the first server and restart it:

one $ killall distkv
one $ distkv server $(hostname)
Running.

You must never start a server with the -i option unless you’re creating a new and separate DistKV network.

You can create separate networks by changing the server.root config variable. Such networks do not collide with each other, other than sharing Serf gossip bandwidth.

Data commands

You might want to add an alias for “distkv client data” so that you don’t have to type so much. In bash:

one $ dkd() { distkv client data "$@"; }

Then, you can store arbitrary data at random DistKV nodes:

one $ dkd set -ev 123 one two three
one $ dkd set -ev 1234 one two three four
one $ dkd set -v Duh one two three four five
one $ dkd get one two three
123
one $ dkd get one two three four five
"Duh"
one $

The -e flag tells the set command to evaluate the given data as a Python expression. You can store numbers, True/False/None, binary and Unicode strings, lists/tuples, and hashes composed of these.

All entries’ values are independent. DistKV’s storage is organized hierarchically, (among other reasons) for ease of retrieval:

one $ dkd get -rd_ one
two:
  three:
    _: 123
    four:
      _: 1234
      five:
        _: Duh
one $

DistKV’s internal data are stored under a special null root key. You can use distkv client internal dump to display them. This command behaves like distkv client data get -rd_. It accepts a path prefix.

Persistent storage

DistKV keeps everything in memory (for now). If you want your data to survive a power outage, you might want to tell your server to save them:

one $ distkv client log dest /var/local/lib/distkv.$(date +%Y%m%d).state

This command writes the current state to this file. The server keeps the file open and appends new records to it. The log dest has options to either write an incremental change record, or to just write a one-shot dump.

When you need to restart your DistKV system from scratch, simply pass the newest saved state file:

one $ distkv server -l $(ls -t /var/local/lib/distkv.*.state | head -1) $(hostname)
Running.

Obviously, if your state dump files are incremental, you should instead do something like this:

one $ distkv server -l <(cat /var/local/lib/distkv.*.state) $(hostname)
Running.

These commands are somewhat safe to use on a network that’s already running; your node may run with old state for a few seconds until it retrieves the updates that happened while it was down. An option to delay startup until that process has completed is on the TODO list.

In a typical DistKV network, at most two or three nodes will use persistent storage; all others simply syncs up with their peers whenever they are restarted.

Authorization

DistKV initially doesn’t come up with any authorization scheme. However, advanced uses require the ability to distinguish between users.

Let’s set up a “root” user:

one $ distkv client auth -m password user add name=joe password?=Code
Code: ******
one $ distkv client auth -m password user list
joe
one $ distkv client auth -m password init -s
Authorization switched to password
one $

(The input at the “Code:” prompt is not echoed.)

After this point, you can no longer use DistKV without a password:

one $ dkd get
ClientAuthRequiredError: You need to log in using: password
one $

one $ distkv client -a "password name=joe password?=Code" data get
Code: ******
"Root"
one $

Internal data are stored in a separate DistKV subtree that starts with a None value. You can display it:

one $ distkv client -a "password name=joe password=test123" data get -rd_
null:
  auth:
    _:
      current: password
    password:
      user:
        joe:
          _:
            _aux: null
            password: !!binary |
              7NcYcNGWMxapfjrDQIyYNa2M8PPBvHA1J8MCZVNPda4=
one:
  two:
    three:
      _: 123
      four:
        _: 1234
        five:
          _: Duh

As you can see, passwords are encrypted – hashed, actually. The exact scheme depends on the auth method. The data below None (or “null” in YAML syntax) are otherwise inaccessible.

NB: nothing prevents you from using the string "null" as an ordinary key name:

one $ distkv client -a "password name=joe password=test123" data set -v bar null foo
one $ distkv client -a "password name=joe password=test123" data get -rd_
…
'null':
  foo:
    _: bar

For experimentation, there’s also a _test method which only exposes a user name:

one $ distkv client auth -m _test user add name=joe
one $ distkv client auth -m _test user add name=root
one $ distkv client auth -m _test init
one $ distkv client data get
ClientAuthRequiredError: You need to log in using: _test
one $ dkv() { distkv client -a "_test name=joe" "$@"; }
one $ dkv data get
123
one $

We’ll use that user and alias in the following sections.

ACLs and distributed servers

DistKV servers use the client protocol when they sync up. Thus, when you set up authorization, you must teach your servers to authenticate to their peer:

one $ distkv -C connect.auth="_test name=joe" server $(hostname)

Access restrictions

A user can be restricted from accessing or modifying DistKV data.

Let’s say that we’d like to create a “write-only” data storage:

one $ dkv acl set writeonly -a "xc" wom '#'
one $ dkv data set -ev 42 wom foo bar
one $ dkv data set -ev 43 wom foo bar
ServerError: (<AclEntry:[None, 'acl', 'writeonly', 'wom', '#']@<NodeEvent:<Node: test1 @10> @4 1> ='cx'>, 'w')
one $ dkv data get wom foo
ServerError: (<AclEntry:[None, 'acl', 'writeonly', 'wom', '#']@<NodeEvent:<Node: test1 @10> @4 1> ='cx'>, 'r')
one $

As you can see, this allows the user to write to arbitrary values, but Joe cannot change anything, nor can he read the values which he wrote.

Note that we also created a “root” user who doesn’t have ACL restrictions. If we had not, we’d now be locked out of our DistKV storage because “no matching ACL” means “no access”.

A user who has an ACL set can no longer modify the system, because the None element that separates system data from the rest cannot match a wildcard. ACLs for system entries are on the TODO list.

Code execution

DistKV doesn’t just store passive data: you can also use it to distribute actual computing. We’ll demonstrate that here.

First we feed some interesting code into DtstKV:

one $ dkv code set the answer <<END
> print("Forty-Two!")
> return 42
> END

Then we set up a one-shot run-anywhere instance:

one $ dkv run set -c "the answer" -t 0 a question

This doesn’t actually execute any code because the executor is not part of the DistKV server. (The server may gain an option to do that too, but not yet.) So we run it:

one $ dkv run all
Forty-Two!

(Initially this takes some time, because the run command needs to co-ordinate with other runners. There are none currently, but it can’t know that.)

The code will not run again unless we either re-set --time, or set a repeat timer with --repeat.

Start times are mostly-accurate. There are two reasons why they might not be:

  • the co-ordination system has a periodic window where it waits for the next message. This causes a delay of up to two seconds.
  • TODO: The current leader might decide that it’s too busy and wants to delegate starting a particular job to some other node in the cluster. This incurs some delay, more if the recipient is no longer available.

This method will run the code in question on any node. You can also run code on one specific node; simply do

one $ dkv run -n $(hostname) set -c “same answer” -t 0 a question one $ dkv run -n $(hostname) all

The one-node-only runner and the any-node runner are distinct.

Errors

Nobody is perfect, and neither is code. Sometimes things break. DistKV remembers errors. To demonstrate, let’s first provoke one:

one $ dkv code set the error <<END
> raise RuntimeError("Owch")
> END
one $ dkv run set -c "the error" -t 0 what me worry
one $ dkv run all  # if it's not still running
20:24:13.935 WARNING:distkv.errors:Error ('.distkv', 'error', 'test1', 16373) test1: Exception: Owch

The list of errors is now no longer empty:

one $ dkv error list -d_
[ some YAML ]

You can limit the error list to specific subtrees. This command has the same effect:

one $ dkv error list -d_ .distkv run any

except that the path is shortened for improved useability.

Error details are available; add the -a option. You can also filter errors on a specific node, which only includes that node’s details.

The Python API

Command lines are all well and good, but DistKV gets really interesting when you use it from Python.

Let’s start by simply setting some value:

import anyio
from distkv.client import open_client

async def dkv_example():
   async with open_client() as client:
      client.set("one","two","three", value=("Test",42,False), chain=None)

anyio.run(dkv_example)

That was easy. Now we’d like to update that entry:

async def dkv_example():
   async with open_client() as client:
      res = client.get("one","two","three")
      ret = client.set("one","two","three", value=("Test",v[1]+1,False), chain=res.chain)
      assert res.chain != ret.chain

The chain parameter is important: it tells DistKV which change caused the old value. So if somebody else changed your one two three entry while your program was running, you’d get a collision and the set would fail.

set returns a new chain so you can update your value multiple times.

Deleting an entry clears the chain: the source of a non-existing value doesn’t matter.

Watching for Changes

The result of the previous get was static. If somebody else subsequently changes it, you wouldn’t know. Let’s fix that:

async def dkv_example():
   async with open_client() as client:
      async with client.watch("one", fetch=True) as watcher:
         async for res in watcher:
            if 'path' not in res:
               continue
            path = ' '.join(str(x) for x in res.path)
            if 'value' in res:
               print(f"{path}= {res.value}")
            else:
               print(f"{path}: deleted")

fetch=True will send the current state in addition to any changes. The 'path' not in res test filters the notification that tells you that the subtree you requested is complete.

Active objects

While watching for changes is nice, organizing the resulting objects tends to be tedious. DistKV comes with a method that does this for you:

from distkv.client import ClientRoot, ClientEntry, NotGiven

class OneEntry(ClientEntry):
   async def set(self, value):
      await super().set_value()
      path = ' '.join(str(x) for x in self.subpath)
      if value is NotGiven:
         print(f"{path}= {value}")
      else:
         print(f"{path}: deleted")

class OneRoot(ClientRoot):
   @classmethod
   def child_type(cls, name):
      return OneEntry

async def dkv_example():
   async with open_client() as client:
      async with client.mirror("one", root_type=OneRoot) as root:
         # At this point you have the sub-tree in memory
         assert root['two']['three'].value[1] >= 42

         while True:
            await anyio.sleep(99999)
      pass
      # at this point the sub-tree is still there, but won't be updated

except that in a real program you’d do some real work instead of sleeping.

Verification

Complex data should be clean. Storing "Hello there!" in a value that the rest of your code expects to be an integer is likely to have unwanted effects. For instance, we’d like to enforce that all quota values in our site statistics are integer percentages.

First, we define the type:

one $ ./kv client type set -g 0 -g -2 -g 123 -b 1.2 -b '"Hello"' int <<END
> if int(value) != value: raise ValueError("not an integer")
> END
one $

As you can see, data types must be accompanied by example values that include both “good” and “bad” examples.

You can also declare subtypes:

one $ dkv type set -g 0 -g 99 -g 100 -b -1 -b 101 int percent <<END
> if not (0 <= value <= 100): raise ValueError("not a percentage")
> END
one $

The example values must pass the supertype’s checks.

Now we associate the test with our data:

one $ dkv type match -t int -t percent stats '#' quota

Then we store some value:

one $ dkv data set -v 123 stats foo bar quota
ServerError: ValueError("not an integer")

Oops. We forgot that arguments are strings:

one $ dkv data set -ev 123 stats foo bar quota
ServerError: ValueError('not a percentage')
one $ dkv data set -ev 12 stats foo bar quota

DistKV does not test that existing values match your restrictions.

Data mangling

Structured data are great, but some clients want boring single-value items. For instance, some home automation systems want to use "ON" and "OFF" messages, while your active code is much happier with a bool value – or even a mapping that also carries the time of last change, so that a turn off after 15 minutes rule will actually work.

Let’s write a simple number codec:

one $ dkv codec set -i '"12.5"' 12.5 -o 13.25 '"13.25"' floatstr
Enter the Python script to encode 'value'.
return str(value)
Enter the Python script to decode 'value'.
return float(value)
^D

As you can see, you need to give the codec some examples. Here they’re symmetric but that’s not a requirement; for instance, a bool codec for our home automation system could accept a wide range of true-ish or false-ish strings but it would always output ON and OFF.

Associating this codec with a path is slightly more involved:

one $ dkv codec convert -c floatstr floatval monitor '#' value

This associates

  • the float-to-string codec we just created
  • all paths that start with monitor and end with value

with the name floatval. As not every user needs stringified numbers, we now need to tell DistKV which users to apply this codec to:

one $ dkv auth user modify --aux codec=floatval name=joe

Thus, Joe will read and write values as strings:

one $ dkv data set -v 99.5 monitor a b c value
one $ dkv data set -v 12.3 monitor a b c thing
one $ dkv data get -rd_ monitor
a:
  b:
    c:
      value:
        _:
          99.5
      thing:
        _:
          '12.3'

This is especially helpful if Joe is an MQTT gateway which only transmits (binary) strings.

DistKV currently can’t translate paths, or merge many values to one entry’s attributes.

Your best bet is to use active objects and add some code to their set_value methods that translates between one and the other. There are some caveats:

  • All such data are stored twice.
  • Don’t change a value that didn’t in fact change; otherwise you’ll generate an endless loop.
  • You need to verify that the two trees match when you start up, and decide which is more correct. (The tock stamp will help you here.) Don’t

overwrite changes that arrive while you do that.