The DistKV tutorial¶
Installation¶
This part is easy. pip install distkv
.
You now have, or should have, a distkv
command-line utility. If not,
use this script:
#!/usr/bin/env python3
import sys
# sys.path[0:0] = (".", "../asyncserf") # for development
from distkv.command import cmd
cmd()
You also need a running Serf <http://serf.io> daemon.
Start the server¶
You start an initial server with this command:
one $ distkv server -i Testing $(hostname)
Running.
By default, your DistKV server will talk to the local Serf process. You can configure the destination by adapting the config file:
one $ distkv -C server.serf.host=my-serfer server -i Testing $(hostname)
You can now retrieve the root value:
one $ distkv client data get
"Testing"
one $
As the purpose of DistKV is to be a distributed key-value storage, you can start another server on a different host:
two $ distkv server $(hostname)
Running.
This will take a few seconds for the servers to sync up with each other. You can verify that the second server has successfully synced up:
two $ distkv client data get
"Testing"
two $
(NB: The root value is not special; by convention, it identifies the DistKV network.)
You can now kill the first server and restart it:
one $ killall distkv
one $ distkv server $(hostname)
Running.
You must never start a server with the -i
option unless you’re
creating a new and separate DistKV network.
You can create separate networks by changing the server.root
config
variable. Such networks do not collide with each other, other than sharing
Serf gossip bandwidth.
Data commands¶
You might want to add an alias for “distkv client data” so that you don’t
have to type so much. In bash
:
one $ dkd() { distkv client data "$@"; }
Then, you can store arbitrary data at random DistKV nodes:
one $ dkd set -ev 123 one two three
one $ dkd set -ev 1234 one two three four
one $ dkd set -v Duh one two three four five
one $ dkd get one two three
123
one $ dkd get one two three four five
"Duh"
one $
The -e
flag tells the set
command to evaluate the given data as a
Python expression. You can store numbers, True/False/None, binary and
Unicode strings, lists/tuples, and hashes composed of these.
All entries’ values are independent. DistKV’s storage is organized hierarchically, (among other reasons) for ease of retrieval:
one $ dkd get -rd_ one
two:
three:
_: 123
four:
_: 1234
five:
_: Duh
one $
DistKV’s internal data are stored under a special null
root key.
You can use distkv client internal dump
to display them. This command
behaves like distkv client data get -rd_
. It accepts a path prefix.
Persistent storage¶
DistKV keeps everything in memory (for now). If you want your data to survive a power outage, you might want to tell your server to save them:
one $ distkv client log dest /var/local/lib/distkv.$(date +%Y%m%d).state
This command writes the current state to this file. The server keeps the
file open and appends new records to it. The log dest
has options to
either write an incremental change record, or to just write a one-shot
dump.
When you need to restart your DistKV system from scratch, simply pass the newest saved state file:
one $ distkv server -l $(ls -t /var/local/lib/distkv.*.state | head -1) $(hostname)
Running.
Obviously, if your state dump files are incremental, you should instead do something like this:
one $ distkv server -l <(cat /var/local/lib/distkv.*.state) $(hostname)
Running.
These commands are somewhat safe to use on a network that’s already running; your node may run with old state for a few seconds until it retrieves the updates that happened while it was down. An option to delay startup until that process has completed is on the TODO list.
In a typical DistKV network, at most two or three nodes will use persistent storage; all others simply syncs up with their peers whenever they are restarted.
Authorization¶
DistKV initially doesn’t come up with any authorization scheme. However, advanced uses require the ability to distinguish between users.
Let’s set up a “root” user:
one $ distkv client auth -m password user add name=joe password?=Code
Code: ******
one $ distkv client auth -m password user list
joe
one $ distkv client auth -m password init -s
Authorization switched to password
one $
(The input at the “Code:” prompt is not echoed.)
After this point, you can no longer use DistKV without a password:
one $ dkd get
ClientAuthRequiredError: You need to log in using: password
one $
one $ distkv client -a "password name=joe password?=Code" data get
Code: ******
"Root"
one $
Internal data are stored in a separate DistKV subtree that starts with a None
value.
You can display it:
one $ distkv client -a "password name=joe password=test123" data get -rd_
null:
auth:
_:
current: password
password:
user:
joe:
_:
_aux: null
password: !!binary |
7NcYcNGWMxapfjrDQIyYNa2M8PPBvHA1J8MCZVNPda4=
one:
two:
three:
_: 123
four:
_: 1234
five:
_: Duh
As you can see, passwords are encrypted – hashed, actually. The exact
scheme depends on the auth method. The data below None
(or “null” in
YAML syntax) are otherwise inaccessible.
NB: nothing prevents you from using the string "null"
as an ordinary
key name:
one $ distkv client -a "password name=joe password=test123" data set -v bar null foo
one $ distkv client -a "password name=joe password=test123" data get -rd_
…
'null':
foo:
_: bar
For experimentation, there’s also a _test
method which only exposes a
user name:
one $ distkv client auth -m _test user add name=joe
one $ distkv client auth -m _test user add name=root
one $ distkv client auth -m _test init
one $ distkv client data get
ClientAuthRequiredError: You need to log in using: _test
one $ dkv() { distkv client -a "_test name=joe" "$@"; }
one $ dkv data get
123
one $
We’ll use that user and alias in the following sections.
ACLs and distributed servers¶
DistKV servers use the client protocol when they sync up. Thus, when you set up authorization, you must teach your servers to authenticate to their peer:
one $ distkv -C connect.auth="_test name=joe" server $(hostname)
Access restrictions¶
A user can be restricted from accessing or modifying DistKV data.
Let’s say that we’d like to create a “write-only” data storage:
one $ dkv acl set writeonly -a "xc" wom '#'
one $ dkv data set -ev 42 wom foo bar
one $ dkv data set -ev 43 wom foo bar
ServerError: (<AclEntry:[None, 'acl', 'writeonly', 'wom', '#']@<NodeEvent:<Node: test1 @10> @4 1> ='cx'>, 'w')
one $ dkv data get wom foo
ServerError: (<AclEntry:[None, 'acl', 'writeonly', 'wom', '#']@<NodeEvent:<Node: test1 @10> @4 1> ='cx'>, 'r')
one $
As you can see, this allows the user to write to arbitrary values, but Joe cannot change anything, nor can he read the values which he wrote.
Note that we also created a “root” user who doesn’t have ACL restrictions. If we had not, we’d now be locked out of our DistKV storage because “no matching ACL” means “no access”.
A user who has an ACL set can no longer modify the system, because the
None
element that separates system data from the rest cannot match a
wildcard. ACLs for system entries are on the TODO list.
Code execution¶
DistKV doesn’t just store passive data: you can also use it to distribute actual computing. We’ll demonstrate that here.
First we feed some interesting code into DtstKV:
one $ dkv code set the answer <<END
> print("Forty-Two!")
> return 42
> END
Then we set up a one-shot run-anywhere instance:
one $ dkv run set -c "the answer" -t 0 a question
This doesn’t actually execute any code because the executor is not part of the DistKV server. (The server may gain an option to do that too, but not yet.) So we run it:
one $ dkv run all
Forty-Two!
(Initially this takes some time, because the run
command needs to
co-ordinate with other runners. There are none currently, but it can’t know
that.)
The code will not run again unless we either re-set --time
, or set a
repeat timer with --repeat
.
Start times are mostly-accurate. There are two reasons why they might not be:
- the co-ordination system has a periodic window where it waits for the next message. This causes a delay of up to two seconds.
- TODO: The current leader might decide that it’s too busy and wants to delegate starting a particular job to some other node in the cluster. This incurs some delay, more if the recipient is no longer available.
This method will run the code in question on any node. You can also run code on one specific node; simply do
one $ dkv run -n $(hostname) set -c “same answer” -t 0 a question one $ dkv run -n $(hostname) all
The one-node-only runner and the any-node runner are distinct.
Errors¶
Nobody is perfect, and neither is code. Sometimes things break. DistKV remembers errors. To demonstrate, let’s first provoke one:
one $ dkv code set the error <<END
> raise RuntimeError("Owch")
> END
one $ dkv run set -c "the error" -t 0 what me worry
one $ dkv run all # if it's not still running
20:24:13.935 WARNING:distkv.errors:Error ('.distkv', 'error', 'test1', 16373) test1: Exception: Owch
The list of errors is now no longer empty:
one $ dkv error list -d_
[ some YAML ]
You can limit the error list to specific subtrees. This command has the same effect:
one $ dkv error list -d_ .distkv run any
except that the path is shortened for improved useability.
Error details are available; add the -a
option. You can also filter
errors on a specific node, which only includes that node’s details.
The Python API¶
Command lines are all well and good, but DistKV gets really interesting when you use it from Python.
Let’s start by simply setting some value:
import anyio
from distkv.client import open_client
async def dkv_example():
async with open_client() as client:
client.set("one","two","three", value=("Test",42,False), chain=None)
anyio.run(dkv_example)
That was easy. Now we’d like to update that entry:
async def dkv_example():
async with open_client() as client:
res = client.get("one","two","three")
ret = client.set("one","two","three", value=("Test",v[1]+1,False), chain=res.chain)
assert res.chain != ret.chain
The chain
parameter is important: it tells DistKV which change caused
the old value. So if somebody else changed your one two three
entry
while your program was running, you’d get a collision and the set
would
fail.
set
returns a new chain so you can update your value multiple times.
Deleting an entry clears the chain: the source of a non-existing value doesn’t matter.
Watching for Changes¶
The result of the previous get
was static. If somebody else
subsequently changes it, you wouldn’t know. Let’s fix that:
async def dkv_example():
async with open_client() as client:
async with client.watch("one", fetch=True) as watcher:
async for res in watcher:
if 'path' not in res:
continue
path = ' '.join(str(x) for x in res.path)
if 'value' in res:
print(f"{path}= {res.value}")
else:
print(f"{path}: deleted")
fetch=True
will send the current state in addition to any changes.
The 'path' not in res
test filters the notification that tells you that
the subtree you requested is complete.
Active objects¶
While watching for changes is nice, organizing the resulting objects tends to be tedious. DistKV comes with a method that does this for you:
from distkv.client import ClientRoot, ClientEntry, NotGiven
class OneEntry(ClientEntry):
async def set(self, value):
await super().set_value()
path = ' '.join(str(x) for x in self.subpath)
if value is NotGiven:
print(f"{path}= {value}")
else:
print(f"{path}: deleted")
class OneRoot(ClientRoot):
@classmethod
def child_type(cls, name):
return OneEntry
async def dkv_example():
async with open_client() as client:
async with client.mirror("one", root_type=OneRoot) as root:
# At this point you have the sub-tree in memory
assert root['two']['three'].value[1] >= 42
while True:
await anyio.sleep(99999)
pass
# at this point the sub-tree is still there, but won't be updated
except that in a real program you’d do some real work instead of sleeping.
Verification¶
Complex data should be clean. Storing "Hello there!"
in a value that
the rest of your code expects to be an integer is likely to have unwanted
effects. For instance, we’d like to enforce that all quota
values in our
site statistics are integer percentages.
First, we define the type:
one $ ./kv client type set -g 0 -g -2 -g 123 -b 1.2 -b '"Hello"' int <<END
> if int(value) != value: raise ValueError("not an integer")
> END
one $
As you can see, data types must be accompanied by example values that include both “good” and “bad” examples.
You can also declare subtypes:
one $ dkv type set -g 0 -g 99 -g 100 -b -1 -b 101 int percent <<END
> if not (0 <= value <= 100): raise ValueError("not a percentage")
> END
one $
The example values must pass the supertype’s checks.
Now we associate the test with our data:
one $ dkv type match -t int -t percent stats '#' quota
Then we store some value:
one $ dkv data set -v 123 stats foo bar quota
ServerError: ValueError("not an integer")
Oops. We forgot that arguments are strings:
one $ dkv data set -ev 123 stats foo bar quota
ServerError: ValueError('not a percentage')
one $ dkv data set -ev 12 stats foo bar quota
DistKV does not test that existing values match your restrictions.
Data mangling¶
Structured data are great, but some clients want boring single-value items.
For instance, some home automation systems want to use "ON"
and
"OFF"
messages, while your active code is much happier with a bool
value – or even a mapping that also carries the time of last change, so that
a turn off after 15 minutes
rule will actually work.
Let’s write a simple number codec:
one $ dkv codec set -i '"12.5"' 12.5 -o 13.25 '"13.25"' floatstr
Enter the Python script to encode 'value'.
return str(value)
Enter the Python script to decode 'value'.
return float(value)
^D
As you can see, you need to give the codec some examples. Here they’re
symmetric but that’s not a requirement; for instance, a bool
codec for our
home automation system could accept a wide range of true
-ish or
false
-ish strings but it would always output ON
and OFF
.
Associating this codec with a path is slightly more involved:
one $ dkv codec convert -c floatstr floatval monitor '#' value
This associates
- the float-to-string codec we just created
- all paths that start with
monitor
and end withvalue
with the name floatval
. As not every user needs stringified numbers, we now
need to tell DistKV which users to apply this codec to:
one $ dkv auth user modify --aux codec=floatval name=joe
Thus, Joe will read and write values as strings:
one $ dkv data set -v 99.5 monitor a b c value
one $ dkv data set -v 12.3 monitor a b c thing
one $ dkv data get -rd_ monitor
a:
b:
c:
value:
_:
99.5
thing:
_:
'12.3'
This is especially helpful if Joe is an MQTT gateway which only transmits (binary) strings.
DistKV currently can’t translate paths, or merge many values to one entry’s attributes.
Your best bet is to use active objects and add some code to their set_value
methods that translates between one and the other. There are some caveats:
- All such data are stored twice.
- Don’t change a value that didn’t in fact change; otherwise you’ll generate an endless loop.
- You need to verify that the two trees match when you start up, and decide
which is more correct. (The
tock
stamp will help you here.) Don’t
overwrite changes that arrive while you do that.