0:37:25 So Beelay, has a requirement that it
has to work with, encrypted chunks.
0:37:30 So, you know, we do this compression
and then encryption, on top of it,
0:37:34 and then send that to the Sync Server.
0:37:36 The Sync Server can see, because it
has to know who it can send these
0:37:39 chunks around to, the membership.
0:37:41 So Sync Server does have
access to the membership.
0:37:44 of each doc, but not the
content of the document.
0:37:47 so if you make a request, it checks,
you know, okay, are you somebody
0:37:50 that, has the, the rights to, to
have this sent to you, yes or no,
0:37:53 and then it'll send it to you or not.
0:37:55 And this isn't only for sync servers,
you know, if you connect to somebody,
0:37:58 you know, directly over Bluetooth, you
know, you'd do the same thing, right?
0:38:01 Even if, you know, you
can both see the document.
0:38:04 There's nothing special
here about sync servers.
0:38:06 To do this sync, well, we're no
longer syncing individual ops, right?
0:38:10 Like, we could do that, but
then we lose the compression.
0:38:13 It's not great, right?
0:38:15 And ideally, we don't want people to
know, you know, if somebody were to
0:38:19 break into your server, hey, here's how
everything's related to each other, right?
0:38:22 Like, that compression and
encryption, you know, also hides
0:38:25 a little bit more of this data.
0:38:27 We do show the links between these,
you know, compressed chunks, but
0:38:30 we'll, we'll get to that in a second.
0:38:32 Essentially what we want to do is chunk
up the documents in such a way where,
0:38:38 there's the fewest number of chunks to
get synced, and the longer ranges that
0:38:43 we have of, you Automerge ops that we get
compressed before we encrypt it, right?
0:38:48 On the, I'll call it client.
0:38:50 It's not really a client in
a local-first setting, right?
0:38:52 But like not on the not sync server
when you're sending it to it.
0:38:55 the more stuff that you have,
the better the compression is.
0:38:58 And chunking up the document here
means basically, you're really
0:39:02 chunking up the history of operations
that then get internally rolled up
0:39:07 into one snapshot of the document.
0:39:09 And that could be very long.
0:39:11 And, there's room for optimization.
0:39:14 That is like the, the compression here,
where if you set a ton of times, like,
0:39:19 Hey, the name of the document is Peter.
0:39:22 And later you say like, no, it's Brooke.
0:39:24 And later you say, no, it's Peter.
0:39:26 No, it's Johannes.
0:39:28 Then you, you can like compress it into,
for example, just the latest operation.
0:39:33 Yeah, exactly.
0:39:34 So, you know, if you want to think about
how this, you know, to get, to get more
0:39:37 concrete, you know, if you take this
slider all the way to one end and you take
0:39:40 the entire history and run length encoded,
you know, do this Automerge compression,
0:39:45 you get very, very good compression.
0:39:47 If we take it to the far other
end, we go really granular.
0:39:50 Every op, doesn't get compressed, but you
know, so it's just like each individual
0:39:55 op, so you don't get compression.
0:39:56 So there's something in between
here of like, how can we chop up
0:39:59 the history in a way where I get
a nice balance between these two?
0:40:04 When Automerge receives new ops, It has
to know where in the history to place it.
0:40:10 So you have this partial order,
you know, you have this, you
0:40:12 know, typical CRDT lattice.
0:40:14 And then, we put that, or it
puts it into a strict order.
0:40:18 It orders all the events and
then plays over them like a log.
0:40:21 And this new event that you get,
maybe it becomes the first event.
0:40:24 Like you could go way to the
beginning of history, right?
0:40:26 Like you, you don't know because
everything's eventually consistent.
0:40:29 So if you do that linearization
first and then chop up the documents,
0:40:34 you have this problem where.
0:40:36 If I do this chunking, or you do this
chunking, well, it really depends
0:40:39 on what history we have, right?
0:40:41 And so it makes it very, very difficult
to have a small amount of redundancy.
0:40:46 So we found, two techniques
helped us with this.
0:40:49 One was, we take some particular,
operation as a head and we
0:40:55 say, ignore everything else.
0:40:56 Only give me the history
for this operation.
0:40:58 Only instruct ancestors.
0:41:00 So even if there's something concurrent,
forget about all of that stuff.
0:41:04 So that gets us something stable
relative to a certain head.
0:41:08 And then to know where the
chunk boundaries are, we
0:41:13 run a hash hardness metric.
0:41:15 So, the number of zeros at the
end of the hash for each op, gives
0:41:20 you, you know, you can basically
say, you know, each individual op,
0:41:23 there may or may not be a 0, 0, 0, so
I'm, I'm happy with, with anything.
0:41:28 Or if I want it to be a range of, you
know, 4, then give me two 0s at the
0:41:32 end, because that will be, you know, 2
to the power of 2 is 4, so I'll chunk
0:41:35 it up into 2s, and you, you make this
as big or as small as you want, right?
0:41:38 So now you have some way of
probabilistically chunking up the
0:41:41 documents, relative to some head.
0:41:44 And you can say how big you want that to
be based on this hash hardness metric.
0:41:47 the advantage of this is even if
we're doing things relative to
0:41:51 different heads, now we're going to
hit the same boundaries for these
0:41:54 different, hash hardness metrics.
0:41:56 So now we're sharing how we're
chunking up the document.
0:41:59 And we, Assume that on average,
not all the time, but like on
0:42:04 average, older, operations will
have been seen by more people.
0:42:08 So, or, you know, more and more peers.
0:42:11 So, you're going to be appending things
really to the end of the document, right?
0:42:17 So you, you will less frequently
have something concurrent with the
0:42:20 first operation using this system.
0:42:22 That means that we can get really
good compression on older operations.
0:42:28 Let's take, I'm just picking numbers
out of the air here, but let's take
0:42:30 the first two thirds of the document,
which are relatively stable, compress
0:42:34 those, we get really good compression.
0:42:36 And then encrypt it and
send it to the server.
0:42:38 And then for the next, you know, of
the remaining third, let's take the
0:42:42 first two thirds of that and compress
them and send them to the server.
0:42:46 And then at some point we
get each individual op.
0:42:48 This means that as the, the
document grows and changes.
0:42:52 We can take these smaller chunks and as
that gets pushed further and further into
0:42:56 history, we can, whoever can actually
read them, can recompress those ranges.
0:43:02 So, Alex has this, I think, really
fantastic, name for this, which is
0:43:06 sedimen-tree because it's almost acting in
sedimen-tree layers, but it's sedimen-tree
0:43:12 because you get a tree of these layers.
0:43:14 Yeah, it's cute, right?
0:43:15 and so if you want to do a sync,
like let's say you're doing a sync
0:43:18 of like completely fresh, you've
never seen the document before.
0:43:21 You will get the really big chunk,
and then you'll move up a layer,
0:43:25 and you'll get the next biggest
chunk of history, and then you move
0:43:27 up a layer, and then eventually
get like the last couple of ops.
0:43:30 So we can get you really good
compression, but again, it's this
0:43:32 balance of the these two forces.
0:43:35 Or, if you've already seen the
first half of the document, you
0:43:38 never have to sync that chunk again.
0:43:39 You only need to get these higher
layers of the sedimentary sync.
0:43:44 So that's how we chunk up the document.
0:43:46 Additionally, and I'm not at all
going to go into how this thing works,
0:43:49 but if people are into sync systems,
this is like a pretty cool paper.
0:43:53 It's called Practically Rateless Set
Reconciliation is the name of the paper.
0:43:57 And it does really interesting things
with, compressing how, all the information
0:44:02 you need to know what the other side has.
0:44:04 So in half a round trip, so in one
direction on average, you can get all
0:44:09 the information you need to know what
the delta is between your two sets.
0:44:13 Literally, what are, what's the handful
of ops that we've diverged by without
0:44:18 having to send all of the hashes?
0:44:20 so if people are into that
stuff, go check out that paper.
0:44:22 It's pretty cool.
0:44:23 but there's a lot of detail in
there that we're not, we're not
0:44:25 going to cover on this podcast.
0:44:26 Thanks a lot for explaining.
0:44:29 I suppose it's like, Just a tip of
the iceberg of like how Beelay works,
0:44:33 but I think it's important to get a
feeling for like, this is a new world
0:44:37 in a way where it's decentralized,
it is encrypted, et cetera.
0:44:42 There's like really hard constraints what
certain things can do since you could
0:44:47 say like in your traditional development
mindset, you would just say like, yeah,
0:44:52 let's treat the client like it's just
like a, like a Kindle, with like no
0:44:56 CPU in it let's have the server do as
much as the heavy lifting as possible.
0:45:01 I think that's like a, the
muscle that we're used to so far.
0:45:04 But in this case, the server, even if it
has a super beefy machine, et cetera, it
0:45:11 can't really do that because it doesn't
have access to do all of this work.
0:45:15 So the clients need to do it.
0:45:17 And, and when the clients
independently do so, They need to
0:45:21 eventually end up in the same spot.
0:45:23 Otherwise the entire system, falls
over or it gets very inefficient.
0:45:27 So that sounds like a really
elegant system that, that you're
0:45:30 like working on in that regard.
0:45:32 So with Beehive overall, like
again, you're starting out here with
0:45:38 Automerge as the driving system that
drives the requirements, et cetera.
0:45:43 But I think your, bigger ambition
here, your bigger goals, is that this
0:45:48 actually becomes a system that is,
that at some point goes beyond just
0:45:54 applying to Automerge, and that being a
system that applies to many more other
0:45:59 local-first technologies in the space.
0:46:01 If there are application framework authors
or like, like other people building a
0:46:07 sync system, et cetera, and they'd be
interested in seeing like, Hmm, instead
0:46:11 of like us trying to come up with our
own, research here for like what it
0:46:17 means to do, authentication authorization
for our sync system, particularly if
0:46:23 you're doing it in a decentralized way.
0:46:25 What would be a good way for those
frameworks, those technologies to
0:46:30 jump on the, the Beehive wagon.
0:46:33 so if they're already using
Automerge, I think that'll be
0:46:37 pretty straightforward, right?
0:46:38 You'll have bindings, it'll just work.
0:46:40 but Beehive doesn't have a hard
dependency on Automerge at all.
0:46:45 because it lives at this layer below and
we, Early on, we're like, well, should
0:46:50 we just weld it directly into Automerge?
0:46:51 Or like, you know, how much does
it really need to know about it?
0:46:55 and where we landed on this was
you just need to have some kind
0:46:58 of way of saying, here's the
partial order between these events.
0:47:02 and then everything works.
0:47:04 So, as, just as a intuition.
0:47:07 You could put Git inside of, Beehive,
and it would work, I don't think
0:47:11 GitHub's gonna adopt this anytime
soon, but like, if you had your own
0:47:14 Git syncing system, like, you, you
could do this, and, and it would work.
0:47:18 you just need to have some way of
ordering, events next to each other.
0:47:22 and yes, then you have to get a little
bit more into slightly lower level APIs.
0:47:27 So I, when I build stuff, I tend to
work in layers of like, here's the very
0:47:32 low level primitives, and then here's
a slightly higher level, and a slightly
0:47:35 higher level, and a slightly lower level.
0:47:37 so people using it from Automerge
will just have add member, remove
0:47:40 member, and like, everything works.
0:47:41 to go down one layer, you have to wire
into it, here's how to do ordering.
0:47:47 And that's it.
0:47:48 And then everything else should,
should wire all the way through.
0:47:51 And you have to be able to
pass it, serialized bytes.
0:47:53 So, like, Beehive doesn't know anything
about this compression that we were
0:47:56 just talking about that Automerge does.
0:47:58 But you tell it, hey, this is, you
know, this is some batch, this is
0:48:02 some, like, archive that I want to do.
0:48:03 It starts at this timestamp
and ends at that timestamp,
0:48:06 or, you know, logical clock.
0:48:07 please encrypt this for me.
0:48:09 And it goes, sure, here you go.
0:48:10 Encrypted.
0:48:11 And, you know, off it goes.
0:48:12 So it has very, very few, assumptions
0:48:15 That's certainly something that I might
also pick up a bit further down the
0:48:18 road myself for, for LiveStore where
the underlaying substrate to sync data
0:48:23 around is like a ordered event log.
0:48:26 And, if I'm encrypting those events.
0:48:29 then I think that fulfills, perfectly
the requirements that you've listed,
0:48:34 which are very few for, for Beehive.
0:48:37 So I'm really looking forward
to once that gets further along.
0:48:40 So speaking of like, where
is Beehive right now?
0:48:43 I've seen the, lab notebooks from what
you have been working on at Ink & Switch.
0:48:49 can I get my hands on
Beehive already right now?
0:48:52 Where is it at?
0:48:54 what are the plans for the coming years?
0:48:56 So at the time that we're recording
this, at least, which is in early
0:48:59 December, there's unfortunately not,
not a publicly available version of it.
0:49:02 I really hoped we'd have it ready by now,
but, unfortunately we're still, wrapping
0:49:06 up the last few, items in, in there.
0:49:09 but, Q1, we plan to have, a release.
0:49:12 as I mentioned before, there are some
changes required, to Automerge to consume.
0:49:16 specifically to, to
manage revocation history.
0:49:19 So somebody got kicked out, but we're
still in this eventually consistent world.
0:49:23 Automerge needs to know
how to manage that.
0:49:24 But.
0:49:25 Managing things, sync, encryption,
all of that stuff, we, we hope to have
0:49:30 in, I'm not going to commit, commit
the team to any particular, timeframe
0:49:33 here, but like, we'll, we'll say in
the next few, in the next coming weeks.
0:49:37 right now the team is, myself.
0:49:39 John Mumm, who joined a couple months
into the project, and has been working
0:49:43 on, BeeKEM, focused primarily on BeeKEM,
which is a, again, I'm just going to
0:49:48 throw out words here for people that
are interested in this stuff, related to
0:49:51 TreeKEM, but we made a concurrent, Which
is based on, MLS or one of the primitives
0:49:55 for, for messaging layer security.
0:49:57 he's been doing great work there.
0:49:58 And, Alex, amongst the many, many
things that Alex Good does between
0:50:02 writing the sync system and maintaining
Automerge and all of these, you
0:50:07 know, community stuff that he does,
has also been, lending a hand.
0:50:11 So I'm sure there's like for,
for Beehive in a way you're, Just
0:50:15 scratching the surface and there's
probably enough work here for, to
0:50:19 fill like another few years, maybe
even decades worth of ambitious work.
0:50:24 Can you paint a picture of like, what
are some of like the, like right now
0:50:28 you're probably working through the kind
of POC or just the table stakes things.
0:50:33 What are some of like the, way
more ambitious longterm things
0:50:36 that you would like to see in
under the umbrella of Beehive?
0:50:39 Yeah.
0:50:40 So, There's a few.
0:50:41 Yes.
0:50:42 and we have this running list internally
of like, what would a V2 look like?
0:50:45 So, one is, adding a
little policy language.
0:50:48 I think it's just like the, bang
for the buck that you get on having
0:50:51 something like UCAN's policy language.
0:50:53 It's just so high.
0:50:54 It just gives you so much flexibility.
0:50:56 hiding the membership, from even
the sync server, is possible.
0:51:00 it's just requires more engineering.
0:51:02 so there are many, many places in
here where, zero knowledge proofs, I
0:51:06 think, would be very, Useful, for, for
people who knows, know what those are.
0:51:09 essentially it would let the sync
server say, yes, I can send you bytes
0:51:14 without knowing anything about you.
0:51:16 Right,
0:51:17 but it would still deny others.
0:51:19 And right now it basically needs
to run more logic to actually
0:51:22 enforce those auth rules.
0:51:25 Yeah.
0:51:25 So today you have to, sign a message
that says, I signed this with the same
0:51:30 private key that you know about the
public key for in this membership, we
0:51:36 can hide the entire membership from
the sync server and still do this.
0:51:39 Without revealing even who's
making the request, right?
0:51:41 Like, that would be awesome.
0:51:43 in fact, and this is a bit of a
tangent, I think there's a number
0:51:45 of places where, that class of
technology would be really helpful.
0:51:49 Even for things like, in CRDTs,
there's this challenge where you have
0:51:53 to keep all the history for all time.
0:51:55 and I think with zero knowledge proofs,
we can actually, like, this, this would
0:51:58 very much be a research project, but I, I
think it's possible to delete history, but
0:52:02 still maintain cryptographic proofs, that
things were done correctly and compress
0:52:06 that down to, you know, a couple bytes,
basically, but that's a bit of a tangent.
0:52:10 I would love to work on that at some
point in the future, but for, for
0:52:13 Beehive, yeah, hiding more metadata,
Hiding, you know, the membership
0:52:17 from, from the group, making it,
all the signatures post quantum.
0:52:21 that is like even the main,
recommendations from, from NIST, the U.
0:52:26 S.
0:52:26 government agency that that handles
these things only just came out.
0:52:30 So, you know, we're still kind of waiting
for good libraries on it and, you know,
0:52:34 all, all of this stuff and what have you.
0:52:36 But yeah, making it post quantum, or
fully, big chunks of it are already
0:52:40 post quantum, but making it fully
post quantum, would, would be great.
0:52:43 and then yeah, adding all kinds of, bells
and whistles and features, you know,
0:52:46 making it faster, adding, it's not going
to have its own compression, because it
0:52:50 relies so heavily on cryptography, So
it doesn't compress super well, right?
0:52:54 So we're going to need to figure
out our own version of, you know,
0:52:58 Automerge has run length encoding.
0:52:59 What is our version of that, given
that we can't run length encode
0:53:02 easily, encrypted things, right?
0:53:04 Or, or signatures or, you
know, all, all of this.
0:53:06 so there's a lot of stuff,
down, down in the plumbing.
0:53:08 Plus I think this policy language
would be really, really helpful.
0:53:11 That sounds awesome.
0:53:12 Both in terms of new features,
capabilities, no pun intended, being
0:53:16 added here, but also in terms of just,
removing overhead from the system and like
0:53:22 simplifying the surface area by doing,
more of like clever work internally,
0:53:27 which simplifies the system overall.
0:53:29 That sounds very intriguing.
0:53:31 The, the other thing worth noting with
this, just, I think both to show point
0:53:35 away into the future and then also
draw a boundary over where what Beehive
0:53:39 does and doesn't do, is identity.
0:53:41 so Beehive only knows about public
keys because those are universal.
0:53:46 They work everywhere.
0:53:47 They don't require a naming
system, any of this stuff.
0:53:50 we have lots of ideas and opinions
on how to do a naming system.
0:53:55 but you know, if, if you look at,
for example, uh, BlueSky, under
0:53:58 the hood, all of the accounts are
managed with public keys, and then
0:54:02 you map a name to them using DNS.
0:54:04 So either you're using, you know, myname.
0:54:07 bluesky.
0:54:07 social, or you have your own
domain name like I'm expede.Wtf
0:54:12 on BlueSky, for example, right?
0:54:13 Because I own that domain name
and I can edit the text record.
0:54:15 and that's great and it, definitely
gives users a lot of agency over
0:54:20 how to name themselves, right?
0:54:21 Or, you know, there are
other related systems.
0:54:24 But it's not local-first
because it relies on DNS.
0:54:28 So, like, how could I invite you to a
group without having to know your public
0:54:32 key, We're probably going to ship,
I would say, just because it's like
0:54:35 relatively easy to do, a system called
Edge Names, based on pet names, where
0:54:40 basically I say, here's my contact book.
0:54:42 I invited you.
0:54:43 And at the time I
invited you, I named you.
0:54:45 Johannes right?
0:54:46 And I named Peter, Peter, and so on and
so forth, but there's no way to prove
0:54:52 that that's just my name for them.
0:54:54 Right.
0:54:54 And for these people, and having
a more universal system where
0:54:59 I could invite somebody by like
their email address, for example, I
0:55:02 think would be really interesting.
0:55:03 Back at Fission, Blaine Cook.
0:55:06 Who's also done a bunch of stuff with
Ink & Switch in the past, had proposed
0:55:09 this system, the NameName system,
that would give you local-first names
0:55:12 that were rooted in things like email,
so you could invite somebody with
0:55:17 their email address and A local-first
system could validate that that person
0:55:21 actually had control over that email.
0:55:23 It was a very interesting system.
0:55:25 So there's a lot of work to be done in
identity as separate from, authorization.
0:55:29 Right, yeah.
0:55:30 I feel like there just always, There's
so much interesting stuff happening
0:55:35 across the entire spectrum from, like,
the world that we're currently in,
0:55:40 which is mostly centralized, for just
in terms of, like, that things work at
0:55:45 all, and even there, it's hard to keep
things up to date and, like, working,
0:55:50 et cetera, but we want to aim higher.
0:55:54 And one way to improve things a lot is
like by going more decentralized but
0:55:59 there's like so many hard problems to
tame and like, we're starting to just peel
0:56:04 off like the layers from the onion here.
0:56:07 And, Automerge I think is a, is a great,
canonical case study there, like it has
0:56:12 started with the data and now things
are around, authorization, et cetera.
0:56:17 And like, then authentication,
identity there, we probably have
0:56:21 enough research work ahead of us
for, for the coming decades to come.
0:56:25 And super, super cool to see that so
many bright minds are working on it.
0:56:29 maybe one last question
in regards to Beehive.
0:56:34 When there's a lot of cryptography
involved, that also means there's
0:56:38 even more CPU cycles that need
to be spent to make stuff work.
0:56:43 have you been looking into some,
performance benchmarks, when you, let's
0:56:48 say you want to synchronize a certain,
history of Automerge for some Automerge
0:56:54 documents, with Beehive disabled and
with Beehive enabled, do you see like
0:57:00 a certain factor of like how much it
gets slower with, Beehive and sort of
0:57:05 the authorization rules applied both
on the client as well as on the server?