0:29:09 Yeah, and you know, like we had a very
strict rule, right, where it just,
0:29:13 we do not look at content, right?
0:29:15 and so that was the thing when
debugging issues, the saving grace is
0:29:20 that for most of the issues we saw.
0:29:22 They were more metadata issues around
like sync, not converging or sync, getting
0:29:28 to the client thinking it's in sync
with the server, but them disagreeing.
0:29:32 so we had a few pretty,
yeah, like pretty interesting
0:29:35 supporting algorithms for this.
0:29:37 So one of them was just simple like
hang detection, like making sure, like
0:29:41 if, when should a client reasonably
expect that they are in sync?
0:29:45 And if they're online and if
they've downloaded all the recent
0:29:49 versions and things are getting
stuck, why are they getting stuck?
0:29:53 So are they getting stuck because
they can't read stuff from the
0:29:55 server, either metadata or data?
0:29:57 Are they getting stuck because they
can't write to the file system and
0:30:00 there's some permission errors?
0:30:02 So I think having very fine-grained
classification of that and having the
0:30:06 client do that in a way that's like not
including any private information and
0:30:11 sending that up for reports and then
aggregating that over all of the clients
0:30:14 and being able to classify was a big part
of us being able to get a handle on it.
0:30:20 And I think this is just generally
very useful for these sync engines.
0:30:23 the biggest return on investment we
got was from consistency checkers.
0:30:27 So part of sync is that there's the same
data duplicated in many places, right?
0:30:33 Like, so we had the data that's
on the user's local file system.
0:30:37 We had all of the metadata that we stored
in SQLite or we would store like what
0:30:41 we think should be on the file system.
0:30:43 We would store what the latest
view from the server was.
0:30:46 We would store things that were
in progress, and then we have
0:30:49 what's stored on the server.
0:30:50 And for each one of those like hops, we
would have a consistency checker that
0:30:55 would go and see if those two matched.
0:30:57 And those would, that was like the
highest return on investment we got.
0:31:02 Because before we had that, people
would write in and they would
0:31:05 complain that Dropbox wasn't working.
0:31:07 And until we had these consistency
checkers, we had no idea the
0:31:10 order of magnitude of how
many issues were happening.
0:31:13 And when we started doing
it, we're like, wow.
0:31:16 There's actually a lot.
0:31:18 So a consistency check in this regard
was mostly like a hash over some
0:31:22 packets that you're sending around.
0:31:24 And with that you could verify, okay, up
until like from A to B to C to D, we're
0:31:30 all seeing the same hash, but suddenly
on the hop from D to E, the hash changes.
0:31:35 Ah-huh.
0:31:36 Let's investigate.
0:31:37 Exactly.
0:31:38 And so, and to do that in a way
that's respectful of the users,
0:31:42 even like resources on their system.
0:31:45 Like we wouldn't just go and blast their
CPU and their disc and their network to go
0:31:50 and like turn through a bunch of things.
0:31:51 So we would have like a sampling
process where we like sample a random
0:31:54 path in the tree and the client
and do it the same on the server.
0:31:58 we would have stuff with like Merkle
trees and then when things would diverge,
0:32:02 we would try to see like, is there a way
we can compare on the client and see like
0:32:07 for example one of the kind of really
important, goals for us as an operational
0:32:12 team was to have like the power of zero.
0:32:14 I think it might be from AWS or something.
0:32:17 My co-founder James, has
a really good talk on it.
0:32:19 but we would want to have a metric of
saying that the number of unexplained
0:32:25 inconsistencies is zero and one 'cause.
0:32:28 Then the nice thing right, is that
if it's a zero and it regresses,
0:32:31 you know that it's a regression.
0:32:33 If it's at like fluctuating at like 15
or like a hundred thousand and it kind
0:32:38 of goes up by 5%, it's very hard to know
when evaluating a new release, right?
0:32:42 That like that's actually safe or not.
0:32:44 so then that would mean that whenever we
would have an inconsistency due to a bit
0:32:49 flip, which we would see all the time
on client devices, then we would have to
0:32:55 categorize that and then bucket that out.
0:32:57 So we would have a baseline.
0:32:59 Expectation of how many bit flips there
are across all of the devices on Dropbox.
0:33:03 And we would see that that's
staying consistent or increasing or
0:33:06 decreasing, and that the number of
unexplained things was still at zero.
0:33:10 now let's take those detours
since you got me curious.
0:33:13 Uh, what would cause bit
flips on a local device?
0:33:16 I think a few, few causes, one of them
is just that in the data center, most
0:33:20 memory uses error correction and you have
to pay more for it, usually have to pay
0:33:24 more for a motherboard that supports it.
0:33:26 at least back then.
0:33:27 now like on client
devices we don't have that.
0:33:30 So this is a little bit above
my pay grade for hardware cosmic
0:33:34 rays or thermal noise or whatever.
0:33:36 But memory is much more
resilient in the data center.
0:33:40 I think another is just that, storage
devices are very greatly in quality.
0:33:44 Like your SSDs and your hard drives are
much higher quality inside the data
0:33:49 center than they are on local devices.
0:33:51 And so.
0:33:53 You know, there's that.
0:33:54 it also could be like I had
mentioned that people have all
0:33:57 types of weird configurations.
0:33:59 Like on Mac there are all these
kernel extensions on Windows, there's
0:34:03 all of these mini filter drivers.
0:34:05 There are all these things
that are interposing between
0:34:07 Dropbox, the user space process
and writing to the file system.
0:34:11 And if those have any memory safety
issues where they're corrupting memory
0:34:15 'cause of the written in archaic C
you know, or something that that's
0:34:19 the way things can get corrupted.
0:34:20 I mean, we've seen all types of things.
0:34:22 We've seen network routers get
having corrupting data, but usually
0:34:26 that fails some checksum, right?
0:34:28 Or we've seen even registers on CPUs
being bad where the memory gets replaced
0:34:33 and the memory seems like it's fine, but
then it just turns out the CPU has its
0:34:38 own registers on CHIP that are busted.
0:34:40 And so all of that stuff I
think just can happen at scale.
0:34:44 Right.
0:34:45 that makes sense.
0:34:45 And I'm happy to say that I've hadn't
had yet to worry about flip bits, whether
0:34:51 it's being for storage or other things,
but huge respect to whoever had already
0:34:56 to, tame those parts of the system.
0:34:59 So, you mentioning the consistency check
as probably the biggest lever that you
0:35:05 had to understand which health stage
your sync engine is in the first place.
0:35:11 was this the only kind of metric and
proxy for understanding with how well
0:35:18 the syn system is working or were
there some other aspects that gave
0:35:22 you visibility both macro and micro?
0:35:26 Yeah, I mean, I think this yeah,
the kind of hangs, so like knowing
0:35:30 that something gets to a sync state
and knowing the duration, right?
0:35:33 So the kind of performance of that
was one of our top line metrics.
0:35:38 And the other one was
this consistency check.
0:35:40 And then first specific
like operations, right?
0:35:43 Like uploading a file, like how much
bandwidth are people able to use
0:35:47 because for like, people wanted to
use Dropbox, but, and upload lots,
0:35:53 like huge data, like huge number of
files where each file is really large.
0:35:57 And then they might do it on in
Australia or Japan where they're
0:36:01 far away from a data center.
0:36:03 So latency is high, but bandwidth
is very high too, right?
0:36:06 So making sure that we could
fully saturate their pipes and all
0:36:09 types of stuff with debugging.
0:36:12 Things in the internet, right?
0:36:13 People having really bad
routes to AWS and all that.
0:36:16 so we would track things like that.
0:36:18 I think other than that it was
mostly just the usual quality stuff,
0:36:20 like just exceptions and making
sure that features all work.
0:36:25 I think when we rewrote this system
and we, designed it to be very correct.
0:36:30 We moved a lot of these things into
testing before we would release.
0:36:35 So we this is I think one of the, to
jump ahead a little bit, we designed,
0:36:38 decided to rewrite Dropbox's sync engine
from this big Python code base into Rust.
0:36:45 And one of the specific design decisions
was to make things extremely testable.
0:36:49 So we would have everything be
deterministic on a single thread,
0:36:53 have all of the reads and rights
to the network and file system,
0:36:56 be, through a virtualized API.
0:36:59 So then we could run all of these
simulations of exploring what would
0:37:03 happen if you uploaded a file here and
deleted it concurrently and then had a
0:37:08 network issue that forced you to retry.
0:37:10 And so by simulating all of those in
ci, we would be able to then have very
0:37:14 strong in variance about them that
knowing that like a file should never
0:37:18 get deleted in this case, or that
it should always converge, or things
0:37:21 like the sharing that this file should
never get exposed to this other viewer.
0:37:26 I think like the, having much, like
having stronger guarantees was something
0:37:31 that we only could really do effectively
once we designed the system to make
0:37:36 it easy to test those guarantees.
0:37:38 Right.
0:37:39 That makes a lot of sense.
0:37:40 And I think we're seeing more
and more systems, also in the
0:37:43 database world, embrace this.
0:37:45 I think TigerBeetle is,
is quite popular for that.
0:37:49 I think the folks at Torso are
now also embracing this approach.
0:37:54 I think it goes under the
umbrella of simulation testing.
0:37:57 that sounds very interesting.
0:37:58 Can you explain a little bit more how
maybe in a much smaller program would
0:38:03 this basically be Just that every
assumption and any potential branch,
0:38:08 any sort of side effect thing that might
impact the execution of my program.
0:38:13 Now I need to make explicit and it's
almost like a parameter that I put into
0:38:19 the arguments of my functions and now I
call it under these circumstances, and I
0:38:25 can therefore simulate, oh, if that file
suddenly gives me an unexpected error.
0:38:31 Then this is how we're gonna handle it.
0:38:33 Yeah, exactly.
0:38:34 So it's like and there's techniques
that like the TigerBeetle folks, like
0:38:38 we, we do this at Convex in rust with the
right, like abstractions, there's like
0:38:42 techniques to make it not so awkward.
0:38:45 But yeah, it is like this idea of like,
can you pin all of the non-determinism in
0:38:50 the system can, whether it's like reading
from a random number generator, whether
0:38:54 it's looking at time, whether it's reading
and writing to files or the network.
0:38:58 Can that all be like pulled out so
that in, production it's just using the
0:39:04 random AP or the regular APIs for it.
0:39:07 so there's like for any of these
sync engines, there's a core
0:39:10 of the system which represents
all the sync rules, right?
0:39:13 Like when I get a new file
from the server, what do I do?
0:39:16 You know, if there's a concurrent
edit to this, what do I do?
0:39:19 and that I. Core of the code is often
the part that has the most bugs, right?
0:39:23 It has the, it doesn't think about
some of the corner cases or if
0:39:27 there are errors or needs retries
or doesn't handle concurrency.
0:39:30 It might have race conditions.
0:39:32 So the kind of, I think the core idea
for determination, determin deterministic
0:39:36 simulation testing is to take that core
and just kind of like pull out all of the
0:39:43 non-determinism from it into an interface.
0:39:45 So time randomness, reading and
writing to the network, reading
0:39:49 and writing to the file system, and
making it so that in production,
0:39:52 those are just using the regular APIs.
0:39:55 But in a testing situation,
those can be using mocks.
0:39:59 Like they could be using things
that for a particular test
0:40:02 and wants to test a scenario or
setting it up in a specific way.
0:40:06 Or it could be randomized, right?
0:40:09 Where it might be that reading from
Like time, the test framework might
0:40:14 decide pseudo randomly to advance it
or to keep it at the current time or
0:40:18 might serialize things differently.
0:40:21 And that type of ability to have random
search explore the state space of
0:40:27 all the things that are possible is
just one of those like unreasonably
0:40:30 effective ideas, I think for testing.
0:40:33 And then that like getting a
system to pass that type of
0:40:37 deterministic simulation testing.
0:40:39 It's not at the threshold of having
formal verification, but in our
0:40:42 experience it's pretty close and with
a much, much, smaller amount of work.
0:40:48 And you mentioning
Haskell at the beginning?
0:40:50 I still remember when I, after a a lot of
time having spent writing unit tests in
0:40:55 JavaScript and I, back then, in the other
order, I first had JavaScript and then I
0:41:00 learned Haskell, and then I found quick
test and was quick test, Quick Check.
0:41:05 which one was it?
0:41:06 I think it was Quick check, right?
0:41:07 Well, right.
0:41:08 So I found Quick Check and I could express
sort of like, Hey, this is this type.
0:41:13 It has sort of those aspects to it,
those invariants and then would just
0:41:18 go along and test all of those things.
0:41:20 Like, wait, I never thought
of that, but of course, yes.
0:41:23 And then you combine those and you
would get way too lazy to write unit
0:41:27 tests for the combinatorial explosion
of like all of your different things.
0:41:32 And then you can say, sample it
like that, and like, focus on this.
0:41:36 and so I actually also, started
embracing this practice a lot more in the
0:41:40 TypeScript work that I'm doing through
a great project called Prop Check.
0:41:45 and that is, picking up the same
ideas and for particularly those
0:41:52 sort of scenarios where, okay,
Murphy's Law will come and haunt you.
0:41:56 this is in distributed systems.
0:41:58 That is typically the case.
0:42:00 Building things in such a way where
all the aspects can be, specifically
0:42:05 injected and the, the sweet spot.
0:42:07 If you can do so still in an ergonomic
way, I think that's the way to go.
0:42:13 It's so, so valuable, right?
0:42:15 And yeah.
0:42:15 And yeah, the ability to, for prop tasks,
for quick check for all of these to
0:42:20 also minimize is just magical, right?
0:42:23 Like it comes up with this crazy
counter example and it might be
0:42:27 like a list with 700 elements, but
then is able to shrink it down to
0:42:31 the, like, real core of the bug.
0:42:33 It's magic, right?
0:42:35 And you know, I mean, I think
this is something like, you know.
0:42:38 A totally different theme, right?
0:42:40 Like one thing at Convex we're exploring
a lot is like coding has changed a lot
0:42:44 in the past year with AI coding tools.
0:42:46 And one of the things we've observed
for getting coding tools to work very
0:42:50 well with Convex is that these types
of like very succinct tests that can
0:42:54 be generated easily and have like a
really high strength to weight or power
0:42:59 to weight ratio are just really good
for like autonomous coding, right?
0:43:03 Like, if you are gonna take like
cursor agent and let it go wild,
0:43:06 like what does it take to just let it
operate without you doing anything?
0:43:10 It takes something like a prop test
because then it can just continuously
0:43:13 make changes, run the test, and not know
that it's done until that test passes.
0:43:18 Yeah, that makes a lot of sense.
0:43:20 So let's go back for a moment to the
point where you were just transitioning
0:43:25 from the previous Python based sync
engine to the Rust based sync engine.
0:43:32 So you're embracing simulation
testing to have a better sense of
0:43:36 like all the different aspects that
might influence the outcome here.
0:43:41 walk me through like how you, went about.
0:43:44 Deploying that new system.
0:43:46 Were there any sort of big headaches
associated with migrating from the
0:43:52 previous system to the new system?
0:43:54 since you, for everything, you
had sort of a defacto source
0:43:57 of truth, which are the files.
0:43:59 So could you maybe just forget everything
the old system has done and you just
0:44:04 treat it as like, oh, the, user would've
just installed this fresh, walk me
0:44:09 through like how you thought about
that since migrating systems on such
0:44:14 a big scale is typically, quite dread