April 15, 2025

#23 – Sujay Jayakar: Dropbox, Convex

All episodes

April 15, 2025

#23 – Sujay Jayakar: Dropbox, Convex

Transcript

Dowload transcript

0:00:00 Intro

0:00:00 There's another kind of interesting decision here on Dropbox by

0:00:04 design was always like a sidecar.

0:00:06 It's always something that just sits and it looks at your files.

0:00:09 Your files are just regular files on the file system.

0:00:12 And if Dropbox, the app isn't running, your files are there and they're safe,

0:00:17 and it's something that you know, regular apps can just read and write

0:00:21 to, and in some sense like Dropbox was unintentionally local-first

0:00:27 from that perspective, right?

0:00:28 Because it's saying that no matter what happens, your data

0:00:31 is just there and you own it.

0:00:33 Welcome to the localfirst.fm podcast.

0:00:36 I'm your host, Johannes Schickling, and I'm a web developer, a

0:00:39 startup founder, and love the craft of software engineering.

0:00:42 For the past few years, I've been on a journey to build a modern high quality

0:00:46 music app using web technologies, and in doing so, I've been following down

0:00:50 the rabbit hole of local-first software.

0:00:52 This podcast is your invitation to join me on that journey.

0:00:56 In this episode, I'm speaking to Sujay Jayakar.

0:00:59 Co-founder of Convex and Early Engineer at Dropbox.

0:01:02 In this conversation, Sujay shares the story on how the Sync Engine

0:01:06 powering Dropbox was built initially and later redesigned to address all

0:01:11 sorts of distributed systems problems.

0:01:13 Before getting started, also a big thank you to Jazz for supporting this podcast.

0:01:19 And now my interview with Sujay.

0:01:22 Hey, Sujay.

0:01:23 So nice to have you on the show.

0:01:24 How are you doing?

0:01:25 Doing great.

0:01:26 Great.

0:01:26 Really happy to be here.

0:01:28 I'm super excited to have you on the show.

0:01:30 I've been using your work really since over a decade at this point

0:01:35 when I was really getting into using computers productively.

0:01:39 And we just the other time had another really interesting guest, Seph Gentle on

0:01:45 the podcast, who has worked on a really fascinating tool, called Google Wave

0:01:50 back then that had a big impact on me.

0:01:53 And you've been working on another technology that had a big impact

0:01:56 on me, which is Dropbox and still has a very positive impact on me.

0:02:01 That was all the way back then over 10 years ago in 2014.

0:02:05 I don't think I need to explain to the audience what Dropbox is, but, I

0:02:11 want to hear it from you, like what led you to join Dropbox, I think very

0:02:15 early on and just hearing a little bit just embedded in your personal

0:02:20 context when you joined it, and then we're gonna go dive really deep into

0:02:24 all things syncing related, et cetera.

0:02:27 How does that sound?

0:02:28 Sujay's backstory

0:02:28 Yeah, that sounds great.

0:02:30 It's actually a really funny story.

0:02:31 my career here in technology started in 2012.

0:02:34 I was actually studying mathematics.

0:02:37 I was going to go work at the NSA doing cryptography, and I was born in India.

0:02:44 but I'm a naturalized citizen for the United States, and, you have to

0:02:48 be, have security clearance to go do these types of cryptography things.

0:02:52 And you know, my clearance kept on dragging on and on and on and

0:02:58 they like interviewed my roommates and apparently just a very sketchy

0:03:01 guy so I had an offer to go work there, but it kept on dragging on.

0:03:06 And then my roommate, at the time was a computer science major who wanted some

0:03:10 like someone to go with him to the career fair and, just started chatting with the

0:03:15 Dropbox people and you know, it's about like a hundred people around that time.

0:03:19 And, chatting turned into hanging out at dinner, turned into interviewing

0:03:23 and being a math person, I did my interviews all in Haskell and

0:03:26 didn't know any real programming.

0:03:28 and then yeah, that turned into doing an internship, dropping out of undergrad and.

0:03:34 just following the dream.

0:03:35 And so I worked on, at Dropbox, I worked on a bunch of things.

0:03:38 I started off working on our, like, growth team.

0:03:40 So I did a lot of like email system.

0:03:43 Like I did this, I worked on this thing called the space raise, like a promotion.

0:03:47 Oh, I remember that.

0:03:48 Yes.

0:03:49 I think I've, I've earned quite a lot of like free storage, which I think

0:03:53 over the time has like gone down.

0:03:56 But that was a very smart and effective mechanism.

0:03:58 I surely invited all my friends back then.

0:04:01 I couldn't afford a premium plan being a broke student that worked.

0:04:07 And then from there worked on the sync engine for some time.

0:04:11 And then right now I'm the co-founder and chief scientist of a startup called Convex

0:04:16 and my three co-founders and I met working on this project called Magic Pocket,

0:04:20 where Dropbox stores hundreds of petabytes now exabytes of files, for users.

0:04:25 And we used to do that in S3.

0:04:27 And so the three of us worked together on a team to build Amazon S3, but in-house

0:04:32 and migrate all of the data over.

0:04:34 so we did that for a few years and then, Worked on rewriting the entirety of

0:04:39 Dropbox, the sync engine, the thing that runs on all of our desktop computers.

0:04:43 we rewrote it to be really correct and scalable and very flexible.

0:04:47 and shipped that.

0:04:48 after that left Dropbox in 2020 I was trying to decide if I wanted

0:04:52 to get back to academics or not.

0:04:54 So I did some research in networking and then decided to start Convex in 2021.

0:04:59 Certainly curious, which sort of research has had your interest the

0:05:03 most in this sort of transitionary per period, but maybe we stash that

0:05:07 for a moment and go back to the beginning when you joined Dropbox.

0:05:11 you mentioned there were around a hundred people working there currently.

0:05:15 how do I need to imagine the technology behind Dropbox at this point?

0:05:20 it clearly started all out with like, desktop focused like daemon project,

0:05:27 like daemon process that's running on your machine somehow keeps track of the files

0:05:33 on your system and then applies the magic.

0:05:37 So explain to me how things worked back then and what was it like to

0:05:46 Dropbox

0:05:46 Yeah, I mean, it was pretty magical, right?

0:05:49 Because the company had, I think gotten so many things right on the product side

0:05:54 and then those showed up in technology.

0:05:55 But just this feeling of like Dropbox being this product that just worked right?

0:06:00 It was for everyone.

0:06:01 It was not just for technologists, but anyone should be able, anyone who's

0:06:05 comfortable using a computer should be able to install Dropbox and have

0:06:09 a folder of theirs become magical.

0:06:12 And without understanding anything about how it works, they should

0:06:15 just think of it as like an extension of what they know already.

0:06:19 yeah.

0:06:19 And so like the ways that that showed up I think were really interesting.

0:06:22 At the time there was a very strong culture of like reverse engineering.

0:06:26 So to have this daemon that runs locally.

0:06:30 You know, there was one of the amazing early moments in Dropbox was that,

0:06:34 if like you open up finder or explore and you have the overlays on it.

0:06:39 Like that used to be done by like attaching to the finder

0:06:43 process and injecting code into it

0:06:47 and to the point where, uh, when some folks had gone to talk to Apple at the

0:06:51 time and about like working with the file system and everything like the,

0:06:57 there were teams at Apple that asked Dropbox, how did you do that in Finder?

0:07:05 So you wanted to offer the most native experience.

0:07:08 There weren't the necessary APIs for that.

0:07:11 And so you just made it happen.

0:07:12 That's amazing.

0:07:13 Yeah.

0:07:14 Yeah.

0:07:14 And so that that idea of like, how do you create the best user experience,

0:07:19 something that you know, for the purpose of making non-technical users feel very

0:07:26 confident and feel very safe using it.

0:07:28 That was another, I think, really deep like company value of like

0:07:32 being worthy of trust and taking people's files very seriously.

0:07:36 You know, I like remember having a friend who was in residency at the

0:07:39 time and he was telling me that he keeps all of his, like some of his non

0:07:44 HIPAA stuff, but like his things that he looks at on Dropbox and you know,

0:07:50 pulls him up and he's consulting 'em.

0:07:51 And there's a part of me which is terrified by that, right?

0:07:54 Like we think of software as something where like throwing a 500

0:07:57 error is fine every once in a while.

0:08:00 And a Dropbox that was there was a culture of making users feel

0:08:03 like they could really trust us.

0:08:04 And then that showed up for things like making sure that, like when

0:08:08 we give feedback to users, if we put that green overlay in finder.

0:08:12 They know that no matter what happens, they could throw their laptop in a pool.

0:08:16 They could like they, anything could happen and their files are safe.

0:08:20 Like if their house burns down, they don't have to worry about that thing.

0:08:24 And that's like all of that reverse engineering and all of the emphasis

0:08:29 on correctness and durability.

0:08:31 It was all in service of that feeling, which I think was really cool.

0:08:34 so on the engineering side, at the time it was like in hyper growth mode.

0:08:38 So they had a Python desktop client.

0:08:40 Almost all of Dropbox was in Python at the time.

0:08:43 And so there's a pre my py, like big rapidly changing desktop client that

0:08:50 needed to support Mac, windows and Linux and all these different file systems.

0:08:53 and then on the server, it was like we had one big server called Meta Server,

0:08:58 meta, I think was from metadata.

0:09:00 and that like ran almost all of Dropbox.

0:09:03 We stored the metadata in MySQL.

0:09:06 The files were stored in S3, and then we had a separate notification server

0:09:11 for managing pushes and things like that.

0:09:13 And so it was like kind of classic architecture and like reach was

0:09:17 starting to reach the limits of its scaling even at that time.

0:09:20 And, those were a lot of the things we worked on over the next 10 years.

0:09:24 Wow.

0:09:25 So was the server also written in Python?

0:09:27 So it was all one big python shop.

0:09:30 Yeah.

0:09:30 And the server was all written in Python.

0:09:33 we, had some pretty funny bugs that were due to it's kind of

0:09:39 crazy to think about it now.

0:09:40 You know, we, you working in TypeScript and full time and to think of, like

0:09:44 back in the day we just had these like hundreds of thousands, millions of lines

0:09:48 of code with no type safety and with all types of crazy meta programming and

0:09:55 decorators and meta classes and stuff.

0:09:57 And yeah, so there was a, it was all in Python when I showed up.

0:10:00 it was not all in Python and not all in one big monolithic service when I left.

0:10:04 So you mentioning joining when there were around a hundred people and you

0:10:09 probably already at this point had like multitudes more in terms of users.

0:10:15 Being in hypergrowth, it is sort of this race against time where you only have

0:10:21 so much time to work on something, but growth may be outrun you already and

0:10:26 things are already starting to break.

0:10:28 Or You know like, okay, if things gonna grow like this, this system will

0:10:33 break and it's gonna be pretty bad.

0:10:36 So tell me more about how you were dealing with like the constant r

0:10:42 race against time to rebuild systems, redesign systems, putting out fires.

0:10:49 What was that like?

0:10:50 Hypergrowth

0:10:50 Yeah, and I think there's like kind of an interesting place to take this on.

0:10:53 I think like the normal things were on scale right there.

0:10:56 Those were like.

0:10:57 One, kinda class of problems of being able to handle the load.

0:11:00 But I think one kind of really interesting, dimension of this that led

0:11:04 to our decision to start rewriting all of the sync engine in 2016 was actually

0:11:09 just like customer debugging load.

0:11:12 You know, we have we had hundreds of millions of active users and they were

0:11:17 using Dropbox in all types of crazy ways.

0:11:20 Like one of the stories is someone was using Dropbox with like, I think

0:11:24 it was running on some, I don't know if it was like a raspberry pie or

0:11:27 something, something on his tractor.

0:11:28 Like the guy ran a farm and he was using Dropbox to sink like

0:11:32 pads in text files to his tractor.

0:11:35 And I might be getting some of the details wrong, but

0:11:37 it's something like that.

0:11:38 And so people would just use Dropbox in all types of crazy ways on crazy

0:11:43 file systems with kernel modules running that are messing things around

0:11:47 or so I think, You know, in terms of getting ahead of scale, I think we found

0:11:52 ourselves around 2015, 2016, in the place where for the syn engine on the

0:11:58 desktop client, the entire team just spent all of its time debugging issues.

0:12:04 We had this principle of like anything that's possible, anything that a

0:12:08 protocol allows, anything that some threading race condition that's

0:12:13 theoretically possible will be possible.

0:12:16 And then we would see it, right?

0:12:17 Like users would write in saying, my files aren't sinking.

0:12:20 And then we would look at it and we would spend months debugging each one of these

0:12:24 issues and trying to read the tea leaves from traces and reports and reproductions.

0:12:30 And it'll be like, oh they mounted this file system over here

0:12:33 and then this one and this one are in a different file system.

0:12:36 So moving the file actually did a copy, but then the X adders

0:12:40 were in, preserved this and that.

0:12:42 You know, in terms of that theme of like getting ahead of scale, like I think there

0:12:46 was first this realization that like the set of possible things that can happen in

0:12:51 the system is just astronomically large.

0:12:54 And all of them will happen if they're allowed to.

0:12:57 And we do not have, no matter how much like incremental time

0:13:01 we put into debugging things, we will never be able to keep up.

0:13:05 And the cost of doing that is that the entire team is working

0:13:08 on maintenance like this.

0:13:09 We couldn't build any new features.

0:13:11 So I think that was a motivation then for the rewrite to is can we find like points

0:13:15 of leverage where if we just invest a little bit in technology upfront, like by

0:13:20 architecting things a particular way, can we just eliminate a much bigger set of

0:13:25 potential work from debugging and working with customers and stuff like that.

0:13:29 So maybe this is a good time to take a step back and try to

0:13:33 better understand what was Dropbox sync Engine actually back then?

0:13:38 So from just thinking about it through like a user's perspective, I have maybe

0:13:45 two computers, and I have files over here.

0:13:48 I. I want to make sure that I have the files synced over from here to here.

0:13:53 So I could now think about this as sort of like a Git style, approach.

0:13:59 Maybe there's other ways as well.

0:14:01 walk me through sort of like through the solution space, how this could have been

0:14:05 approached and how was it approached?

0:14:07 is there some sort of like diffing involved between different file states

0:14:12 over time, those are being synced around.

0:14:14 Do you sync around the actual file content itself?

0:14:18 Help me to understand.

0:14:19 Building a mental model, what does it mean back then for the sync engine to work?

0:14:25 Dropbox Sync Engine

0:14:25 Yeah.

0:14:26 It's a super interesting question, right?

0:14:28 Because I think like you're saying, there's so many different paths one

0:14:31 can take and it's, I think one of those things where like if someone

0:14:34 asks, like design Dropbox in an interview question, there's like

0:14:37 definitely not one right answer, right?

0:14:39 It's like there are so many trade-offs and like different forks in the decision tree.

0:14:44 I think one of the first things is that, so you have your desktop A and you

0:14:48 have your, maybe you have your desktop and your laptop, and one of the first

0:14:52 decisions for Dropbox is that we would have a central server in the middle,

0:14:56 that there would be a Dropbox file system in the middle that Dropbox, the company

0:15:01 ran, and we did that from this trust perspective, we wanted to say that we

0:15:05 will run this infallibly when you get that green check mark when it's there.

0:15:11 You know, even if an asteroid destroys the eastern side of the United

0:15:15 States, like we will have things replicated in multiple data centers.

0:15:18 And that you know, and then also it's accessible anywhere

0:15:22 on the internet, right?

0:15:23 You can go to the library.

0:15:24 This is not so common these days but I remember when I was a student, like,

0:15:26 go to the library, log into Dropbox and read all your things right?

0:15:30 rather than having to bring a USB stick around.

0:15:32 And so I think that is the first decision, but it's not necessary, right?

0:15:36 Like there were plenty of distributed, entirely peer to

0:15:39 peer file syncing, designs, right?

0:15:42 And so that was the first decision.

0:15:44 And I think the kind of second decision was that if we imagine our desktop and

0:15:48 our laptop and you have the server in the middle, the desktop might be on

0:15:52 Windows, the laptop might be on Mac OS.

0:15:56 So I think that decision to support multiple platforms.

0:15:59 Is like another really interesting one.

0:16:02 This is like where I think Git and Dropbox can be a little bit different.

0:16:06 And that Git is at the end of the day quite Linux centric.

0:16:09 It's case sensitive for its file system.

0:16:12 It deals with directories and it makes particular assumptions about

0:16:15 how directories should behave.

0:16:17 And that was something with Dropbox.

0:16:19 We wanted to be consumer, we wanted to support everything and we wanted

0:16:22 it to feel very automatic, right?

0:16:24 That like, someone shouldn't have to understand like what a, like unicode,

0:16:28 normalization disagreement means.

0:16:29 Right?

0:16:30 Where in Git like in really bad settings, like you might have to understand

0:16:34 that, that you're right, you with an accent differently on Mac and Windows.

0:16:38 so I think that's the kind of like next, side.

0:16:40 So then Dropbox has its design for a file system and it's a

0:16:43 central, it's like the hub and all those folks are your phone, your.

0:16:48 desktop, your laptop and whatnot.

0:16:50 and then so to kind of get down to the details a bit more.

0:16:53 So then, yeah, we have a process that runs on your computer, that's the

0:16:56 Dropbox app, and that watches all of the files on your file system, and then

0:17:02 it looks at what's happened and then syncs them up to the Dropbox server.

0:17:07 And then whenever changes happen on the Dropbox server, it syncs them down.

0:17:11 there's another kind of interesting decision here on Dropbox by

0:17:15 design was always like a sidecar.

0:17:17 It's always something that just sits and it looks at your files.

0:17:20 Your files are just regular files on the file system.

0:17:23 And if Dropbox, the app isn't running, your files are there and they're safe,

0:17:28 and it's something that you know, regular apps can just read and write

0:17:32 to, and in some sense like Dropbox was unintentionally local-first

0:17:37 from that perspective, right?

0:17:39 Because it's saying that no matter what happens, your data

0:17:42 is just there and you own it.

0:17:44 and you know, there are other systems, right?

0:17:46 Like if you use NFS a like a network file system, then if you unmount it or

0:17:52 if you lose connection to the server.

0:17:54 You might not be able to actually open any files that you have the metadata for.

0:17:58 Right.

0:17:59 And I remember from a user perspective, the local-first aspect, I really went

0:18:04 through like all the stages where I had a computer that wasn't connected

0:18:08 to the internet yet, and that at some point I had an internet connection.

0:18:12 But, files were always where like everything depended on files.

0:18:16 Like if I didn't have a file, things wouldn't work.

0:18:20 Everything depended on files.

0:18:22 There were barely websites that where you could do meaningful things.

0:18:26 certainly web apps weren't very common yet.

0:18:30 And then Dropbox made everything seamlessly work together.

0:18:35 And then when web apps and SaaS software more came along, I was a

0:18:41 bit confused because I felt Okay.

0:18:43 I t gives me some collaboration, but seems to be a different kind of collaboration

0:18:48 since I had collaboration before.

0:18:50 But I also understood the limitations of, when I'm working on the same doc

0:18:56 file, through Dropbox, which gets sort of like the first copy, second

0:19:00 copy, third copy, and now I need to somehow manually reconcile that.

0:19:05 And when I saw Google Docs for the first time.

0:19:09 That was really like a revelation because, oh, now we can do this at the same time.

0:19:14 But at the same while I saw that, I still remember the feeling

0:19:19 where, but where are my files?

0:19:20 This is my stuff now.

0:19:22 Where, where is it?

0:19:23 And that trust that you've mentioned with Dropbox, I felt like I lost some,

0:19:30 some control here and it required a lot of trust, in those tools that I

0:19:35 started now step by step, embracing.

0:19:37 And frankly, I think a lot of those tools didn't deserve my trust in hindsight.

0:19:41 I still feel like we've lost something by no longer being able to like call

0:19:48 the foundation our own in a way.

0:19:50 And I'm still hoping that we kind of find the best of both worlds where

0:19:54 we get that seamless collaboration that we now take for granted.

0:19:58 Something like that Figma gives us.

0:20:00 but also the control and just being ready for whatever happens, that's

0:20:06 something Dropbox gave us out of the box.

0:20:08 I just wanna share this sort of like anecdote and like almost

0:20:12 emotional confusion as I walk through those different stages

0:20:16 of how we work with software.

0:20:18 Totally.

0:20:19 And we've ended up in a place that's not great in a lot of ways.

0:20:22 Right.

0:20:22 And I think you know, I think part of the sad thing, and maybe from

0:20:27 even like an operating systems design perspective is that I feel like files

0:20:32 have lots of design decisions that are.

0:20:35 Packaged up together.

0:20:36 You know, like one of the amazing things about files is that

0:20:39 they're self-contained, right?

0:20:41 Like on Google, I don't know what Google's backend looks like for Google

0:20:44 Docs, but they probably have like all of the metadata and pieces of the data

0:20:49 spread and different rows in a database and different things in an object store.

0:20:53 And just even thinking about like the physical implementation of that

0:20:57 data, it's like scattered around probably a bunch of servers, right?

0:21:00 Maybe in different data centers.

0:21:02 And there's something really nice about a file where a file is just

0:21:05 like a piece of state, right?

0:21:08 That is just self-contained.

0:21:09 And I think the thing that I think is one of the things I think is very

0:21:13 unfortunate is like from a operating systems perspective is that that decision

0:21:18 then has also been coupled with a very anemic API like with files, they're just

0:21:24 sequences of bytes that can be read and written to and impended and there's

0:21:30 no additional structure beyond that.

0:21:32 And I think like.

0:21:33 Folks the way that things have evolved is that we've given up on, too

0:21:37 have more structure, too make things like Google Docs, too be able to

0:21:41 reconcile and have collaboration and interpret things more than just bites.

0:21:46 We've also given up this ability to package things together.

0:21:49 Mac os had like a very kind of baby step in this direction with I

0:21:53 think they're called bundles.

0:21:54 Like the things where like if you have like your.app, they're

0:21:57 actually a zip file, right?

0:21:59 And there's all types of ways, all types of brain damage for how this

0:22:03 like, doesn't actually work well.

0:22:05 You know?

0:22:05 But the idea is kind of interesting, right?

0:22:07 It's like what if files had some more structure and what if you still

0:22:10 considered something, an atomic unit, but then it had pieces of it that

0:22:15 weren't just uninterpretable bites.

0:22:17 And I think that's like, the path dependent, way that we've

0:22:20 ended up where we are today.

0:22:22 That makes sense.

0:22:23 So going back to the sync engine implementation did the Python process

0:22:28 back in the day did that mostly index all of the files and then actually

0:22:34 send across the actual bites probably in some chunks, across the wire?

0:22:39 Or was there some more intelligent and diffing happening client side

0:22:45 that you would only send kind of the changes across the wire and how do I

0:22:50 need to think about what is a change when I'm dealing with like a ton of

0:22:55 bites before and a ton of bites after?

0:22:57 Yeah.

0:22:58 It's really, really good questions.

0:22:59 I think maybe like the first starting point is that like files

0:23:03 in Dropbox were stored, just broken up into four megabyte chunks.

0:23:07 And that was just a decision at the very beginning to pick some size.

0:23:11 And on the server, the way that those chunks were stored is that they,

0:23:15 each four megabyte chunk was stored by key to by its shot to 56 hash.

0:23:20 So we would assume that those are globally unique.

0:23:23 So then if you had the same copy of a bunch of file, or you had

0:23:27 a file copied many times in your Dropbox, we would only store it once.

0:23:31 And that would just happen organically because we would say

0:23:34 like, okay, I looked at this file, it has three chunks A, B, and C.

0:23:39 And then the client would ask the server, do you have A, B, and C?

0:23:43 Like the server would say, yes, I have B and C already, please send A, then we

0:23:47 would upload A. so there was already like at the file level there was this like

0:23:52 kind of very coarse grained Delta sync.

0:23:56 at the four megabyte chunk layer.

0:23:58 and then the kind of, it's funny, these things evolve, right?

0:24:01 Like then the next thing we layered on up top was that in that setting where

0:24:05 you decided B and C were there already and you needed to upload a then with

0:24:09 a, the desktop client could use rsync to know that there was previously a

0:24:15 prime and do a patch between the two and then send just those contents.

0:24:19 the kind of thing that was pretty interesting is that a lot of the content

0:24:23 on Dropbox was very incompressible stuff like video, images, so the

0:24:29 benefits of deduplication both across users or even within a user.

0:24:34 And the benefit of like rsync was not actually as much as one might think,

0:24:40 at least from the like, terms of bandwidth going through the system.

0:24:43 It wasn't that reductive because a lot of this content was just kind of unique and

0:24:48 not getting updated in small patches.

0:24:51 And on your server side, blob store, now that you had those hashes for those four

0:24:56 megabyte chunks, that also means that you could probably deduplicate some content

0:25:02 across users, which makes me think of all sorts of other implications of that.

0:25:09 When do you know it's safe to let go of a junk?

0:25:12 do you also now know that, you could kind of go backwards and

0:25:16 say like, oh, from this hash, we know this is sensitive content.

0:25:20 And have some further implications for, whatever we don't need to go too

0:25:25 much into depth on that now, but, yeah.

0:25:28 I'm curious like how you thought of those design decisions and

0:25:32 the possible implications.

0:25:34 Yeah.

0:25:34 Yeah, for the first one yeah, like distributed garbage collection

0:25:38 was a very hard problem for us.

0:25:39 We called it vacuuming and in terms of making Dropbox economics work out

0:25:44 of, like, when we couldn't afford to keep a lot of content that was deleted

0:25:48 that we couldn't charge users for.

0:25:50 So that was you know, there's all additional complexity where different

0:25:54 users would have like the ability to restore for different periods of time.

0:25:58 So we would say like, anything that's deleted, it doesn't actually

0:26:01 get deleted for 30 days or a year or whatnot based on their plan.

0:26:05 so then, yeah, like having to do this like big distributed mark and

0:26:09 sweep garbage collection algorithm across hundreds of petabytes,

0:26:14 exabytes of content that was something that we had to get pretty good at.

0:26:18 And when we designed Magic Pocket, where we, implemented S3 in-house, we

0:26:23 had specific primitives for making it a little bit easier to avoid race conditions

0:26:28 where like, if a file was deleted.

0:26:31 And we decided that no one needed it anymore.

0:26:34 But then just at that point in time, someone uploads it again, making sure

0:26:38 that we don't accidentally delete it.

0:26:40 So that was like, yeah, definitely a very tricky problem.

0:26:43 And I think in retrospect this is like an interesting design exercise, right?

0:26:48 And that if deduplication wasn't actually that valuable for us, we could have

0:26:52 eliminated a lot of complexity for this garbage collection by not doing it right.

0:26:58 I think for the second thing, yeah.

0:26:59 So at the beginning when Dropbox started, if you had a file with A, B and C and you

0:27:06 uploaded it, it would just check, does A, B and C exist anywhere in Dropbox?

0:27:11 And, that got changed over time to be does do you as your user

0:27:17 have access to A, B, and C?

0:27:19 And you know, 'cause otherwise you could use this for all types of purposes, right?

0:27:24 To see if there exists some content anywhere in Dropbox.

0:27:27 And, that was something where we would in the case where the user was

0:27:32 uploading A, B, and C, say none of them were present in their account, we would

0:27:38 actually force them to upload it, incur the bandwidth for doing so, and then

0:27:42 discard it if B and C existed elsewhere.

0:27:46 Yeah.

0:27:46 Very interesting.

0:27:47 I mean, this would be an interesting rabbit hole just to go down just the

0:27:50 kind of second order effects of that design decision, particularly at

0:27:54 the scale and importance of Dropbox.

0:27:57 But maybe we save that for another time.

0:27:59 So going back to the sync engine, now that we have a better understanding of, how it

0:28:04 worked in that shape and form back then.

0:28:07 You've been already mentioning before, like as things as usage went through

0:28:12 the roof, all sorts of different usage scenarios also expanded.

0:28:17 you had all sorts of more esoteric ways, how you didn't kind of even think

0:28:22 before that it would be used this way.

0:28:25 Now all of that came to light.

0:28:28 I'm curious which sort of, helper systems you put in place that you could

0:28:33 even have a grasp of what's going on since a part of the trust that Dropbox

0:28:39 owned or that earned over time, was probably also related to privacy.

0:28:44 So you, you couldn't just like read everything that's going on in someone's

0:28:49 system, so you're probably also relying to some degree on the help of a user

0:28:55 that they like send something over.

0:28:57 Yeah.

0:28:57 Walk me through like the evolution of that and that you, like as

0:29:02 an engineer, if there's a bug reproducing that bug is everything.

0:29:09 Consistency checks

0:29:09 Yeah, and you know, like we had a very strict rule, right, where it just,

0:29:13 we do not look at content, right?

0:29:15 and so that was the thing when debugging issues, the saving grace is

0:29:20 that for most of the issues we saw.

0:29:22 They were more metadata issues around like sync, not converging or sync, getting

0:29:28 to the client thinking it's in sync with the server, but them disagreeing.

0:29:32 so we had a few pretty, yeah, like pretty interesting

0:29:35 supporting algorithms for this.

0:29:37 So one of them was just simple like hang detection, like making sure, like

0:29:41 if, when should a client reasonably expect that they are in sync?

0:29:45 And if they're online and if they've downloaded all the recent

0:29:49 versions and things are getting stuck, why are they getting stuck?

0:29:53 So are they getting stuck because they can't read stuff from the

0:29:55 server, either metadata or data?

0:29:57 Are they getting stuck because they can't write to the file system and

0:30:00 there's some permission errors?

0:30:02 So I think having very fine-grained classification of that and having the

0:30:06 client do that in a way that's like not including any private information and

0:30:11 sending that up for reports and then aggregating that over all of the clients

0:30:14 and being able to classify was a big part of us being able to get a handle on it.

0:30:20 And I think this is just generally very useful for these sync engines.

0:30:23 the biggest return on investment we got was from consistency checkers.

0:30:27 So part of sync is that there's the same data duplicated in many places, right?

0:30:33 Like, so we had the data that's on the user's local file system.

0:30:37 We had all of the metadata that we stored in SQLite or we would store like what

0:30:41 we think should be on the file system.

0:30:43 We would store what the latest view from the server was.

0:30:46 We would store things that were in progress, and then we have

0:30:49 what's stored on the server.

0:30:50 And for each one of those like hops, we would have a consistency checker that

0:30:55 would go and see if those two matched.

0:30:57 And those would, that was like the highest return on investment we got.

0:31:02 Because before we had that, people would write in and they would

0:31:05 complain that Dropbox wasn't working.

0:31:07 And until we had these consistency checkers, we had no idea the

0:31:10 order of magnitude of how many issues were happening.

0:31:13 And when we started doing it, we're like, wow.

0:31:16 There's actually a lot.

0:31:18 So a consistency check in this regard was mostly like a hash over some

0:31:22 packets that you're sending around.

0:31:24 And with that you could verify, okay, up until like from A to B to C to D, we're

0:31:30 all seeing the same hash, but suddenly on the hop from D to E, the hash changes.

0:31:35 Ah-huh.

0:31:36 Let's investigate.

0:31:37 Exactly.

0:31:38 And so, and to do that in a way that's respectful of the users,

0:31:42 even like resources on their system.

0:31:45 Like we wouldn't just go and blast their CPU and their disc and their network to go

0:31:50 and like turn through a bunch of things.

0:31:51 So we would have like a sampling process where we like sample a random

0:31:54 path in the tree and the client and do it the same on the server.

0:31:58 we would have stuff with like Merkle trees and then when things would diverge,

0:32:02 we would try to see like, is there a way we can compare on the client and see like

0:32:07 for example one of the kind of really important, goals for us as an operational

0:32:12 team was to have like the power of zero.

0:32:14 I think it might be from AWS or something.

0:32:17 My co-founder James, has a really good talk on it.

0:32:19 but we would want to have a metric of saying that the number of unexplained

0:32:25 inconsistencies is zero and one 'cause.

0:32:28 Then the nice thing right, is that if it's a zero and it regresses,

0:32:31 you know that it's a regression.

0:32:33 If it's at like fluctuating at like 15 or like a hundred thousand and it kind

0:32:38 of goes up by 5%, it's very hard to know when evaluating a new release, right?

0:32:42 That like that's actually safe or not.

0:32:44 so then that would mean that whenever we would have an inconsistency due to a bit

0:32:49 flip, which we would see all the time on client devices, then we would have to

0:32:55 categorize that and then bucket that out.

0:32:57 So we would have a baseline.

0:32:59 Expectation of how many bit flips there are across all of the devices on Dropbox.

0:33:03 And we would see that that's staying consistent or increasing or

0:33:06 decreasing, and that the number of unexplained things was still at zero.

0:33:10 now let's take those detours since you got me curious.

0:33:13 Uh, what would cause bit flips on a local device?

0:33:16 I think a few, few causes, one of them is just that in the data center, most

0:33:20 memory uses error correction and you have to pay more for it, usually have to pay

0:33:24 more for a motherboard that supports it.

0:33:26 at least back then.

0:33:27 now like on client devices we don't have that.

0:33:30 So this is a little bit above my pay grade for hardware cosmic

0:33:34 rays or thermal noise or whatever.

0:33:36 But memory is much more resilient in the data center.

0:33:40 I think another is just that, storage devices are very greatly in quality.

0:33:44 Like your SSDs and your hard drives are much higher quality inside the data

0:33:49 center than they are on local devices.

0:33:51 And so.

0:33:53 You know, there's that.

0:33:54 it also could be like I had mentioned that people have all

0:33:57 types of weird configurations.

0:33:59 Like on Mac there are all these kernel extensions on Windows, there's

0:34:03 all of these mini filter drivers.

0:34:05 There are all these things that are interposing between

0:34:07 Dropbox, the user space process and writing to the file system.

0:34:11 And if those have any memory safety issues where they're corrupting memory

0:34:15 'cause of the written in archaic C you know, or something that that's

0:34:19 the way things can get corrupted.

0:34:20 I mean, we've seen all types of things.

0:34:22 We've seen network routers get having corrupting data, but usually

0:34:26 that fails some checksum, right?

0:34:28 Or we've seen even registers on CPUs being bad where the memory gets replaced

0:34:33 and the memory seems like it's fine, but then it just turns out the CPU has its

0:34:38 own registers on CHIP that are busted.

0:34:40 And so all of that stuff I think just can happen at scale.

0:34:44 Right.

0:34:45 that makes sense.

0:34:45 And I'm happy to say that I've hadn't had yet to worry about flip bits, whether

0:34:51 it's being for storage or other things, but huge respect to whoever had already

0:34:56 to, tame those parts of the system.

0:34:59 So, you mentioning the consistency check as probably the biggest lever that you

0:35:05 had to understand which health stage your sync engine is in the first place.

0:35:11 was this the only kind of metric and proxy for understanding with how well

0:35:18 the syn system is working or were there some other aspects that gave

0:35:22 you visibility both macro and micro?

0:35:26 Yeah, I mean, I think this yeah, the kind of hangs, so like knowing

0:35:30 that something gets to a sync state and knowing the duration, right?

0:35:33 So the kind of performance of that was one of our top line metrics.

0:35:38 And the other one was this consistency check.

0:35:40 And then first specific like operations, right?

0:35:43 Like uploading a file, like how much bandwidth are people able to use

0:35:47 because for like, people wanted to use Dropbox, but, and upload lots,

0:35:53 like huge data, like huge number of files where each file is really large.

0:35:57 And then they might do it on in Australia or Japan where they're

0:36:01 far away from a data center.

0:36:03 So latency is high, but bandwidth is very high too, right?

0:36:06 So making sure that we could fully saturate their pipes and all

0:36:09 types of stuff with debugging.

0:36:12 Things in the internet, right?

0:36:13 People having really bad routes to AWS and all that.

0:36:16 so we would track things like that.

0:36:18 I think other than that it was mostly just the usual quality stuff,

0:36:20 like just exceptions and making sure that features all work.

0:36:25 I think when we rewrote this system and we, designed it to be very correct.

0:36:30 We moved a lot of these things into testing before we would release.

0:36:35 So we this is I think one of the, to jump ahead a little bit, we designed,

0:36:38 decided to rewrite Dropbox's sync engine from this big Python code base into Rust.

0:36:45 And one of the specific design decisions was to make things extremely testable.

0:36:49 So we would have everything be deterministic on a single thread,

0:36:53 have all of the reads and rights to the network and file system,

0:36:56 be, through a virtualized API.

0:36:59 So then we could run all of these simulations of exploring what would

0:37:03 happen if you uploaded a file here and deleted it concurrently and then had a

0:37:08 network issue that forced you to retry.

0:37:10 And so by simulating all of those in ci, we would be able to then have very

0:37:14 strong in variance about them that knowing that like a file should never

0:37:18 get deleted in this case, or that it should always converge, or things

0:37:21 like the sharing that this file should never get exposed to this other viewer.

0:37:26 I think like the, having much, like having stronger guarantees was something

0:37:31 that we only could really do effectively once we designed the system to make

0:37:36 it easy to test those guarantees.

0:37:38 Right.

0:37:39 That makes a lot of sense.

0:37:40 And I think we're seeing more and more systems, also in the

0:37:43 database world, embrace this.

0:37:45 I think TigerBeetle is, is quite popular for that.

0:37:49 I think the folks at Torso are now also embracing this approach.

0:37:54 I think it goes under the umbrella of simulation testing.

0:37:57 that sounds very interesting.

0:37:58 Can you explain a little bit more how maybe in a much smaller program would

0:38:03 this basically be Just that every assumption and any potential branch,

0:38:08 any sort of side effect thing that might impact the execution of my program.

0:38:13 Now I need to make explicit and it's almost like a parameter that I put into

0:38:19 the arguments of my functions and now I call it under these circumstances, and I

0:38:25 can therefore simulate, oh, if that file suddenly gives me an unexpected error.

0:38:31 Then this is how we're gonna handle it.

0:38:33 Yeah, exactly.

0:38:34 So it's like and there's techniques that like the TigerBeetle folks, like

0:38:38 we, we do this at Convex in rust with the right, like abstractions, there's like

0:38:42 techniques to make it not so awkward.

0:38:45 But yeah, it is like this idea of like, can you pin all of the non-determinism in

0:38:50 the system can, whether it's like reading from a random number generator, whether

0:38:54 it's looking at time, whether it's reading and writing to files or the network.

0:38:58 Can that all be like pulled out so that in, production it's just using the

0:39:04 random AP or the regular APIs for it.

0:39:07 so there's like for any of these sync engines, there's a core

0:39:10 of the system which represents all the sync rules, right?

0:39:13 Like when I get a new file from the server, what do I do?

0:39:16 You know, if there's a concurrent edit to this, what do I do?

0:39:19 and that I. Core of the code is often the part that has the most bugs, right?

0:39:23 It has the, it doesn't think about some of the corner cases or if

0:39:27 there are errors or needs retries or doesn't handle concurrency.

0:39:30 It might have race conditions.

0:39:32 So the kind of, I think the core idea for determination, determin deterministic

0:39:36 simulation testing is to take that core and just kind of like pull out all of the

0:39:43 non-determinism from it into an interface.

0:39:45 So time randomness, reading and writing to the network, reading

0:39:49 and writing to the file system, and making it so that in production,

0:39:52 those are just using the regular APIs.

0:39:55 But in a testing situation, those can be using mocks.

0:39:59 Like they could be using things that for a particular test

0:40:02 and wants to test a scenario or setting it up in a specific way.

0:40:06 Or it could be randomized, right?

0:40:09 Where it might be that reading from Like time, the test framework might

0:40:14 decide pseudo randomly to advance it or to keep it at the current time or

0:40:18 might serialize things differently.

0:40:21 And that type of ability to have random search explore the state space of

0:40:27 all the things that are possible is just one of those like unreasonably

0:40:30 effective ideas, I think for testing.

0:40:33 And then that like getting a system to pass that type of

0:40:37 deterministic simulation testing.

0:40:39 It's not at the threshold of having formal verification, but in our

0:40:42 experience it's pretty close and with a much, much, smaller amount of work.

0:40:48 And you mentioning Haskell at the beginning?

0:40:50 I still remember when I, after a a lot of time having spent writing unit tests in

0:40:55 JavaScript and I, back then, in the other order, I first had JavaScript and then I

0:41:00 learned Haskell, and then I found quick test and was quick test, Quick Check.

0:41:05 which one was it?

0:41:06 I think it was Quick check, right?

0:41:07 Well, right.

0:41:08 So I found Quick Check and I could express sort of like, Hey, this is this type.

0:41:13 It has sort of those aspects to it, those invariants and then would just

0:41:18 go along and test all of those things.

0:41:20 Like, wait, I never thought of that, but of course, yes.

0:41:23 And then you combine those and you would get way too lazy to write unit

0:41:27 tests for the combinatorial explosion of like all of your different things.

0:41:32 And then you can say, sample it like that, and like, focus on this.

0:41:36 and so I actually also, started embracing this practice a lot more in the

0:41:40 TypeScript work that I'm doing through a great project called Prop Check.

0:41:45 and that is, picking up the same ideas and for particularly those

0:41:52 sort of scenarios where, okay, Murphy's Law will come and haunt you.

0:41:56 this is in distributed systems.

0:41:58 That is typically the case.

0:42:00 Building things in such a way where all the aspects can be, specifically

0:42:05 injected and the, the sweet spot.

0:42:07 If you can do so still in an ergonomic way, I think that's the way to go.

0:42:13 It's so, so valuable, right?

0:42:15 And yeah.

0:42:15 And yeah, the ability to, for prop tasks, for quick check for all of these to

0:42:20 also minimize is just magical, right?

0:42:23 Like it comes up with this crazy counter example and it might be

0:42:27 like a list with 700 elements, but then is able to shrink it down to

0:42:31 the, like, real core of the bug.

0:42:33 It's magic, right?

0:42:35 And you know, I mean, I think this is something like, you know.

0:42:38 A totally different theme, right?

0:42:40 Like one thing at Convex we're exploring a lot is like coding has changed a lot

0:42:44 in the past year with AI coding tools.

0:42:46 And one of the things we've observed for getting coding tools to work very

0:42:50 well with Convex is that these types of like very succinct tests that can

0:42:54 be generated easily and have like a really high strength to weight or power

0:42:59 to weight ratio are just really good for like autonomous coding, right?

0:43:03 Like, if you are gonna take like cursor agent and let it go wild,

0:43:06 like what does it take to just let it operate without you doing anything?

0:43:10 It takes something like a prop test because then it can just continuously

0:43:13 make changes, run the test, and not know that it's done until that test passes.

0:43:18 Yeah, that makes a lot of sense.

0:43:20 So let's go back for a moment to the point where you were just transitioning

0:43:25 from the previous Python based sync engine to the Rust based sync engine.

0:43:32 So you're embracing simulation testing to have a better sense of

0:43:36 like all the different aspects that might influence the outcome here.

0:43:41 walk me through like how you, went about.

0:43:44 Deploying that new system.

0:43:46 Were there any sort of big headaches associated with migrating from the

0:43:52 previous system to the new system?

0:43:54 since you, for everything, you had sort of a defacto source

0:43:57 of truth, which are the files.

0:43:59 So could you maybe just forget everything the old system has done and you just

0:44:04 treat it as like, oh, the, user would've just installed this fresh, walk me

0:44:09 through like how you thought about that since migrating systems on such

0:44:14 a big scale is typically, quite dread

0:44:17 From Sync Engine Classic to Nucleus

0:44:17 Yeah, dreadsome is, yeah.

0:44:19 appropriate word.

0:44:20 I think one of the biggest challenges was that by design we had a very different

0:44:26 data model for the old sync engine.

0:44:29 We called it sync engine Classic.

0:44:31 Affectionately.

0:44:32 And then we had for Nucleus was a new one.

0:44:34 Nucleus had a very different data model, and the motivation for that was that

0:44:40 sync engine Classic just had a ton of possible states that were illegitimate.

0:44:46 It could, if you had like a, the server update a file and the client update

0:44:50 a file, but then a shared folder gets mounted above it, things could get

0:44:54 into all of these really weird states that were legal but would cause bugs.

0:45:00 And then I think that was like one of the big guiding principles more

0:45:04 than even just like Rust or Python, was just like designing what states

0:45:09 should the system be allowed to be in and design away everything else,

0:45:14 make illegal states unrepresentable.

0:45:17 And so that, what that then meant is once we had that.

0:45:21 When we needed to migrate, we had a long tail of really weird starting positions.

0:45:27 So where you basically realized, okay, this system is in this state A, how the

0:45:33 heck did it ever get into that state?

0:45:35 And B, what are we gonna do about it now where we can basically,

0:45:40 it's like from a mapping function, this is like invalid input.

0:45:44 So can you explain a little bit of like, how you constrained the space of, and how

0:45:49 you designed the space of, legitimate, valid states and what were some of the,

0:45:56 if you think about this as like a big matrix of combinations, what are some

0:46:00 of the more intuitive ones that were, not allowed that you saw quite a bit?

0:46:06 Yeah, so I think part of the difficulty for Dropbox, like as syncing things

0:46:13 from the file system is that file system APIs are really anemic.

0:46:17 File system aPIs don't have transactions.

0:46:19 They don't things can get reordered in all types of ways.

0:46:23 So we would just read and write to files from the local file system, and

0:46:26 we would use file system events on Mac, we would use the equivalent on

0:46:30 Windows and Linux to get, updates.

0:46:32 But everything can be reordered and racy and everything.

0:46:36 So one, like common invariant would be that if you have a

0:46:40 directory you know, like files have to exist within directories.

0:46:44 If a file exists, then it's parent directory exists.

0:46:48 And like simultaneously, if you delete a directory, it shouldn't

0:46:51 have any files within it.

0:46:53 And that invariant guarantees and that the file system is a tree.

0:46:57 Right?

0:46:58 And then we, it's very easy to come up with settings, with reads from the

0:47:03 local file system where if you just naively take that and write it into

0:47:07 your SQLite database, you will end up with data that does not form a tree.

0:47:12 and then especially even with like I know it's being unique, right?

0:47:16 Like if I move a file from A to B, then I might observe the add for it at B

0:47:23 way before the delete at B or I might observe it vice versa, where the file

0:47:28 is transiently gone and disappeared and we definitely don't wanna sync that.

0:47:31 and then with directories, if I have like a, as a directory and then B as

0:47:37 a directory, and then I move it's, I could observe a state where A moves into

0:47:43 B, which then without doing the right bookkeeping, might introduce a cycle in

0:47:48 the graph and a cycle for directories would be really bad news, right?

0:47:52 so all of these invariants were things that the file system APIs, they don't

0:47:57 respect, even though the file system internally has these invariants, right?

0:48:01 You cannot create a direct recycle on any file system.

0:48:05 Definitely.

0:48:05 I mean certainly without root And all of these invariants exist but

0:48:09 are not observable through the APIs.

0:48:12 And so then we sync Engine Classic would get into the state where it's

0:48:16 like local SQLite file would have all types of violations like that.

0:48:20 So then how do we read the tea leaves of like the database is in

0:48:24 a really weird state we can't lose.

0:48:26 And to go back to, I think what you had talked about at the beginning of this was

0:48:30 that we always had the nuclear option of dropping all of our local state and doing

0:48:36 a full resync from the files themselves.

0:48:39 But then the problem is that we would entirely lose user intent.

0:48:42 So if, for example, I was offline for a month and I had a bunch of files,

0:48:48 and then during that month other people in my team deleted those files.

0:48:53 If I came back online and didn't have my local database, we would have to

0:48:58 recreate those files and people would complain about this all the time because.

0:49:03 They would delete something and wanna delete it, and then Dropbox would

0:49:05 just randomly decide to resurrect it.

0:49:07 So those types of decisions we, we tried to avoid that as much as possible, but

0:49:12 then that meant having to look at a potentially really confusing database and

0:49:17 read what the user intent might have been.

0:49:19 Right.

0:49:20 I wanna dig a little bit more into the topic of user intent.

0:49:24 Since with Dropbox you've built a sync engine very specifically for the use

0:49:30 case of file management, et cetera, where user intent has a particular meaning that

0:49:36 might be very different from moving a cursor around in a Google Docs document.

0:49:41 So can you explain a little bit, what are some of the, common scenarios of, and

0:49:47 maybe subtle scenarios of user intent, when it comes to the Dropbox design space?

0:49:55 User intent

0:49:55 Yeah, totally.

0:49:56 and I think the for regular things like say editing files.

0:50:01 I think we saw that like people just generally did not, maybe because

0:50:06 of the way the system was even its capabilities, people did not

0:50:09 edit the same files all too often.

0:50:12 So maintaining user intent when file, when everyone is online, just kind of

0:50:17 taking last writer wins Where I think user intent became very interesting is

0:50:21 if someone went offline, like they're on an airplane before wifi and airplanes

0:50:27 And they worked on their document and someone else worked on the same time.

0:50:31 In that case, we observed that users always wanted to see the conflicted

0:50:35 copy and that they wanted to get the opportunity to say, like, I did.

0:50:39 I put in a lot of effort into working on this when I was on the plane.

0:50:43 Someone else, put in probably a similar amount of effort when they were online and

0:50:48 you know, so last writer wins policies.

0:50:50 There violated user expectations quite a lot because either a person

0:50:55 had to win and then the person who lost would be really upset.

0:50:58 so I think those were pretty interesting.

0:51:00 I think with Moose, like with more metadata operations I think people

0:51:05 were a little bit more permissive.

0:51:06 Like if I moved something from one folder to another, another person

0:51:10 moved it to a different folder.

0:51:12 having it just converged on something as long as it converges.

0:51:15 We observed it being like people didn't worry about it too much.

0:51:18 I think the place where user intent is really interesting

0:51:21 with moves is with sharing.

0:51:23 So I think thinking about this from like the distributed systems

0:51:26 perspective on causality, there would be like someone might have like,

0:51:31 I dunno, their HR folder, right?

0:51:33 And I don't know, like, let's say that someone is transferring to the HR team is

0:51:38 they're getting added to the HR folder.

0:51:41 But then say before they were on the team, they were on a

0:51:44 performance improvement plan.

0:51:46 So then the administrator for HR would delete that file, make sure it's

0:51:50 deleted, and then add them to the folder.

0:51:54 And so their user intent is express in a very specific

0:51:59 sequencing of operations, right?

0:52:01 That like this causally depended on this.

0:52:04 I would not have invited 'em to the folder unless the delete was stably synced.

0:52:08 And that making sure that gets preserved throughout the system,

0:52:12 even when people are going online and offline and everything is a very

0:52:16 hard distributed systems problem.

0:52:18 Right.

0:52:18 and it was intimately related with the details of the product.

0:52:22 Right.

0:52:23 yeah.

0:52:23 How did you capture that causality chain of events since you probably also

0:52:29 couldn't quite trust the system clock?

0:52:32 How did you go about that?

0:52:34 Yeah, this became even more difficult, right?

0:52:36 Where file system metadata was partitioned across many shards in the database.

0:52:41 So then we ended up using something like Lamport timestamp, where every single

0:52:45 operation would get assigned a timestamp.

0:52:47 And those timestamps were usually only reading and writing to their

0:52:50 particular shard and for whatever timestamp the client had observed.

0:52:55 But then in these cases where there were potentially cross shard, they

0:52:59 weren't transactions, but like causal dependencies, we would be able to say

0:53:03 like, the operation to mount this or to add someone to the shared folder

0:53:07 and there them mounting it within their file system has to have a higher

0:53:11 timestamp than any right within that or.

0:53:15 Rights including deletes.

0:53:16 so then that way when the client is syncing it would be able to know that when

0:53:21 I am merging operation logs across all of the different shards, I need to assemble

0:53:26 them in a causally consistent order.

0:53:29 And that would then respect all of these particular invariants.

0:53:33 Right.

0:53:34 So you having thought through those different scenarios for Dropbox and

0:53:38 made very intentional design decisions that, for example, in one scenario

0:53:43 last writer wins is not desirable.

0:53:46 Since that might lead to a very sad person stepping off the plane because

0:53:51 all of your data is suddenly gone, or the other person's data is gone.

0:53:55 so you make very specific design trade-offs here when it

0:53:58 comes to somehow squaring the circle of distributed systems.

0:54:03 Which sort of advice would you have for application developers or people even

0:54:08 who are sitting inside of a company and are now thinking about, oh, maybe

0:54:12 we should have our own Dropbox style, linear style sync engine internally.

0:54:17 Which sort of advice would you give them when they Yeah.

0:54:23 Sujay's advices to build a Sync Engine

0:54:23 Yeah, I'll talk through kind of how we structured things at Dropbox to be able

0:54:28 to navigate these types of problems.

0:54:30 And I think the patterns here, can be quite general.

0:54:33 I think what we ended up with was that like thinking like distributed

0:54:37 systems syncing is hard, right?

0:54:40 So we would have the kind of base layer of the sync protocol and how state

0:54:45 gets moved around between the clients and the servers and all the shards.

0:54:49 We would have very strong consistency guarantees there.

0:54:52 So we would not use any of the knowledge of the product at that layer.

0:54:57 So from a, like thinking of Dropbox in the file system as a CRDT.

0:55:03 Dropbox allows, like moves to happen concurrently.

0:55:06 It ha allows you to add something while another thing is happening.

0:55:10 But at the protocol level, we kept things very strict.

0:55:12 We kept them very close to being serializable that every view of the

0:55:17 system was identified by a very small amount of state, like a timestamp.

0:55:21 And that would fully determine the state of the system and like the

0:55:24 amount of entropy in that was very low.

0:55:26 And then whenever you are modifying it, you would say, here's what I expect

0:55:30 the data to be, and if it doesn't match exactly, it will reject the operation.

0:55:34 And then by doing it, structuring things in that way, then we made it very easy

0:55:39 for product teams and for even us working on sync to embed all of these like

0:55:45 looser more product focused requirements.

0:55:47 They also may wanna change over time into the end points, like layered on top.

0:55:51 So every time we wanted to change a policy on how like a delete reconciles with an.

0:55:57 You know, add for a folder or something.

0:55:59 We didn't have to solve any distributed systems problems to do that.

0:56:03 So I think that like pattern of saying that, like is there a good abstraction?

0:56:07 Is there something that is like very powerful that could solve a large

0:56:11 class of problems, doing that well at the lowest layer and then potentially

0:56:16 weakening the consistency above it.

0:56:19 I actually really like the Rocicorp folks have a really great description of

0:56:24 their consistency model for Replicache of it being like session plus consistency.

0:56:29 And it's like a very similar idea where like when we build things on

0:56:34 a platform, we may as our with our product hats on, like want users to

0:56:38 not have to think about conflicts and merging and all that in a lot of cases.

0:56:42 But those decisions might be very particular to our app.

0:56:45 And that's something that holds for everything on the platform.

0:56:48 And then there's always a way to embed those decisions onto, say.

0:56:52 Session consistency and Replicache or serializability and other systems.

0:56:57 And so I think that's like that separation of concerns I

0:57:00 think is something that can apply to a lot of systems.

0:57:04 Right.

0:57:04 So maybe we use this also as a transition to talk a bit more about what you're

0:57:09 now designing and working on Convex.

0:57:12 What were some of the key insights that you've taken with you from Dropbox that

0:57:22 Convex

0:57:22 Yeah, when we first were starting Convex we were looking at how apps

0:57:27 are getting built today, right?

0:57:28 Like web apps are easier to build than ever.

0:57:32 Even in 2021, it's incredible how much, like more productive

0:57:37 that compared to 10 years before.

0:57:39 Right.

0:57:40 It was, and I think we noticed that the hard part for so many discussions

0:57:45 was managing state and like how state propagates I think it was from

0:57:50 the Riffle paper right, on how like so many issues in app development

0:57:54 are kind of database problems in disguise and that how techniques

0:57:58 from databases might be able to help.

0:58:00 So with Convex we were saying like, well if we start with the idea of designing

0:58:05 a database from first principles, can we apply some of those database solutions

0:58:10 to things across the whole stack?

0:58:12 So say for example, when I'm reading data from it within in my app, I have

0:58:17 all of these React components that are all reading different pieces of data.

0:58:21 It'd be really nice if all of them just executed at the same timestamp

0:58:24 and I never had to handle consistency issues where one component knows

0:58:29 about a user or the other one doesn't.

0:58:31 Similarly, like why isn't it possible to be that I just use query across

0:58:36 all my components and they just all live update whenever I read anything,

0:58:40 it's a automatically reactive.

0:58:42 So those were some of the like the initial kind of thought

0:58:46 experiments for what led to Convex.

0:58:48 I think the other one that was really motivated from our time at

0:58:52 Dropbox and I think is like kind of a both a blessing and a curse.

0:58:56 It's kind of like one of the key design decisions for Convex is

0:58:59 that Convex is very opinionated about there being a separation

0:59:03 between the client and the server.

0:59:05 So we saw this at Dropbox where they were just different teams, right?

0:59:09 And you know, as we've seen with like even the origin of GraphQL, right?

0:59:13 Like that ability to decouple development between.

0:59:16 teams working on user facing features and the way that the data fetching

0:59:20 is implemented on the backend, it's gonna be really powerful.

0:59:23 And so kind of the kind of thought experiment with Convex is, can we

0:59:27 maintain a very strong separation while still getting like live updating, while

0:59:32 still getting a really good ergonomics for both consuming data on the client

0:59:36 and like fetching it on the server.

0:59:39 Right.

0:59:39 So yeah, walk me through a little bit more through the evolution of Convex then.

0:59:44 And so, in, in terms of all the other options that are out there in terms

0:59:49 of state management and I think most what applications are using is probably

0:59:55 something that at least to some degree is somewhat customized and hand rolled and

1:00:01 comes with its own huge set of trade-offs.

1:00:05 Help me better understand sort of the, where you mentioned the,

1:00:08 opinionated nature of Convex.

1:00:11 What are the, benefits of that?

1:00:13 What are the downsides of that and other implications?

1:00:16 Yeah, so when you write an app on Convex we can use maybe

1:00:20 like a basic to do app, right?

1:00:22 The linear clone, everyone does.

1:00:24 you write endpoints like you might be used to, right?

1:00:26 Where it's like list all the to-dos in a project like update a to-do in a project.

1:00:31 and those get pushed as your API to your Convex server.

1:00:35 the implementations of that API can then read and write to the database

1:00:39 and Convex has like a, kinda like Mongo or Firebase, like API for doing so.

1:00:44 I think the main benefit then of Convex relative to more traditional

1:00:48 architectures is that if you're on the client, the only thing you need to do

1:00:53 is call the, like the use query hook.

1:00:56 You're saying like, I am looking at a project I just do use like use query

1:01:01 list tasks and project that will then talk to the server, run that query, but

1:01:07 then also set up the subscription and then whenever any data that that query

1:01:12 looked at changes, it will efficiently determine that and then push the update.

1:01:16 So part of what is like been nice with Convex is that you are getting

1:01:21 a client that has a web socket protocol, it has a sync engine built in.

1:01:26 You're getting infrastructure for running JavaScript at scale and for

1:01:30 handling sandboxing and all of that.

1:01:32 And then you're also getting a database, which is, you know.

1:01:36 One, supporting transactions or reading and writing to it.

1:01:39 But then it also supports this efficient like being able to subscribe

1:01:43 on, I ran this query, this query just ran a bunch of JavaScript.

1:01:47 It looked at different rows and it ran some queries.

1:01:51 the system will automatically efficiently determine if any right overlaps with that.

1:01:56 So the combination of all of those things is like part of the benefit of

1:01:59 Convex, you just write TypeScript and you write it in a way that's, feels

1:02:03 very natural and everything just works.

1:02:07 And I think some of the like downsides is that it's it is a different set of APIs.

1:02:13 it's not using sql, it's doing things a little bit differently

1:02:16 than they've been done before.

1:02:18 yeah, it's like kind of interesting even today to see like what you know.

1:02:23 Talking about AI code gen, right?

1:02:24 Like models have been trained, pre-trained on this huge corpus

1:02:28 of stuff on the internet.

1:02:29 And when are they good at adopting new technologies?

1:02:32 Technologies that might be after their knowledge cutoff.

1:02:35 And when are they like it's better just to stick to things that they know already.

1:02:39 Right.

1:02:39 So what you've mentioned before where you say, Convex is rather opinionated for me.

1:02:45 in let's say five years ago, I might've been much more of

1:02:49 like, oh, but maybe there's a technology that's less opinionated

1:02:53 and I can use it for everything.

1:02:54 But the more experience I got, the more I realized no, actually.

1:02:58 I want something that's very opinionated, but opinionated

1:03:02 and I share those opinions.

1:03:04 Those are exactly for my use case.

1:03:06 So I think that is much better.

1:03:08 This is why we have different technologies and they are great for different

1:03:12 scenarios, and I think the more a technology tries to say, no, we're,

1:03:17 we're best for everything, I think the, less it's actually good at anything.

1:03:23 And so I greatly appreciate you standing your ground and saying

1:03:26 like, Hey, those are, our design, decisions that we've made.

1:03:31 And those are the use cases where, you'd be really well served building

1:03:35 on top of something like Convex.

1:03:37 And, I particularly like for now where TypeScript is really the, default

1:03:42 language to build full stack applications.

1:03:45 And it's also increasingly becoming the default for.

1:03:48 ai, based applications as well.

1:03:51 And AI based systems speak type script, just as well as English.

1:03:57 And given that Convex makes that full stack super easy.

1:04:02 And also I think you can, when you build local-first apps, it can

1:04:07 sometimes get really tricky because you empower the client so much.

1:04:11 You give the client so much responsibility and therefore there's

1:04:15 many, many things that can go wrong.

1:04:17 And I think Convex therefore, takes a more conservative approach and says

1:04:21 like, Hey, everything that happens on the server is like highly privileged

1:04:25 and this is your safe environment.

1:04:27 And the client will try to give you the best user experience and

1:04:31 developer experience out of the box.

1:04:33 But the client could be in a more adversarial environment.

1:04:37 And I think those are great design trade offs.

1:04:40 So, I think that is a fantastic foundation for tons of different applications.

1:04:45 Yeah.

1:04:46 talking about some of these strong opinions being both

1:04:49 blessings and curses, right?

1:04:50 Like over the past few months, one thing we've been working on is trying

1:04:54 to bridge the gap between those two points in the spectrum, right?

1:04:58 we wrote a blog post on it a few months ago of like working on what we're calling

1:05:02 our like Object sync engine, trying to take a lot of the principles from more of

1:05:08 a local-first type approach of having a data model that it is synced to the client

1:05:14 and the only interaction between the server and the client is through the sync.

1:05:18 And the client then can always render its UI just looking at the local

1:05:22 database and it can be offline.

1:05:24 It's also fully describes the app stage so it can be exported

1:05:28 and rehydrated or whatever.

1:05:29 it's very interesting design exercise we've been on to say like, can

1:05:33 you structure a protocol on a sync engine in a way such that the UI

1:05:39 is still reading and writing to a local store that is authoritative.

1:05:43 But then that local store is like to kind of use like an electric SQL terminology is

1:05:47 like that is a shape that is some mapping of a strongly separated server data model.

1:05:52 So we still have a client data model and server data model, which might be

1:05:56 owned by different teams and evolve independently and, we also have that

1:06:01 strong separation where the implementation of the shape is privileged and running

1:06:06 on the server and has authorization rules built in and get the best of both worlds.

1:06:10 And we've kind of, we have a like beta that we've not released publicly thought

1:06:16 open, sourced out there, but kind of a thing where we, I think they're

1:06:19 still figuring out like the DX for it.

1:06:21 And I think we have something that like algorithmically works

1:06:24 and it's like the protocol works, but it's like, it's kind of hard.

1:06:28 Right.

1:06:28 It kind of reminds me a lot of writing GraphQL resolvers of like saying How do I

1:06:32 take the messages table from my chat app?

1:06:35 Then under the hood that might be joining stuff from many different

1:06:39 tables and filtering rows, or might even be doing a full tech search

1:06:43 query in another view or something.

1:06:45 and coming up with the right ergonomics to make that feel

1:06:48 great for a day one experience.

1:06:50 I think something that's like still we're working on, still

1:06:53 kinda like a research project,

1:06:54 right?

1:06:54 Well, when it comes to data, there is no free lunch, but I'd much rather to have

1:06:58 it be done in the order and sequencing that you're going through, which is

1:07:03 having a solid foundation that I can trust and then figuring out the right

1:07:09 ergonomics afterwards, since I think there's many, many tools that start with

1:07:14 great ergonomics, but later realize that it's on a built, on a unsound foundation.

1:07:19 So when it comes to data, I want a trustworthy foundation, and I think

1:07:24 you're going about in the right order.

1:07:26 Hey, Sujay, I've been learning so much about one of my favorite

1:07:31 products of all time, Dropbox.

1:07:33 I've learned so much of like how the sausage was actually made, how it evolved

1:07:39 over time and I'm really excited that you got to share the story today and

1:07:45 many me included, got to, learn from it.

1:07:48 Thank you so much for taking the time and sharing all of this.

1:07:51 Thanks for having me.

1:07:52 This is super, super fun.

1:07:54 Thank you for listening to the localfirst.fm podcast.

1:07:56 If you've enjoyed this episode and haven't done so already, please

1:08:00 subscribe and leave a review.

1:08:01 Please also share this episode with your friends and colleagues.

1:08:04 Spreading the word about the podcast is a great way to support

1:08:07 it and to help me keep it going.

1:08:09 A special thanks again to Jazz for supporting this podcast.