localfirst.fm
All episodes
April 15, 2025

#23 – Sujay Jayakar: Dropbox, Convex

#23 – Sujay Jayakar: Dropbox, Convex
Sponsored byJazz
Show notes

Transcript

0:00:00 Intro
0:00:00 There's another kind of interesting decision here on Dropbox by
0:00:04 design was always like a sidecar.
0:00:06 It's always something that just sits and it looks at your files.
0:00:09 Your files are just regular files on the file system.
0:00:12 And if Dropbox, the app isn't running, your files are there and they're safe,
0:00:17 and it's something that you know, regular apps can just read and write
0:00:21 to, and in some sense like Dropbox was unintentionally local-first
0:00:27 from that perspective, right?
0:00:28 Because it's saying that no matter what happens, your data
0:00:31 is just there and you own it.
0:00:33 Welcome to the localfirst.fm podcast.
0:00:36 I'm your host, Johannes Schickling, and I'm a web developer, a
0:00:39 startup founder, and love the craft of software engineering.
0:00:42 For the past few years, I've been on a journey to build a modern high quality
0:00:46 music app using web technologies, and in doing so, I've been following down
0:00:50 the rabbit hole of local-first software.
0:00:52 This podcast is your invitation to join me on that journey.
0:00:56 In this episode, I'm speaking to Sujay Jayakar.
0:00:59 Co-founder of Convex and Early Engineer at Dropbox.
0:01:02 In this conversation, Sujay shares the story on how the Sync Engine
0:01:06 powering Dropbox was built initially and later redesigned to address all
0:01:11 sorts of distributed systems problems.
0:01:13 Before getting started, also a big thank you to Jazz for supporting this podcast.
0:01:19 And now my interview with Sujay.
0:01:22 Hey, Sujay.
0:01:23 So nice to have you on the show.
0:01:24 How are you doing?
0:01:25 Doing great.
0:01:26 Great.
0:01:26 Really happy to be here.
0:01:28 I'm super excited to have you on the show.
0:01:30 I've been using your work really since over a decade at this point
0:01:35 when I was really getting into using computers productively.
0:01:39 And we just the other time had another really interesting guest, Seph Gentle on
0:01:45 the podcast, who has worked on a really fascinating tool, called Google Wave
0:01:50 back then that had a big impact on me.
0:01:53 And you've been working on another technology that had a big impact
0:01:56 on me, which is Dropbox and still has a very positive impact on me.
0:02:01 That was all the way back then over 10 years ago in 2014.
0:02:05 I don't think I need to explain to the audience what Dropbox is, but, I
0:02:11 want to hear it from you, like what led you to join Dropbox, I think very
0:02:15 early on and just hearing a little bit just embedded in your personal
0:02:20 context when you joined it, and then we're gonna go dive really deep into
0:02:24 all things syncing related, et cetera.
0:02:27 How does that sound?
0:02:28 Sujay's backstory
0:02:28 Yeah, that sounds great.
0:02:30 It's actually a really funny story.
0:02:31 my career here in technology started in 2012.
0:02:34 I was actually studying mathematics.
0:02:37 I was going to go work at the NSA doing cryptography, and I was born in India.
0:02:44 but I'm a naturalized citizen for the United States, and, you have to
0:02:48 be, have security clearance to go do these types of cryptography things.
0:02:52 And you know, my clearance kept on dragging on and on and on and
0:02:58 they like interviewed my roommates and apparently just a very sketchy
0:03:01 guy so I had an offer to go work there, but it kept on dragging on.
0:03:06 And then my roommate, at the time was a computer science major who wanted some
0:03:10 like someone to go with him to the career fair and, just started chatting with the
0:03:15 Dropbox people and you know, it's about like a hundred people around that time.
0:03:19 And, chatting turned into hanging out at dinner, turned into interviewing
0:03:23 and being a math person, I did my interviews all in Haskell and
0:03:26 didn't know any real programming.
0:03:28 and then yeah, that turned into doing an internship, dropping out of undergrad and.
0:03:34 just following the dream.
0:03:35 And so I worked on, at Dropbox, I worked on a bunch of things.
0:03:38 I started off working on our, like, growth team.
0:03:40 So I did a lot of like email system.
0:03:43 Like I did this, I worked on this thing called the space raise, like a promotion.
0:03:47 Oh, I remember that.
0:03:48 Yes.
0:03:49 I think I've, I've earned quite a lot of like free storage, which I think
0:03:53 over the time has like gone down.
0:03:56 But that was a very smart and effective mechanism.
0:03:58 I surely invited all my friends back then.
0:04:01 I couldn't afford a premium plan being a broke student that worked.
0:04:07 And then from there worked on the sync engine for some time.
0:04:11 And then right now I'm the co-founder and chief scientist of a startup called Convex
0:04:16 and my three co-founders and I met working on this project called Magic Pocket,
0:04:20 where Dropbox stores hundreds of petabytes now exabytes of files, for users.
0:04:25 And we used to do that in S3.
0:04:27 And so the three of us worked together on a team to build Amazon S3, but in-house
0:04:32 and migrate all of the data over.
0:04:34 so we did that for a few years and then, Worked on rewriting the entirety of
0:04:39 Dropbox, the sync engine, the thing that runs on all of our desktop computers.
0:04:43 we rewrote it to be really correct and scalable and very flexible.
0:04:47 and shipped that.
0:04:48 after that left Dropbox in 2020 I was trying to decide if I wanted
0:04:52 to get back to academics or not.
0:04:54 So I did some research in networking and then decided to start Convex in 2021.
0:04:59 Certainly curious, which sort of research has had your interest the
0:05:03 most in this sort of transitionary per period, but maybe we stash that
0:05:07 for a moment and go back to the beginning when you joined Dropbox.
0:05:11 you mentioned there were around a hundred people working there currently.
0:05:15 how do I need to imagine the technology behind Dropbox at this point?
0:05:20 it clearly started all out with like, desktop focused like daemon project,
0:05:27 like daemon process that's running on your machine somehow keeps track of the files
0:05:33 on your system and then applies the magic.
0:05:37 So explain to me how things worked back then and what was it like to
0:05:46 Dropbox
0:05:46 Yeah, I mean, it was pretty magical, right?
0:05:49 Because the company had, I think gotten so many things right on the product side
0:05:54 and then those showed up in technology.
0:05:55 But just this feeling of like Dropbox being this product that just worked right?
0:06:00 It was for everyone.
0:06:01 It was not just for technologists, but anyone should be able, anyone who's
0:06:05 comfortable using a computer should be able to install Dropbox and have
0:06:09 a folder of theirs become magical.
0:06:12 And without understanding anything about how it works, they should
0:06:15 just think of it as like an extension of what they know already.
0:06:19 yeah.
0:06:19 And so like the ways that that showed up I think were really interesting.
0:06:22 At the time there was a very strong culture of like reverse engineering.
0:06:26 So to have this daemon that runs locally.
0:06:30 You know, there was one of the amazing early moments in Dropbox was that,
0:06:34 if like you open up finder or explore and you have the overlays on it.
0:06:39 Like that used to be done by like attaching to the finder
0:06:43 process and injecting code into it
0:06:47 and to the point where, uh, when some folks had gone to talk to Apple at the
0:06:51 time and about like working with the file system and everything like the,
0:06:57 there were teams at Apple that asked Dropbox, how did you do that in Finder?
0:07:05 So you wanted to offer the most native experience.
0:07:08 There weren't the necessary APIs for that.
0:07:11 And so you just made it happen.
0:07:12 That's amazing.
0:07:13 Yeah.
0:07:14 Yeah.
0:07:14 And so that that idea of like, how do you create the best user experience,
0:07:19 something that you know, for the purpose of making non-technical users feel very
0:07:26 confident and feel very safe using it.
0:07:28 That was another, I think, really deep like company value of like
0:07:32 being worthy of trust and taking people's files very seriously.
0:07:36 You know, I like remember having a friend who was in residency at the
0:07:39 time and he was telling me that he keeps all of his, like some of his non
0:07:44 HIPAA stuff, but like his things that he looks at on Dropbox and you know,
0:07:50 pulls him up and he's consulting 'em.
0:07:51 And there's a part of me which is terrified by that, right?
0:07:54 Like we think of software as something where like throwing a 500
0:07:57 error is fine every once in a while.
0:08:00 And a Dropbox that was there was a culture of making users feel
0:08:03 like they could really trust us.
0:08:04 And then that showed up for things like making sure that, like when
0:08:08 we give feedback to users, if we put that green overlay in finder.
0:08:12 They know that no matter what happens, they could throw their laptop in a pool.
0:08:16 They could like they, anything could happen and their files are safe.
0:08:20 Like if their house burns down, they don't have to worry about that thing.
0:08:24 And that's like all of that reverse engineering and all of the emphasis
0:08:29 on correctness and durability.
0:08:31 It was all in service of that feeling, which I think was really cool.
0:08:34 so on the engineering side, at the time it was like in hyper growth mode.
0:08:38 So they had a Python desktop client.
0:08:40 Almost all of Dropbox was in Python at the time.
0:08:43 And so there's a pre my py, like big rapidly changing desktop client that
0:08:50 needed to support Mac, windows and Linux and all these different file systems.
0:08:53 and then on the server, it was like we had one big server called Meta Server,
0:08:58 meta, I think was from metadata.
0:09:00 and that like ran almost all of Dropbox.
0:09:03 We stored the metadata in MySQL.
0:09:06 The files were stored in S3, and then we had a separate notification server
0:09:11 for managing pushes and things like that.
0:09:13 And so it was like kind of classic architecture and like reach was
0:09:17 starting to reach the limits of its scaling even at that time.
0:09:20 And, those were a lot of the things we worked on over the next 10 years.
0:09:24 Wow.
0:09:25 So was the server also written in Python?
0:09:27 So it was all one big python shop.
0:09:30 Yeah.
0:09:30 And the server was all written in Python.
0:09:33 we, had some pretty funny bugs that were due to it's kind of
0:09:39 crazy to think about it now.
0:09:40 You know, we, you working in TypeScript and full time and to think of, like
0:09:44 back in the day we just had these like hundreds of thousands, millions of lines
0:09:48 of code with no type safety and with all types of crazy meta programming and
0:09:55 decorators and meta classes and stuff.
0:09:57 And yeah, so there was a, it was all in Python when I showed up.
0:10:00 it was not all in Python and not all in one big monolithic service when I left.
0:10:04 So you mentioning joining when there were around a hundred people and you
0:10:09 probably already at this point had like multitudes more in terms of users.
0:10:15 Being in hypergrowth, it is sort of this race against time where you only have
0:10:21 so much time to work on something, but growth may be outrun you already and
0:10:26 things are already starting to break.
0:10:28 Or You know like, okay, if things gonna grow like this, this system will
0:10:33 break and it's gonna be pretty bad.
0:10:36 So tell me more about how you were dealing with like the constant r
0:10:42 race against time to rebuild systems, redesign systems, putting out fires.
0:10:49 What was that like?
0:10:50 Hypergrowth
0:10:50 Yeah, and I think there's like kind of an interesting place to take this on.
0:10:53 I think like the normal things were on scale right there.
0:10:56 Those were like.
0:10:57 One, kinda class of problems of being able to handle the load.
0:11:00 But I think one kind of really interesting, dimension of this that led
0:11:04 to our decision to start rewriting all of the sync engine in 2016 was actually
0:11:09 just like customer debugging load.
0:11:12 You know, we have we had hundreds of millions of active users and they were
0:11:17 using Dropbox in all types of crazy ways.
0:11:20 Like one of the stories is someone was using Dropbox with like, I think
0:11:24 it was running on some, I don't know if it was like a raspberry pie or
0:11:27 something, something on his tractor.
0:11:28 Like the guy ran a farm and he was using Dropbox to sink like
0:11:32 pads in text files to his tractor.
0:11:35 And I might be getting some of the details wrong, but
0:11:37 it's something like that.
0:11:38 And so people would just use Dropbox in all types of crazy ways on crazy
0:11:43 file systems with kernel modules running that are messing things around
0:11:47 or so I think, You know, in terms of getting ahead of scale, I think we found
0:11:52 ourselves around 2015, 2016, in the place where for the syn engine on the
0:11:58 desktop client, the entire team just spent all of its time debugging issues.
0:12:04 We had this principle of like anything that's possible, anything that a
0:12:08 protocol allows, anything that some threading race condition that's
0:12:13 theoretically possible will be possible.
0:12:16 And then we would see it, right?
0:12:17 Like users would write in saying, my files aren't sinking.
0:12:20 And then we would look at it and we would spend months debugging each one of these
0:12:24 issues and trying to read the tea leaves from traces and reports and reproductions.
0:12:30 And it'll be like, oh they mounted this file system over here
0:12:33 and then this one and this one are in a different file system.
0:12:36 So moving the file actually did a copy, but then the X adders
0:12:40 were in, preserved this and that.
0:12:42 You know, in terms of that theme of like getting ahead of scale, like I think there
0:12:46 was first this realization that like the set of possible things that can happen in
0:12:51 the system is just astronomically large.
0:12:54 And all of them will happen if they're allowed to.
0:12:57 And we do not have, no matter how much like incremental time
0:13:01 we put into debugging things, we will never be able to keep up.
0:13:05 And the cost of doing that is that the entire team is working
0:13:08 on maintenance like this.
0:13:09 We couldn't build any new features.
0:13:11 So I think that was a motivation then for the rewrite to is can we find like points
0:13:15 of leverage where if we just invest a little bit in technology upfront, like by
0:13:20 architecting things a particular way, can we just eliminate a much bigger set of
0:13:25 potential work from debugging and working with customers and stuff like that.
0:13:29 So maybe this is a good time to take a step back and try to
0:13:33 better understand what was Dropbox sync Engine actually back then?
0:13:38 So from just thinking about it through like a user's perspective, I have maybe
0:13:45 two computers, and I have files over here.
0:13:48 I. I want to make sure that I have the files synced over from here to here.
0:13:53 So I could now think about this as sort of like a Git style, approach.
0:13:59 Maybe there's other ways as well.
0:14:01 walk me through sort of like through the solution space, how this could have been
0:14:05 approached and how was it approached?
0:14:07 is there some sort of like diffing involved between different file states
0:14:12 over time, those are being synced around.
0:14:14 Do you sync around the actual file content itself?
0:14:18 Help me to understand.
0:14:19 Building a mental model, what does it mean back then for the sync engine to work?
0:14:25 Dropbox Sync Engine
0:14:25 Yeah.
0:14:25 Yeah.
0:14:26 It's a super interesting question, right?
0:14:28 Because I think like you're saying, there's so many different paths one
0:14:31 can take and it's, I think one of those things where like if someone
0:14:34 asks, like design Dropbox in an interview question, there's like
0:14:37 definitely not one right answer, right?
0:14:39 It's like there are so many trade-offs and like different forks in the decision tree.
0:14:44 I think one of the first things is that, so you have your desktop A and you
0:14:48 have your, maybe you have your desktop and your laptop, and one of the first
0:14:52 decisions for Dropbox is that we would have a central server in the middle,
0:14:56 that there would be a Dropbox file system in the middle that Dropbox, the company
0:15:01 ran, and we did that from this trust perspective, we wanted to say that we
0:15:05 will run this infallibly when you get that green check mark when it's there.
0:15:11 You know, even if an asteroid destroys the eastern side of the United
0:15:15 States, like we will have things replicated in multiple data centers.
0:15:18 And that you know, and then also it's accessible anywhere
0:15:22 on the internet, right?
0:15:23 You can go to the library.
0:15:24 This is not so common these days but I remember when I was a student, like,
0:15:26 go to the library, log into Dropbox and read all your things right?
0:15:30 rather than having to bring a USB stick around.
0:15:32 And so I think that is the first decision, but it's not necessary, right?
0:15:36 Like there were plenty of distributed, entirely peer to
0:15:39 peer file syncing, designs, right?
0:15:42 And so that was the first decision.
0:15:44 And I think the kind of second decision was that if we imagine our desktop and
0:15:48 our laptop and you have the server in the middle, the desktop might be on
0:15:52 Windows, the laptop might be on Mac OS.
0:15:56 So I think that decision to support multiple platforms.
0:15:59 Is like another really interesting one.
0:16:02 This is like where I think Git and Dropbox can be a little bit different.
0:16:06 And that Git is at the end of the day quite Linux centric.
0:16:09 It's case sensitive for its file system.
0:16:12 It deals with directories and it makes particular assumptions about
0:16:15 how directories should behave.
0:16:17 And that was something with Dropbox.
0:16:19 We wanted to be consumer, we wanted to support everything and we wanted
0:16:22 it to feel very automatic, right?
0:16:24 That like, someone shouldn't have to understand like what a, like unicode,
0:16:28 normalization disagreement means.
0:16:29 Right?
0:16:30 Where in Git like in really bad settings, like you might have to understand
0:16:34 that, that you're right, you with an accent differently on Mac and Windows.
0:16:38 so I think that's the kind of like next, side.
0:16:40 So then Dropbox has its design for a file system and it's a
0:16:43 central, it's like the hub and all those folks are your phone, your.
0:16:48 desktop, your laptop and whatnot.
0:16:50 and then so to kind of get down to the details a bit more.
0:16:53 So then, yeah, we have a process that runs on your computer, that's the
0:16:56 Dropbox app, and that watches all of the files on your file system, and then
0:17:02 it looks at what's happened and then syncs them up to the Dropbox server.
0:17:07 And then whenever changes happen on the Dropbox server, it syncs them down.
0:17:11 there's another kind of interesting decision here on Dropbox by
0:17:15 design was always like a sidecar.
0:17:17 It's always something that just sits and it looks at your files.
0:17:20 Your files are just regular files on the file system.
0:17:23 And if Dropbox, the app isn't running, your files are there and they're safe,
0:17:28 and it's something that you know, regular apps can just read and write
0:17:32 to, and in some sense like Dropbox was unintentionally local-first
0:17:37 from that perspective, right?
0:17:39 Because it's saying that no matter what happens, your data
0:17:42 is just there and you own it.
0:17:44 and you know, there are other systems, right?
0:17:46 Like if you use NFS a like a network file system, then if you unmount it or
0:17:52 if you lose connection to the server.
0:17:54 You might not be able to actually open any files that you have the metadata for.
0:17:58 Right.
0:17:59 And I remember from a user perspective, the local-first aspect, I really went
0:18:04 through like all the stages where I had a computer that wasn't connected
0:18:08 to the internet yet, and that at some point I had an internet connection.
0:18:12 But, files were always where like everything depended on files.
0:18:16 Like if I didn't have a file, things wouldn't work.
0:18:20 Everything depended on files.
0:18:22 There were barely websites that where you could do meaningful things.
0:18:26 certainly web apps weren't very common yet.
0:18:30 And then Dropbox made everything seamlessly work together.
0:18:35 And then when web apps and SaaS software more came along, I was a
0:18:41 bit confused because I felt Okay.
0:18:43 I t gives me some collaboration, but seems to be a different kind of collaboration
0:18:48 since I had collaboration before.
0:18:50 But I also understood the limitations of, when I'm working on the same doc
0:18:56 file, through Dropbox, which gets sort of like the first copy, second
0:19:00 copy, third copy, and now I need to somehow manually reconcile that.
0:19:05 And when I saw Google Docs for the first time.
0:19:09 That was really like a revelation because, oh, now we can do this at the same time.
0:19:14 But at the same while I saw that, I still remember the feeling
0:19:19 where, but where are my files?
0:19:20 This is my stuff now.
0:19:22 Where, where is it?
0:19:23 And that trust that you've mentioned with Dropbox, I felt like I lost some,
0:19:30 some control here and it required a lot of trust, in those tools that I
0:19:35 started now step by step, embracing.
0:19:37 And frankly, I think a lot of those tools didn't deserve my trust in hindsight.
0:19:41 I still feel like we've lost something by no longer being able to like call
0:19:48 the foundation our own in a way.
0:19:50 And I'm still hoping that we kind of find the best of both worlds where
0:19:54 we get that seamless collaboration that we now take for granted.
0:19:58 Something like that Figma gives us.
0:20:00 but also the control and just being ready for whatever happens, that's
0:20:06 something Dropbox gave us out of the box.
0:20:08 I just wanna share this sort of like anecdote and like almost
0:20:12 emotional confusion as I walk through those different stages
0:20:16 of how we work with software.
0:20:18 Totally.
0:20:19 And we've ended up in a place that's not great in a lot of ways.
0:20:22 Right.
0:20:22 And I think you know, I think part of the sad thing, and maybe from
0:20:27 even like an operating systems design perspective is that I feel like files
0:20:32 have lots of design decisions that are.
0:20:35 Packaged up together.
0:20:36 You know, like one of the amazing things about files is that
0:20:39 they're self-contained, right?
0:20:41 Like on Google, I don't know what Google's backend looks like for Google
0:20:44 Docs, but they probably have like all of the metadata and pieces of the data
0:20:49 spread and different rows in a database and different things in an object store.
0:20:53 And just even thinking about like the physical implementation of that
0:20:57 data, it's like scattered around probably a bunch of servers, right?
0:21:00 Maybe in different data centers.
0:21:02 And there's something really nice about a file where a file is just
0:21:05 like a piece of state, right?
0:21:08 That is just self-contained.
0:21:09 And I think the thing that I think is one of the things I think is very
0:21:13 unfortunate is like from a operating systems perspective is that that decision
0:21:18 then has also been coupled with a very anemic API like with files, they're just
0:21:24 sequences of bytes that can be read and written to and impended and there's
0:21:30 no additional structure beyond that.
0:21:32 And I think like.
0:21:33 Folks the way that things have evolved is that we've given up on, too
0:21:37 have more structure, too make things like Google Docs, too be able to
0:21:41 reconcile and have collaboration and interpret things more than just bites.
0:21:46 We've also given up this ability to package things together.
0:21:49 Mac os had like a very kind of baby step in this direction with I
0:21:53 think they're called bundles.
0:21:54 Like the things where like if you have like your.app, they're
0:21:57 actually a zip file, right?
0:21:59 And there's all types of ways, all types of brain damage for how this
0:22:03 like, doesn't actually work well.
0:22:05 You know?
0:22:05 But the idea is kind of interesting, right?
0:22:07 It's like what if files had some more structure and what if you still
0:22:10 considered something, an atomic unit, but then it had pieces of it that
0:22:15 weren't just uninterpretable bites.
0:22:17 And I think that's like, the path dependent, way that we've
0:22:20 ended up where we are today.
0:22:22 That makes sense.
0:22:23 So going back to the sync engine implementation did the Python process
0:22:28 back in the day did that mostly index all of the files and then actually
0:22:34 send across the actual bites probably in some chunks, across the wire?
0:22:39 Or was there some more intelligent and diffing happening client side
0:22:45 that you would only send kind of the changes across the wire and how do I
0:22:50 need to think about what is a change when I'm dealing with like a ton of
0:22:55 bites before and a ton of bites after?
0:22:57 Yeah.
0:22:58 It's really, really good questions.
0:22:59 I think maybe like the first starting point is that like files
0:23:03 in Dropbox were stored, just broken up into four megabyte chunks.
0:23:07 And that was just a decision at the very beginning to pick some size.
0:23:11 And on the server, the way that those chunks were stored is that they,
0:23:15 each four megabyte chunk was stored by key to by its shot to 56 hash.
0:23:20 So we would assume that those are globally unique.
0:23:23 So then if you had the same copy of a bunch of file, or you had
0:23:27 a file copied many times in your Dropbox, we would only store it once.
0:23:31 And that would just happen organically because we would say
0:23:34 like, okay, I looked at this file, it has three chunks A, B, and C.
0:23:39 And then the client would ask the server, do you have A, B, and C?
0:23:43 Like the server would say, yes, I have B and C already, please send A, then we
0:23:47 would upload A. so there was already like at the file level there was this like
0:23:52 kind of very coarse grained Delta sync.
0:23:56 at the four megabyte chunk layer.
0:23:58 and then the kind of, it's funny, these things evolve, right?
0:24:01 Like then the next thing we layered on up top was that in that setting where
0:24:05 you decided B and C were there already and you needed to upload a then with
0:24:09 a, the desktop client could use rsync to know that there was previously a
0:24:15 prime and do a patch between the two and then send just those contents.
0:24:19 the kind of thing that was pretty interesting is that a lot of the content
0:24:23 on Dropbox was very incompressible stuff like video, images, so the
0:24:29 benefits of deduplication both across users or even within a user.
0:24:34 And the benefit of like rsync was not actually as much as one might think,
0:24:40 at least from the like, terms of bandwidth going through the system.
0:24:43 It wasn't that reductive because a lot of this content was just kind of unique and
0:24:48 not getting updated in small patches.
0:24:51 And on your server side, blob store, now that you had those hashes for those four
0:24:56 megabyte chunks, that also means that you could probably deduplicate some content
0:25:02 across users, which makes me think of all sorts of other implications of that.
0:25:09 When do you know it's safe to let go of a junk?
0:25:12 do you also now know that, you could kind of go backwards and
0:25:16 say like, oh, from this hash, we know this is sensitive content.
0:25:20 And have some further implications for, whatever we don't need to go too
0:25:25 much into depth on that now, but, yeah.
0:25:28 I'm curious like how you thought of those design decisions and
0:25:32 the possible implications.
0:25:34 Yeah.
0:25:34 Yeah, for the first one yeah, like distributed garbage collection
0:25:38 was a very hard problem for us.
0:25:39 We called it vacuuming and in terms of making Dropbox economics work out
0:25:44 of, like, when we couldn't afford to keep a lot of content that was deleted
0:25:48 that we couldn't charge users for.
0:25:50 So that was you know, there's all additional complexity where different
0:25:54 users would have like the ability to restore for different periods of time.
0:25:58 So we would say like, anything that's deleted, it doesn't actually
0:26:01 get deleted for 30 days or a year or whatnot based on their plan.
0:26:05 so then, yeah, like having to do this like big distributed mark and
0:26:09 sweep garbage collection algorithm across hundreds of petabytes,
0:26:14 exabytes of content that was something that we had to get pretty good at.
0:26:18 And when we designed Magic Pocket, where we, implemented S3 in-house, we
0:26:23 had specific primitives for making it a little bit easier to avoid race conditions
0:26:28 where like, if a file was deleted.
0:26:31 And we decided that no one needed it anymore.
0:26:34 But then just at that point in time, someone uploads it again, making sure
0:26:38 that we don't accidentally delete it.
0:26:40 So that was like, yeah, definitely a very tricky problem.
0:26:43 And I think in retrospect this is like an interesting design exercise, right?
0:26:48 And that if deduplication wasn't actually that valuable for us, we could have
0:26:52 eliminated a lot of complexity for this garbage collection by not doing it right.
0:26:58 I think for the second thing, yeah.
0:26:59 So at the beginning when Dropbox started, if you had a file with A, B and C and you
0:27:06 uploaded it, it would just check, does A, B and C exist anywhere in Dropbox?
0:27:11 And, that got changed over time to be does do you as your user
0:27:17 have access to A, B, and C?
0:27:19 And you know, 'cause otherwise you could use this for all types of purposes, right?
0:27:24 To see if there exists some content anywhere in Dropbox.
0:27:27 And, that was something where we would in the case where the user was
0:27:32 uploading A, B, and C, say none of them were present in their account, we would
0:27:38 actually force them to upload it, incur the bandwidth for doing so, and then
0:27:42 discard it if B and C existed elsewhere.
0:27:46 Yeah.
0:27:46 Very interesting.
0:27:47 I mean, this would be an interesting rabbit hole just to go down just the
0:27:50 kind of second order effects of that design decision, particularly at
0:27:54 the scale and importance of Dropbox.
0:27:57 But maybe we save that for another time.
0:27:59 So going back to the sync engine, now that we have a better understanding of, how it
0:28:04 worked in that shape and form back then.
0:28:07 You've been already mentioning before, like as things as usage went through
0:28:12 the roof, all sorts of different usage scenarios also expanded.
0:28:17 you had all sorts of more esoteric ways, how you didn't kind of even think
0:28:22 before that it would be used this way.
0:28:25 Now all of that came to light.
0:28:28 I'm curious which sort of, helper systems you put in place that you could
0:28:33 even have a grasp of what's going on since a part of the trust that Dropbox
0:28:39 owned or that earned over time, was probably also related to privacy.
0:28:44 So you, you couldn't just like read everything that's going on in someone's
0:28:49 system, so you're probably also relying to some degree on the help of a user
0:28:55 that they like send something over.
0:28:57 Yeah.
0:28:57 Walk me through like the evolution of that and that you, like as
0:29:02 an engineer, if there's a bug reproducing that bug is everything.
0:29:09 Consistency checks
0:29:09 Yeah, and you know, like we had a very strict rule, right, where it just,
0:29:13 we do not look at content, right?
0:29:15 and so that was the thing when debugging issues, the saving grace is
0:29:20 that for most of the issues we saw.
0:29:22 They were more metadata issues around like sync, not converging or sync, getting
0:29:28 to the client thinking it's in sync with the server, but them disagreeing.
0:29:32 so we had a few pretty, yeah, like pretty interesting
0:29:35 supporting algorithms for this.
0:29:37 So one of them was just simple like hang detection, like making sure, like
0:29:41 if, when should a client reasonably expect that they are in sync?
0:29:45 And if they're online and if they've downloaded all the recent
0:29:49 versions and things are getting stuck, why are they getting stuck?
0:29:53 So are they getting stuck because they can't read stuff from the
0:29:55 server, either metadata or data?
0:29:57 Are they getting stuck because they can't write to the file system and
0:30:00 there's some permission errors?
0:30:02 So I think having very fine-grained classification of that and having the
0:30:06 client do that in a way that's like not including any private information and
0:30:11 sending that up for reports and then aggregating that over all of the clients
0:30:14 and being able to classify was a big part of us being able to get a handle on it.
0:30:20 And I think this is just generally very useful for these sync engines.
0:30:23 the biggest return on investment we got was from consistency checkers.
0:30:27 So part of sync is that there's the same data duplicated in many places, right?
0:30:33 Like, so we had the data that's on the user's local file system.
0:30:37 We had all of the metadata that we stored in SQLite or we would store like what
0:30:41 we think should be on the file system.
0:30:43 We would store what the latest view from the server was.
0:30:46 We would store things that were in progress, and then we have
0:30:49 what's stored on the server.
0:30:50 And for each one of those like hops, we would have a consistency checker that
0:30:55 would go and see if those two matched.
0:30:57 And those would, that was like the highest return on investment we got.
0:31:02 Because before we had that, people would write in and they would
0:31:05 complain that Dropbox wasn't working.
0:31:07 And until we had these consistency checkers, we had no idea the
0:31:10 order of magnitude of how many issues were happening.
0:31:13 And when we started doing it, we're like, wow.
0:31:16 There's actually a lot.
0:31:18 So a consistency check in this regard was mostly like a hash over some
0:31:22 packets that you're sending around.
0:31:24 And with that you could verify, okay, up until like from A to B to C to D, we're
0:31:30 all seeing the same hash, but suddenly on the hop from D to E, the hash changes.
0:31:35 Ah-huh.
0:31:36 Let's investigate.
0:31:37 Exactly.
0:31:38 And so, and to do that in a way that's respectful of the users,
0:31:42 even like resources on their system.
0:31:45 Like we wouldn't just go and blast their CPU and their disc and their network to go
0:31:50 and like turn through a bunch of things.
0:31:51 So we would have like a sampling process where we like sample a random
0:31:54 path in the tree and the client and do it the same on the server.
0:31:58 we would have stuff with like Merkle trees and then when things would diverge,
0:32:02 we would try to see like, is there a way we can compare on the client and see like
0:32:07 for example one of the kind of really important, goals for us as an operational
0:32:12 team was to have like the power of zero.
0:32:14 I think it might be from AWS or something.
0:32:17 My co-founder James, has a really good talk on it.
0:32:19 but we would want to have a metric of saying that the number of unexplained
0:32:25 inconsistencies is zero and one 'cause.
0:32:28 Then the nice thing right, is that if it's a zero and it regresses,
0:32:31 you know that it's a regression.
0:32:33 If it's at like fluctuating at like 15 or like a hundred thousand and it kind
0:32:38 of goes up by 5%, it's very hard to know when evaluating a new release, right?
0:32:42 That like that's actually safe or not.
0:32:44 so then that would mean that whenever we would have an inconsistency due to a bit
0:32:49 flip, which we would see all the time on client devices, then we would have to
0:32:55 categorize that and then bucket that out.
0:32:57 So we would have a baseline.
0:32:59 Expectation of how many bit flips there are across all of the devices on Dropbox.
0:33:03 And we would see that that's staying consistent or increasing or
0:33:06 decreasing, and that the number of unexplained things was still at zero.
0:33:10 now let's take those detours since you got me curious.
0:33:13 Uh, what would cause bit flips on a local device?
0:33:16 I think a few, few causes, one of them is just that in the data center, most
0:33:20 memory uses error correction and you have to pay more for it, usually have to pay
0:33:24 more for a motherboard that supports it.
0:33:26 at least back then.
0:33:27 now like on client devices we don't have that.
0:33:30 So this is a little bit above my pay grade for hardware cosmic
0:33:34 rays or thermal noise or whatever.
0:33:36 But memory is much more resilient in the data center.
0:33:40 I think another is just that, storage devices are very greatly in quality.
0:33:44 Like your SSDs and your hard drives are much higher quality inside the data
0:33:49 center than they are on local devices.
0:33:51 And so.
0:33:53 You know, there's that.
0:33:54 it also could be like I had mentioned that people have all
0:33:57 types of weird configurations.
0:33:59 Like on Mac there are all these kernel extensions on Windows, there's
0:34:03 all of these mini filter drivers.
0:34:05 There are all these things that are interposing between
0:34:07 Dropbox, the user space process and writing to the file system.
0:34:11 And if those have any memory safety issues where they're corrupting memory
0:34:15 'cause of the written in archaic C you know, or something that that's
0:34:19 the way things can get corrupted.
0:34:20 I mean, we've seen all types of things.
0:34:22 We've seen network routers get having corrupting data, but usually
0:34:26 that fails some checksum, right?
0:34:28 Or we've seen even registers on CPUs being bad where the memory gets replaced
0:34:33 and the memory seems like it's fine, but then it just turns out the CPU has its
0:34:38 own registers on CHIP that are busted.
0:34:40 And so all of that stuff I think just can happen at scale.
0:34:44 Right.
0:34:45 that makes sense.
0:34:45 And I'm happy to say that I've hadn't had yet to worry about flip bits, whether
0:34:51 it's being for storage or other things, but huge respect to whoever had already
0:34:56 to, tame those parts of the system.
0:34:59 So, you mentioning the consistency check as probably the biggest lever that you
0:35:05 had to understand which health stage your sync engine is in the first place.
0:35:11 was this the only kind of metric and proxy for understanding with how well
0:35:18 the syn system is working or were there some other aspects that gave
0:35:22 you visibility both macro and micro?
0:35:26 Yeah, I mean, I think this yeah, the kind of hangs, so like knowing
0:35:30 that something gets to a sync state and knowing the duration, right?
0:35:33 So the kind of performance of that was one of our top line metrics.
0:35:38 And the other one was this consistency check.
0:35:40 And then first specific like operations, right?
0:35:43 Like uploading a file, like how much bandwidth are people able to use
0:35:47 because for like, people wanted to use Dropbox, but, and upload lots,
0:35:53 like huge data, like huge number of files where each file is really large.
0:35:57 And then they might do it on in Australia or Japan where they're
0:36:01 far away from a data center.
0:36:03 So latency is high, but bandwidth is very high too, right?
0:36:06 So making sure that we could fully saturate their pipes and all
0:36:09 types of stuff with debugging.
0:36:12 Things in the internet, right?
0:36:13 People having really bad routes to AWS and all that.
0:36:16 so we would track things like that.
0:36:18 I think other than that it was mostly just the usual quality stuff,
0:36:20 like just exceptions and making sure that features all work.
0:36:25 I think when we rewrote this system and we, designed it to be very correct.
0:36:30 We moved a lot of these things into testing before we would release.
0:36:35 So we this is I think one of the, to jump ahead a little bit, we designed,
0:36:38 decided to rewrite Dropbox's sync engine from this big Python code base into Rust.
0:36:45 And one of the specific design decisions was to make things extremely testable.
0:36:49 So we would have everything be deterministic on a single thread,
0:36:53 have all of the reads and rights to the network and file system,
0:36:56 be, through a virtualized API.
0:36:59 So then we could run all of these simulations of exploring what would
0:37:03 happen if you uploaded a file here and deleted it concurrently and then had a
0:37:08 network issue that forced you to retry.
0:37:10 And so by simulating all of those in ci, we would be able to then have very
0:37:14 strong in variance about them that knowing that like a file should never
0:37:18 get deleted in this case, or that it should always converge, or things
0:37:21 like the sharing that this file should never get exposed to this other viewer.
0:37:26 I think like the, having much, like having stronger guarantees was something
0:37:31 that we only could really do effectively once we designed the system to make
0:37:36 it easy to test those guarantees.
0:37:38 Right.
0:37:39 That makes a lot of sense.
0:37:40 And I think we're seeing more and more systems, also in the
0:37:43 database world, embrace this.
0:37:45 I think TigerBeetle is, is quite popular for that.
0:37:49 I think the folks at Torso are now also embracing this approach.
0:37:54 I think it goes under the umbrella of simulation testing.
0:37:57 that sounds very interesting.
0:37:58 Can you explain a little bit more how maybe in a much smaller program would
0:38:03 this basically be Just that every assumption and any potential branch,
0:38:08 any sort of side effect thing that might impact the execution of my program.
0:38:13 Now I need to make explicit and it's almost like a parameter that I put into
0:38:19 the arguments of my functions and now I call it under these circumstances, and I
0:38:25 can therefore simulate, oh, if that file suddenly gives me an unexpected error.
0:38:31 Then this is how we're gonna handle it.
0:38:33 Yeah, exactly.
0:38:34 So it's like and there's techniques that like the TigerBeetle folks, like
0:38:38 we, we do this at Convex in rust with the right, like abstractions, there's like
0:38:42 techniques to make it not so awkward.
0:38:45 But yeah, it is like this idea of like, can you pin all of the non-determinism in
0:38:50 the system can, whether it's like reading from a random number generator, whether
0:38:54 it's looking at time, whether it's reading and writing to files or the network.
0:38:58 Can that all be like pulled out so that in, production it's just using the
0:39:04 random AP or the regular APIs for it.
0:39:07 so there's like for any of these sync engines, there's a core
0:39:10 of the system which represents all the sync rules, right?
0:39:13 Like when I get a new file from the server, what do I do?
0:39:16 You know, if there's a concurrent edit to this, what do I do?
0:39:19 and that I. Core of the code is often the part that has the most bugs, right?
0:39:23 It has the, it doesn't think about some of the corner cases or if
0:39:27 there are errors or needs retries or doesn't handle concurrency.
0:39:30 It might have race conditions.
0:39:32 So the kind of, I think the core idea for determination, determin deterministic
0:39:36 simulation testing is to take that core and just kind of like pull out all of the
0:39:43 non-determinism from it into an interface.
0:39:45 So time randomness, reading and writing to the network, reading
0:39:49 and writing to the file system, and making it so that in production,
0:39:52 those are just using the regular APIs.
0:39:55 But in a testing situation, those can be using mocks.
0:39:59 Like they could be using things that for a particular test
0:40:02 and wants to test a scenario or setting it up in a specific way.
0:40:06 Or it could be randomized, right?
0:40:09 Where it might be that reading from Like time, the test framework might
0:40:14 decide pseudo randomly to advance it or to keep it at the current time or
0:40:18 might serialize things differently.
0:40:21 And that type of ability to have random search explore the state space of
0:40:27 all the things that are possible is just one of those like unreasonably
0:40:30 effective ideas, I think for testing.
0:40:33 And then that like getting a system to pass that type of
0:40:37 deterministic simulation testing.
0:40:39 It's not at the threshold of having formal verification, but in our
0:40:42 experience it's pretty close and with a much, much, smaller amount of work.
0:40:48 And you mentioning Haskell at the beginning?
0:40:50 I still remember when I, after a a lot of time having spent writing unit tests in
0:40:55 JavaScript and I, back then, in the other order, I first had JavaScript and then I
0:41:00 learned Haskell, and then I found quick test and was quick test, Quick Check.
0:41:05 which one was it?
0:41:06 I think it was Quick check, right?
0:41:07 Well, right.
0:41:08 So I found Quick Check and I could express sort of like, Hey, this is this type.
0:41:13 It has sort of those aspects to it, those invariants and then would just
0:41:18 go along and test all of those things.
0:41:20 Like, wait, I never thought of that, but of course, yes.
0:41:23 And then you combine those and you would get way too lazy to write unit
0:41:27 tests for the combinatorial explosion of like all of your different things.
0:41:32 And then you can say, sample it like that, and like, focus on this.
0:41:36 and so I actually also, started embracing this practice a lot more in the
0:41:40 TypeScript work that I'm doing through a great project called Prop Check.
0:41:45 and that is, picking up the same ideas and for particularly those
0:41:52 sort of scenarios where, okay, Murphy's Law will come and haunt you.
0:41:56 this is in distributed systems.
0:41:58 That is typically the case.
0:42:00 Building things in such a way where all the aspects can be, specifically
0:42:05 injected and the, the sweet spot.
0:42:07 If you can do so still in an ergonomic way, I think that's the way to go.
0:42:13 It's so, so valuable, right?
0:42:15 And yeah.
0:42:15 And yeah, the ability to, for prop tasks, for quick check for all of these to
0:42:20 also minimize is just magical, right?
0:42:23 Like it comes up with this crazy counter example and it might be
0:42:27 like a list with 700 elements, but then is able to shrink it down to
0:42:31 the, like, real core of the bug.
0:42:33 It's magic, right?
0:42:35 And you know, I mean, I think this is something like, you know.
0:42:38 A totally different theme, right?
0:42:40 Like one thing at Convex we're exploring a lot is like coding has changed a lot
0:42:44 in the past year with AI coding tools.
0:42:46 And one of the things we've observed for getting coding tools to work very
0:42:50 well with Convex is that these types of like very succinct tests that can
0:42:54 be generated easily and have like a really high strength to weight or power
0:42:59 to weight ratio are just really good for like autonomous coding, right?
0:43:03 Like, if you are gonna take like cursor agent and let it go wild,
0:43:06 like what does it take to just let it operate without you doing anything?
0:43:10 It takes something like a prop test because then it can just continuously
0:43:13 make changes, run the test, and not know that it's done until that test passes.
0:43:18 Yeah, that makes a lot of sense.
0:43:20 So let's go back for a moment to the point where you were just transitioning
0:43:25 from the previous Python based sync engine to the Rust based sync engine.
0:43:32 So you're embracing simulation testing to have a better sense of
0:43:36 like all the different aspects that might influence the outcome here.
0:43:41 walk me through like how you, went about.
0:43:44 Deploying that new system.
0:43:46 Were there any sort of big headaches associated with migrating from the
0:43:52 previous system to the new system?
0:43:54 since you, for everything, you had sort of a defacto source
0:43:57 of truth, which are the files.
0:43:59 So could you maybe just forget everything the old system has done and you just
0:44:04 treat it as like, oh, the, user would've just installed this fresh, walk me
0:44:09 through like how you thought about that since migrating systems on such
0:44:14 a big scale is typically, quite dread
0:44:17 From Sync Engine Classic to Nucleus
0:44:17 Yeah, dreadsome is, yeah.
0:44:19 appropriate word.
0:44:20 I think one of the biggest challenges was that by design we had a very different
0:44:26 data model for the old sync engine.
0:44:29 We called it sync engine Classic.
0:44:31 Affectionately.
0:44:32 And then we had for Nucleus was a new one.
0:44:34 Nucleus had a very different data model, and the motivation for that was that
0:44:40 sync engine Classic just had a ton of possible states that were illegitimate.
0:44:46 It could, if you had like a, the server update a file and the client update
0:44:50 a file, but then a shared folder gets mounted above it, things could get
0:44:54 into all of these really weird states that were legal but would cause bugs.
0:45:00 And then I think that was like one of the big guiding principles more
0:45:04 than even just like Rust or Python, was just like designing what states
0:45:09 should the system be allowed to be in and design away everything else,
0:45:14 make illegal states unrepresentable.
0:45:17 And so that, what that then meant is once we had that.
0:45:21 When we needed to migrate, we had a long tail of really weird starting positions.
0:45:27 So where you basically realized, okay, this system is in this state A, how the
0:45:33 heck did it ever get into that state?
0:45:35 And B, what are we gonna do about it now where we can basically,
0:45:40 it's like from a mapping function, this is like invalid input.
0:45:44 So can you explain a little bit of like, how you constrained the space of, and how
0:45:49 you designed the space of, legitimate, valid states and what were some of the,
0:45:56 if you think about this as like a big matrix of combinations, what are some
0:46:00 of the more intuitive ones that were, not allowed that you saw quite a bit?
0:46:06 Yeah, so I think part of the difficulty for Dropbox, like as syncing things
0:46:13 from the file system is that file system APIs are really anemic.
0:46:17 File system aPIs don't have transactions.
0:46:19 They don't things can get reordered in all types of ways.
0:46:23 So we would just read and write to files from the local file system, and
0:46:26 we would use file system events on Mac, we would use the equivalent on
0:46:30 Windows and Linux to get, updates.
0:46:32 But everything can be reordered and racy and everything.
0:46:36 So one, like common invariant would be that if you have a
0:46:40 directory you know, like files have to exist within directories.
0:46:44 If a file exists, then it's parent directory exists.
0:46:48 And like simultaneously, if you delete a directory, it shouldn't
0:46:51 have any files within it.
0:46:53 And that invariant guarantees and that the file system is a tree.
0:46:57 Right?
0:46:58 And then we, it's very easy to come up with settings, with reads from the
0:47:03 local file system where if you just naively take that and write it into
0:47:07 your SQLite database, you will end up with data that does not form a tree.
0:47:12 and then especially even with like I know it's being unique, right?
0:47:16 Like if I move a file from A to B, then I might observe the add for it at B
0:47:23 way before the delete at B or I might observe it vice versa, where the file
0:47:28 is transiently gone and disappeared and we definitely don't wanna sync that.
0:47:31 and then with directories, if I have like a, as a directory and then B as
0:47:37 a directory, and then I move it's, I could observe a state where A moves into
0:47:43 B, which then without doing the right bookkeeping, might introduce a cycle in
0:47:48 the graph and a cycle for directories would be really bad news, right?
0:47:52 so all of these invariants were things that the file system APIs, they don't
0:47:57 respect, even though the file system internally has these invariants, right?
0:48:01 You cannot create a direct recycle on any file system.
0:48:05 Definitely.
0:48:05 I mean certainly without root And all of these invariants exist but
0:48:09 are not observable through the APIs.
0:48:12 And so then we sync Engine Classic would get into the state where it's
0:48:16 like local SQLite file would have all types of violations like that.
0:48:20 So then how do we read the tea leaves of like the database is in
0:48:24 a really weird state we can't lose.
0:48:26 And to go back to, I think what you had talked about at the beginning of this was
0:48:30 that we always had the nuclear option of dropping all of our local state and doing
0:48:36 a full resync from the files themselves.
0:48:39 But then the problem is that we would entirely lose user intent.
0:48:42 So if, for example, I was offline for a month and I had a bunch of files,
0:48:48 and then during that month other people in my team deleted those files.
0:48:53 If I came back online and didn't have my local database, we would have to
0:48:58 recreate those files and people would complain about this all the time because.
0:49:03 They would delete something and wanna delete it, and then Dropbox would
0:49:05 just randomly decide to resurrect it.
0:49:07 So those types of decisions we, we tried to avoid that as much as possible, but
0:49:12 then that meant having to look at a potentially really confusing database and
0:49:17 read what the user intent might have been.
0:49:19 Right.
0:49:20 I wanna dig a little bit more into the topic of user intent.
0:49:24 Since with Dropbox you've built a sync engine very specifically for the use
0:49:30 case of file management, et cetera, where user intent has a particular meaning that
0:49:36 might be very different from moving a cursor around in a Google Docs document.
0:49:41 So can you explain a little bit, what are some of the, common scenarios of, and
0:49:47 maybe subtle scenarios of user intent, when it comes to the Dropbox design space?
0:49:55 User intent
0:49:55 Yeah, totally.
0:49:56 and I think the for regular things like say editing files.
0:50:01 I think we saw that like people just generally did not, maybe because
0:50:06 of the way the system was even its capabilities, people did not
0:50:09 edit the same files all too often.
0:50:12 So maintaining user intent when file, when everyone is online, just kind of
0:50:17 taking last writer wins Where I think user intent became very interesting is
0:50:21 if someone went offline, like they're on an airplane before wifi and airplanes
0:50:27 And they worked on their document and someone else worked on the same time.
0:50:31 In that case, we observed that users always wanted to see the conflicted
0:50:35 copy and that they wanted to get the opportunity to say, like, I did.
0:50:39 I put in a lot of effort into working on this when I was on the plane.
0:50:43 Someone else, put in probably a similar amount of effort when they were online and
0:50:48 you know, so last writer wins policies.
0:50:50 There violated user expectations quite a lot because either a person
0:50:55 had to win and then the person who lost would be really upset.
0:50:58 so I think those were pretty interesting.
0:51:00 I think with Moose, like with more metadata operations I think people
0:51:05 were a little bit more permissive.
0:51:06 Like if I moved something from one folder to another, another person
0:51:10 moved it to a different folder.
0:51:12 having it just converged on something as long as it converges.
0:51:15 We observed it being like people didn't worry about it too much.
0:51:18 I think the place where user intent is really interesting
0:51:21 with moves is with sharing.
0:51:23 So I think thinking about this from like the distributed systems
0:51:26 perspective on causality, there would be like someone might have like,
0:51:31 I dunno, their HR folder, right?
0:51:33 And I don't know, like, let's say that someone is transferring to the HR team is
0:51:38 they're getting added to the HR folder.
0:51:41 But then say before they were on the team, they were on a
0:51:44 performance improvement plan.
0:51:46 So then the administrator for HR would delete that file, make sure it's
0:51:50 deleted, and then add them to the folder.
0:51:54 And so their user intent is express in a very specific
0:51:59 sequencing of operations, right?
0:52:01 That like this causally depended on this.
0:52:04 I would not have invited 'em to the folder unless the delete was stably synced.
0:52:08 And that making sure that gets preserved throughout the system,
0:52:12 even when people are going online and offline and everything is a very
0:52:16 hard distributed systems problem.
0:52:18 Right.
0:52:18 and it was intimately related with the details of the product.
0:52:22 Right.
0:52:23 yeah.
0:52:23 How did you capture that causality chain of events since you probably also
0:52:29 couldn't quite trust the system clock?
0:52:32 How did you go about that?
0:52:34 Yeah, this became even more difficult, right?
0:52:36 Where file system metadata was partitioned across many shards in the database.
0:52:41 So then we ended up using something like Lamport timestamp, where every single
0:52:45 operation would get assigned a timestamp.
0:52:47 And those timestamps were usually only reading and writing to their
0:52:50 particular shard and for whatever timestamp the client had observed.
0:52:55 But then in these cases where there were potentially cross shard, they
0:52:59 weren't transactions, but like causal dependencies, we would be able to say
0:53:03 like, the operation to mount this or to add someone to the shared folder
0:53:07 and there them mounting it within their file system has to have a higher
0:53:11 timestamp than any right within that or.
0:53:15 Rights including deletes.
0:53:16 so then that way when the client is syncing it would be able to know that when
0:53:21 I am merging operation logs across all of the different shards, I need to assemble
0:53:26 them in a causally consistent order.
0:53:29 And that would then respect all of these particular invariants.
0:53:33 Right.
0:53:34 So you having thought through those different scenarios for Dropbox and
0:53:38 made very intentional design decisions that, for example, in one scenario
0:53:43 last writer wins is not desirable.
0:53:46 Since that might lead to a very sad person stepping off the plane because
0:53:51 all of your data is suddenly gone, or the other person's data is gone.
0:53:55 so you make very specific design trade-offs here when it
0:53:58 comes to somehow squaring the circle of distributed systems.
0:54:03 Which sort of advice would you have for application developers or people even
0:54:08 who are sitting inside of a company and are now thinking about, oh, maybe
0:54:12 we should have our own Dropbox style, linear style sync engine internally.
0:54:17 Which sort of advice would you give them when they Yeah.
0:54:23 Sujay's advices to build a Sync Engine
0:54:23 Yeah, I'll talk through kind of how we structured things at Dropbox to be able
0:54:28 to navigate these types of problems.
0:54:30 And I think the patterns here, can be quite general.
0:54:33 I think what we ended up with was that like thinking like distributed
0:54:37 systems syncing is hard, right?
0:54:40 So we would have the kind of base layer of the sync protocol and how state
0:54:45 gets moved around between the clients and the servers and all the shards.
0:54:49 We would have very strong consistency guarantees there.
0:54:52 So we would not use any of the knowledge of the product at that layer.
0:54:57 So from a, like thinking of Dropbox in the file system as a CRDT.
0:55:03 Dropbox allows, like moves to happen concurrently.
0:55:06 It ha allows you to add something while another thing is happening.
0:55:10 But at the protocol level, we kept things very strict.
0:55:12 We kept them very close to being serializable that every view of the
0:55:17 system was identified by a very small amount of state, like a timestamp.
0:55:21 And that would fully determine the state of the system and like the
0:55:24 amount of entropy in that was very low.
0:55:26 And then whenever you are modifying it, you would say, here's what I expect
0:55:30 the data to be, and if it doesn't match exactly, it will reject the operation.
0:55:34 And then by doing it, structuring things in that way, then we made it very easy
0:55:39 for product teams and for even us working on sync to embed all of these like
0:55:45 looser more product focused requirements.
0:55:47 They also may wanna change over time into the end points, like layered on top.
0:55:51 So every time we wanted to change a policy on how like a delete reconciles with an.
0:55:57 You know, add for a folder or something.
0:55:59 We didn't have to solve any distributed systems problems to do that.
0:56:03 So I think that like pattern of saying that, like is there a good abstraction?
0:56:07 Is there something that is like very powerful that could solve a large
0:56:11 class of problems, doing that well at the lowest layer and then potentially
0:56:16 weakening the consistency above it.
0:56:19 I actually really like the Rocicorp folks have a really great description of
0:56:24 their consistency model for Replicache of it being like session plus consistency.
0:56:29 And it's like a very similar idea where like when we build things on
0:56:34 a platform, we may as our with our product hats on, like want users to
0:56:38 not have to think about conflicts and merging and all that in a lot of cases.
0:56:42 But those decisions might be very particular to our app.
0:56:45 And that's something that holds for everything on the platform.
0:56:48 And then there's always a way to embed those decisions onto, say.
0:56:52 Session consistency and Replicache or serializability and other systems.
0:56:57 And so I think that's like that separation of concerns I
0:57:00 think is something that can apply to a lot of systems.
0:57:04 Right.
0:57:04 So maybe we use this also as a transition to talk a bit more about what you're
0:57:09 now designing and working on Convex.
0:57:12 What were some of the key insights that you've taken with you from Dropbox that
0:57:22 Convex
0:57:22 Yeah, when we first were starting Convex we were looking at how apps
0:57:27 are getting built today, right?
0:57:28 Like web apps are easier to build than ever.
0:57:32 Even in 2021, it's incredible how much, like more productive
0:57:37 that compared to 10 years before.
0:57:39 Right.
0:57:40 It was, and I think we noticed that the hard part for so many discussions
0:57:45 was managing state and like how state propagates I think it was from
0:57:50 the Riffle paper right, on how like so many issues in app development
0:57:54 are kind of database problems in disguise and that how techniques
0:57:58 from databases might be able to help.
0:58:00 So with Convex we were saying like, well if we start with the idea of designing
0:58:05 a database from first principles, can we apply some of those database solutions
0:58:10 to things across the whole stack?
0:58:12 So say for example, when I'm reading data from it within in my app, I have
0:58:17 all of these React components that are all reading different pieces of data.
0:58:21 It'd be really nice if all of them just executed at the same timestamp
0:58:24 and I never had to handle consistency issues where one component knows
0:58:29 about a user or the other one doesn't.
0:58:31 Similarly, like why isn't it possible to be that I just use query across
0:58:36 all my components and they just all live update whenever I read anything,
0:58:40 it's a automatically reactive.
0:58:42 So those were some of the like the initial kind of thought
0:58:46 experiments for what led to Convex.
0:58:48 I think the other one that was really motivated from our time at
0:58:52 Dropbox and I think is like kind of a both a blessing and a curse.
0:58:56 It's kind of like one of the key design decisions for Convex is
0:58:59 that Convex is very opinionated about there being a separation
0:59:03 between the client and the server.
0:59:05 So we saw this at Dropbox where they were just different teams, right?
0:59:09 And you know, as we've seen with like even the origin of GraphQL, right?
0:59:13 Like that ability to decouple development between.
0:59:16 teams working on user facing features and the way that the data fetching
0:59:20 is implemented on the backend, it's gonna be really powerful.
0:59:23 And so kind of the kind of thought experiment with Convex is, can we
0:59:27 maintain a very strong separation while still getting like live updating, while
0:59:32 still getting a really good ergonomics for both consuming data on the client
0:59:36 and like fetching it on the server.
0:59:39 Right.
0:59:39 So yeah, walk me through a little bit more through the evolution of Convex then.
0:59:44 And so, in, in terms of all the other options that are out there in terms
0:59:49 of state management and I think most what applications are using is probably
0:59:55 something that at least to some degree is somewhat customized and hand rolled and
1:00:01 comes with its own huge set of trade-offs.
1:00:05 Help me better understand sort of the, where you mentioned the,
1:00:08 opinionated nature of Convex.
1:00:11 What are the, benefits of that?
1:00:13 What are the downsides of that and other implications?
1:00:16 Yeah, so when you write an app on Convex we can use maybe
1:00:20 like a basic to do app, right?
1:00:22 The linear clone, everyone does.
1:00:24 you write endpoints like you might be used to, right?
1:00:26 Where it's like list all the to-dos in a project like update a to-do in a project.
1:00:31 and those get pushed as your API to your Convex server.
1:00:35 the implementations of that API can then read and write to the database
1:00:39 and Convex has like a, kinda like Mongo or Firebase, like API for doing so.
1:00:44 I think the main benefit then of Convex relative to more traditional
1:00:48 architectures is that if you're on the client, the only thing you need to do
1:00:53 is call the, like the use query hook.
1:00:56 You're saying like, I am looking at a project I just do use like use query
1:01:01 list tasks and project that will then talk to the server, run that query, but
1:01:07 then also set up the subscription and then whenever any data that that query
1:01:12 looked at changes, it will efficiently determine that and then push the update.
1:01:16 So part of what is like been nice with Convex is that you are getting
1:01:21 a client that has a web socket protocol, it has a sync engine built in.
1:01:26 You're getting infrastructure for running JavaScript at scale and for
1:01:30 handling sandboxing and all of that.
1:01:32 And then you're also getting a database, which is, you know.
1:01:36 One, supporting transactions or reading and writing to it.
1:01:39 But then it also supports this efficient like being able to subscribe
1:01:43 on, I ran this query, this query just ran a bunch of JavaScript.
1:01:47 It looked at different rows and it ran some queries.
1:01:51 the system will automatically efficiently determine if any right overlaps with that.
1:01:56 So the combination of all of those things is like part of the benefit of
1:01:59 Convex, you just write TypeScript and you write it in a way that's, feels
1:02:03 very natural and everything just works.
1:02:07 And I think some of the like downsides is that it's it is a different set of APIs.
1:02:13 it's not using sql, it's doing things a little bit differently
1:02:16 than they've been done before.
1:02:18 yeah, it's like kind of interesting even today to see like what you know.
1:02:23 Talking about AI code gen, right?
1:02:24 Like models have been trained, pre-trained on this huge corpus
1:02:28 of stuff on the internet.
1:02:29 And when are they good at adopting new technologies?
1:02:32 Technologies that might be after their knowledge cutoff.
1:02:35 And when are they like it's better just to stick to things that they know already.
1:02:39 Right.
1:02:39 So what you've mentioned before where you say, Convex is rather opinionated for me.
1:02:45 in let's say five years ago, I might've been much more of
1:02:49 like, oh, but maybe there's a technology that's less opinionated
1:02:53 and I can use it for everything.
1:02:54 But the more experience I got, the more I realized no, actually.
1:02:58 I want something that's very opinionated, but opinionated
1:03:02 and I share those opinions.
1:03:04 Those are exactly for my use case.
1:03:06 So I think that is much better.
1:03:08 This is why we have different technologies and they are great for different
1:03:12 scenarios, and I think the more a technology tries to say, no, we're,
1:03:17 we're best for everything, I think the, less it's actually good at anything.
1:03:23 And so I greatly appreciate you standing your ground and saying
1:03:26 like, Hey, those are, our design, decisions that we've made.
1:03:31 And those are the use cases where, you'd be really well served building
1:03:35 on top of something like Convex.
1:03:37 And, I particularly like for now where TypeScript is really the, default
1:03:42 language to build full stack applications.
1:03:45 And it's also increasingly becoming the default for.
1:03:48 ai, based applications as well.
1:03:51 And AI based systems speak type script, just as well as English.
1:03:57 And given that Convex makes that full stack super easy.
1:04:02 And also I think you can, when you build local-first apps, it can
1:04:07 sometimes get really tricky because you empower the client so much.
1:04:11 You give the client so much responsibility and therefore there's
1:04:15 many, many things that can go wrong.
1:04:17 And I think Convex therefore, takes a more conservative approach and says
1:04:21 like, Hey, everything that happens on the server is like highly privileged
1:04:25 and this is your safe environment.
1:04:27 And the client will try to give you the best user experience and
1:04:31 developer experience out of the box.
1:04:33 But the client could be in a more adversarial environment.
1:04:37 And I think those are great design trade offs.
1:04:40 So, I think that is a fantastic foundation for tons of different applications.
1:04:45 Yeah.
1:04:46 talking about some of these strong opinions being both
1:04:49 blessings and curses, right?
1:04:50 Like over the past few months, one thing we've been working on is trying
1:04:54 to bridge the gap between those two points in the spectrum, right?
1:04:58 we wrote a blog post on it a few months ago of like working on what we're calling
1:05:02 our like Object sync engine, trying to take a lot of the principles from more of
1:05:08 a local-first type approach of having a data model that it is synced to the client
1:05:14 and the only interaction between the server and the client is through the sync.
1:05:18 And the client then can always render its UI just looking at the local
1:05:22 database and it can be offline.
1:05:24 It's also fully describes the app stage so it can be exported
1:05:28 and rehydrated or whatever.
1:05:29 it's very interesting design exercise we've been on to say like, can
1:05:33 you structure a protocol on a sync engine in a way such that the UI
1:05:39 is still reading and writing to a local store that is authoritative.
1:05:43 But then that local store is like to kind of use like an electric SQL terminology is
1:05:47 like that is a shape that is some mapping of a strongly separated server data model.
1:05:52 So we still have a client data model and server data model, which might be
1:05:56 owned by different teams and evolve independently and, we also have that
1:06:01 strong separation where the implementation of the shape is privileged and running
1:06:06 on the server and has authorization rules built in and get the best of both worlds.
1:06:10 And we've kind of, we have a like beta that we've not released publicly thought
1:06:16 open, sourced out there, but kind of a thing where we, I think they're
1:06:19 still figuring out like the DX for it.
1:06:21 And I think we have something that like algorithmically works
1:06:24 and it's like the protocol works, but it's like, it's kind of hard.
1:06:28 Right.
1:06:28 It kind of reminds me a lot of writing GraphQL resolvers of like saying How do I
1:06:32 take the messages table from my chat app?
1:06:35 Then under the hood that might be joining stuff from many different
1:06:39 tables and filtering rows, or might even be doing a full tech search
1:06:43 query in another view or something.
1:06:45 and coming up with the right ergonomics to make that feel
1:06:48 great for a day one experience.
1:06:50 I think something that's like still we're working on, still
1:06:53 kinda like a research project,
1:06:54 right?
1:06:54 Well, when it comes to data, there is no free lunch, but I'd much rather to have
1:06:58 it be done in the order and sequencing that you're going through, which is
1:07:03 having a solid foundation that I can trust and then figuring out the right
1:07:09 ergonomics afterwards, since I think there's many, many tools that start with
1:07:14 great ergonomics, but later realize that it's on a built, on a unsound foundation.
1:07:19 So when it comes to data, I want a trustworthy foundation, and I think
1:07:24 you're going about in the right order.
1:07:26 Hey, Sujay, I've been learning so much about one of my favorite
1:07:31 products of all time, Dropbox.
1:07:33 I've learned so much of like how the sausage was actually made, how it evolved
1:07:39 over time and I'm really excited that you got to share the story today and
1:07:45 many me included, got to, learn from it.
1:07:48 Thank you so much for taking the time and sharing all of this.
1:07:51 Thanks for having me.
1:07:52 This is super, super fun.
1:07:54 Thank you for listening to the localfirst.fm podcast.
1:07:56 If you've enjoyed this episode and haven't done so already, please
1:08:00 subscribe and leave a review.
1:08:01 Please also share this episode with your friends and colleagues.
1:08:04 Spreading the word about the podcast is a great way to support
1:08:07 it and to help me keep it going.
1:08:09 A special thanks again to Jazz for supporting this podcast.