Tuesday, May 4, 2021

Episode 111: Real-time Data Streaming with Frank McSherry

Download

We are sponsored by audible! http://www.audibletrial.com/programmingthrowdown

We are on Patreon! https://www.patreon.com/programmingthrowdown

T-Shirts! http://www.cafepress.com/programmingthrowdown/13590693

Join us on Discord! https://discord.gg/r4V2zpC

Real-time Data Streaming with Frank McSherry

In this episode, we talk with Frank McSherry, Gödel Prize-winning data scientist, and Co-founder and Chief Scientist at Materialize, Inc. Frank shares expert viewpoints drawn from his years as an academic, as well as personal insights on helping run a company at the cutting edge of real-time data streaming.


00:01:10 Working from home

00:05:15 Frank’s career beginnings: Grad School and MSR

00:11:52 PTO 

00:14:06 Working in Europe

00:17:25 SPARK, Naiad and Dryad

00:18:45 Real-time data streaming, in context

00:23:15 Presto

00:25:18 Materialize

00:35:33 Ansi SQL

00:42:39 Origins of Materialize

00:48:30 Pitching to investors

00:55:11 Collaboration between academics and engineers

00:59:37 Materialize’s customer acquisition model

01:04:00 Kafka and Kinesis

01:07:33 “Like tail on steroids”

01:10:13 Materialize outside of Big Data

01:13:23 Ousterhout’s dichotomy

01:15:08 “Now there’s 15.”

01:16:22 What’s it like working at Materialize?

01:18:27 Company offsites

01:23:34 Career opportunities at Materialize

01:27:38 Advice for budding database engineers

01:29:47 SQL learning recommendations


Resources mentioned in this episode:


Rust 

https://www.rust-lang.org/


SPARK

https://www.adacore.com/about-spark/


Naiad

https://www.microsoft.com/en-us/research/project/naiad/


Dryad

https://www.microsoft.com/en-us/research/project/dryad/


Presto

https://prestodb.io/


Json

https://www.json.org/json-en.html


Avro

https://avro.apache.org/


Apache Hive
https://hive.apache.org/


Kafka

https://kafka.apache.org/


Kinesis

https://aws.amazon.com/kinesis/


PostgreSQL

https://www.postgresql.org/


SQL Training by Markus Winand

https://winand.at/


SQL Performance: 

https://use-the-index-luke.com/

 

Materialize, Inc., Frank’s company:

https://materialize.com/


You can reach out to Frank via:


GitHub: https://github.com/frankmcsherry

Twitter: @frankmcsherry


Transcript:

Episode 111 Real-time Data Streaming with Frank McSherry

Patrick Wheeler: Programming Throwdown Episode 111, Real-time Data Streaming with Frank McSherry. Take it away, Jason. 

[00:00:22] Jason Gauci: Hey everybody. So this is going to be a really, really interesting episode. A lot of folks are interested in Big Data. Big Data is still hugely, hugely important. It's still growing at like a really, really fast pace.

[00:00:37] There's so many companies that are figuring out how to manage a lot of information that is coming through and how to harness that and how to use it to make better products. 

[00:00:49] And so we're here to talk with Frank McSherry, who's the Co-founder of Materialize, who's going to really kind of walk us through, you know, really what is real-time data streaming, how that works under the hood, and you know, how everyone else can kind of get their hands on this tech. So thanks a lot for coming on the show, Frank. 

[00:01:08] Frank McSherry: Oh, it's not, not a problem at all. It's my pleasure. 

[00:01:10] Jason Gauci: Cool. So how are you handling the COVID situation, the work from home? How has it kind of changed, you know, your, your day to day? 

[00:01:18] Frank McSherry: Yeah, well, it's, it's maybe unsurprising it's changed everyone's day to day, pretty substantially. We went from being a group of people who are basically all in the same room, more or less, we made a sort of 15-person office that everyone showed up and worked and, you know, certain collaboration style and that sort of pivoted a hundred, 180 degrees ,to everyone is somewhere else.

[00:01:40] And it's in some ways good. Like from my point of view, we need to be a lot more thoughtful about our communication and our processes and stuff like that. You can't just yell at someone to try to figure out how a thing works. You have to ideally write it down and, and everyone can see at that point and, I don't know, but like super, super disruptive. Yeah. 

[00:01:57] Jason Gauci: Have you, how do you handle that situation where someone needs to ask for help? I find this is a real challenge with our team is, it used to be you kind of just yell and whoever, someone will just kind of jump in, but now it's like if you post to the chat, it's kind of a little bit more disruptive and there's this kind of Mexican standoff among all the folks in the chat, like, who's going to answer this? And yeah, I was wondering, how you deal with that? 

[00:02:23] Frank McSherry: You're right. It's, it's definitely a little tricky, I think at the moment, at least we're still, you know, things are changing of course, but we're, we're small enough that there's still some sort of sense of ownership for who's who is in charge of, of a certain thing.

[00:02:35] And that person might not be looking at the moment, but there isn't, there aren't a bunch of people saying, Oh, not me, I didn't do it. Generally. they're not folks that are interested in sort of the health of the good, stuff like that, that someone will, will pipe up and say, like, I thought, I thought it was this, or here's a PR that, that looks relevant. Maybe that's where the problem is. 

[00:02:51] Jason Gauci: Yeah, it makes sense. That makes sense. Yeah. I think I've been telling folks to try to do point to point, like it's, it might be easier to message one person and that person tell you go some like, like it's actually this other person, then you messaged the group because you ended up in this situation where potentially maybe no one answers and everyone's kind of looking at each other, but yeah, I think that all of these things, and white boarding is another thing that I've found to be a real challenge, but all these things, I think we'll, we'll be able to make, you know, that we'll be able to sort of make progress in these areas. It's just going to take time. 

[00:03:25] Frank McSherry: Yeah, no, it makes it very clear that this is a, for me, at least, for sure, this was an underappreciated aspect of how do you get work done. You know, like we, we can certainly, we're, we're faking it a little bit now in terms of picking out the right ways to communicate, but clearly, you know, people who are good at remote first, for example, it's, it's impressive that you're like, Oh wow. You know, you must have some great processes in place. or just fundamentally different ones, but, but clearly more robust to, you know, someone isn't available for a little while or you know, someone someone's sick or something like that. And they're just out for a bit. Can your org resist that?

[00:04:00] That's, that's neat stuff. And I hadn't, I hadn't really thought very hard about this at all beforehand then. 

[00:04:04] Jason Gauci: Yeah. Yeah, definitely. Yeah. It's wild. How, how it just kind of happened right away. I mean, I still actually have, you know, a bunch of little tokens and picture frames and stuff at my desk. I think, I mean, I don't even know if someone cleaned them off or no, but, but you know, it's just one day we were told not to go into work.

[00:04:22] And so it's just kind of like, I wonder if it's just, if it's just a snapshot of March 2020 over there, I mean, I don't, I can't even get in the building, so I don't know. 

[00:04:31] Frank McSherry: We had that for a bit. We had someone go in, we were at a, WeWork essentially. And our lease was up at sometime in June or so, we had some folks going in and basically put things in boxes and put addresses on them and shipped them.

[00:04:44] So you got like a, essentially a care package, which was, this was the stuff that was on your desk, which is a bit like, well, it's nice to have, but it's also a bit weird, to, to think like basically just been moved out or something, but yeah, it's all very emotionally jarring. A lot of the stuff I'm sure for many people, for many different reasons, but that's also been a big problem.

[00:05:04] I would say there's the mechanical aspect of writing code and building a business, but there's also just a big emotional tax on a lot of folks who are trying to get their head around the world being different and, and stuff like that. 

[00:05:15] Jason Gauci: Yeah. Yeah, totally makes sense. So, so let's rewind from, from the, in Materialize, in the, in the WeWork and let's kind of start from the beginning and kind of, what's your backstory and what kind of led to you co-founding Materialize?

[00:05:31] Frank McSherry: Yeah. Okay. Well, there's, it goes back a ways. Tell me if, no that's too far, let's speed it up. 

[00:05:38] Jason Gauci: No, go for it. I mean, you can say, you know, first there was the womb and then...  (laugh)  

[00:05:41] Frank McSherry: Yeah, no, I think like in terms of formative moments, you went to, went to grad school, a standard computer science education. I went to grad school, which has maybe a little less standard and did some, some great work with, I thought, great work with Anna Karlin, who's this person who works at this intersection of theory and systems work, and got a little bit of a taste for both, you know, thinking about things for long enough that they make sense, but also trying to get your head around, whether the thing that you've thought about should actually be turned into something that computers do, and actually results in something meaningful and, and of consequence.

[00:06:17] From there I went, I start working at Microsoft Research Lab in Silicon Valley, which was there for, you know, it was like 12 years or something like this. Lots of great people. This is very formative. This is a lot of. A really interesting combination of theoretical computer scientists and people working on systems and principle, distributed systems, but computer systems. And a great place to really, learn a lot, like the people there are very strong, and you learn a lot, both about the actual technical bits of computer science, but also how to think about, about research, how to do things of consequence.

[00:06:51] You relieved a bit from a bunch of the academic pressures of publishing at a very fast cadences, or wait until you've actually got the thing that you think is right before telling people about it. 

[00:07:01] Jason Gauci: Were you there when MSR in, in Silicon Valley closed down? 

[00:07:04] Frank McSherry: I was. 

[00:07:06] Jason Gauci: Oh man, that, that broke my heart. I mean, I read about it in the news and, and, and as being in another tech company, we were constantly reaching out to folks to try to try to hire them, but that was just unbelievable. 

[00:07:16] Frank McSherry: It was, it was very surprising at the time. I mean, to be honest, I have sort of mixed emotions about it and I'm totally happy, I'd be careful framing this, but like, it was a very comfortable place.

[00:07:29] And I do kind of like the idea that folks get moved out of their comfort zone occasionally have to go and do new and interesting and different things. And I was very happy that I was moved out of my comfort zone. Cause I feel like the time after that was, for me at least was very good. I got to do new things, think about new stuff, try different ways of the world that I wouldn't have bothered to do.

[00:07:49] If I had, if I had still been around, I would have just stayed there for another 12 years on autopilot, writing papers, doing things. So, I'm personally glad that I got shaken up a bit by that though. I have lots of colleagues who weren't nearly as glad, but... 

[00:08:02] Jason Gauci: Yeah, I mean, I was in a similar position where I was, had a job that I was, you know, very, it was a very comfortable job and I was getting kind of good ratings and, and I, I had in this case, you know, I didn't, I wasn't, I chose to take this opportunity, but I was always kind of, you're always kind of really nervous leaving your comfort zone because you feel like I can never really go back and, and you know, and it's kind of taking this big risk. But you know, there's two things that I, that I kind of learned from, from my experience.

[00:08:33] And I'd love to hear what you learn from yours. For mine, I learned one, is that no matter how, kind of much time and energy you've put in how much you're a part of the process at your current company, that when you leave, everyone else just picks up the slack. And like, you know, I, I actually thought that is, the team was going to struggle a lot more than they actually did. They were just fine. 

[00:08:55] And the other thing is going in the other direction, you know, you can always go back and people, you know, if you tell your boss you're leaving and while you're leaving, you know, provide it's on good terms and everything. they're always happy to welcome you back. And so both of those made me feel a lot better.

[00:09:13] Now obviously in the MSR case, there wasn't going back. But, but, but, you know, I know from someone who saw that from another company that there's, there's always really good opportunities for, you know, good people who work hard. 

[00:09:24] Frank McSherry: I think that's totally right. I mean, that was my experience as well, which is, mechanically, it would have been difficult to go back to Microsoft at the time, based on how things happened. But if you're in, if you're good at what you do or close enough, there, there's lots of opportunities out there for a lot of different folks. 

[00:09:40] There's, there's up a bunch, to be totally honest in the startup space as well, where we just have Materialize, continually have these conversations with people that we're trying to, you know, trying to recruit great people who have a cushy job somewhere and are a bit worried about the risk, right?

[00:09:54] The risk of like, well, do I want to leave my job to do something that's a little riskier? What happens if in the worst case scenario? Yeah. The answer is usually, well, the worst case scenario is you just go back to your existing job or, or something similar. It's not like they're not going to be at your, at your throat just because you wanted to go off and do something interesting. Generally they're super welcome to either have you back or, you know, doing a similar thing at a, at a different company, if for any particular reason, your opportunities come. 

[00:10:21] So it's, it's the thing that you don't, I certainly didn't think of, ahead of time, thought like, wow, it's really comfortable here. And indeed, if I go somewhere else, it's gonna be tricky to get this again. And it doesn't, it's not necessarily the case. 

[00:10:34] Jason Gauci: Yeah. That makes sense. I've heard that in the startup world, you know, there's yeah, it's a really good point. In the startup world, there's a sense, less protection. Like, for example, if you're at a startup and the startup pivots, and they just don't need your particular skill anymore, then you could, you know, be like, go for that. 

[00:10:51] Whereas if you're in some giant company, there's almost like, always someplace you can go or they'll give you time to retool. And so that can make people nervous, but the same thing still applies that, that, that there's always kind of a, you know, a ton of demand there. And so if you find some startup, that's doing something you're really passionate about, you know, you can, you can join it with, with confidence.

[00:11:15] Frank McSherry: I think that's right. I mean, that's certainly been what I've seen and I, that might be coming from a position of privilege for sure. But it's certainly the case, that although a startup might do something surprising and you decide you don't like them anymore, or maybe it just doesn't fit. And that's bad news.

[00:11:30] The larger world is at the moment, still really exciting for computer scientists, especially ones who are doing bold, innovative, startup-y things. Usually, company is pretty happy to get in touch on, find something for you to do. 

[00:11:44] Jason Gauci: Yeah. That makes sense. So you were at MSR and then did you go from there to another big company or did you jump straight into startup world?

[00:11:52] Frank McSherry: Yeah, no. I actually, what I did is I hadn't taken any vacation in the 12 years I had been at Microsoft, no vacation of consequences. 

[00:11:58] Jason Gauci: What? Wait, no way. Are you serious? 

[00:11:59] Frank McSherry: I had taken one like a year or two before I took about three weeks off. But other than that, it was all, you know, visit folks for, for the holidays type things. No, no particular vacations of consequences. Yeah. 

[00:12:11] Jason Gauci: Did your vacation accumulate? How did that work? Is there a limit?

[00:12:14] Frank McSherry: I mean, it's, it's California, so they're not allowed to it's, you know, it builds up, but it maxes out at some, you know, six weeks or something like that. Yeah. 

[00:12:22] Jason Gauci: And so you just had maxed out? You're just, you're just shedding vacation days for, for years? That's wild. I mean, you must be really passionate about what you're doing. 

[00:12:30] Frank McSherry: No, I mean, yeah, maybe, but no, it was much so it was... I, you know, I certainly hadn't at that point in my life come across know, like building up a good work-life balance and, you know, sustainability and stuff like that. It just. 

[00:12:43] Jason Gauci: Yeah, actually, we should do a show on that. Maybe we'll invite you, invite you to talk about, about work. That's actually something... we never, we never covered that. And you know, now that you mentioned it, that's so, so important, it could easily be a whole hour...

[00:12:59] Frank McSherry: This past year made that really clear. I think like a lot of... we've had a lot of folks at the company that, you know, just getting wound up and stressed for nonstandard reasons, right?

[00:13:09] Like not things you would have anticipated, not things that you sort of put on your calendar as make sure to, to unwind. And it's been really important for us to try to pay attention, remind people that like you should absolutely think about taking time off. Don't sweat the fact that like you can't go to an island somewhere and, and drink you know, bright colored drinks. Take some time. 

[00:13:28] Jason Gauci: Yeah. You know, I get these automated emails from when people get close to their limit on, on PTO. and you know, usually, I mean, I've never even, I don't think I've ever seen them before, but now everyone is at their limit. And I think you hit the nail on the head. People say, well, you know, I don't want to go on PTO if I'm just gonna have to stay around the house, but then they're completely burnt out.

[00:13:52] And so you almost have to kind of force people and say, look, you need to spend a week just sitting around your house, doing nothing. Like you just have to, because you are just flipping out over stuff that doesn't matter. And, and it's yeah, it's, it's, this is incredibly difficult time for that. 

[00:14:06] Frank McSherry: So, so you had asked, what did you do after Microsoft? And, and that was the segue into here, but the, what happened essentially was I concluded I should take some vacation, and vanished to, to Morocco for a little bit, for some surfing, cool stuff like that, and just chilling out. It's actually, it's really pleasant. It was even with the surfing and the three meals a day and yoga, it was a rent reduction from San Francisco.

[00:14:30] So that was, yeah, that was pretty sweet. And just started doing a bit more of like low-fi living, I guess, like not wearing quite something. I had a laptop, I had one suitcase. That was everything that I that I owned, pretty much, and sort of wandered a bit around. 

[00:14:46] I had some work obligations in Europe. I had agreed to chair some workshops. And so, you know, did a little bit of work, but mostly just slumming around doing, doing some work on the side, but at my own pace. 

[00:14:59] Jason Gauci: Yeah. That's, that's super nice. I don't know if you know Richard Stallman's kind of like lifestyle, but this is, this sounds a lot like, I interviewed Richard Stallman a while ago and yeah, he, he basically you know, goes conference to conference,  you know, gives talks and he asked, you know, can I sleep on your couch?

[00:15:18] He has this, he has this, this email thread, like listserv type thing of all the couches he can sleep on. And he just goes from place to place meeting new people. And it sounds like a really, really kind of exciting, you know, kind of life, where it is exciting, but it's also chill, which is kind of hard to get 

[00:15:33] Frank McSherry: It's, it definitely was interesting. Like, so some of the, some of the time I was going, guess there's of course some just hanging out and surfing and rocker type things, but there's also dropping it on the people doing Apache Flink in Berlin, and dropping in at Cambridge in the UK to work with them for a few weeks.

[00:15:51] And then eventually it turns out dropping in, on ETH in Zurich, which happened to be doing a bunch of related work stuff, work on data flow processing were, they were looking at building systems that would essentially take the exhaust out of data centers. So whatever's happening in your data center, not, not the actual work that's going on, but what messages are getting sent around, who's communicating with whom what was going on between your various racks, and feed that into some sort of analysis subsystem that they're trying to build. 

[00:16:21] And just as a technical segue, this is, this is a moment where they were, they were struggling to make, for example, SPARK work. They had gotten Naiad, which is what we had done at MSR up and working, but they're on Linux and the C# support on Linux at the time was not, it was not stellar.

[00:16:39] So the guy in charge, Matthew Orosco, basically, I said, look, why don't you just show up and help us sort this out, because this sounds like it's exactly what you've been claiming your stuff is good at, and we could really use it and then we can pay you, and you're in Europe. So like, and Switzerland's nice. So go for it. 

[00:16:58] So that, that led to some recurring collaboration with them. So I worked there for about seven months then. 

[00:17:03] Jason Gauci: So just so I understand. So you got SPARK to work or you got this thing from Microsoft? 

[00:17:08] Frank McSherry: Well, I didn't, I didn't get anything to work. Sorry.   (laugh) 

[00:17:12] They're already trying to, they knew what they wanted to build, essentially, like sorts of analysis they wanted to do.

[00:17:17] And they had tried to do that with SPARK and SPARK, unfortunately the time was just falling behind it. Couldn't keep up with the data volumes.  

[00:17:24] Jason Gauci: I see. 

[00:17:25] Frank McSherry: The, the work from Microsoft Research, this, the stuff that led into this real-time streaming work, timely data flow and stuff like that, was originally at Microsoft.

[00:17:33] This project called Naiad. That was a C# project, and it worked great on windows and worked on, on, on Linux. But...

[00:17:40] Jason Gauci: How do you spell that? N Y...

[00:17:42] Frank McSherry: N A I A D. It's Greek. The naiads where these, the animating spirits of rivers and streams and flowing water. 

[00:17:54] Jason Gauci: Oh, that makes sense. There was a Big Data project called Dryad...

[00:18:00] Frank McSherry: Yup, the same, same group of people. Yep. 

[00:18:02] Jason Gauci: Oh, interesting. Okay. So is Dryad like an evolution of Naia?

[00:18:05] Frank McSherry: The other way around, though evolution has maybe a bit strong. So Dryad certainly came first. Dryad sort of came, came to be after MapReduce rose in popularity and Dryad, essentially, it's why not think about larger data flow graphs, but still use roughly the same, same principle? So dryads are the animating spirits. Those are trees and forests. 

[00:18:24] Jason Gauci: Yup. Yup. 

[00:18:25] Frank McSherry: It's where that came from. Building sort of tags of, of data flow graphs. But it's still very much in the spirit of, of batch computation. So this data flow graph runs by looking at its inputs, which are probably very large datasets turning on the bits of work and nearest those, those data sets, running them to completion. They produce output data. You start up the next people. 

[00:18:45] Jason Gauci: Yeah. So, I mean, maybe just to give a bit of context, a bit of context here, like you know, people know about, let's say, let's see an example here. Like MP3s, where, you know, you have to kind of encode something in an MP3. And so you wouldn't necessarily, you wouldn't necessarily have some incremental MP3, like most MP3s are, you know, there's, there's some sort of bookkeeping and some stuff built into the file and, you know, it might be random access, but there's usually some kind of paging, so it's not totally random access.

[00:19:18] And so you end up with like this kind of big volume, that's effectively immutable. If you want to mutate it, then you would run it through a process that produces another big volume that's slightly different. And so this is kind of the essence of, of, of MapReduce, or let's say batch batch processing. So, you know, you, you can take this, you can break it up based on how the data is chunked.

[00:19:45] So let's say the data is, is, is separate in a chunk such that you have a thousand chunks and at most, you can have a thousand machines.  You're reading in a chunk, doing something to it and then emitting, another, another set of chunks. 

[00:20:00] Now those chunks get shuffled. That's a shuffle part of MapReduce. And then the shuffled chunks end up in buckets and then those buckets can be processed a second time.

[00:20:12] And the output of all of this is just another huge batch of data. You know, SPARK and other things are kind of built on the idea of MapReduce. And there's a lot of also, cosmetic things built on MapReduce. Like there's Apache Crunch, which was kind of a, you know, something that sat on top of MapReduce and just made it more accessible.

[00:20:30] But one thing is not really clear is how do you handle like a firehose? Like if you have something that's a machine that's generating logs, incrementally, you know, one a second or something. Well, then this idea kind of doesn't fit that paradigm. And that's where the real-time streaming is really important.

[00:20:47] Frank McSherry: Actually, your example of MP3s is pretty good. I don't want to pretend to know a great deal about how MP3s are included, but you could totally imagine, like we we've been talking now for almost almost half an hour and we could record all of that and plop it down in a file and then have someone pick that up and start the encoding process.

[00:21:05] But it's just as reasonable to imagine that as we've been producing data, someone could be picking that data up and start the process of transforming it and coding it, so that by the time we're done the competition is also pretty, nearly done and ready to, ready to disseminate. 

[00:21:20] Or for example, people who are listening, they could in near real-time be picking up the output of the encoding process, something more efficient than just the raw WAV file. And not have to wait for the entire session to be recorded and then processed and then uploaded and all these things. 

[00:21:35] So, a lot of the streaming is, it's one of the things that's cool about streaming, I suppose, in a lot of cases, it's the same computation as these batch computations, it's just done a stage slightly differently.

[00:21:46] So instead of doing all of that first work at once and waiting until you're done, you can start the first step whatever that happens to be and start producing partial results. And then whatever the second step is that can also start working at the same time. And you just sort of keep things busy where they would otherwise be waiting.

[00:22:04] So otherwise everyone's just waiting for that first hour of data to be finished before they can even start working. And rather than do that now, just to get everything going all at once, it's a bit more bookkeeping for sure. But the nice thing I think is that from the, potentially from the user's point of view, they don't need to think of new idioms necessarily.

[00:22:22] The system itself can just change this behavior. And it, it just, it's suddenly a bit more responsive than it was previously. You don't need to educate the user to tell them you must write a new program. So that's potentially really powerful if you can harness people's mental model for how do I approach working with Big Data and not have to change that to be some new, totally different way of working with data.

[00:22:45] Jason Gauci: Yeah, totally makes sense. I think that, yeah, streaming from a functionality standpoint is like a super set, right? Because you can always stream in a batch of data. but then in terms of what you can do, I'm sure there's some limits, like you don't have random access to the entire data set all the time.

[00:23:02] And so, you know, there there's some things that you can't do it's streaming, or if you, if you are going to do them, you have to accumulate some kind of bookkeeping versus doing it in one shot. 

[00:23:15] Yeah, I think the closest I've come to data streaming is... or, or, or with the issues let's say like, I could imagine coming out with data streaming is through Presto.

[00:23:23] So Presto is this SQL engine. It's not for real-time, but it's, it's it keeps everything in memory. And so because of that, it's really fast. But because of that, as soon as you run out of memory, it's, it just gives up. And so, for example, if you wanted to, if you have a giant database, it doesn't even matter the size and you want to just see how many times your name shows up in the database, Presto can do that, because it can read as little as one row at a time, look for your name and then just keep a count. 

[00:23:56] But if you wanted to do something like generate a histogram of names then, or, or maybe a better example is if you wanted to join the table to itself, so that all of the rows for the same name, were grouped together.

[00:24:12] Well then Presto has to keep the entire database in memory in some way, shape or form to do that self join. And most likely Presto will just blow up. And so, you know, you kind have the same limitations there kind of apply here where you don't want to be in a situation where you need the entire data set, you know, one time. 

[00:24:32] Frank McSherry: Yeah. I mean, you're not wrong. One of the things I suppose that streaming does is start to expose, expose some limitations, essentially. You can, as you say, like, if you can look at a batch computation, just as a streaming computation and of course the, the process reading the data off of HDFS or off of your disc or whatever is not loading it atomically in, into memory, it's looking sequentially are most likely at the data.

[00:24:58] So it's sort of streaming in off of, so it's, it's a type of streaming computation already. But as soon as you give people streaming systems and tell them, Ooh, low latency, you know, something, something, something they, they start to believe that, they start to use it and they start to be surprised if, if their computer catches on fire, when you do something like this. (laugh ) 

[00:25:18] Jason Gauci: Yeah. That makes sense. So, so, so Naiad, so we're starting to see how, Materialize got Materialize, right? So you, you were working with ETH on, on Naiad, you know, helping them kind of with that. And did you kind of see a lot of the issues that, that led you to to work on, you know, start Materialize?

[00:25:44] Frank McSherry: Um, a little bit, not, I wouldn't say directly. No, let's see. So I'm just starting to roll back the clock just a little bit so that I avoid tripping myself up in the future.  (laugh) 

[00:25:56] One of the things that happened as I, as I departed in Microsoft Research is that we were no longer meant to be affiliated with Microsoft.

[00:26:04] You know, we were no longer a particular, no longer working on than I had could base. And it felt like a good time to pick up a new, new programming language. So I, I pivoted over from C# to, to Rust at the time and started essentially doing a reboot of that project. So a different version of, of Naiad that was, you know, fixed some of the issues that we had the first time around, and almost certainly didn't quite get as far in all the dimensions.

[00:26:27] But it started being what is now this timely data flow in Rust project, which is actually what, what I went to ETH with and worked with them on their...

[00:26:38] Jason Gauci: Ah cool. 

[00:26:39] Frank McSherry: I would say that at ETH, this is an academic setting and you have a lot more liberties.

[00:26:44] So one of the biggest distinctions between academia and Materialize we'll get to, but, in academia, a lot more Liberty to just do what you want and what you need. So in a sense, what they were working on their, their, their acting in sort of the consumer of the technology. So they could just build bespoke pieces of technology that would just work because they're, you know, as, as a bunch of computer science, PhD, people, they're all empowered to just write a whole pile of new code and say like, great works for us, ship it.

[00:27:15] Materialize, by contrast is very much the opposite. It's like we, the people building materials have these skills, but the goal is to target people, users who don't want to have to get an advanced education in streaming data flow infrastructure or anything like that. The goal is very much to take the ideas, the things that were learned along the way, essentially, and try to map them to concepts and idioms that a lot more people are already familiar with in the case of Materialize, that's SQL, which is a language that doesn't say anything about streaming or, or any of that stuff, but does have sufficient concepts, things like joins and views and indexes and reductions that you can allow people to express, express queries and ideas in that program language, and then transport them. 

[00:28:00] You know, we do the hard work to do this, but transport them to streaming infrastructure. 

[00:28:05] Jason Gauci: Oh, interesting. That makes a ton of sense. The, how there's that, there's a lot to unpack there. So. So, yeah, I guess since we're on that topic, how do you prevent people from doing things in SQL that would just cause a lot of Harper and like trying to join a table to itself, for example.

[00:28:23] Frank McSherry: Yeah. So we don't.  (laugh) 

[00:28:26] Jason Gauci: Okay. 

[00:28:26] Frank McSherry: I think that's sort of fair to say, like the, it feels a little bit like databases back from the nineties or something like that, where you could with a crappy query take down, you know, your production database. If you go and try to do something that's a crossjoin or something like that.

[00:28:41] Jason Gauci: Okay. That makes sense. But with, you know, with the right window size, I don't know if that's an ANSI SQL or if it's only in Presto, but, yeah, there's this, there's this whole, we have this thing where we do a self join, but, you know, the, where clause is, is such that we can do it within a window.

[00:28:58] Like if we sort to, basically, sort the database, although sorting, I think in streaming would also be a challenge. Yes some of these joins, I think. But I think the streaming doesn't have to do everything. It just has to do some of the things that need to be done in real-time. And so it's a good compliment to something like Presto or SPARK.

[00:29:17] Frank McSherry: Yeah. So for example, you brought up like one of the main pain points, to be honest with SQL streaming, which is window functions in SQL. And for folks not familiar, window functions in SQL are roughly a way to write SQL, the equivalent of a for loop. Yeah, it's just sort of, you can say put these records together and now, attached to each record its ordinal position in this list, like one, two, three, four, five, six, seven, whatever.

[00:29:43] And you can write queries that are really problematic. Like you could say, I do this and then get me all of the odd records out. And that that's a query. You can write it. It's a little. A little mysterious, but you can totally write this. And it's very problematic if someone adds one record to the beginning of this list, right?

[00:30:00] Because all of the answers change, and not just slightly change, the entire set of data flip-flops each time you add one new record and, and just very problematic in terms of performance and resources for that query. 

[00:30:16] And it's a good question. Should you work hard to prevent people from writing these queries?Should you let them right them and learn that, that their performance isn't good? Some, some of the quizzes are fine, right? If, if, instead of saying, you know, give me an odd versus even records, you say, just give me the top five. It doesn't flip-flop nearly as much, you know, like, you know, you can add one record and worst you can do is bump someone out of the top five.

[00:30:40] But it's, it's a great question. And really this is the heart of what a lot of the Big Data problems out there have been. How do I figure out how to present an API to users that. These are sort of like handles to scissors. Like how do you present handles that you can grab safely, you don't grab the cutting part of the scissors. You only grabbed the safe part?

[00:31:00] So it's a tool that you can pick up and only use safely. How do you, how do you do that? Like, like we, we know how to give people access to computers from from UC to, you can just check them out, write whatever code you want and you can cause the computers to be arbitrarily problematic.

[00:31:16] That's easy. Like we know how to give people access to computers. How do you give them gloves that they can wear to access the computer as safely and effectively? And sometimes that means telling people no. And that's sort of where this, the essence of a lot of these Big Data design questions are, is like, how do you prevent people from you know, I guess give them enough rope to use, but not so much that they can get themselves into trouble.

[00:31:41] Jason Gauci: Yeah. Yeah. That makes sense. Yeah, I think that you know, it's it's yeah, I think over time you build this kind of mental model of. What kind of works well with what engine? It's like, for example, like sorting, sorting, it's almost never a good idea in Presto, because as soon as you want to sort, then you never know if, as you said a record's going to come, that needs to belong in the first position.

[00:32:06] Like the very last record you look at could actually be the first record when it's sorted. And so the only way to sort is to hold everything in memory. Right? So, so now with, with SPARK, for example, sorting is not an issue because it spells the disc. 

[00:32:19] And so SPARK will, basically, imagine you have a huge database. You want to sort by one column, SPARK will effectively create a file for, let's say each letter. So that the A file, the B, it's kind of like what you would do if you were sorting a list of folders, is you'd have an A group, a B group, a C groups, so on and so forth. And then SPARK could just sort the A group and, and you, you know, you don't have to do it by the first letter.

[00:32:46] You could even do it by the first two letters, three letters. And so you could always find a way to do it in SPARK where it will be, you know, fast and efficient. And yeah, I think you hit the nail on the head that it's like very hard to kind of, you know, kind of like encode that knowledge. It's almost kind of like, you know, you could go to home Depot and buy a saw. You can't really buy a, saw that won't cut your thumb off. Like they just, they haven't invented that yet. And they probably never will. And so it's really about how do you, let people experience Materialize or Presto or SPARK in a way where they make mistakes, but they don't kind of blow up the system or cut their thumb off. Right?

[00:33:31] Frank McSherry: Yep. You're absolutely right? One of the things that a bit tricky with Materialize, I suppose, is it, whereas folks have this expectation with a lot of Big Data tools that you might cut your thumb off or that, you know, you should not randomly do things on your prod cluster, for example, the database community has with their products have gotten pretty solid about trying to bulletproof a lot of the tools there so that you can't quite as easily catch your system on fire.

[00:33:55] If some person shows up, if they have quotas in place, they have, you know, ways to protect queries from interfering in each other. So the expectations are a bit higher with that, with that crowd, actually. So this is definitely one of the slightly awkward moments is that the prospects are showing up for like, well, I, you know, I expect to be able to have 20 people use this and not get in each other's way. 

[00:34:14] What's your story? And we, you know, we sort of have to come back with, well, our stories, Big Data sort of side of the story, which is that if, if you really need these people to be isolated from each other, you should probably turn on a few separate copies of Materialize. And it's, it's not as exciting and answer as, as they were hoping for, for sure. But it's realistic, at least at the moment. 

[00:34:33] Jason Gauci: Yeah, that makes sense. Yeah. I think the way we do it at my company is there's a, there's like a Presto quota, probably same thing for SPARK, there's like quotas. And so if you maximize the like, let in, in theory, the worst case scenario is where you get as close as possible, once you exceed the quota, the job dies. So that's actually not that big a deal, but it's, if you're right at the quota for a really long time, and if a bunch of people are doing that, then things can start to get, get really bogged down. But that should be super, super rare because no one... it's hard to really design something, like you can't really optimize your query so that you're just under the quota. So...

[00:35:18] Frank McSherry: Yeah, no, I mean, you're totally right there and there's other people did, there's some clever things, you know, you can totally randomize the quarters a little bit to make sure that...

[00:35:25] Jason Gauci: Oh yeah, just introduce white noise in the corner...

[00:35:27] Frank McSherry: You know, harmless, just plus or minus, a little bit enough that no one can actually sniff out. Where is that? That I can safely operate. 

[00:35:33] Jason Gauci: Yeah. That's funny. okay, cool. So, so yeah, so you kind of, actually one thing about material, we'll go back to the background, is it, is it ANSI SQL or, or have you added things to SQL?

[00:35:46] Frank McSherry: Yeah. The target is ANSI SQL . We were very very cognizant of the fact that with, with SQL, the language, there's a bit of an uncanny valley where if you are in fact SQL compliant, great, people can use you, tools can use you.

[00:36:00] So a lot of people's tooling use a SQL. If you're only 90% SQL compatible, things catch on fire pretty quick. Right? Right. Like you can demo. Here's a joint and a reduction. You're like, Oh, that's great. Join and a reduction. Pretty happy with that. 

[00:36:15] But as soon as people realize that maybe you've got different semantics for NOLs in some places, or maybe you don't do a great job at multi-way joins or, you know, support prepared statements or, various things like this, you know, your tools start to fall apart.

[00:36:29] Things that used to get correct answers suddenly get mysterious glitches in them. And that, although you thought, Oh, 90% compatible, the actual usefulness of the SQL is, is closer to zero at that point. So we've cleaved very strongly to, to ANSI SQL. We have SQL, SQL Lite has I don't know, 5 million query test battery that we're in total compliance with at the moment. All sorts of... 

[00:36:56] Jason Gauci: Oh wow.Cool.

[00:36:56] Frank McSherry: Really obscene cases. Like things that you wouldn't. I, I would never have thought someone would write these queries. You know, you can write correlated subqueries in the join condition of a outer joint. And it was some pain to, to get those to be correct and correct in a streaming fashion. But it makes a lot of sense. Like it's, it's very sensible to try to do SQL, right. If you're going to do it, we've not really added too much. There are few, I would say interesting interpretations that we've done of things. they're, they're a bit technical. I'm happy to go into them, but they're sort of, there are things that don't really make it quite as much sense in a standard database. But really cool interpretations in the streaming space. 

[00:37:41] Jason Gauci: Yeah. I mean, what's a, what's an example of something that, of one of those things? 

[00:37:44] Frank McSherry: So like one of the things that, so in a standard SQL database, You can use, I'm going to lie a little bit here, but you can use the now function to get the current time for when your code is being, being run, which I don't, you know, you might do to print out along with your results. When did a thing actually happen? You know something like that, and that's, that's fine. That's a, that's a good use of, of now. 

[00:38:09] You can do something really interesting though in Materialize, which is to put that now term in a, in a predicate, like in, in a where clause. So you can say like where my data dot timestamp greater than now. And what that does is holds back the data until the current time is equal to whatever value is seen in your data, right?

[00:38:32] So since, since we're evolving the results of this, this query over time, it's going to give essentially a temporal instruction to the system that says, here's an interesting record. Don't show it to anyone, until the time that is written down in this piece of data. 

[00:38:48] And it allows you to start programming with time and stuff like that in a way that, yeah, I, you know, you could write that query in vanilla SQL that just does one-off queries and gives you the answer, but interest, some really interesting new behavior in a streaming system that is going to update the data over time.

[00:39:05] Jason Gauci: Yeah, that's what I mean, as soon as you introduce something like that, then you can't really throw any data out because you just don't know when it could become relevant. Right? 

[00:39:14] Frank McSherry: Yeah. The, though delightfully, right. You can use this exact same query to say where my data dot timestamp. I think it's less than now, the other direction, the other quality direction, essentially it says like throw my data away. As soon as this time passes, right? This record will never pass this predicate again, because we know that now only goes up, and you have some piece of data that we've now we've now passed. 

[00:39:40] So this is actually a way to give in, in your query to describe what data are okay. To garbage collect and, and clean up so that you can keep you know, if you wanted to have, for example, a one hour window that you're maintaining that slides continually through time, you can totally say, you know, blah, blah, blah, select all the records where, now between my data dot timestamp and my data dot timestamp plus one hour, and that will wait until that time to introduce the data. And one hour later clean up the record, throw it away. You'll have a constant memory footprint over time and just stuff like that. 

[00:40:15] But again, if, if you're thinking about now, I'm just going to use real data, like a data warehouse, you got to plop all your data in there, and it's gotta look at all your data over and over again because you know, who knows what's going on in there. By giving clearer instructions, the stream processor, we actually can learn a bit more about what do you really need to keep around? Like what data can we throw away and how can we, you know, more efficiently operate to keep your query up to date? 

[00:40:38] Jason Gauci: Yeah, that makes sense. So what about, you know, one of the things that I was really excited to see SPARK 3.0 add is, is the, the sort of array aggregation and all of that. So like you can, for example, you know, there's an array column type, which is in, you know, in, in high oven and SQL, but not much not in ANSI SQL.

[00:41:01] And that, that array data type ends up being super, super useful. Like you might say. Take all the records with this person's name and build an array of, of all the, the ages the person said they were, and then let's analyze a distribution or something. 

[00:41:18] And so that, that seems to be the thing that I miss the most. Whenever I'm using something like SQL Lite, I always kind of miss, you know, array_agg and map_agg and some of these functions.  

[00:41:28] Frank McSherry: Yeah. We have several of these. I don't want to, I get myself tied up a little bit when I try to distinguish all there there's arrays and there are lists, and there's a little bit of a difference because Postgres has, we're basically following a lot of Postgres.

[00:41:41] There's some distinctions between the raggedness of arrays versus multi-dimensional arrays and it's it's, it hurts my head to try to keep all of these straight. But yeah. there's a, I'm thinking of what we have literally at the moment. We literally have a json aggregation that allows you to do these groupings and then pack them into a common json object.

[00:42:01] If we don't have an array aggregation, it seems like the sorta thing that super easy to add, but yeah, the functionality, I guess we've, we've generalized it a little bit. I think we've not invented too many things of our own, I want to be careful, but, but we have, when you look at the, what folks come to us with, like, they show up with Avro data and Avro, you know, can represent some, some quirky things in it.

[00:42:22] And we got to figure out like, well, someone showed up with some Avro data. We need to make sure that the type system is rich enough to reveal the various things that people might've shown up with for data. And that includes various various forms of arrays and stuff like that, that aren't as commonly seen in, in ANSI SQL.

[00:42:39] Jason Gauci: Yeah, that makes sense. Okay. So we got to ETH Zurich, we went back a little bit to talk about Naiad. And so, yeah. let let's sort of continue the story there. I mean, did, was Materialize created while you were on this road trip, or was it like conceptualized while you're on this road trip?

[00:42:56] Frank McSherry: So I think the right way to frame it is that Materialize is conceptualized by my co-founder, Arjan Narayan, who was working at Cockroach, Cockroach Labs at the time. And he had during the course of his, his PhD, been working in the same sort of area, Big Data systems, stuff like that. And had...

[00:43:15] Jason Gauci: Yeah, correct me if I'm wrong, with Cockroach DV is like a key value store, right? Like a Big Data cube?

[00:43:20] Frank McSherry: Yeah, roughly. I mean, it's, it's more of a transaction processing, like OLTP-style system than an analytic processor. 

[00:43:27] And this makes a bit of sense, to be honest, like they were, I would say, not an expert here, but they were good at what they were doing, which is storing data, keeping data, you know, consistent, all of these sorts of things.

[00:43:37] And we're sniffing around for like, well, what's the right way to process all this data, right. It's sort of silly to do all of a sudden, dump it out, HDFS and call into SPARK or something like that. And Arjun's take, at least, I hope I'm not misrepresenting him, was that, the Naiad paper, the thing that came into Microsoft Research was a great answer to all of this, right?

[00:43:56] It sort of resolved a lot of the the quirks that stream processing systems had at the time. And that this would make a lot of sense for anyone who uses a transaction processor to keep the primary source of truth for their data, but wants to attach to it some analytics that will continually be able to ask questions and also keep, keep answers up to date for questions you've already asked.

[00:44:18] So I would say he, he and, and potentially collaborators at Cockroach, but he was the one who was pushing forward on the idea that this is really interesting technology. And there's actually a pain point that people have out there where you know, you can use a data warehouse for sure. And just ask questions over and over again.

[00:44:34] But there are a lot of people who have relatively fewer questions, I suppose. They want to see the answers to their queries refreshed as quickly as possible, always up to date. And ideally, this shouldn't have to mean that we have to go back to the data warehouse once a second and reassure the query from scratch.

[00:44:48] So he was the one is or showed up, I would say with the, like, let's actually do something specific here. His pitch to me was roughly like, wait, wait, sorry. We knew each other from before, but, but his pitch with respect to the company was, it's super interesting. Like all the stuff in Rust that you're building, and you read a bunch of fun blog posts, but if you actually want to see if this has legs, if this can actually go anywhere.

[00:45:10] They're going to be annoying things that you don't want to have to do. Someone's got the right documentation. People are going to have to do, write tests. You're going to have to go and shake hands with potential customers and stuff like that. And that's not what you want to do in your day to day when you're totally right.

[00:45:25] But that the right vehicle to do this was to put together a company, basically put together something that has some funding so that you can pay people to put together marketing information, to put together documentation websites, you know, write tests, write SQL compatibility layer, stuff like that.

[00:45:40] Jason Gauci: Yeah. That there's such as on a really, really good point. I think there is this kind of misconception that a startup company is, you know, like Steve jobs and Steve Wozniak in their garage, just, just writing, you're building a bunch of systems, or just, you know, one person in their garage. But, but, but you're, you're totally right that you need to have sort of sales.

[00:46:01] You need to have people writing documentation. you need to have that whole ecosystem right at the beginning. And yeah, and I, I see a lot of people who have some really good technology, but I think they kind miss that part of it that you need to have that, that, that whole part of it. And we talked a little bit about this, in the last episode about Docker and I don't want to beat up at Docker again, but, but your Docker has amazing technology, but in, on the business side, you know, there were some real challenges.

[00:46:35] And so it's really important to kind of have a person who's really plugged into that. Who can help out with all of that. 

[00:46:42] Frank McSherry: You know, I definitely found this to be the case. I mentioned them in most startups, these roles exist, whether you like them or not, of course, you know, someone to do community management or tech support or these things.

[00:46:51] And presumably in most various small startups everyone just wears five different hats and you probably do a little less good of a job than if you've got an, a, a specialist to do a thing. And so part of, I mean, part of what was compelling, I suppose about Arjan's proposal is like, this is good enough stuff that we can get some funding and actually get people who are good at these jobs and like doing them, rather than have to slog through the unpleasantness of doing them yourself necessarily.

[00:47:17] Jason Gauci: Yeah. It makes sense. And so, so did you go straight from ETH to Materialize or was there something in between? 

[00:47:24] Frank McSherry: Oh, it, yeah. Sorry. There's, there's a bunch of time dilation that went on. I was at UTH twice actually. I was there for seven months the first time and, you know, having, having done the thing that I thought I was there to do, went off and tootled around a bit more, bit more in Europe, I was just happy to be where I was and did some more surfing and just relaxing.

[00:47:44] And eventually I ended up going back to DTH, for a little over a year and it was a bit more, you know, formerly there at the time, you know, I was working with students, advising folks. so we're helping some folks see through their, their PhD dissertations, but then, you know, it became clear that, that was not for me forever.

[00:48:04] And that Materialize has made sort of a lot more sense than this. The second time there was, I departed, I would say it's early 2019, roughly, and landed in the US, and at that point Materialize had already been started up, essentially, Arjan had chatted about it beforehand and throwing some decks around.

[00:48:24] But, but yeah, I came back from, from Switzerland and was in play, I think number five, I guess that Materialize at that point. 

[00:48:30] Jason Gauci: Cool. So what was it like to, you know, talk to investors. I mean, that's, you know, so I also, you know, have an academic background and, and since going to university, I've really only worked in research labs.

[00:48:44] And so you know, you kind of share that background at least up to ETH. And so what was it like going from that to creating a pitch deck, talking to investors and, and, you know, what, what was that transition like? 

[00:48:57] Frank McSherry: It was, for me, at least it was very surprising. the, the thing that was surprising is that I think we, we went through about a week of, of pitching stuff in, in the Valley and each meeting that we went into, I went in with some preconceptions and came out with exactly the opposite conclusion, basically about what I had expected. 

[00:49:16] Jason Gauci: What's an example of that? 

[00:49:18] Frank McSherry: No, I mean, just like we went in the very first thing we went into, we were like, Oh, this is great. You know, we're, we're pretty solid. The, the deck looks pretty good. And came out of that, and there's a lot more skepticism about things then than we had realized, not of the type, like no one doubted the technology or anything like that. They, they were less sure about how big the market was, for example, on the piece. I didn't even thought of that.

[00:49:36] Jason Gauci: Oh, that makes sense. 

[00:49:38] Frank McSherry: We went into the second meeting and I was pretty sure that ahead of time, the, the person that we're going to chat was already invested in what was essentially a competitor. And, we were basically thinking like, well, I guess we're, we're in deep trouble then, like this isn't gonna work out.

[00:49:55] If we can just sort of warn them and go home. And they're like, no, no, I'm interested. I'm very interested. 

[00:49:59] Jason Gauci: Were you afraid of even giving the pitch? You know, because. Because you know, if they're already invested in your competitor?

[00:50:06] Frank McSherry: Not, not really. I mean, I think one of the nice things about Materialize, it's very reassuring is that nothing we're doing is secret.

[00:50:12] So it's not that there's some cutting information that if anyone got access to it, they would suddenly have a big advantage over us. The main advantage that we have is the technology that we're using is, pretty cutting edge, I would say. I mean, that's self-serving but yeah, that's the main thing that distinguishes us and it's not trivial for anyone else to say like, Oh, I see we should just use the same technology and we'll be where they are.

[00:50:37] So we weren't too worried. Or at least I wasn't too worried about showing up and saying, Hey, we're looking, we're going to do a thing. This is the thing we're going to do. keep it a secret. Like I don't, I don't think anything that. We were talking about was particularly secretive. 

[00:50:51] So, no, I wasn't, I wasn't too worried about that. Maybe I should have been, I dunno, I'm hopelessly naive when it comes to some of these things.  (laugh) 

[00:50:57] Jason Gauci: No, I mean, I think, I think what you said really resonates. I mean, I think to replicate it, they have to really replicate your whole history. It's not good enough just to take, take a snapshot of, of, you know, what you're thinking right now you have to, it's not Markovian, right?

[00:51:13] Like it's, it's kind of based on, your trajectory is kind of based on your, all of your experiences and you can't easily transfer that. 

[00:51:20] Frank McSherry: That's totally true. So for example, one of the things that we've we've had to do and has, has been like some of the value that's been added at Materialise is trying to figure out how to take a bunch of these cases, SQL idioms and map them down to data for computation.

[00:51:33] So when someone has a correlated subqueries, someone's got to figure out how to turn that into to data for computation, and that that's not explained anywhere else. Right? That's not a thing that exists in the open source software that I had previously written. So it was very much. A bit on the, like the team will be able to figure this out.

[00:51:49] What is the bet that the VCs were making? Not that the software already does it, but that there will be some, some problems that we face, but these people are well-prepared to get to clear those hurdles, essentially. 

[00:52:00] Jason Gauci: That makes sense. And so this, this investor was super interested. And then did, did you kind of, what was that conversation like where you, at some point you had to kind of talk about the elephant in the room, right? Which is that they're kind of double-dipping.

[00:52:14] Frank McSherry: Oh yeah. They were very kind of, yeah, they were very clear and it was that, you know, their responsibility, like they have some responsibility of course, to the people that they're invested in, but they also have responsibility to the people who have invested in their, and their funds.

[00:52:26] And as long as they're not in conflict, I think this person's particular take was like, as long as it's not zero sum, right? If it's, if all the money that you would make would come at the expense of these other people, that's no good that that's not a thing that they can... but if investing in two people who happen to be sharing a pie, and in the course of that, the pie is actually, let's say 50% bigger than it initially was then, you know, okay. Company 1 doesn't get all the money in the world, Companies 1 and 2 have to share it, but it's, it's in this case much better for their the investors in the fund. The, the VC is managing.

[00:53:04] I have to imagine also that there's different takes on this, right, across the spectrum of, of VCs. You know, some people are perhaps a bit more kind and gentle, and maybe in some people are more vicious and, and you know, trying to get access to whatever money is that they can get. I have no idea. I definitely don't want to want to judge there, but...

[00:53:24] Jason Gauci: Yeah, totally. I mean, this is just one, one, one meeting, but I think yeah, it's really, it's an interesting, you know, kind of dichotomy, because on one side, yeah, I think you hit the nail on the head. It really depends on is the pie growing. So if there's, if there's a thousand customers and startups you know, are only able to serve, acquire one at a time then, and you know, that there's a whole ecosystem full of startups out there, then yeah, I mean, totally makes sense to have at least two of them, maybe even more.

[00:53:55] When things get big and it becomes like an Uber-Lyft type thing. There's actually just podcasts that I follow called Business Wars that talks about, you know, when companies get so big that they basically exhaust the market, and then they go to war with each other.

[00:54:09] And it's, it's a fascinating podcast, but, but yeah, I think that is maybe, you know, if you're, if you're if Materialize is like competing with, you know, the, the next biggest player, you know, and both companies are, you know, dominating the world together, then that's, that's not a bad position to be in. If you're an investor, it's like, okay, I'll take that. 

[00:54:32] Frank McSherry: Yeah. Yeah, no, you're not, you're not wrong. And each of the participants, you know, Materialize and the others would really love that if the other person would sort of, you know, not, not be there or, their lives would be a lot easier, but from the investor's point of view, presumably, no, this is great.

[00:54:47] Like both of you are gonna make better products. you know, both of you going to compete to be price competitive. I'm sorry, I'm making up a bunch of economic stuff. I have not, no actual background here, but I can imagine a world where it's not inappropriate to you know, support folks who are yeah, again, eating more of the pie as opposed to trying to fight over the same piece of pie. 

[00:55:11] Jason Gauci: Yeah. Yeah, totally. So, so, okay. Start up, Materialize. You're employee number five. and you know, you are, you have this sort of academic background. I'm assuming there are a mixture of people who are really into the theory and, and you know, handling a lot of these edge cases and, and, and you're doing a lot of these really complex transformation of, of SQL to, you know, your engine. 

[00:55:40] Then there are a lot of, I guess you know, front end engineers and there's a whole engineering area. How do those two areas kind of collaborate?

[00:55:51] Frank McSherry: It's a good question. It took I think, the short version is that the, the folks who are interested in the theory, like the MI type people needed to adapt a little bit. And this is mostly because when you look at it, what that Materialize actually needs to do, the goal isn't specifically to advance some very cutting theory and to be really smart and write dumb noxious blog posts, it's actually to do a specific...  (laugh) right and if you look at, you know, if the folks generally speaking, the engineering side of the house of Materialize, is a bit more eyes on the prize about like, you know, we need to actually make this work. That's the actual goal. 

[00:56:25] The goal is, you know, okay. The friends I've made along the way. That's also, it's also very good. But the reason that we're here is to try to put together a thing that looks and behaves a lot, like in this case, Postgres, you know, compliance with SQL and under the covers does it all very efficiently, hopefully things like that. And that's actually the goal. So you know, folks should in some sense to get in line and do that, that sort of work.

[00:56:49] And I remember when I showed up, I was very initially very like, Ugh, this is exhausting. SQL has so many, so many words. it's just, it's gross in a few different ways. Do we have to do this? And you know, at the time it was like, maybe, yeah, I was thinking maybe we don't, you know, we could do some funny business somewhere.

[00:57:06] And the answer was pretty clear. No, no. Like it's really important to do SQL correctly. That may suck. I'm sorry. But the thing that we're making makes sense if we do SQL correctly and not otherwise, so let's figure it out. 

[00:57:18] Jason Gauci: Yeah. I mean, speaking from the other side of the table, there is something that wasn't ANSI SQL. I'm trying to remember. I think it was maybe like Hive. Yeah. Hive. So Hive isn't ANSI SQL. And so yeah, just converting queries from Presto to Hive or from, you know, testing them locally on on SQL Lite and converting into hive. Like it's never a straight conversion, it's always a huge pain and you're always kind of wondering, like, why didn't they just go take the extra step now?

[00:57:47] I mean, high of, I think ,that whole Hadoop ecosystem was filling such a huge void that they had a lot of latitude in terms of the product. But ultimately, I mean, Hive was replaced by, by Presto and SPARK and things that were, more compliant. So, so I mean, even then, I mean, it didn't last, it was just a honeymoon phase, right? 

[00:58:07] But yeah, you hit the nail on the head. I mean, if it, if, if, you know, especially if you're at a bigger company, if you have, you know, 20,000 queries that you run and you push them to Materialize and like a hundred of them fail. You know, for one person who's trying out a new product, that's insurmountable to try to fix a hundred...

[00:58:25] it's usually really ugly fixes a hundred times. And so it kind of can't be 99% done. It has to be a hundred percent for you to really get those customers. 

[00:58:36] Totally. And again, 

[00:58:37] Frank McSherry: one of the, one of the changes, I guess, coming from the academic spaces like in the academic world is you want to, it's a bit introspective there.

[00:58:44] You're like, I, my goal is to think of a clever thing and then tell the world about it. Whereas on the business sort of real worlds side of the things, your goal is to meet the potential users where they are, right? Like you want to get some technology to them that they can pick up immediately and start working with them.

[00:59:00] They're more and more delighted. The less they have to screw around with it or figure out, or, you know, if their life is now fixing these a hundred queries as people write new ones as terrible. I mean, that's, you know, that's not the thing that they were hoping it was going to be. And you, you get to notice this a bit more as you show up.

[00:59:17] And I mean, I was learning this at least coming from academia, where you get rewarded for being clever and different to, to a space where absolutely the goal is to try to be as not different as possible. Ideally not have to tell anyone about your cleverness. They just serve experience that your product is for whatever reason, much more pleasant to use than the competition.

[00:59:37] Jason Gauci: Yeah, that makes sense. And so in terms of customer acquisition, is, is your, is your, is Materialize's style kind of like a bottom-up thing where you have a free tier and you try and get, you know, developers to convince their manager or director to, to jump on board? Or is it more of like an enterprise thing where you go, and make a pitch to to the leadership. Like what was the kind of model for Materialize? 

[01:00:05] Frank McSherry: That's a good question, I probably go script the answer to this because they're very like, sort of clear takes on each of these things. My experience with, with Materialize has been that the person, the people that we end up trying to convince Materialize is good, have so far been, not, not strictly the bottom-up, like just random developer, trying to get a thing done, but maybe like a tier up from that.

[01:00:27] So a person who's trying to think about like, how should I organize infrastructure from my group or something like that, or like, I need to support a few people, the various various people writing SQL queries. How should I go about doing that? And that this person has some latitude to make a good decision or bad decision, but they're sort of their decision-making type of person, rather than a person who can pull whatever they want onto their laptop and start using it at the same time.

[01:00:51] We're not. sort of going over and scheduling meetings with Coco-Cola to try to tell them like, you know, please, please stop using big competition and start using us instead, you know, business, business, business, handshakes, martinis. (laugh) 

[01:01:04] I would say the motion is a bit more bottom-up in the sense that it's technology led. Folks are meant to understand, the users are meant to understand that this is a valuable thing to do, that they like the experience, more low-latency responses to queries are better, as opposed to more top-down like your organization will be better, cheaper, whatever, if you pivot over to Materialize, that might also be true, but it's harder to harder to put that in front of people at the moment.

[01:01:30] Jason Gauci: Yeah. I think anything having to do with, you know, data, you know, anything to do with data will require you to be a step up from the developer because it's not something you can run on the cloud. Like people aren't going to just move all their data to some kind of public cloud that Materialize has access to.

[01:01:49] And so. And so something has to be, I'm assuming something has to be kind of done where Materialize is kind of plugged into whatever, you know, their, their data system. I mean, it might be on AWS, but it's obviously I could be saying it's as supposed to the public. Yeah.

[01:02:05] Frank McSherry: Well, I should say it takes this opportunity to, to, to throw out there that Materialize Cloud has just entered private beta. Folks, Materialize.com/cloud.

[01:02:13] And, up onto the, the sign up list that we're, you know, folks that are being admitted in waves. But, you know, the intent is for sure to try to put together a thing that where an organization can try this out. It, you know, we'll, we'll deploy inside your private cloud in AWS, but if you've got your data in, in Kafka or something like that, we can attach a Materialize instance to it and start reading it and give you, like 30 seconds or something like that interactive experience where you get to see what it's like to, to start use this, maybe to start to make some decisions about, is this you loving this or is it, is it the same problems as before. 

[01:02:48] Jason Gauci: Yeah, that makes sense. But, but in so far as like, it is kind of a bigger commitment than trying out a different ID, for example.

[01:02:57] And so in so far as that's true, like I would say from what you described Materialize is sort of bottom-up at the lowest level that you, you can reach and still get the kind of commitment that you need to set up. 

[01:03:09] Frank McSherry: If you read that it's more sophisticated than just getting a new, a new ID or a new theme for vs code or something like this, for sure.

[01:03:16] We've tried to make it not terrible from the point of view of an incremental deployment. So for, for example, you know, step one is not reformat all of your data into our native representation or something like that. Like, we'll look at you know, your Kafka topics, pull data out of there that, you know, could be CSV format.

[01:03:33] It could be Avro, Json, you know, various, various things. Hopefully, ways you've already written your data down so that we're not actually introducing any new new costs for you. So it's not as bad, like, you know, for sure other other systems out there, step one is, okay, we need to pivot all of your data and HTFS into a columnar representation because that's the only way we work efficiently.

[01:03:52] So like, one week later you can actually try running one of, sort of grindy, olappy style, analytics tool. 

[01:04:00] Jason Gauci: That makes sense. So you're kind of plugged into, Kafka and I think the Amazon is like, I want to say, it's Kinesis. 

[01:04:08] Frank McSherry: Kinesis is another one that they have. Yeah, yeah.

[01:04:11] Jason Gauci: Yeah. There's a bunch of these, pubsub type things or, you know, basically sources, we'll say sources for real-time data.

[01:04:18] And so you've written kind of adopters for a lot of these different sources. And so as long as people are using one of these, you know, kind of standard things, then they can, they can try out Materialize. 

[01:04:29] Frank McSherry: Yeah. And the goal for sure, is to show up, from our part, show up with as many of these points of integration as we can reasonably manage with the team that we have.

[01:04:37] So, you know, if you can pull data, Kafka is the easy one at the moment. Kinesis has some interesting characteristics that make it a bit harder to show people that data end be correct and like show them the same data again the second time, if it crashes and starts up again. 

[01:04:51] But, but for example, also there's some recent work to pull data out of out of Postgres as a, as a read replica, essentially.

[01:04:57] So to use  the replication protocol out of just a Postgres instance and say like, if you have your data in Postgres and Materialize can attach it to that. 

[01:05:04] Jason Gauci: Oh, that makes sense. Yeah. So, so stepping back a bit, looking at, you know, someone who's in high school or college and, and you know, maybe they have some very, very limited SQL, like maybe they've written a, you know, they've, they've made some MySQL queries on a startup, you know, a small project, a hobby project. How can they get started with Materialize and, and is there sort of a free tier or what's a way for, for students and hobbyists to learn more? 

[01:05:33] Frank McSherry: Oh, absolutely. I mean, you can definitely, Materialize's source is available in and very nearly as available as we can make it.

[01:05:40] It's, it's BSL licensed. So basically anyone can go in and grab it. And as long as you aren't building a competing database as a service-tyle product, you're free to use it for whatever you want. And you can go, go, go grab the code, build it. We have Docker images that, that we should push out each time we successfully build something.

[01:05:59] And you just grab this down, pull it down to your laptop. You don't need any complicated Apache infrastructure you know, in Zookeeper up and running, any of that stuff. It's literally a single binary. You turn it on. You can, you connect to it as if it were Postgres. So if you have a terminal and you use psql, which is sort of standard way to shell into two Postgres, you can use that to connect to Materialize. And if you don't have Kafka up and running, you can point it out like a file, for example, and you can you know, Penrose of, of let's say text a bunch of different formats, but Penrose of text to it, to the file and see the results continue to update there. 

[01:06:33] This is one of the sorts of interrupt that, you know, it's a little janky, but this is how folks have prototypes and things where, you have a file on your laptop. That's continually scraping something, some other sorts of data on the internet, but bending stuff to the file and then Materialize, essentially tailing that file, right? It's watching for changes to it. And anytime new data show up, it'll push them into the pipeline and update all of your queries. 

[01:06:53] And you can do all of this without public enterprise infrastructure, anything it's just on your laptop. This is how I use Materialize a lot, to be totally honest. 

[01:07:00] Jason Gauci: Oh, that makes sense. It's kind of like, you could use it as kind of like a tail on steroids. 

[01:07:05] Frank McSherry: No, absolutely. Like if you're used to using, I don't know, like awk or something like that, to do a little bit of data munging through your CSVs, and needed something more advanced than that, like awk is great at what it does. I use awk a lot, but if you're like, Jesus, I really need to take these five CSVs and find things present in here and not present in there and get the distinct, these things back out. Yada yada, something SQL like, yeah, you can totally use Materialize to do that and keep things up to date has data change if that's, if that's exciting to you.

[01:07:33] Jason Gauci: That is really, really cool. Yeah. That is, that was really like, let me just give a bit of tech background and feel free to kind of correct any, any records here, 'cause this is, this is just shooting from the hip here, but... 

[01:07:45] So a bit of background, so there's, you know, in Unix, there's tail, so you can have a big text file. You do tail file, you get the last 10 lines, right?

[01:07:53] Simple enough. There's also a tail-f. If you do tail-f, instead of just giving you the last 10 lines, you will actually just listen to that file just forever. And anytime a line is added, appended to that file, tail will print it out. So think of tail-f as like this monitor, that's just listening for changes and writing them out.

[01:08:16] There's also a bunch of other Unix commands like there's awk, and there's sed, and there's jq. All of these are ways of extracting data. So if you, if your, if your file is, rows of json objects, so every line in your files, a json object, you could, you could pipe that over to jq and you could pull out one of the entries, one of the keys in that object, right?

[01:08:42] If your file is just, is rows of texts and maybe there's a timestamp you're interested in, you can use tr and sed and awk, and these other tools to pull out, you know, that timestamp. 

[01:08:53] But then, you know, as soon as things start to get complicated, like maybe you need to keep a rolling histogram or something like that. You know, you're really kind of stuck. I mean, at that point, I mean, you could try doing something with Python. You know, at that point you're basically writing a Python program that reached from standard in, and, you know, as soon as you jump into Python, you're writing a lot of code and et cetera, et cetera.

[01:09:15] So, so SQL would be really attractive. There's a lot of times where I've converted things to, or just loaded things into a SQL Lite database, just so I can run queries. And it takes a long time. You have to transform the data, especially if it's just flat text. And so, you know, Materialize running it locally is a really, really attractive alternative.

[01:09:37] You could have a Materialize that's tailing. Yeah, CSV or I think it's called jsonL where there's a J subject per line, a jsonL format, and do more complicated things like groups and, and Windows and all of that, without having to, you know...

[01:09:55] Frank McSherry: It sounds absolutely correct. I realized, I say tail lot and I say tail a lot outside Materialize, but tail-f is actually you're right. Absolutely. The exact specific use of tail that we should be thinking of where... 

[01:10:06] Jason Gauci: Yeah. I mean, there's tail the verb, and then there's tail the command, right? I think, yeah. tail, the command is just as one shot. 

[01:10:13] Frank McSherry: You're, you're totally right. That there, one of the things we've seen a lot of interest from folks about are not even necessarily a Big Data anything in particular. Those folks are interested of course, but, but there are other folks who are just putting together, like even let's just call them web apps or something like that.

[01:10:26] Things, you know, that, I suppose at the moment they'd be using something like Firebase to get told about changes to their data. But in fairly primitive, elemental ways, you know, like they, maybe they pass a filter and you get to see your records that passed the filter and that might then prompt them to redraw a webpage or do some work like that.

[01:10:45] And Materialize is pretty appealing. And then you, then you get to have the same experience except you push a more interesting query through, through to the server, essentially. You could say that this is wonderful, but you know, just only show me when a particular more complicated property happens. You know, show me when new distinct users show up or someone logs in after five minutes later than they had ever previously locked in.

[01:11:09] Things like this, that, yeah, these people don't necessarily have terabytes of data to work on, but it's really handy to have someone save them, the pain of writing, you know, the Python or the JavaScript or the, you know, the, whatever it is that is handwritten, bespoke code to try to put together a thing that does the, not necessarily for the complicated task of figuring out when should I tell someone that a new thing has happened?

[01:11:30] And Materialize is, well sort of popular in that space, at least as, as an idea, like why, why can't we have this for other classes of programming, essentially SQL and Big Data is, is great, but there's lots of other people who deal with, reactive applications, essentially. We're trying to build whatever, literally react-style webpages that you want to express what it should look like. The data might change. Why can't the computer system take care of all of this for me? 

[01:11:58] So like the bug I think, is getting out there in terms of people expecting, even wanting, but, but eventually expecting that their system can actually take care of all of these updates for them. They don't have to hand write a whole bunch of triggers and weird callbacks and stuff like that.

[01:12:13] Jason Gauci: Yeah. That makes a ton of sense. I mean, I think one of the biggest, I think challenges or biggest say like, like mistakes that people make when they're starting out, is using a programming language or maybe another way of saying is using something like Python or C++ instead of using, you know, Unix commands and SQL.

[01:12:37] I think that you know, I know when I was, when I was going to college, you know, I kind of thought it was SQL that's for you know, that's for, for you know, people with real jobs, you know, like, a PhD student. So I, if I needed to you know, read one column out of the CSV, I would just start you know, into main and writing C++, and that, that made me extremely unproductive, you know?

[01:13:05] And I think that that is a, it's a lesson that's super, super important. And having the ability to do a real-time. Yeah, I think there's a massive, massive tail of folks who, can make a really, really good use of something like that. That just don't know. 

[01:13:23] Frank McSherry: This is a, there's a computer science principle, actually that gives name to this, this thing called Ousterhout's dichotomy where this is, I think John Ousterhout at Stanford who proposed it. 

[01:13:32] Essentially, there's roughly two types of programming languages, right? There's sort of this productivity level language. That's a bit like, I know awkwould be a good example, or SQL, you know, you can use it to get your job done as quickly as possible. 

[01:13:44] And the more systems in programming languages, let's call them like C++ or something, which is, let's say you want to build one of these tools, right?

[01:13:50] Like someone actually has to build built the things. and if you know, one of each of these languages, that's pretty good, right? Like only knowing a productivity language or, or a systems language, you're going to have some limitations either because you only know C++, and you spend all of your days trying to open files and read lines and stuff like that.

[01:14:09] Or if you only know SQL, it's a little hard to like to invent a new thing, essentially, if SQL isn't doing what you want. you're kind of in trouble at that point and need to get someone else to help you. 

[01:14:20] But if you, if you know, one of each of these things and can move between them, that's a really good place to be.

[01:14:25] Jason Gauci: Yeah. Yeah. That makes a ton of sense. Cool. Yeah, I think yeah, this is amazing. So folks out there, you know, we should definitely, I'll give it a shot. I think folks out there should definitely grab Materialize. 

[01:14:36] So I know there's Docker, Docker is usually pretty heavyweight, but are there like just standalone, statically compiled binaries for different OSPs...

[01:14:44] Frank McSherry: Yeah, we've I, hopefully I'm not screwing this up, but I think we have them. We have like an app get repo there's I believe we have, at times I should make sure it's up to date, but the home Homebrew versions of these things, you know, you just grab the code, the code and build it from source. If, if you're that sort of person.

[01:14:59] I should double check all of the package manners we have though. I think there's a few that we for sure, keep up to date on some that we might've either let slip or lost some traction with. 

[01:15:08] Jason Gauci: Yeah. I mean, someone who has, I have a package and in a bunch of these package managers and it's. It's so difficult. I mean, I'm currently right now, I have, I have an issue on Ubuntu 18, but it works on a, all the other Ubuntus and you know, 10 different other OS's.

[01:15:26] And so it just, it just, it never ends. There's always like something that breaks somewhere. And it's one of these. Yeah. One of these days, someone needs to write some way to automate that, but that's going to be a challenge because they're all so different.

[01:15:43] Frank McSherry: And I'm sure someone has written that thing, but it's only supported on some of the OSS, so you can only use it in some stuff.

[01:15:49] I mean, it's, it's one of these, you know, like the XKCD cartoon about like, there are 14 competing standards. We should, we should invent a new one that encapsulates all of them. Now there's 15. 

[01:15:59] Jason Gauci: And now there's 15. Yeah. It's so true. Yeah. I guess, you know, that thing is probably snapcraft, which which doesn't have enough market penetration.

[01:16:10] Like I don't think they cover Windows. And so yeah, you're right. You can't really, I mean, maybe you could lower your number of things, but you can't get it down to one.

[01:16:22] Cool. So let's, let's jump into, into Materialize as, as a company. So what is a day like for a scientist or an engineer in Materialize? Like how, you know, specifically, like, how is it, you know, everyone kind of, or I guess pre COVID, let's say everyone drove into work, you know, how to cubicle or has a bullpen. But is there something kind of unique about life at Materialize? 

[01:16:47] Frank McSherry: Well, we're in New York city, so no one, no one drove anywhere.  (laugh) You know, you would, you would hop into your metal cylinder and be propelled you know, from one end of the city to the other, but, yeah, no, it was, It's, I mean, it's changed, I guess it's part of the problem.

[01:17:05] So I'm trying to get a snappy way to, to characterize it. But early days it was, you know, we're all basically in, within 10 feet of each other and there's a bunch of rapid prototyping and sort of turnover where like I put together some code and then handed it over to someone else and they'd come back and say like, Oh, this just doesn't, you know, this isn't correct from seeing, okay, well, you know, let's iterate on it.

[01:17:26] And there's a lot of, you know, dynamic energy where things are randomly changing and we're trying stuff out. As we've gotten bigger, this has cooled down a little bit in that people go go crazy if you just randomly change what they're working on while they're working on it. 

[01:17:40] So, you know, we have, sorry, this is not unique to Materialize, but you know, a process now, like sort of clerical settings and stuff like that, trying to figure out... you know, for example, I'm turning on the cloud product, what are the steps we still need to do before we're, we're comfortable putting that in front of the people? 

[01:17:55] You know, folks have nicely carved up bits of work where we're pretty comfortable. I mean, if the work gets done, it doesn't necessarily matter how it gets done. You don't need to be sort of button seat for any particular hours of the day or anything like that.

[01:18:12] It depends a little, you know, the cadence changes a little bit, sometimes something new and exciting gets put out there and it's, it's worth having you sit around for a little while to see did anything catch on fire, help out people who don't understand exactly what you did, but generally speaking sense.

[01:18:27] Jason Gauci: What's the coolest offsite that you folks have done?

[01:18:31] Frank McSherry: We've, we've got a few and I'll name my favorite one, but we haven't done too many because it was just  about a year, and then, and then COVID happens up. 

[01:18:39] Jason Gauci: Oh yeah, that's right. I guess that too. 

[01:18:41] Frank McSherry: And there's not any offsite since then. We desperately want to do one, but we've done, we've done two, basically. We went to upstate New York and did some hiking. This was when we were about five or six people. And I don't know, I would say fairly stereotypical but super funny, like hiking during the day, and then, then Smash Brothers at night, and we had some name calling and some whiskey and stuff like that, but it was totally appropriate for, for who we were and what we want to do at the time.

[01:19:06] So it folks went rock climbing, like everyone's just happy to get out of the city and just sort of stretch the legs in in the outdoors. And that was great. 

[01:19:14] And then come, I think it was February actually, before anything got especially weird. We actually, we went on a, essentially a skiing trip though. It was up in Vermont and it was raining rather than snowing. And, you know, it was just mostly getting some time out of the standard work environment where you still get to be social with your colleagues. 

[01:19:33] You get to, you know, reinforces the fact that these are actual humans. not just people who written annoying comments on your, on your PR or something like that, then just chill with chill with people, spend some time socializing that it doesn't have to be in a bar drinking or something like that. It can just be pretty mellow taking walks or yeah, just over dinner. 

[01:19:51] Jason Gauci: Yeah. That's awesome. Yeah, I think yeah, I think with COVID, it's a challenge. I mean, most of our, I guess, quote unquote, offsites have been just playing video games.  (laugh) We just take some time out and play some games together. Play, play a bit of Counter-Strike or something. 

[01:20:06] Frank McSherry: A little complicated, cause during COVID I'd love to do this,  like a virtual offsite. Though it feels like a very weird thing to require of people. Like it's, it's one thing to say like, well, we're all getting in a car and we're going somewhere awesome.

[01:20:16] And which basically is like, okay, fair enough. But if you tell them, like we're all taking next week and we're not going anywhere interesting, but you got to log on and play some video games or something like that. And a lot of the folks that are like rather they'd rather do something else. And it's hard to I mean, on the one hand, you you'd love to, you know, take some time off of work to get people a bit more social interactive, but it's a bit hard to tell them like, your time, which is a scarcity. At this moment it needs to be spent screwing around with us playing Scattergories online or something like that. 

[01:20:46] Jason Gauci: Yeah, it is, it is super awkward. I think it's a, it's a real challenge. and yeah, it's a fine line. You have to walk between you know, if you, if you make it kind of, let's say to...  or if you don't, if you don't hype it up and promote it, then people won't show up, but then if you make it mandatory, then it kind of feels like you're in the show at the office. Right. So, so there's some fine line there.

[01:21:10] Frank McSherry: So we had a holiday party for example, which was done virtually and you know, straddle this line pretty, pretty well, I guess, like, you know, we, it wasn't strictly speaking mandatory, but everyone's definitely encouraged to come. And, and folks leaned into that and, you know, got dressed up and made their own fairly nice dinners and showed them off on zoom and stuff like this.

[01:21:28] And this felt pretty good. Like, it felt good that it wasn't, you know no mandatory sit and look into camera and have dinner together, which is not nearly as exciting as, you know, we're all gonna go out and have some, some cocktails and then a nice dinner. 

[01:21:43] Jason Gauci: Yeah, one thing, my team hasn't done this, but another team I think it was like HelloFresh. Yeah, there's this thing called HelloFresh where they'll deliver ingredients to cook a meal, and it's just enough ingredients to make a very specific meal. And so they, they deliver this to everyone's house on the same day and then everyone, you know, set up their portal device or their phone on a stand or something like that. And we all are, I mean, they all just kind of cook together. I thought that was really clever. 

[01:22:14] Frank McSherry: I like the idea a lot though. I gotta say, like, it was the same sort of problems crop up, especially you know, the folks in New York who, you know, so some folks that are listening in the kitchens are not, not the centerpiece of the apartment.

[01:22:24] And if you tell them like, unfortunately you're gonna have to cook your own dinner tonight. you know, no, no, ordering it. Yeah. It's in the interest of the company that, that you cook your own dinner and eat what you make. almost sounds like punishment.  (laugh)  I think it's really fun. I like cooking. 

[01:22:39] Jason Gauci: I never thought about that. Yeah, I mean, I also really love cooking. Is it just kind of shows how you know, we all kind of like bring our own biases. Right. Like I never would have thought that, but when you put it in that perspective, it totally makes sense. Right. I bet there are some people who were just like, like what, you know what I don't like my kitchen is just like this stack of, of of, of boxes.

[01:23:00] Frank McSherry: Yeah. I mean something that the first offsite was this hike, hiking stuff in upstate New York. And I get to, I loved it. I think that's, I love being out in the woods and running around and stuff like that, but I could totally imagine there's some other folks who are like, why is this is not what I thought of when I thought it was fun.

[01:23:12] I was thinking we're gonna sit in a chair and drink some beer or something and a different structure, different folks, I suppose. But like this, I suppose, again, this is one of the things that there's an art to doing it, and it's not necessarily a thing that's super easy to fake some I'm impressed when people do it well, how do we bring together a bunch of people who have, you know, different, different goals, different ideas of fun, and nonetheless get them to connect when you can do that. That's great. 

[01:23:34] Jason Gauci: Yeah. Makes sense. So are you folks ahiring like either interns or full-time or anything like that? 

[01:23:39] Frank McSherry: Totally, yeah. Everyone, anyone who's interested should should, should reach out. I think generally the answer is yes. If you have a particular affinity for this, this sort of thing, you know, we're interested, for sure, for sure, in interns, you know, all across, I think all across the spectrum of engineering background, there's not, I don't think any particular thing where we said like, no, not, we just need to stop hiring this, that, or the other thing.

[01:24:07] Jason Gauci: That makes sense. And so post COVID the offices are in New York city, and so, so people should you know you know, if people are interested, that's one of the things they should expect that, that they would well...

[01:24:18] Frank McSherry: We have, we have actually several locations now, and we have, we have people, sorry, several locations is too strong. We hired remote people who are not going to be moving to New York. You know, folks who are in California or in Europe, stuff like that. So that's definitely on the table. I think, you know, we're excited by all of this. 

[01:24:34] There's a management overhead associated with it. So, so the engineering management, for example, for sure has the ability to say like, no, like we don't know how to handle someone in this time zone, and I don't want to wake up at, at two in the morning to do their their one-on-ones. So there's a bit of a pushback if they're not in an existing time zone that we have, we'll need to figure out how to manage that growth. 

[01:24:54] But I think if, if you're interested and excited about this sort of thing, I think reaching out is a hundred percent the right thing to do, and we can try to figure out, you know, if not now, when, or or see what makes sense.

[01:25:05] Jason Gauci: Cool. And so for folks who are interested in, you know, grabbing a copy of Materialize, trying it out, like we said, it's super, super accessible. You can get it from, from, you know, an app to repo or brew or whatever, but, but definitely check out the website first and learn, learn about it. Or you can go to Materialize.com. It's Materialize with the Z. So I think. I think that's the American version. I think Materialize with an S is the British version. 

[01:25:30] Frank McSherry: The main interesting point, I guess, is that there's, if you go to Materialise with an S there's a company there they're a different company, and you might have a very different experience if you apply for an internship there.  (laugh) 

[01:25:40] Jason Gauci: Yeah, that's right. So they're actually a fish farm?  (laugh) I have no idea, but yeah. Materialize with the Z and you know, you can, I'm sure there's a Careers page. You can check all of that out. There's a place where you can get, you know, the latest copy of Materialize and try it out. If you, if you have any data that's structured on your machine, you can try out Materialize.

[01:26:04] And actually, so one, just to be clear, you can like run Materialize over a file, right? I mean, if you're printing ...

[01:26:10] Frank McSherry: Totally. Yeah. Text files, like a CSV as a classic thing that you can... we have, we have a few worked examples on the webpage and one of them literally, as long as you have an internet connection just starts w getting data from from Wikipedia, about what are people editing, for example, at the moment. Just starts pulling that down to your computer and has a built-in query that asks, who are the top contributors as, as this data set evolves.

[01:26:34] And it's just, you know, grabbing the data continually, once you start the little tasks and you know, there wasn't necessarily any data on your computer beforehand, but there is now, and here's just sort of looking at that as it evolves, you can do some other crazy stuff with that too, and play with it. Yeah. 

[01:26:49] Jason Gauci: Cool. That makes sense. And so if people want to talk to you about, about Materialize also, at you, @FrankMcSherry on Twitter. Absolutely. I will post all of that in the show notes. 

[01:27:02] Frank McSherry: Yeah. Yeah, no, for sure. Definitely active on Twitter. I mean, I think that I, I didn't mention, I suppose, is that if you're going to Materialize, there's also for example, a bunch of blog posts, stuff that we've written just, I would say slightly more conversational content about what's interesting or different going on in here.

[01:27:16] And it's a great place to look. To sort of form some questions, for example, like, like this looks great, but and then reaching out in person is totally fine. Like that's, I spend a bunch of my time trying to help people work people through like what's different here, or I don't see how you can do that or whatnot.

[01:27:31] It's a great thing to do in public. A bunch of people learn from it who didn't necessarily know to ask or couldn't figure out how to frame their questions. 

[01:27:38] Jason Gauci: Yeah. That makes sense. I think, you know, another thing is, is for folks out there who are trying to get into maybe like, you know database engineering, you know, the best thing to do is to get your feet wet, you know, using some of these tools.

[01:27:50] And at some point you might kind of be scratching your head saying, I don't really know how to do this in Materialize. And I don't really know how to do it any other way either. You know, maybe I'll write a plugin or maybe all fork it and make some changes. And the next thing you know, Frank comes knocking on your door saying, Hey, those are some pretty cool stuff. Why don't you come work at Materialize? 

[01:28:09] So, I mean, you know, jump into these projects and dive in and the source, it's totally open source. So this is an amazing way to kind of learn it, it sounds like it's a very powerful tool for just about anybody. 

[01:28:25] Frank McSherry: I would definitely say Materialize, I suppose to say more than other things, but maybe that's not fair. But it's, Materialize is this cool property that I've so far that, you can do some pretty interesting things with it, some unexpected things. So if we hadn't planned for it, for sure. So I think, you know, maybe, well, as much as other projects that they're getting your feet wet and starting to use it often leads to something surprisingly cool.

[01:28:44] And interesting. And I don't, maybe generally it's interesting, but, but even just your friends are like posting on hacker news or something. There's, there's some cool things that you can do with Materialize that many of us didn't expect ahead of time and didn't know like, Oh, well this, I didn't realize this was the main problem in sports statistics or something like it just something we don't know anything about. And you're like, Hey, I just put it together. And now it does a thing and everyone's super stoked. 

[01:29:07] You can build some pretty cool and new, different things, and telling people about those is wonderful as well. But I think you're right. just to sort of loop back around to things like getting your feet wet, whether it's with Materialize or other data platforms, is a great way to start getting a handle on like what's hard, what's easy.

[01:29:26] What do you find to be most, most unpleasant? A lot of the folks, the engineers who are at Materialise are there because I literally just asked some folks recently, they're there because this was painful in their previous lives. And if they can make this better, they find that really exciting. 

[01:29:40] But getting that context for like, what's hard, what's easy. What would I like to make better is, is invaluable. 

[01:29:47] Jason Gauci: That makes sense. And so for people who have never worked with SQL before, what do you recommend to them? Should they, is there, are there some, does Materialize link to some like kind of generic SQL tutorials? Or is there your favorite tutorial that you point people to?

[01:30:00] Frank McSherry: I don't think we do link to a generic SQL tutorial. That's a really interesting point. Actually. We have documentation on the SQL that, that we support. So as, as if a Materialize had invented SQL, of course that's not the case, but that's, the way the docs are structured. It's a, it's a really good question.

[01:30:16] Actually, I came to SQL in a very non-standard roundabout way, having done a whole bunch of data pro computation first and then looped back around and tried to map SQL onto it. So I, and I wouldn't recommend that path. I, I liked it a lot, but it took many years. I'm not really sure. There's a bunch of, like, I think for example, Markus Winand has a fairly well-regarded introduction to SQL and sort of, and also skilling up SQL stuff.

[01:30:39] I don't know the webpage off the top of my head, but I could try to try to track that down.

[01:30:45] I have to imagine there's good and bad SQL tutorials. Yeah. 

[01:30:49] Jason Gauci: Yeah. I mean, we can add anything to the to the show notes. 

[01:30:52] Frank McSherry: I'll track it down, and I'll hand it out and we can make sure it's linked. 

[01:30:55] Jason Gauci: Yeah. I also, like you, I kind of learned SQL through yeah, I was basically out of place for a bunch of the code was written in SQL. And so that was kind of my way of getting thrown into it. And then I kind of realized post-hoc like, Oh, I should have learned this a decade ago. And so yeah, I actually I'm pretty sure we've done a show on SQL. It might be dated now, although, you know, the standard doesn't doesn't change very often, so it's still relevant. But, but in that, in that episode, I'll see if I can link to that one as well. We went through references. So, so yeah, definitely, you know, check out. You learn SQL, I guess that's step one. Really, really important, super useful.

[01:31:37] You know, you know, like SQL Lite is very, very accessible. Materialize is very, very accessible and they will make your life so much easier. And then, and then after you learn SQL, you know, check out Materialize and, and, and start using it. 

[01:31:50] So yeah, I think I think we can kind of put a bookmark here, but but Frank, that was a really, really amazing, you know, an inspiring talk.

[01:31:59] I mean,I feel like I want to try ,I'm going to go and grab Materialize right now, and I have some, some files that I want to see, see kind of how it works on them. And I think the idea of having kind of a SQL query that works on streaming and running that same one on batch and not having to write two of everything, you know, all of that is, is super, super appealing.

[01:32:19] I think people out there have learned a lot and in the past hour. And so I really appreciate it, your time and you coming on the show.

[01:32:27] Frank McSherry: It's not a problem at all. I'm happy to be here. And actually the questions were great and really sort of draw out for me, at least what's, what's exciting and sort of stimulating at what we're doing, why we're doing it, and hopefully you know, the listeners, some fraction and they're agreeing like, yeah, that does sound like a thing that I either need or really want or something like that. And that sort of then resonates with us for building it. 

[01:32:47] Jason Gauci: Yeah, totally. Thanks again. 

[01:32:48] You know, for folks out there, we're working on doing two shows a month, so you might be surprised to see this show, considering we already have an April show. So you might be surprised when you're seeing another April show.

[01:33:03] And so that's, that's, what's going on there. We're going to, we've been working with some really, really nice folks who've been helping us with a lot of the post-processing and that's allowed us to ultimately produce more content, which is, which is super exciting. And, and the reason why we can do that is because of, you know, your ongoing support.

[01:33:21] So, you know, thank you so much folks out there who are subscribed on Patreon, and people who found out about Audible through the show, through our shows. So thank you all so much for all of your support, your emails. 

[01:33:34] We get a whole bunch of new ideas over the past few weeks that we've added to our list. So there's, there's, the content is still growing faster than, than we can consume it, which is really, really important and great. And everyone have a great rest of the month, and we'll see you all next time.

[01:33:55] VO: Music by Eric Barndollar. 

[01:34:06] Jason Gauci: Programming Throwdown is distributed under Creative Commons, Attribution ShareAlike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, and adapt the work, but you must provide attribution to Patrick and I, and sharealike in kind.


No comments:

Post a Comment