WEBVTT

0:00:15.840 --> 0:00:22.240
<v A>Programming throwdown Episode 169 Hyper Log Log Take it away, Jason.

0:00:22.720 --> 0:00:30.700
<v B>Hey everybody. So, Patrick, do you have our. All your cars electric? Any of your cars electric? Do you have an ICE car?

0:00:32.060 --> 0:00:39.100
<v A>Actually I have one normal car and one the plug in hybrid we discussed. So it does have a battery, but also Has a regular 12 volt car battery?

0:00:39.660 --> 0:00:45.100
<v B>Oh yeah, that's what I was going to ask you. So even the hybrids have regular car batteries?

0:00:46.380 --> 0:00:54.060
<v A>The hybrids for sure. I think even electric cars still often have like 12 volt sort of normal car batteries for a variety of purposes. I don't know.

0:00:54.460 --> 0:01:03.780
<v B>Okay, Tesla does. I mean it's probably better than trying to like convert, you know, from whatever voltage the, the engine battery uses.

0:01:04.020 --> 0:01:09.220
<v A>It looks like here they do just a quick Google says that. Yeah, they have the normal lead acid 12 volt batteries.

0:01:09.460 --> 0:01:16.420
<v B>So you know, how do you know when your car battery is dead? Do you just drive until the battery stops working or what do you do?

0:01:16.820 --> 0:01:27.330
<v A>I mean, in an overall thing, I guess. Yeah, the car can't start. So I don't know how that works in electric, your accessory systems. Like when you sit in a car before you start it, it probably doesn't turn on.

0:01:27.970 --> 0:02:57.240
<v B>Yeah. So like in my experience until recently, I have never tested my car battery. And so basically what will happen is I will just drive until the car won't start and I will have to go and desperately figure out how to replace the car battery and ends up being this big thing. And it's just gotten worse as like we've gotten busier over the years. So it culminated when we were about to all go camping and the car didn't start. And it's been, we've had the same battery for a while, but part of it is, you know, you never totally know without a tester if it's, if it's the battery or if you left the lights on or if there was some kind of glitch that caused the energy to drain. Like we had the key in the car, like not literally in the ignition, but we had the key next to the car. And I feel like just having the key so close to the car, it somehow got confused and it thought we were in there or something because, you know, I'm actually, so I guess to answer, I'm, I'm ordering a car battery tester which is only 20 bucks. I thought it was going to be like hundreds of dollars because, you know, when they, anytime they do anything at the dealership or you call someone to do something, it always seems like oh, like this must be some amazing device that must be manufactured in a laboratory in the middle of the earth or something. And so no, it's like you can get a car tester, car battery tester for like 20 bucks.

0:02:57.480 --> 0:03:06.600
<v A>Do you have to disconnect the battery, not remove it, but at least like disconnect the terminals like the rest of the electrical system for it to work? Because that's a, that's an annoying thing if you do.

0:03:06.600 --> 0:03:41.140
<v B>Yeah, I don't think so. Right. Because as long as the car is not on, it's not drawing. It's not drawing any amps. So it should be an insignificant draw. So we're going to find out. I'm going to get the car battery tester on Tuesday, tomorrow, and I'm going to see what the instructions say. And yeah, I'll let you know. But I'm pretty sure just putting my engineering hat on that as long as the car is not on. Well, the car has to not be on and not even on accessory mode. Like the keys have to be far away and everything.

0:03:42.340 --> 0:03:44.860
<v A>I could see that, but I. It's not a voltage thing.

0:03:44.860 --> 0:03:45.100
<v B>Right.

0:03:45.100 --> 0:03:53.420
<v A>So the thing you're describing is can you pull enough amps temporarily to, to turn the car over? Right. So it's like measuring like rush current.

0:03:53.420 --> 0:04:16.040
<v B>That it can deliver. Yeah. So this $20 thing, what it needs to do is draw enough amps for a certain amount of time and make sure the voltage doesn't drop. Right. I think that's. But it can also tell you, it can actually tell you whether the battery, like all the cells are just old or if you have one bad cell. I don't know what you would do with that information, but it can tell the difference.

0:04 --> 0:04:22.520
<v A>This probably a, a voltage based thing, knowing that all the batteries are made from the same chemicals or whatever, so.

0:04:23.080 --> 0:04:32.240
<v B>Yeah, that's right. Yeah. I think if you have one dead cell, then you would have. You'd consistently pull a certain number of amps, but at a lower voltage.

0:04:32.240 --> 0:04:42.390
<v A>Well, you'd have one cell's voltage drop missing. So you. Like if I don't know how many cells are in a 12, but like you would go to nine, three. And that's like all your cells are good except for one.

0:04:42.550 --> 0:04:42.950
<v B>Yeah.

0:04:42.950 --> 0:04:56.230
<v A>Versus if you just had like 10 instead of 12, then. Okay, so that your, your, your mission, should you choose to accept it, is to crack open the tester to look at the circuitry and figure out what it's doing.

0:04:56.710 --> 0:05:48.940
<v B>Yeah. After the last episode where I burnt the Raspberry PI and I literally have a imprint on my finger from the SSD card. I don't know if I'm going to crack open the battery, but. But yeah, I guess, you know, it'd be interested to hear what people out there do. So if you just drive your car until it doesn't turn on, if you have some regiment that works for you and you've never had this happen to you either, end of the spectrum, let us know. You know, I just, you know, it kind of like it didn't ruin our camping trip, but it definitely set the camping trip back, but by like multiple hours. So I kind of feel like I need to improve this area of my life. So we'll see how it goes. If you have a way of doing this, let's know. Actually, Patrick, do you do anything or do you just drive the car until it doesn't turn on?

0:05:49.100 --> 0:06:12.570
<v A>Yes, that one. Just horrible. I think when I take it for its regular oil change, I think one of the things in the checkpoint is they hook up their expensive battery tester to it and then try to convince me to buy their battery. So that's been my. I do regularly get the oil changes on schedule. I don't know if that actually, actually is super critical, but I, for whatever reason, I just choose to do it that way.

0:06:12.810 --> 0:06:30.320
<v B>And so I hope I get the oil changes on schedule, but I've never, as far as I know, I've never had a mechanic tell me I need to change the battery. It's always just died. Oh, man. All right, onto news.

0:06:30.880 --> 0:06:31.920
<v A>You're first up.

0:06:32.320 --> 0:08:32.200
<v B>All right, so we never really talked about this. This is a news article about Google continuing to do layoffs. But this particular article isn't what I want to talk about. I want to talk about tech layoffs in general. My view of it, there are layoffs. There have been layoffs for what, probably 18 months now. It kind of oscillates. So there's layoffs, and all of a sudden there's even articles about companies laying people off that desperately asking them to come back. It's very, it's a very volatile situation. You know, my overall take on this is that, you know, I think that it's a really complicated situation because on one hand, I do think people should choose degrees and majors and careers that, that, that have a addressable market. Right. So if you, if you get a degree, well, depending on what your goal is, but you know, if your goal is to enter industry and you get a degree in, in American history, that's going to be really difficult. I Mean, there are, you know, you don't need to do what your degree is, but it's just making things a little bit more difficult. And so someone might say, well, with all these tech layoffs, maybe we shouldn't get a major in engineering or computer science. I think my overall take of this is engineering is still an extremely strong discipline career wise relative to the spectrum of degrees or things you could study. And so I wouldn't be really fazed by this. If you're a high school student, college student, you know, I would choose engineering if it's something you're interested in and not really be worried about layoffs. I wouldn't, I wouldn't make that a really big factor in which, you know, college major you're going to pick. What do you think? Patrick?

0:08:33.240 --> 0:12:01.270
<v A>Yeah, I agree. I mean, I would love to say that the, you know, if you look back to other periods in time, like 2001 or other, the great financial crisis, and that the layoffs were completely merit based. But I mean, I think the unfortunate truth is when a company goes to layup, when a company hires, people hire against different reasons, different performance targets. They might be over specializing, over generalizing, hiring less quality than they need, overpaying for, whatever it. There's just so many variables that go in. And I think when a company goes to do layoffs, it becomes very difficult to sort of like run some strict merit based thing because also layoffs normally come from top down, right? Normally the bottom people aren't doing the layoffs. So it's the top down that hey, there's a target to hit and it, but bottom up there's this like opsification that happens where everyone says their team is overperforming because it's sort of a game theory thing, right? Like you don't want to say, oh, I have a team full of underperformers. And that's why my, you know, that's sounds better than the manager, right? So you get this like information problem. And so I think when the layoffs happen, unfortunately it isn't always very fair. But to Jason's point, I do agree that on a whole, I think the industry is fairly robust and there are other people hiring, but times are exuberant and times are lean. And I think what often can happen is if you are in the wrong part of the hype cycle when you come out of college, you can have sort of wrong expectations like, oh, you know, I'm going to start out and make X dollars, you know, per year. And then if times get lean and you're sort of waiting for a better, better offer, you know, that matches what you heard. It's like, unfortunately the situation is dynamic and like, you know, it's not necessarily true that those offers are still there and so you got to kind of, you know, be aware of that. But, but on the whole, I think, you know, as an industry there's still growth here. I mean, people kind of say, which they've always said, oh, AI, offshoring, whatever, there's always some threat to computer science or engineering, you know, disciplines. But I don't, I don't, I'm not over worried about it myself and I would still encourage people who are interested in it to pursue it. I think there's lots of adjacent jobs you can do if you, you know, if, if things got really, really lean, there's plenty of other things you could apply, the skills you learn in getting an engineering degree too. And so also though, I, I will like give a shout out for the finance stuff we talk about sometimes, which is the importance of having savings and not, you know, living paycheck to paycheck. And it can be hard when you live somewhere expensive, but making sure that if that does happen, where you have to sort of go a few months before you can land another job and have the freedom to not have to just take the first thing that comes across your, you know, sort of desk, then I, I think that's a huge benefit and, you know, gives you a little bit more confidence to kind of seek out the, the thing you're looking for. But yeah, I agree, I, I, I'm not overly concerned about it. I think like this latest round that, that you're mentioning are a lot smaller. So the news likes to hype it up because everything's dramatic. But you know, there have been serious cuts. I think the cuts are tapering off. We're probably seeing the last of them. But you'll always see companies wind down, but new companies start up and, you know, it's kind of an ebb and flow. And so I, I, as a whole, I think the health of the industry in terms of, you know, compensation and stuff is probably steady. Probably not as exuberant as it was two, three years ago, but, but holding steady.

0:12:01.830 --> 0:14:52.120
<v B>Yeah, yeah, I agree with that. Yeah, I think you brought up a really good point which is, you know, and I know we've talked about this before, but, you know, when you start out, you know, let's say you're in college right now, you're looking for different careers, your goal should be to, to Learn as much as possible. And in fact, like most companies will, will, will have source control, will have peer reviews, we'll have like a lot of different disciplines that will be really important for you to pick up early. And so you really can't, as far as I know, you really can't go wrong. Obviously I've only had one first job as one human being, but, but you really can't go wrong with your first job. You can always find stuff to learn. There's, there's, there's just a huge expanse of things that you need to know to be a professional developer. And you will learn, you know, at least 70, 80% of that almost anywhere. I would say that maybe the place that's the most risky would be to join, you know, something extremely early. I heard a hilarious story. I don't think I've ever talked about this on a show where this was back when Java is really popular. I had a buddy who worked at a really small company. It's only just a handful of engineers. Most of them were straight out of college. So there wasn't really anybody who has experience. And you know, at Java used to have these things called JAR files, right? Which are basically zip files with some metadata that told you like which Java file to run first. And I'm probably butchering that, but it's basically what it was. And so most people don't know that JAR files under the hood are just zip files, but these folks did. And so their continuous deployment strategy was to build the code on their computer, guess@which.class files changed like based on which Java files they changed and copy those into the zip into the jar that was running on the deployment like on the production server. So there was no source control. Everyone just copied source files to each other over Windows shares. And so when it was your turn to update the code, you would open up this JAR file and start dropping Java class files in there. So that's a difficult situation to learn. Continuous deployment and source control and all of that, that might be something I'd be concerned about. But most medium to large size companies, actually all medium to large size companies and most small size businesses would be just fine.

0:14:55.400 --> 0:16:50.120
<v A>All right, well, taking a hard turn there from the seriousness of that, but I agree with what you say. My next news article is my continual stream of shader descriptions, which is in my. I've given up the belief that I'm going to build a killer game. So now I'm just like, I'm going to build a killer shader for, you know, making A cool visual demo. So this is real time dreamy cloudscapes with volumetric ray marching. Ooh, that's a mouthful. This is. Check this link out. This is by guessing by the name of the blog. I think it's by someone named Maxime Maximum Maxime Heckel is what I would guess there. But the. Yeah, I think that's right. And they have an article on their blog about doing. Well, a couple of things. First of all, rendering clouds, as the name says, in real time. So you know, doing it nice and fast. It's a beautiful thing if you check it out. But also interesting is if you sort of click around through the linked articles, they have a ton of really just interesting examples of doing shaders and how to kind of make them work. And so shader is code that roughly runs on your GPU and you know, sort of has instructions per pixel about what to render. And it's source of a lot of like, how do you call it, like computer generated art, demo scene style things which we've alluded to before. But also this discussion of ray marching has come up a few times into. I don't. I feel like those things ebb and flow. I think it's the algorithm just like honing in on me, but like whatever. So I've seen a lot of YouTube videos or whatever about ray marching recently and so it's on my like list of things to look into which is a. Of sort of thinking about ray casting and you know, understanding how close you are to something signed distance functions. Anyways, tons of good quality things. It's like a jumping off point. So I one day aspire to make beautiful like mellow screensaver like shaders. Like these articles I keep touting out for everyone.

0:16:50.680 --> 0:17:11.680
<v B>This is amazing. 1 PSA, don't look at this in Firefox, it will cause cause all sorts of heartburn. But on Chrome it works just fine. Do you do this, Patrick? I look, I use different browsers on my work computer for personal and for work things. That way it's like the most isolation, isolated environments I could think of.

0:17:11.840 --> 0:17:17.520
<v A>Oh yeah, I do do that, but it's sort of a more because it's a requirement rather than.

0:17:20.160 --> 0:17:37.819
<v B>Okay. But yeah, on Chrome it works just fine. That is so cool. I wonder how many of these things are commodities. Like if you were to use Godot or Unity or one of these things, do they just have volumetric array marching or is it too cutting edge?

0:17:38.539 --> 0:18:41.440
<v A>Oh, that's a good question. Probably. It's like you could probably find out implementations of it and pieces. But I think this is one of those things that you kind of piece together the sort of almost algorithms into building the look of your game or your. Your, you know, demo or whatever. So you may do things like compile various ways of doing water rendering and cloud rendering if you're doing, you know, this kind of thing. And so I think you might be able to pull down certain. It. It's not like shaders. I don't think that, you know, my understanding is it's not like pulling down a model, like downloading a model and like, you know, rendering the model and it's rigged and you do animations like those things. I think you have common utilities for the shaders are more like the sort of look and feel. And so you can probably have a jumping off point with them, but you sort of combine them in your own way to give your game its unique style, you know, or realistic, you know, kind of thing. But I think a lot of people, independents, are pursuing a very stylistic approach, so that kind of looks unique to them or whatever. So rather than, you know, hyper realistic is a. Is a really tough game to play in.

0:18 --> 0:18:55.670
<v B>Yeah, that totally makes sense. All right, my second news is I've been diving into. Okay, so what was your first computer, Patrick? First thing with a keyboard.

0:18:57.110 --> 0:18:59.670
<v A>It must have been one of those. Is that like an apple IIe?

0:19:01.030 --> 0:21:30.830
<v B>Okay, yep, yep. We had those in school. I had a Commodore 64 and, you know, I also had an Apple TV at school. And I've just been going down memory lane on old, like, original games, like the first few games I ever played. One of them was this game called Street Beat where you ran around with a boombox trying to hit people in the face with music notes to make them dance. And so you actually, the way it worked was first you had to find a song and the songs were like, in people's houses. So you'd. You'd see a glowing door. That was a sign that you. This person had made an awesome song. You'd go in his house, you take his song, and then you turn on your boombox and hit people in the face with music notes. And once you've hit a certain number, then the song was like, you know, I guess like proven to be good or something, baked in or whatever it is. And then you would take your baked in song to your house, your own personal house, and score a point. And so I don't remember how many points you have to get, but anyway, so that was Street Beat. I came across one that I played with My family, when I was really little, that kind of blew my mind in that it was really an interesting crossover that I don't think I've seen since called Robot Rascals. And so the way Robot Rascals worked is it was a card game and a computer game all in one. So, you know, the cards were your private information that no one else could see, and the computer was sort of the community information. And so you would take turns playing this game on the computer, but you would have your private information and you try to, Based on what other people were doing, try to guess at what their private information was so that you could exploit it. And I just. Yeah, so many good memories of this, but. But I haven't really seen anything like that. There are some, you know, companion apps. So if you're playing Monopoly, for example, there's like a companion app where you don't have to pass paper money around. You know, there's things like that, but those are more assisted tools. I haven't really seen, like a board game app trying to cross over like this. Have you, Patrick? Have you seen anything like that?

0:21:30.830 --> 0:21:50.310
<v A>Yeah, I mean, there are a few. So I think I played like a Ticket to Ride where, you know, like, everyone has their tablet and, you know, the screen is rendering the common thing, but you can also do pass and play or whatever where, like the color. Oh, I don't know how people. Anyway, the color of the cards you have, you know, shows at your turn. There's also, I think, one for Turo.

0:21:51.780 --> 0:21:53.380
<v B>Oh, wait, hang on. I don't know if maybe.

0:21:53.380 --> 0:21:54.260
<v A>I don't know.

0:21:54.900 --> 0:22:07.100
<v B>Okay, so what I'm saying is you have physical cards, so. So you're playing a physical card game, but then the computer has, like, the other half of the game, you know, so, like, the computer doesn't know.

0:22:07.100 --> 0:22:17.620
<v A>It's like, combined, like, hybrid. So it's not like you're playing the board game. I was like, okay, no, I missed this. Okay, so like, you have like a deck of cards, like, holding in your physical hand.

0:22:17.700 --> 0:22:18.580
<v B>Right, Right.

0:22 --> 0:22:23.720
<v A>Oh, wait, what? How? And so you have to enter information reliably into the computer.

0:22:23.800 --> 0:22:58.550
<v B>So, yeah, so, like, you could, like, basically, like, you could tell the computer, like, I have a right to mine this rock. And the computer has no way of knowing that because it doesn't know physically what cards you have. So it just assumes when you say something, you're right. But. But yeah, I mean, you know, you're also playing with other people, so they. They would. If you have a right to, like, bind this Rock and someone else is just looking at that car, then he knows you don't have it. So you can't, you can't really cheat, per se. But yeah, like, like you need the sort of like the physical and the virtual kind of in the same space.

0:23:00.550 --> 0:23:13.750
<v A>I, I think I have heard of something like that, but. No, I've never, I've never, I never played it where like, yeah, like the, the engine of the game can be very complex because the. All the accounting is done by the computer. But you start playing a physical game.

0:23:14.570 --> 0:23:15.850
<v B>Right, Right, exactly.

0:23:16.090 --> 0:23:17.690
<v A>Yeah. That's cool.

0:23:17.770 --> 0:23:18.250
<v B>Yeah.

0:23:18.410 --> 0:23:22.810
<v A>This was like an old game, right? So it looks like you were saying it's like robots or something.

0:23:22.970 --> 0:23:55.430
<v B>I definitely played it as a very little child. It came out 1986, so I was actually four when it came out. I mean, I played it. I was probably like 8 when I played it. Very cool. Wow, that's nice. Yeah. So folks out there, if you're in game development, this would be like a interesting area. I don't know if it's sort of like just, you know, past its usefulness, but I just, I feel like there might be something cool there. We'll. We'll see. Maybe someone can write in and say what some contemporary equivalent is.

0:23:56.710 --> 0:26:49.430
<v A>Well, I mean, alternative to what you're proposing is you could go into virtual reality and the game could just be sitting on the table in front of you. And if you wanted sort of a newly released virtual reality headset, you could use the new meta quest 3. Do you have one of these? No, I don't, I don't. But I just bring it up because I feel like it's one of those, a couple like, you know, space travel, random finance threads, virtual reality. There's like a few things I feel like we, we keep weaving through our podcast over the years, and so our podcast is old enough that, that we've seen some of these things mature and so interesting. Meta quest 3 is like a sort of. I guess it's the latest released hardware, of course. You know, I think there's the Vision Pro, which people are talking about coming from Apple next year, I think is what they're saying. But for the meta quest 3, this is out, you can buy it and the sort of like it does a lot of things right, like, you know, revs everything but the pass through where your camera's on the front that are sort of eye distance apart, passing through to you on the inside, so. So that you can do the mixed reality. This is the thing that, you know, everyone really wants to make happen is the augmented reality, the mixed reality. And so you are seeing. Which is kind of the funny thing. The what happened when Google Glass was coming out, which is like people sort of out in public using their headsets in weird places to, you know, record, record, stereo, video. Yes. But also to like be somewhere but also not be in that place. And so it's sort of this interesting, I guess, you know, zone between this being a real thing where you just put on glasses and see it because you're still wearing a headset. People are very aware of what you're doing. But yeah, I know I do play still my. I have a quest 2. I want to call it an Oculus Quest, but they deprecated that. But to be clear, it was like on the branding of my box, I feel like I still call it that. Anyways, the Quest 2 and I've really enjoyed it. I feel like they sort of figured some stuff out. The games, you know, we play, we keep playing and, and we've talked about that before on the show. Getting exercise in the game and just generally how immersive it is. I've been playing the. I guess I could have made that my tool to show I didn't. Oh well. Which is I expect you to die number three. Which I told to someone and they looked at me very funny like, what kind of horrible game are you playing? It's like, no, no, no. It's like a light hearted puzzle room. So anyways, I expect you to die number three. I feel like people have kind of figured out the formula a bit for some of these things that work and it may be your cup of tea or not, but excited to continue to see stuff coming out in this year. I don't know where it's going. I don't know if this is like the next big thing or not, but definitely excited to continue to see how it shapes up and at least, you know, for me, you know, being a casual person involved here, you know, enjoying some good games and entertainment as these things mature.

0:26:49.910 --> 0:28:41.580
<v B>Yeah, I mean I'm one of those people too that plays the Oculus. The Oculus Quest 2 at least once a week, sometimes even like multiple times a week. And but you know, something that like, it's kind of interesting, you know, I play basically the same two games that I've had almost since the beginning. And so, you know, I don't know how much of a commercial success it is but. But it creates a ton of value and basically for me I just play boxing and sword fighting and you Know, originally I got it to where I was playing boxing on the harder and harder difficulty levels. But I realized two things. Number one, you know, I'm not really very, like, fast in terms of reaction time, and so. And number two is, like, when you were sort of whipping your head around, it's just not very comfortable. So I thought, okay, I can make this more difficult by adding weight, because I know boxers do this in real life, Right. They have, like, really weighted gloves when they trained and stuff. And so I got these wrist weights, and I've been kind of just adding more and more weight and playing these games on the normal difficulty just with more weight. It is amazing, like, number one, how much harder it is. Even with one pound of weight on each wrist, you'll be amazed. Yeah. You'll be amazed at, like, how slow you punch and how slow, like, you get your hands back in a guard and all of that. And I got. This is kind of ridiculous, but I'm up to, like, five pounds of weight on each arm. So I have these. They're meant for your ankle, but I put them around my wrist, and I have another one that has a thumb loop. And so I have these, like, basically two weights on each arm. And I'm doing boxing, and it is, like, a heck of a workout. I mean, it's just. You'll be sweating bullets after you do it. It's a ton of fun.

0:28:42.710 --> 0:28:44.790
<v A>So you're ready for a street brawl is what I'm hearing.

0:28:45.270 --> 0:29:21.980
<v B>Yeah, I. Talking to a friend of mine who does. He's kind of a hobby blacksmith, and he told me that I always thought. And I always thought that swords were, like, £30, but no swords in real life. A broadsword or whatever is only, like, six pounds. And so now, granted, there's even more of a. Of a cantilever effect because the sword's weight is not even close to your body. But I'm starting to approach what it feels like to hold a real sword, and it's a ton of fun.

0:29:22.620 --> 0:29:29.100
<v A>So if you get stuck in one of those time vortexes or whatever, and you end up in medieval times, you'll be ready to go. You're ready to rock.

0:29:29.260 --> 0:29:37.250
<v B>Yeah. I think the challenge is I'm used to moving through people, not used to. Used to people actually existing physical plane.

0:29:38.050 --> 0:29:39.970
<v A>So as long as. If it's a lightsaber, you're fine.

0:29:40.130 --> 0:29:42.370
<v B>Yeah, as long as nobody hits me either.

0:29:44.850 --> 0:29:47.490
<v A>Every plan is good until you get punched. Yeah.

0:29:47.490 --> 0:29:48.530
<v B>Mike Tyson, right?

0:29:48.530 --> 0:30:44.740
<v A>Yeah. Random sidetrack. Okay. I wanted to go back to One thing you were saying, which I don't know if we talked about the show, so I'll mention it just really quick because we've talked about it before. But when Activision was on trial, that was earlier this year, I believe it came out that there are a million PlayStation users who literally only play Call of Duty, literally nothing else. They have their PlayStation and they only play one game. And something like there was 6 million who spent 70% or more of their time playing a Call of Duty. And so you're like, I only play one or two games. I'm the same. I feel bad if I just keep playing the same game, you know, Factoria. But, you know, if you just sit there and play all yard. But actually, I think, you know, this is quite normal is what is. Is what the statistics show is people just end up really liking one game and for very long periods of time, they just play a single game.

0:30:45.540 --> 0:30:55.700
<v B>Yeah, that's wild. I mean, it's very hard for that industry to make money if they have to sell those units. I think they sell the units at a loss. And if someone only buys one game, it's just really difficult.

0:30:56.360 --> 0:30:57.480
<v A>Yeah, no clue.

0:30:58.200 --> 0:31:00.280
<v B>All right, time for Book of the Show.

0:31:01.240 --> 0:32:36.280
<v A>All right, so I'm gonna go fast because mine's a cop out. Mine is the paper we're about to talk about today. So we're talking about an algorithm, hyperloglog. Stay tuned. And so my book of the show, which isn't a book, it's a paper, and it's a paper by Google Research, which is where hyperloglog was sort of written up. So check that out if you're not a fan of reading papers. I know Jason reads a lot of papers, but I don't. I find them very difficult. I don't prefer them. I don't like the way that they're presented. I think they're pretentious, but whatever. Anyways, all the PhD people are coming after me soon. All the. Anyways, but I find that they're not written for clarity, but written to sort of affect a certain style. This paper is no different. But there is some good stuff in here. There are, you know, good graphs and even. Just. I've just given up. That you don't have to understand every line of a paper or even like the notation, just. Just kind of like reading it once or twice. And if you sort of go through a couple of papers and. And people will often reinterpret a previous paper. So for papers that are sort of seminal, you can kind of read a later paper and get a better description. And so anyways, don't feel bad if you're not a paper reader. You know, I would, I would say as part of like, you know, some small percentage of your education, if you're, if you're not a sort of master student or PhD student or PhD graduate, uh, still, still don't shy away from looking at papers every so often when they come up and, and just sort of like reading, you can, you can occasionally glean really interesting tidbits or, or insights. Not always. There's a lot of junk, you know. Anyways, Hyperlog, log paper. That's my. That was a weird rant for a book of the show, but there we go.

0:32:36.440 --> 0:32:42.680
<v B>No, I think it was really important. I mean, we must be on the same wavelength because my book of the show is also a research paper and.

0:32:42.680 --> 0:32:45.240
<v A>I didn't know this. I saw it in there and I thought it was an actual book.

0:32:45.830 --> 0:35:56.520
<v B>Yeah, I went into today's our show thinking, man, I've read so many research papers but made zero progress on my Audible books. I'm just going to talk about a research paper and then here we are. I'll go pretty quick too because I want to get to the content. But basically Nvidia has this really interesting thing where they're using large language models to try to fabricate goals for this reinforcement learning robot. So for example, they'll ask ChatGPT and. But I'm just starting to get into this paper so I'm not going to do it justice right now. But they'll basically ask ChatGPT for like, what are certain things that we can do to show that we kind of have good ambulation? And then they'll kind of just keep riffing on that. So it's. I do feel like one thing that's always missing with reinforcement learning is you have to train from scratch versus, you know, for example, if you're driving a car or writing something down or whatever, you're, you're, you're starting from this base of all this common sense reasoning. So okay, here's a good example. So we have the AI to play Atari, right? And so the way it works if you want to play Pong, is you start off just moving the paddle randomly and losing, but then sometimes you just randomly hit the ball and you get a point. And so that creates a target and then you start marching towards that target. This becomes a really big problem in something like an rpg. So if you want to play a role playing game with AI, you're going to have to, like, try every spell randomly, move around randomly, because you're not drawing from a common sense reasoning bank. Like. Like, you don't. The AI doesn't know that. Well, forests usually have more dangerous enemies than roads, because roads are generally guarded. Or fire works well if the enemy looks like he's made of ice. Like, these are all things that you could just do on the first try. But the AI has to, like, trial and error its way through. And that's why you don't see, you know, AI playing Dragon Warrior, for example. Right. Because there's just too much common sense that. That has to be just learned from scratch. And so this. This interested me because I feel like maybe we can somehow draw from the common sense that's been accumulated in something like ChatGPT or GPT4. I think there's something to that. Like, imagine if you somehow just fed the visual and the textual content to some type of embedding and then trained on that. Then maybe Fire one and Fire two would go into this embedding, and you would just learn, in general, some things about fire, the fire spell, or whatever. So I feel like there's something there, but it's very early days.

0:35:58.080 --> 0:36:27.070
<v A>That's interesting. So rather than combinatorially exploring the space and finding out that if the attribute ice or whatever is attached to a enemy, then use a fire spell, I guess I could use a Pokemon, maybe, is a good example. Like, oh, visually, this thing looks like it's frozen or has an ice type associated with it. Or I've memorized that I should use a fire spell. Like you said. That's done that way. Even though it's an abstract concept, it's just rock, paper, scissors.

0:36:27.070 --> 0:36:27.430
<v B>Right.

0:36:27.430 --> 0:36:55.810
<v A>But that transitive is encoded to you as a human because it maps to something you understand, which is like, you know, if I have something made of wood, something with fire will destroy it and would have a very hard time fighting something made out of fire. And so you're sort of saying you could kind of, like, get that understanding by combining two approaches. Something that understands the context being given to the human and then the normal.

0:36:55.810 --> 0:36:56.650
<v B>You know, sort of.

0:36:56.650 --> 0:36:57.450
<v A>Yeah, interesting.

0:36:57.450 --> 0:37:34.170
<v B>Yeah, exactly. Exactly. Like, I mean, of course you could. Well, I don't know. Of course. I mean, you could theoretically cheat if somehow you could read the game's RAM and figure out, like, okay, this. This character has this, you know, ice attribute. Right. But, like, that's not how humans do it. Like, humans are just looking at the picture. And so to really play Final Fantasy or Dragon War, these Other games fairly like to show it that an AI that is really representative of solving the problem, it has to be able to just look at things and infer based off what it knows about the universe.

0:37:37.530 --> 0:38:02.090
<v A>So I guess the equivalent would be if a computer designed a game that it thought was interesting based on the rules of the game and then presented it to you as a human. You would be like, this is not fun. This is just like some abstract concepts. But the AI would be okay. You would be on a more equal footing. But because game designers are humans designing for humans. Yeah, you kind of. You got to cross that chasm before. Or you put a computer on equal footing.

0:38:02.250 --> 0:38:10.170
<v B>Yeah, I mean, imagine like a poco, actually. They have these. Where it's like they randomize the game. Have you heard of these game randomizers?

0:38:10.170 --> 0:38:11.050
<v A>Yes. Yes.

0:38:11.290 --> 0:38:32.540
<v B>Yeah. And so they, you know, imagine a game randomizer for Pokemon where it just randomized all of the strengths and weaknesses of all the Pokemon. It would be just terrible because it's like, oh, this, the turtle that shoots water is actually a fire entity. And it would just be really frustrating. That's how the AI feels.

0:38:35.580 --> 0:38:39.340
<v A>Am I supposed to empathize and feel bad for the AI? Jason, help me out here.

0:38:39.580 --> 0:39:00.200
<v B>Yeah. I mean, it only makes billions of dollars, controls the stock market. I mean, you should feel bad for this. Thank you. And if you, like, takes like that, you should subscribe to us on Patreon. If you're not an AI, you should subscribe to our. Our Patreon. We really appreciate it.

0:39:00.760 --> 0:39:07.480
<v A>I. If you are an AI, you can also subscribe to our Patreon and fund this very important source of training data.

0:39:08.120 --> 0:39:08.760
<v B>That's right.

0:39:09.400 --> 0:39:18.040
<v A>So, yeah, thank. Thanks to all the Patreons. Read. A lot of. Of people stick around for a very long time, so it's always an encouragement and, you know, thankful for the support.

0:39:18.680 --> 0:39:19.560
<v B>Yeah, definitely.

0:39:20.520 --> 0:42:06.160
<v A>All right, it's time for the tool of the show. Well, I'm returning to form and giving another game. This is. I mentioned factorial earlier. This is in the style of. I guess it's closer to satisfactory if you've played satisfactory, which is, you know, there was a. Kind of a lot of stuff routed up around the Minecraft culture, you know, sort of Terraria and several others. And I think we're seeing kind of the same thing with, I don't know what you call them, I guess factory building games. I'm not sure what the official. I think that's right, like base building plus factory building. And so Tectonica is a new one. I've been playing on my Steam deck. Actually many of these games, like satisfactory, actually don't really have very good sort of controller support and people do play it. So someone's going to write in and be like, I use the touchpads for the 37 shortcuts that you need. And you know, it's not something you would sit down and expect to play on your Xbox. Recently Factorio added that that was actually when I picked it up and started playing on my Steam deck. But I've been playing this one, Tectonica. And so, you know, all these add a twist. So this one is 3D, which is, you know, adds an interesting dimension to your ability to, you know, build your factory and how to align things because it's in a first person sort of 3D 3D thing. And then the second thing is that you're underground so you're in caves. And caves, of course, like limit how you can sprawl your factory even kind of more than normal. But you're also given, you know, drilling equipment basically so you can carve out, you know, paths to, you know, extend your factory line or add stuff. And you know, you need plants to fuel some of your, you know, combustible machines like smelters. And so you need to, you know, to hydroponic growing because, you know, it's a pain to run around and mine, mine, all the mine, gather, harvest, harvest on the walls. So anyways, Tectonica, it's still early access, so I guess in theory it's buggy. I did, I did actually hit one bug on my computer, but in general very playable. I've not hit like some nasty crashes or anything. And a lot of these games interestingly are satisfactory I think is in the same thing still technically early access, early release, whatever it's called. And so it's not formally done yet, but very, very playable with you know, beginning and end game and all of that. And so Tectonica also is nice because as a. I mean you may not like this story, it's not an RPG level story, but there's definitely like a story to it about what's going on and sort of like accomplishments to work your way through that that are sort of very specific to unlock a narrative that goes along with it. So I'm enjoying it. Tectonic. I checked out on PC, very cool. As far as I know. I think Xbox, PC. I don't know about PlayStation.

0:42:07.200 --> 0:42:22.080
<v B>Oh, very cool. I'll check it out. Yeah, I love Factorio Unsatisfactory. So this is right up My alley. I'm going to give it a shot. All right. My tool this show is the ESP32 Development Board. Have you heard about this, Patrick?

0:42:22.590 --> 0:42:22.910
<v A>Yes.

0:42:23.230 --> 0:42:33.310
<v B>You have. Okay, I'm going to say what I think it is. I literally just got one. I have actually two of them right here. I'll hold it up for the audience to not see because you don't record video.

0:42:33.710 --> 0:42:39.470
<v A>Actually that was useful because I was wondering which development board you have. So I have a tiny one.

0:42:39.470 --> 0:42:43.470
<v B>Yeah, it's like the size. How would you. Maybe the size of a stick of gum, basically.

0:42:43.550 --> 0:42:44.110
<v A>Okay.

0:42:47.070 --> 0:45:02.480
<v B>Yeah. And the thing that caught my attention was that it has wifi. So it's the Raspberry PI Pico W. I have a couple of those. Those also have WI fi. Those were significantly more expensive. This is like extremely reasonably priced. I have a link to the actual one I bought and I bought. Yeah, I bought two of these for 1274 US. So what is it, like 620 a piece or 670 a piece? So yeah, very reasonable. And I haven't tried it yet. They're still in their vacuum seal bag but. Or static bag, whatever it is. But I definitely want to give it a shot. It supports Arduino and Micropython, which is pretty cool. So you could use either of those. Yeah, the Raspberry PI and the Arduino, they only support their own thing. This one supports either. I'm generally a little leery of things that. That are just so open ended that maybe it won't do any of those. Well, the instructions on that came with it. Walk you through how to set this up for Arduino so I might just stick with that. But. But yeah, just the fact that I can get WI fi in a small form factor is really neat. I think next year what I might do is I have a. One of my euro, you know, Euro is a mesh WI fi thing. One of my Eero capsules is pretty close to the front of the house. And so I'm thinking I can actually have robots in the trees next year for Halloween and they'll get this WI FI packet and when they get the green. When they get some WI FI packet, they'll swing from the tree like some giant ghost or something. So that's my plan. Last Halloween, which was just a week ago from when we're recording this, I actually went down a step. I only did about half of it. Things have just been really crazy for me. So I didn't have a lot of time to invest in it. But I'M going to try and make up for it next Halloween I'm going to bring back all the robots I didn't do this Halloween and add some more in the trees if I can. Very cool.

0:45:03.440 --> 0:48:16.380
<v A>Yeah. So I guess the commentary I have there is the. I think it's espressif is develops a little chip that does like all the WI fi and has I believe they're ARM cores on them. And so all of that to I guess like what Jason is saying is kind of irrelevant unless you're doing embedded programming directly. But they're very, very cheap and common and so a big community has sprung up around them. So Amazon's a good source if you want them quickly. If you are willing to wait for overseas from AliExpress you can get the original version or at least the original one I came into, which is ESP8266, which is a 16 bit microcontroller I believe again, maybe not important depending on what you're doing. If you're just sort of controlling a few servos and also has the library associated with it and that one's like $2 for like $3. So you can search like they're kind of like all cloned. But wemos W E M O S is a very common like dev board you can kind of look up and they have ESP8266 versions and ESP32 versions. ESP32 is a 32 bit processor faster. So if you're needing to do more computation they have variations of it with. You can attach a camera even and do some minimal image processing on or use as webcam. Of course, if you're just going to need a webcam it's better to normally just go buy one but then you normally, you know, pulling the data down and doing processing rather than being able to handle it sort of device side itself. And they all have, you know, we talked about that last time with the Raspberry PI. It has the sort of headers for doing I Squared C or SPI and you know, a lot of example code. And like Jason mentioned, you can use the Arduino, I guess you call it like middleware but basically like the libraries and the interfaces associated with that they do need to be for like hardware stuff. They have to be ported to ESP32. So you can't run the sort of Arduino libraries directly. In some cases, if they're software it's fine. But for, you know, things like i2C handling for a specific something, you know, you may you may need some, you know, variants, but it's relatively straightforward. Much easier than sort of building your own PCB and doing it from scratch. But they also support micropython, as I mentioned, and LUA as well, are pretty common is to sort of control these. And the big win, like Jason mentioned, which you don't normally see in a, you know, bog standard Arduino, or at least not at this price for sure, is the ability to connect to your WI FI network. And so you'll see a lot of home control devices actually use these as well. And there's a lot of firmware floating out for reflashing sort of, you know, the light bulbs like that screw into your roof have. We'll have one of these chips inside and that's how they connect to your network so that you can control them from the app. And so figured out that a lot of these run very, very similar chips. And so there's various from easy to hard ways of sort of flashing your own stuff onto them. And in many cases, the other cool thing is rather than having it hooked up to your computer, there often is the ability to update firmware over the air. So you can just send a new update to your bat in the tree and change its behavior without having to bring it back inside and hook up cables.

0:48:17.320 --> 0:48:52.510
<v B>Oh, that is awesome. Yeah, I'm really excited about this. I'm going to have to. I bought a. This is like probably just really like amateur to you, but I bought a crimper. I've never crimped anything before, but I wanted to like basically have this thing control the servo. But I need to have a wire come out of this thing and go into the servo. And it can't be like just a Dupont cable because of the way this thing is set up. And so I actually bought a crimper. I'm going to try doing my first crimp at some point. Oh.

0:48:54.350 --> 0:48:59.870
<v A>Yeah, I don't do a lot of crimping. I do a lot of soldering but.

0:49:00.190 --> 0:49:01.190
<v B>Not too much crimping.

0:49:01.190 --> 0:49:08.190
<v A>And a correction, the 8266 was actually still 32 bit. Just less powerful, less peripheral devices.

0:49:09.070 --> 0:49:15.830
<v B>Would you have to buy like a thousand of them if you're buying it from AliExpress or can you still buy like five or ten?

0:49:15.990 --> 0:49:17.390
<v A>You can buy like one or two. Yeah.

0:49:17.390 --> 0:49:18.750
<v B>Oh, nice. Oh, even better.

0:49:18.750 --> 0:49:29.830
<v A>The shipping has caught up. It used to be a much better arbitrage or however you say that, but now, yeah, normally, you know, they want you to buy a few to Basically amortize the cost of shipping, but it's not like hundreds.

0:49:30.150 --> 0:49:30.630
<v B>Got it.

0:49:30.630 --> 0:49:39.830
<v A>You know, if you're buying a tape, a tape of them, you know, the actual chips, you probably gotta buy 500 or something. But most people are buying a dev board, so you're only buying a couple.

0:49:40.590 --> 0:49:42.030
<v B>Makes sense. Very cool.

0:49:43.310 --> 0:52:56.290
<v A>All right, well, it's time to talk about hyperloglog. So Jason has warned me that he has been willing to play foil to my explanation here for more broad topics. I do have some other specific algorithms in mind, so if this goes okay, maybe we'll do some others in the same style. But let me kind of set up and walk it through. And Jason, ask away. The questions represent the average person out there or the everyone else, or even me, because normally I'm the one asking you these questions. The algorithm here arrives from a difficulty that probably most people haven't really thought about. In fact, I really hadn't thought about it until it was posed, which is how do you count the number of unique items in a set? And so if you sort of imagined the classical example here, we'll just use this one. You know, everyone's pretty technically oriented here. Imagine you're a website and you don't want, you know, you see that visitor account, right? And your mom goes and visits your website and then she just like keeps refreshing and the counter just keeps going up. Yeah, I have a popular website. Most people will sort of give this number instead as, you know, unique visitors. So you know, you could say, oh, okay, I'm going to, you know, store a cookie and tell if you've been there before or whatever. Right. There's various ways, but just imagine, you know, it's sort of being written to a log, let's say the IP address or you know, a unique identifier for a user and you want to count up how many of them there are. So you know, like the first source blush, you know, sort of like the naive answer is you sort of, you look through your list of all the visitors and you keep track of which ones you've seen. And then every time you go to the next one, you look through the list and if you see that user, you throw it away and then you count the size of the list at the end and then boom. And that, that is correct, that that actually will produce the correct answer, the so called cardinality, the number of unique visitors that was in your set of consideration. And so you may say, well, well that's, that's pretty obvious. To improve if this was a programming interview or that would be a suitable first answer. And then your interviewer would normally, you know, up the challenge a bit, which is, you know, ask you, well, how efficient is that? Could you do better? And so, of course you can do better than searching through a list for unique items. And you could use a tree, right, that, that would be one option. And the other option is you could use a hash map, right? So you could have a hash map of all the unique identifiers. And then every time you, you know, get a unique ID in, you hash it, or maybe it's already, you know, you randomized enough. But normally you want a really, really rigorous hash. And so you mix up all the bits. You get essentially a random number, but a repeatable one. And you go to a set of buckets and you say, is that one present or not? If it's present, you throw it away, and if it's not, you insert it. And we've talked about hashmaps on the show before. If not, you can, you know, go listen to them if you, if you don't know what they are. And so at the end you count how many entries you have. And, you know, this will also work.

0:52:56.850 --> 0:53:37.659
<v B>Yeah, I mean, the limit of what I, off the top of my head, would do just to scale up is have a distributed hash table where basically you do the hash and then, you know, each. It's like if the hash starts with a 1, send it to computer 1. If the hash starts with a 2, send it to computer 2. And so then you have a computer for each possible value of the first digit of the hash. And, and so then now if there's, I don't know, 56 different characters for the first digit of the hash, then you have 56 computers and you get 56 way parallelism. But yeah, I guess we're going to hear about weight even better than that.

0:53:37.980 --> 0:57:08.440
<v A>No, no, no, no. So this is great. So I had the same kind of thought process in my head and I think that works great for live, right? So if these, if you wanted to sort of like what I was mentioning, you know, these, these visitors are coming in and, and you're sort of counting them. But it doesn't work like after the effect or sort of analytic purposes. So imagine you, like I was mentioning, you have, this is a log and you want to cut it by day, by month, by, you know, people that came from Europe, by people who ended up buying something by whatever, you know, these kind of filters. And so I already referenced this comes from a paper and the paper internal to Google, they have a column store and so they have a distributed system for querying the column store. And this was developed because they went and looked and people doing analytic queries or I forget what it said in there, I won't quote it because I'll get it wrong. But a very large number of count distinct. So you write this sort of SQL query and you say, hey, I want the distinct number of these, but they're over arbitrary input. So what you say will work. Jason. But the issue is sort of like, that's like if you only really want, you know, a handful of them live and you're not sort of like infinitely slicing them later. So not, not, not an off approach. We'll come back in a second. I think you're heading in a good direction. But the, the sort of first, there's many algorithms have been done to, to sort of solve this and the paper even references one and, and to be clear, they even recommend this one if you're sort of not going to have very many if you don't believe that there's a lot of unique items. And this one will work. And I want to talk about this one because it references something else and we already kind of mentioned hashing unique identifier and then, you know, trying to put it in a bucket. So this first one is linear counting. So if you hash it and you put it in a bucket, but rather than in a normal hash map, you would store the identifier and one way is store a linked list of identifiers. So the first thing in the bucket is the head node and then, you know, it points to the next thing and you have a linked list, right? Until you sort of grow the hashmap number of buckets. In this case, what we're going to do is actually only store a single bit and not store the actual id. So you just insert the value and if it goes in a bucket, you just mark that bucket as having seen something. You know, boolean equals true and all the other buckets are boolean equals false. So now what you have an issue is when you have a hash collision, you're losing some information, right? You're losing the information that there were actually potentially it was the same item, but potentially it was two different items. And so you are getting this sort of probabilistic structure, but this is going to turn into an advantage. And the rest of this is going to talk about this. And the reason why I think this is an interesting topic for the show is my background before kind of coming across a few of these was always like, you would definitely not want a probabilistic answer, you would want an exact answer. And so it would not be acceptable to be even off by one. Right? That's an off by one error. And so you exactly what you have. But if you're willing to give up an exact answer, the design space widens dramatically, not only for the number of solutions you can have, but also for things like being able to tune your trade off, because you can say, how reliable can I trade space? Or processing for more reliability. And an example that you see a lot recently with machine learning is approximate K nearest neighbors. So we've talked about K nearest neighbors before. I think maybe, if not maybe that's a good one to talk about.

0:57:08.600 --> 0:57:09.800
<v B>I think we have. Yeah.

0:57:09.800 --> 0:57:10.280
<v A>Okay.

0:57:10.280 --> 0:57:10.920
<v B>Okay, good.

0:57:11.720 --> 0:59:18.590
<v A>But what if you were willing to accept an approximate answer? Then you can kind of approach it differently. So this first one that I'm mentioning is called linear accounting. So you, you hash it, you keep track, and in some cases, when you produce this random number and you go to insert it in a bucket, you're going to collide. And like I mentioned, sometimes that you want it, you want to throw away because it's the same ID and every one of the same IDs will collide, but sometimes another ID could probabilistically collide. And that's the second bit of this, is using statistics to say when you're done, don't just count the number of buckets that have an entry in them, which would be an approximation, but you can do better, which is look how many of the buckets, like the percentage full your hashmap is, and use that to estimate what the likelihood is that you may have seen extra insertions that missed. And so this, this is like this like clever unlock, right, for me was we don't know how many of the same item and how many you had. Duplicate hashes or portions of the hashes that you used are actually different IDs. But if your, you know, hashmap is only 10% full, only 10% of the buckets had values, that's, it's pretty unlikely. Like you can just say 10 of the buckets had items. So I'm just going to guess 10. But if you know, 90% of your buckets, you had 10 buckets and nine of them had it, then the chance that the next thing being inserted being random, 90% chance that it's going to collide. Right? And so you sort of take that into account. Now there is some balance there, right? If you had 10 buckets and 10 million unique items, you're going to get like a terrible guess, right? Right. Because you're going to fill it up and then everything will collide and you can't, the information is lost. So you do need to have an understanding of kind of the approximate number for this example, the sort of approximate number of items that you would have. So this one's called linear counting. And when I was reading about this, the thing that comes up here, which I don't know if you've used these before. Jason, Bloom filter. Have you ever used a Bloom filter before?

0:59:18.670 --> 0:59:21.870
<v B>I've used it. I don't remember exactly how it works.

0:59:22.430 --> 1:00:15.970
<v A>Okay, so this is very similar to a Bloom filter. In a Bloom filter, the idea is you're trying to have us what things are members of the set, and you query the Bloom filter and say via the same mechanism, basically you hash it, you look in the bucket, and if you normally do a couple different buckets and you say, if every one of those buckets has an item in it, then the thing I'm looking for, I'm going to do a more expensive query or check or actually go to the database. But if any of the things that should be present aren't, then I know with 100% certainty that this item is not in my set. So if I'm looking up the username Jason for Jason and I do this operation and it comes back that one of the buckets that should have had a bit set don't, then I know Jason isn't in, he's not a user, so I don't actually have to query my backend database. And the advantage.

1:00:16.130 --> 1:00:56.800
<v B>So let me see if I understand this right. So. So the idea is like, I mean, it's a crude example, so there'd be sort of a bit for every letter and position pair. And so it's like, you know, if you don't have the. So Jason is J a S O N. If you don't have the a at position 2 bit, or you don't have the o at position 4 bit, if you're missing any of these, then you know that the word JSON just can't be there. But if you have J1, A2, S3,04, N5, then you're still not 100% sure, you have to do some extra step, but you're confident enough that you've eliminated so many other things, that's still a huge time save.

1:00:57.860 --> 1:02:30.280
<v A>Yes. So if you took each of those, like J in the first position, A in the second position and you, you know, basically hash them and then put them in the bucket. And the probability of all those things you said, right, the probability of false positive is controllable. If you say, I think I'm going to have this many true entries, I want to use this many bits. And there's calculators online for sort of fine tuning your parameters, and if you have lots and lots and lots of buckets, you can decrease your false positives. But if you, you know, shrink it down, the really nice thing is, say, you know, on the edge node you can't store that you have, you know, 10 million usernames, but you could store a megabyte or whatever and in that megabyte you could store, you know, this Bloom filter and give you a reduction of, you know, it's only worth it if you think there's lots of users going to try to, you know, log on who don't actually have usernames in the database or whatever you're trying to do. But if they do, then you can sort of not have to go to the back end. So if you imagine for something like, you know, a spatial index and you're trying to query for an area on your computer, you know, like your video game, you're looking into this portion of space and most of the portions of space are empty and you just have entries of where there are things, then you could filter out the expensive sort of collision detection and all of the other things by basically knowing up front which things are present or not. Now, there are other ways of handling that as well, but you're right, it doesn't give you a 100% answer, but it prevents you from doing an expensive query by filtering out many of them.

1:02:30.520 --> 1:02:34.360
<v B>Got it. And that's a Bloom filter or did. What is it called? A Bloom filter.

1:02:34.360 --> 1:02:49.600
<v A>Do you know, that's a great question, I think, because you do this multiple hashes, so you don't just do a single entry into a bucket. You normally bloom. I would. This is my imagination, because it blooms out and you normally do a couple different buckets by taking portions of it.

1:02:49.600 --> 1:03:17.450
<v B>Okay, I have a really funny true answer. So. So the. Well, that might be true, but also it's called it because it was created by Burton Howard Bloom. That's a true story in 1970. But, but yeah, I mean, that's, that's really interesting, man, I can't believe that, you know, that this, this someone in 1970 was thinking about this problem. That just blows my mind, you know.

1:03:19.050 --> 1:04:09.360
<v A>So this one was the first. This Bloom Filter was the first one I came across that was this, it's okay to be wrong. And in fact it's just a tunable parameter. This is sort of weird. Sometimes you get this if you do like DSP work or filtering or something, you know, or you talk about floating point math, but for something that was like a data structure, a data structure that's like meant to be somewhat broken, it's like intentional that it sometimes gets the wrong answer. It's by design. So Bloom filter and linear accounting, at least in my mind, pretty similar. So moving on from linear accounting to the next step, we're going to go to the tail end of hyperloglog and talk about log log. So log log does something really interesting that in if you. We were talking about the Bloom filters taking, you know, Jason was talking about taking the j and hashing that, then taking the A and hashing that.

1:04:09.360 --> 1:04:09.720
<v B>Right?

1:04:10.440 --> 1:05:20.960
<v A>So instead if you just hashed JSON, the string JSON, you have a really good hash function. You could just use the first couple bits, the first however many bits you want to determine, you know, the bucket that you're going to go in. And then instead of storing just a single bit of, you know, I had something in this bucket or not, you go to that bucket and you look at the rest of the hash. Say your hash was 32 bits and you had five buckets. So, oh, did I pick bad numbers. 27 bits left, and those remaining 27 bits, you could either use the first or the last, it doesn't really matter. Count the number of sequential zeros. Now, this part was really, really weird to me. I didn't understand this. So take the number of, let's say the. Starting from the left to the right, count the number of zeros in a row that are in the next 27 bits, and then instead of storing one or zero in the bucket, store the number of zeros. So if the first digit was a one, you would put zero. If the first digit was a zero and then a one, you would put one because there's one zero. But you could have four zeros in a row and you would store that in the bucket. Now, if you were like me, that.

1:05:20.960 --> 1:05:24.960
<v B>Number of leading zeros. Yes, in a row. Okay, got it.

1:05:24.960 --> 1:05:59.700
<v A>The number of leading is zeros in a row. And the clever probabilistic thing about this is that if you imagine that there. So your hash needs to be very good. So it needs to have a 50% chance of a 0 and a 50% chance of a 1 in every bit position or as close to that as you can, you know, manage. Then what you would say is, if you're going to look at a stream of unique IDs, you would expect that the length of sequential zeros that you would get is a function of the number of items you've looked at.

1:06:02.660 --> 1:06:03.380
<v B>The length.

1:06:04.500 --> 1:06:22.200
<v A>If, if I said you're only going to play the game once, you're going to generate one hash and look at it, and I asked you to derive, like, what, what is, what would you be the expected number of zeros? You would see, it's much more likely to see one zero or two zeros in a row than to see 16 zeros in a row on the very first time.

1:06:22.280 --> 1:06:32.880
<v B>Yeah, it's like, it's like a coin toss, right? It's like, how many times can you toss the coin and get all heads? Well, like, as the number of tosses go up, that. That probability goes down, right?

1:06:32.880 --> 1:06:47.880
<v A>That's exactly the key to this. So if you look at a given bucket and let's say you only kept one bucket, that's fine. You could just have one bucket and you keep track of this. Then at the end of seeing all of the items, you look at what, the sort of, like, longest run of zeros.

1:06:47.880 --> 1:06:48.120
<v B>You.

1:06:48.850 --> 1:07:01.170
<v A>Now you say that I, I believe that I've seen one over two to that many number of zeros. That's the probability. So then I've seen two to that many number of zeros items is my guess.

1:07:02.370 --> 1:08:10.530
<v B>So, yeah, let me, let me see if I can rephrase it, like, paraphrase it and see if I have it right. So, so you flip a coin, there's a 50, 50 chance you get heads. If you get tails, you're done. And that's a zero. You get zero points. If you get heads, that's a point. And then you try again, you keep going until you keep accumulating points or you're out. And so 50% of the time you'll get zero points, 75% of the time you'll get 0 or 1 point. And so it just keeps going up like this hyperbolic ramp. And so, you know, 99% of the time you're going to get less than 10 points or something, right? And so what that means is, what you're saying is, you know, if you did get 10 points ever, then, then there's probably like 99% of the other trials there that you just didn't record because all you did was record the best score. It's kind of like, it's like if you go to the arcade and the high score on Pac man, like maybe they just rebooted, like reflashed all the machines.

1:08:10.690 --> 1:08:11.130
<v A>Yep.

1:08:11.280 --> 1:09:06.210
<v B>Yeah. And the high score of Pacman, or actually this happened to me where, you know, I'm obviously not super Mr. Athletic or anything, but I went to Chuck E. Cheese with my kids and I always play the football game. You. Have you seen this at Chuck E. Cheese? You have to throw the football through the holes. Yeah. So I played the football game and I won the high score of the day. And I felt like for a moment like, wow, like I should have like, you know, played in the NFL or something. And then like five minutes later, like some guy behind me, like, you know, got the high score of the day. That's because they just reflashed the machines. Right. So the high score, you know, there wasn't a lot of samples and so the high score was easy to beat. Conversely, if you go to like, you know, some other arcade or if you go later on in the day, the high score will be super high. You can't even get to it. And so you're using the high score as an approximation for how many people have played the game.

1:09:06.820 --> 1:09:22.660
<v A>Yes, that's the, yes, that's the key insight. I guess we didn't talk about that part here, but you, you caught onto it, which is you want to record the maximum you've seen. So, so in this bucket you record the maximum. And as you said, as the number of attempts goes up, and unlike your example where it's skill based, every one of these should be random. So.

1:09:22.660 --> 1:09:22.900
<v B>Right.

1:09:22.900 --> 1:11:23.950
<v A>It's just like slot machine and it's the. Whatever anyways. And so, yeah, so the higher that number that you've recorded, you're going to make your guess go up. Now you could get really lucky, unlucky, however you call it. And they vary. You see only a single number and that number happens to have, you know, a hundred leading. Oh, we, we said 32 bits. It has 27 zeros in it. And so you think you've seen two to the 27 items, but you really only ever seen one. And so this is a, this is not good. Right? This, this would make you have a very, you know, a high chance of getting an error because of this one problem. And so the second thing which I started off with there is by having multiple buckets and then averaging them together. You take the first few bits, the first five, and you choose your bucket. Then you use the last 27 and you do this thing I'm mentioning that We've talked about recording the high score and then by averaging all of the buckets together, if you've only seen one, sure, you're going to have this really large number in a single bucket, but all the other buckets are going to say zero. And so you're going to pull it back down to try to correct. And we're setting up for the second thing here in a minute. But this is called, I guess like log log. I don't see a ton of evidence that this was really, you know, rolled out or used because there's some pretty easy ways to improve it here. We'll talk about in a second. But this idea of recording multiple max scores and then looking at the end and taking an average allows you to sort of approximate it in a, in a very similar way. So the, the other thing we didn't mention with, with log log and linear accounting in both cases is you, you kind of measured well in linear counting. You measure this like, you know, amount and log log. You have these coefficients as well because you just sort of know on average you're likely to, you know, kind of see these things. So you run some simulations and you kind of tune some constants to accommodate for kind of like edge effects. I guess you would, you would sort of call it. So hyperloglog improves on log log.

1:11:24.350 --> 1:12:06.910
<v B>But one thing actually, before you jump into hyperloglog, I want to double click on something you said earlier, which is I also went through this where I felt like even just hashing, I kind of felt like, how do you trust the hashing? What happens if your data set is biased in a way where. Because maybe your data set all starts with HTTP that now like your hash function isn't pure and you have more collisions or even like Mac addresses. I'm pretty sure Mac addresses are. Well, actually, I don't have no idea. But I'm assuming that there's some kind of randomness there.

1:12:07.230 --> 1:12:12.590
<v A>There's no way all of that like it's hardware vendor sets the first few bytes.

1:12:13.230 --> 1:12:25.960
<v B>Okay, got it. But then within a hardware vendor at some point there's probably a collision, right? Or are there. There can't be like a global count of Mac addresses while the factory is like churning out tons of these.

1:12:27.080 --> 1:12:32.120
<v A>Yeah, maybe there is. I think there's not supposed to be, but probably there is, but.

1:12:34.840 --> 1:14:46.040
<v B>Oh yeah. So like whenever I see hashes or all these things, I used to feel like, well, this isn't kind of pure. It's not going to work out. And how do you even know if you have a problem. And it turns out that yeah, at scale you can't do anything very precisely. And so what you do instead is you try to measure the bounds statistically of some issue. And then you're constantly correlating between other things. For example, maybe there's a 1 out of 10,000 chance that you overestimate your unique visitors by 10x. Right. And so 1 out of 10,000, that means Google's been around for 15 years or something. So it's probably like a 50, 50 chance this has happened once. Right. But what they're doing is they're constantly correlating that with other things. So if the viewer count goes up by 10x but the revenue doesn't move well, then that might be an error that day in the viewer count. And so you're constantly trying to cross correlate between different data sets, different signals, and then also autoregress correlate across time for a signal. And so it really just becomes about reducing the error and trying to reduce the bias and the variance of these signals. And so when you start thinking about it in those terms, then it becomes okay to kind of say, well yeah, all of these errors are there, we're just going to measure them all in gross and try and get that measurement to be as accurate as possible. And so as they're making changes to these filters, they're comparing the correlation to, again just using our example, to revenue. And they're seeing, oh, the new value is much more correlated to revenue, which is what I would expect. As unique visitors go up, revenue goes up and all of that. So you can actually measure all of these things and get better in an incremental way, even though it feels like a very helpless situation.

1:14:47.720 --> 1:15:29.580
<v A>That's a great call out, I will say also I think a lot of these techniques, we're talking about time and a place. So like your financials, you wouldn't want to do this. You know, you probably do a different sort of like way the expensive query, but you can't maybe do that in real time. And then as you said, not only like comparing it with other metrics, but even just also running maybe like a second method that is less reliable or different parameters and checking that as well. So maybe it's less accurate, but it gives you a coarser bound, right? And says, you know, hey, at least I know that, like if I get a 10x measurement I need to throw it out, like that's, that's going to be wrong. So it can give me an order of magnitude kind of kind of swag.

1:15:30.460 --> 1:15:31.660
<v B>Yep, makes sense.

1:15:32.300 --> 1:17:46.920
<v A>Okay, so moving from log log to hyperloglog, it's, it's not a big jump. So the first one is for any buckets. So it's basically what happens with, you know, kind of very high numbers, like when you're, you're starting to get very dense filled buckets or very low numbers where you have a lot of empty buckets. So in the case of very low numbers, we have a lot of empty buckets. You may consider instead of taking the max counts and doing the average where like I mentioned, you have five buckets, you happen to get one that was 27 zeros. You're going to way overestimate it if you have a lot of empty buckets. Instead, maybe just estimate the probability of having that many empty buckets. So what is the chance that you saw, you know, in the case your estimate there would be very high, like million or something. Rather than estimating a million, what is the chance that you saw a million numbers and had nine empty buckets? That is going to be very low. And so you use that for sort of some corrections as well as very dense. Just like in linear counting, we're talking about you. The more buckets sort of like have bigger numbers in them, the more chance that you're sort of seeing these collisions and you're sort of seeing a different behavior of the, of the numbers. And so handle those as well. And a lot of that you just end up running by either computing the statistics or running simulations for sort of handling that. Rather than, you know, kind of a single formula that you run, you kind of have a sort of set of, you know, if you meet this condition or that condition or the other condition, apply a slightly different formula, but, but roughly the same thing. And that gives you hyperlog. So this hyperlog log allows you to perform these analytics, get an approximate answer. I think the number that I was sort of seeing in the paper is like for the majority of cases with these corrections, you see like 2, 2% error. So if that is important, like in the case of maybe financials or something, you don't want to use it. But for the case of like counting how many visitors in a day, you know, if you don't want to be lying to people if it's like a court thing, so like you wouldn't run it. But if it's just, you know, keeping track from day to day or hour to hour, as an example, you could run something like this and know that you're going to be generally close most of the Time. And of course a lot of those things. Things are tunable.

1:17:48.600 --> 1:17:50.600
<v B>Yeah, very cool. It makes a ton of sense.

1:17:50.840 --> 1:20:07.150
<v A>The final trick and the sort of like second thing that kind of like, I don't know, melted my mind a little. And there's a variety of these problems and this one's going to get to Jason's example. So here we are at the very end and you foreshadowed by talking about, you know, sending it to different computers and having each of the computers handle it. So let's imagine we chop up our logs and send all of our log, you know, one tenth of each of our logs to 10 computers. And each of the 10 computers does the thing we just described with Hyperlog log. What do you do at the end? Well, you can't sum the 10 results, right, because if you sum the 10 results the same. Now if you divided them like Jason said, where I looked at their username and sent them to a computer based on their username, you could. But if I just chop the first 10% of the logs, the second 10% and I sent each of those, some people will have two different computers will have seen the same user potentially. And so you don't want to just add up, hey, each of them. So 10 times, you know, whatever number each of them, you know, just kind of add them up. That will, that won't work. That's not a good estimate. You need to dedupe them again. Well, the sort of mind blowing thing, or at least for me, is if you look at those buckets we talked about and everyone's running the same algorithm, you can just transmit the final counts of the buckets and just run the same max operation. So if you run the same max operation across all of the buckets and then do your final calculation, that actually works perfectly fine and is incredibly cheap. So if computer one says, my first bucket has eight zeros and computer two says, oh, my first bucket only has five zeros, then you final count for the first bucket is eight zeros. And it's the same answer you would have gotten if you had sent all of the queries to only a single computer because of this, you know, counting, keeping only the max. So this max for individual computers is if you take the max across all computers the same as if they had seen them because. Oh, right, yeah, this was not just like, what? No, this ain't gonna work. It was my first inclination here. But it's, it's this aggregating by max is the unlock because you're aggregating them all Together by max. You can just run it at the end across all the computers and you have it distributed as many ways as you want.

1:20:08.190 --> 1:20:17.070
<v B>But wait, what was wrong with your idea of saying. Well, just, just adding up all of the. Oh, because there's duplicates across the computer.

1:20:17.070 --> 1:20:30.270
<v A>Yeah. So imagine the entire log, 1 million entries is all username. Jason and I send each to 10 different computers. Each computer is going to estimate one, let's say, and then my final result, say 10. The answer is actually only one.

1:20:30.760 --> 1:20:31.480
<v B>Right, right.

1:20:32.200 --> 1:20:54.200
<v A>So again, for something like visits to Google, you may know on general how many times a given person visits in a day. Maybe you could do this via other means. But if you're using this for an analytical engine that doesn't know how likely repeats are or not, then this is a very difficult, you know, thing to solve a priori with sort of like specialized knowledge.

1:20:56.130 --> 1:21:26.760
<v B>Yeah, that makes sense. Yeah, that's totally mind blowing. Another thing that this benefits from is that each bit doubles the count. And so the difference between seven and eight bits is so large that if one says five bits and the other says eight bits, by throwing away the five bit one, you're not throwing away that much because. Yeah, because as I said, it's hyper logarithmic or whatever.

1:21 --> 1:21:28.920
<v A>Yeah, its contribution is irrelevant.

1:21 --> 1:21:31.720
<v B>Yeah, right. Wow, that is super cool.

1:21:32.120 --> 1:21:45.320
<v A>I guess a couple things slipped over. So one is that moving from log log to hyperloglog, you also move from taking an average to taking a harmonic mean. So you take the harmonic mean of the buckets, which helps.

1:21:45.320 --> 1:21:46.760
<v B>Now what is the harmonic mean?

1:21:47.160 --> 1:23
<v A>Harmonic mean is 1 over 2 to the N time and then you sum those up for each bucket and then you multiply by n at the end. And so that's used for rates. So I try to look it up as well. It sort of, it's like one step past my intuition about probability. So I'll keep, I'll keep noodling on that one and maybe have to get back to you. But they see improvement by moving from the sort of normal averaging that we would think about. You know, just sum them all up and divide by the count and instead moving to this harmonic mean instead, they see improvements there. And again, I mentioned these coefficient factors and you would run simulations and so they give estimates for what they see for these correction factors for high, low offset, these kinds of things. So definitely read the paper, but I would also say that day to day life. Are you going to implement this? Yeah, don't. I mean, maybe if you have, you could. But one, we mentioned a Bunch of things along the way here. Bloom filters, probabilistic counting. Knowing to go reach for one of these if you don't need an exact answer. The second thing is a lot of databases have already like redis and other things have already implemented this under the hood for these kinds of operations. So you're likely already using it. And it's just for me one of those mental like aha moments where the chance that this specific solution is what I need to reach for is very low. But understanding a bit more about this space I feel helps me sort of know when to go to the Internet to search for something I don't know.

1:23:18.760 --> 1:23:26.920
<v B>Yeah, so I mean as far as using it, there must be some way in SQL to say like approximate count distinct.

1:23:27.480 --> 1:23:34.520
<v A>Oh, interesting. I didn't look at. So every SQL is different. So let's see. Postgres. This is fascinating.

1:23:34.520 --> 1:23:39.480
<v B>I also type postgres. Oh man, same wavelength. Same wavelength.

1:23:39.480 --> 1:23:49.070
<v A>I mean it looks like there's some add ons you can kind of do to kind of distribute a distinct count with hyperlog log on Postgres. I see a lot of articles about this.

1:23:49.380 --> 1:23:54.020
<v B>So again, yeah, I agree it looks like a plugin.

1:23:55.860 --> 1:23:56.260
<v A>So.

1:23:56.660 --> 1:24:23.630
<v B>Oh yeah, this is. So you install this plugin in Postgres and then you do hll underscore cardinality and so but I remember, I think this is Sawzall. I remember there was some I do remember in my life typing select a proxy count distinct like I don't like that for some reason is like just radiating through my mind. So there's definitely some system that has it built in.

1:24:24.830 --> 1:24:41.470
<v A>Yeah, I mean I think they're like probably in Hadoop for this sort of very distributed processing of things. So unlike when you have it on a single computer but you have it across multiple computers doing large analytics column data I would imagine like Apache Arrow or something.

1:24:42.380 --> 1:24:57.020
<v B>Yeah, exactly. PI, Spark and Pyspark have a prox count distinct. That's where I've done it. That's amazing. I had no idea that that was doing that under the hood. That's like totally mind blowing.

1:24:59.900 --> 1:25:15.470
<v A>I don't know. So. So maybe right in if you like this detailed singular algorithm exploration. I mean I guess we touched on a bunch of stuff so maybe it's not that dissimilar as I thought it might be in my head. But Jason had to play along here so he's at least acting like he walked away understanding this a bit more.

1:25:15.470 --> 1:26:14.830
<v B>No, this is amazing. I definitely learned a ton. No, we should definitely do more of this. Yeah, Please write in, let us know plus or minus. But I actually learned a ton. This is really satisfying. Another thing is when people get a CS degree and they learn about quicksort and merge, sort and all these things they might say to themselves, well, like, you know, okay, like we figured out how to sort, you know, we're done. Like, why, why even teach this to me? Like, why are you teaching me a one liner in any modern programming language? And, and what you've kind of shown folks here today is that like this doesn't end. You know, like sometimes you need probabilistic things, sometimes you're working very limited environments. You know, there's all sorts of different types of counts and things you do. And so, and so computer science, you can do it all day as your job and like the, literally the science part of computer science and there's a lot of fun stuff.

1:26:19.150 --> 1:26:23.310
<v A>Well, I think that brings us to the end of another episode 169.

1:26:23.470 --> 1:27:19.950
<v B>Wow. Yeah, pretty, pretty wild. Thanks everyone for sticking around. Yeah, we should definitely do something for our long term patrons. I'm gonna do some digging on my end, figure out if I can get that like number of months subscribed and all of that. But thank you so much for supporting us all of these years. You've been able to continue to grow the community. I actually, I'm considering using some of the community money to buy Twitter ads. It does feel like our Twitter presence is growing pretty good organically. And so at the end of the day we try to get as many folks interested in programming computer science as possible. And we couldn't do that without all of your help. And so we do put every dollar that you donate towards trying to accomplish that goal. Yes.

1:27:20.030 --> 1:27:27.950
<v A>Thank you to all the people who have stuck with us for. I always forget how long it's been, but very long time. Including Jason. Thank you Jason, all your hard work.

1:27:28.510 --> 1:27:56.160
<v B>Thank you. Yeah, someone commented asking, we did two shows on bitcoin. One was episode seven or something and another show a few years later and they were asking if we bought bitcoin at those times. And I'm sad to say the answer is no, we did not buy bitcoin at, I think they said 50 cents or $500. We didn't buy bitcoin at either of those prices.

1:27:56.800 --> 1:28:05.120
<v A>I didn't. But after that show I did install. I got some bitcoin out of a faucet but unfortunately it was like 01 Bitcoin or something. So that was the only bitcoin I had left from that.

1:28 --> 1:28:56.520
<v B>So yeah, we answered the. That's definitely the most asked question. I think. I think about once a month we get asked that. Someone asked it on Discord last week, but I do get a lot of email. Hey, are you guys. Did you guys buy a bunch of bitcoin? We didn't do that. Yeah, that wound has been sufficiently salted. But thank you so much, folks. It's awesome that we have such outpouring of support. A bunch of folks asking for different languages. We will definitely get to that, but I also want us to cover algorithms. This was a really exciting kind of new branch for the podcast, and so thanks, Patrick, for doing the background homework on this. All right, all right, catch everyone later.

1:29:08.600 --> 1:29:13.400
<v A>Music by Eric Barndaler Programming throwdown is.

1:29:13.400 --> 1:29:30.290
<v B>Distributed under a Creative Commons attribution Share alike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and share alike in kind.

