Episode 22

#22. From Excel to the Lakehouse: a Data Journey with Franco Patano

November 27, 2023 · 26:57

From Excel to the Lakehouse: a Data Journey with Franco Patano

Sheikh Shuvo: Today's guest is Franco Patano. Franco is the lead product specialist and senior solution architect at Databricks. Based in the Chicago area, Franco has spent the last 17 years figuring out how to make data useful and accessible. He's an expert in all things data management, networking, and cloud. Some of the areas we'll be chatting about are Franco's career journey, challenges and opportunities in ETL for AI, the value of a data lakehouse, and more.

Hi, everyone. Welcome to Humans of AI. I'm Sheikh Shuvo. This is where we learn about all the incredible people building the magic that's changing our world. Franco, thank you so much for making time for us.

Franco Patano: My pleasure. Thanks for having me.

Sheikh Shuvo: So, Franco, the very first question I'd like to start my guests off with is how would you describe your work to a five-year-old?

Franco Patano: That's fun. Actually, I do have two kids, ten and seven. So, I have had to explain this because my son will watch me work and he says, "Dad, I see you making, you're in slides a lot. I do slides for school. Is that what you do for work? You just make slides?" Yeah, I make slides and talk to people. That's my job. Yeah, what I do, as I explained it, is there are lots of problems that people have with data around the world. And as I've grown up in my career, I've noticed that they're all the same problems. They're repeated through every industry. And we need to do more with data than what we're doing today.

And AI lets us do more, faster than what we've ever been able to do before as humans. With our capacity, we have to think through things a lot slower than if you can have a machine do it. So that's how I describe what I do: I help solve problems around the world by applying these data and AI techniques where we could have machines do a lot of the brute force labor that we as humans used to have to do. I don't know if that's good enough for a five-year-old, but that's how I try to explain it to my kids. I think they get it; we'll see.

Sheikh Shuvo: I would love to see your next presentation have your 10-year-old design all the content to see how well it goes.

Franco Patano: He's actually a much better artist than I am, I gotta say.

Sheikh Shuvo: Yeah, nice. Nice. Franco, I know you've spent your career neck-deep in the world of data and ETL, especially even before it was cool. Tell us about what your career journey has been and what were some of the inflection points along the way that led you to where you are at Databricks.

Franco Patano: Like many folks, it started off in spreadsheets a long time ago, in a business far away. I noticed that business leaders needed data to make decisions and to understand what was going on in the business, and the stock reports that they were getting out of the systems just weren't enough. So eventually, I became a spreadsheet jockey. I learned VBA, learned how to code, and then someone showed me this magical language called SQL where instead of trying to VLOOKUP spreadsheets, I could join and select and filter and do a lot more creative things.

I was solving kind of department-level problems. And then I moved on to solving company-level problems and then into finance. And I realized that a lot of these data problems were just the same things over and over again, and I continued my journey.

I learned more about the Microsoft stack, SQL Server integration services, reporting services, and then later on Tableau, because that was all the rage. And then eventually Microsoft, the empire strikes back with Power BI. And I went more into this BI journey. And I realized as I was growing in that career, the early 2010s was like some breakthrough moments of machine learning where elites were using it. I started becoming interested in it. Didn't know about Spark yet. Actually, I was still primarily at the time, like everyone was using R to do data science and I took some courses on it and I was like, this is the future, but these toolings suck, like all of these things of the data is over here in the warehouse, typically and then I started seeing that the warehouse can't do everything. One of my data warehouse architects that I talked to, I never forget the conversation we had.

I was like, "Hey, okay. How do you bring audio files into the warehouse?" And he's like, "Oh, that's easy." I was like, "Oh yeah, you just, you put it on blob storage and then you put a link to it in the warehouse." I'm like, "That's not putting the data in the warehouse. It's putting a reference to it." So I was like, the warehouse world has not figured out unstructured data, like how to get that in and how to process it. And I was there's got to be something else. And then eventually my career brought me to this pivotal project where I had the business that wanted to do streaming from IoT devices, where it was this project where they were trying to study the occupancy of office space. This was before the pandemic. And the whole concept of remote work and hoteling cubes and reducing your office space to optimize your costs was a thing that we were trying to push where I was. What we, the stack we had could not do streaming BI. Informatica at the time just couldn't handle it.

And I went to my Microsoft rep and I was like, "Hey, what do you have that can do streaming BI?" And they were like, "There's this new thing called Databricks." And I was like, "What the heck is a data brick?" And they were like, "No, not like that." Basically, I did a quick hackathon with one of their Spark engineers.

Again, I had no idea what Spark was at that point in time. And I basically learned it in a week and a weekend. And I delivered this streaming BI project. Basically, we had the ETL, the streaming ETL from all these sensors and then show up in Tableau in a dashboard. And I delivered the project sooner than expected and under budget. And I was like, "What is this magic?" And I was like, "It's not me. It's this Databricks thing." And then I got introduced to Databricks and they reached out to me and they were like, "Hey, do you want to be an SA out of Chicago?" I was like, "Yeah, this stuff is the future. I've been dabbling in data and machine learning and all these toolings. They're not great. And you've actually got the premise of the right stuff of how to build this. Of course, I want to come here."

And that's what brought me here.

Sheikh Shuvo: Nice. Nice. I've read a ton of your articles on Medium, and you write extensively about data platforms and the overall benefit of the shift from a data warehouse to a lake house model. For someone not from this world, can you explain what the difference between a warehouse and a lake house are and what the impact has been with the increase in AI workloads?

Franco Patano: Yeah, like I said when I asked my DBA friend, "How do you put unstructured data in the warehouse?" It really was just putting a pointer to it, right? The warehouse is what I consider a specific engine. A warehouse was built and has all the rules of the things that you need to do to handle structured data and put it into a form that is quickly accessible by reports and dashboards.

The OG use case for the warehouse was fast reports and dashboards for the business in the morning. That's what they want. The business needed quick access. The only way that you could do that at the time is by overnight batch processing in the warehouse. And out of that, the world made Inman and Kimball famous and rich, and essentially they needed to structure these things. And these warehouse outbursts, these concepts that you could take raw data and process it and put it into a structured way so that you get really fast access to, for dashboards and reports. But again, they originally did not handle semi-structured data.

So XML, which was prevalent in the nineties, and then JSON, which came out in, I think it was the two thousands or started becoming more prevalent in the two thousands. And so now all this semi-structured data needed to be processed. And these warehouse engines, again, they were built for one thing, fast reports. And over time what I've seen happen is we've been using these specific purpose-built compute, the warehouse compute, for all these other use cases that it wasn't built for. And if you think about the lake house, what ended up happening was that you needed a general-purpose processing engine to process data for the warehouse. And so what ended up happening was lambda architecture.

You'd have a Hadoop stack or like a general-purpose stack. There were variants, right? But then they would have the warehouse compute too because the data lake stack was not great at what the warehouse was really good at, like low concurrent, low latency, high concurrency of BI users like reports and dashboards, right? But the lake engine could do all the other things. It could process unstructured data. It could process semi-structured data. It could actually handle machine learning within the same engine because it's a general-purpose engine. You can build it to apply it to anything.

This is what Spark is: this generalized processing engine that can be purpose-built for all these different solutions. And so that's the difference between a warehouse and a lake house. A lake house basically says, I'm going to take all the best parts of a warehouse and all the best parts of a data lake, and I'm going to bring them together so that I can have the best of both worlds, right? And so you have, in our world, in Databricks, Spark is basically the generalized compute engine, and now we've built purpose-built computes from that Spark primitive that kind of handle all these use cases. And so traditionally everybody used Spark for machine learning, data processing, ETL essentially. And all we really needed to do was have a really good warehouse engine. So that's what we built. That's the premise of the Lakehouse: if you could take this general-purpose processing engine and actually build the specific purpose-built engines on top, you could have a much better unified processing engine. That's what it is that you don't have to have best of breed tools and stitch them together. You could have one unified engine that kind of is built upon general-purpose processing that can do all these things. And so that's what I think is the major difference. The Lakehouse is all-encompassing. It's basically purpose-built solutions from a generalized purpose engine. And what I see that they're doing with the warehouse is they're taking the warehouse engine and they're trying to make it general-purpose, which, I don't really think a lot of these vendors are having luck in.

We see a lot of them trying to do this, but I don't think you can. It's like the wrong way around. It's the wrong direction. But that's the difference, I hope that helps.

Sheikh Shuvo: Interesting. Interesting. And when customers first start implementing Databricks across their organization, what are the things that surprise them? Are there any common challenges that you've seen that you work with customers through?

Franco Patano: So one of the things I think is the unsung hero of Lakehouse is we're awesome at ETL. And no one really knew this; we didn't really talk about it a whole lot, and we never benchmarked it. The world has a lot of problems with processing their data and doing it efficiently. I would say the biggest thing, the shock—not shock, but like the awe—of Lakehouse is how much faster and cheaper it is at doing these complex transformation tasks that tend to be really expensive in the warehouse world, again, because the warehouse was built for the presentation layer, for the serving layer. It was meant to serve data really fast. It wasn't built to transform data. The processing tools, like Apache Spark, were built to transform data. So that's what I see is like people were using the wrong tool for the job, and then when they come to Lakehouse, they see, oh, my gosh. This can be faster and cheaper, and they're just like in awe. That's one of the biggest things is that they can't believe that they can get more for less. And it's eye-opening. I think that's the biggest one.

Sheikh Shuvo: Yeah, that's the VBA to SQL moment right there.

Franco Patano: Yeah, exactly. You go from the dark ages, single compute processing, like single-threaded processing to better mechanisms and more capabilities.

Sheikh Shuvo: Outside of the core technology, you also spend a ton of time just engaging with developer communities there. And I'm wondering, in your time over the past 5 or so years at Databricks, much of the world has changed. You have Covid, the rise of and explosion in open source technologies, and I'm wondering, from a developer community standpoint and really staying authentic and community-minded, how have your views on engaging with developer communities changed over that time? Has the stuff you've been doing changed too?

Franco Patano: It's interesting. I would say the biggest impact on engaging with the developer communities is the pandemic changed how we interact. It's funny when I first came to Databricks, I spent a lot of my time going to meetups, and so I joined in 2019, right before the pandemic. And my first year, I basically went to all of the Chicagoland meetups that were data and AI-based, just trying to get in and talk to the community, see what problems they were having. And some of them actually invited me to talk, and I would talk about what Databricks was doing. And the pandemic came around and changed everything, and we had to pivot. And actually, what I did is, if you go back in time, I did a short run video series at Databricks called Data Collab Lab because we weren't doing anything anymore, and the pandemic happened, we weren't meeting up. And I said, what if we virtually met up like on Zoom and we collaborated on these problems and we showed the world what you could do on Databricks? And so that's what I pivoted to when the pandemic hit; that was early pandemic.

And then all of a sudden, the rest of the world figured out how to work from home, and then Databricks just took off, we no longer had time to work on that project anymore. So it fell off, which was sad. I actually, it's one of the things that I regret that I wasn't able to continue that on. 'Cause I still get people asking, "You didn't do an episode."

It's ah, I don't have time for it anymore. I wish I could. I've been trying to get new SEs at Databricks to pick up the torch, but we've been AI blew up the world last November and it's been trying to catch up ever since.

Sheikh Shuvo: Yeah. With that explosion in AI interest in, say, the past 18 months, have any ways that developers have been using Databricks for AI workloads surprised you at all?

Franco Patano: I was shocked when we came out with the State of AI on Databricks report a little while ago. I was shocked at how common the transformer was used on Databricks. I didn't realize that we had already, GPT blew the top off this stuff, but we had been helping customers build these things for a while.

Comcast has a voice remote. You tell it what you want to watch, and then it comes up. That stuff runs on Databricks, right? So we've been in this NLP world already before LLMs took off. But I was shocked to learn we've been helping customers do this for years, right? And now the rest of the world is waking up to it. That was the most interesting thing that I've seen.

Franco Patano: I just got caught up on over the weekend. I think we are becoming like the new republic where we're like the Republic. That's how I think about it. I think these closed warehouses are the empire where it's like my way or the highway, they lead with an iron fist. I find that it's gotta be the warehouse.

Whereas we're more open, it's more open-source. It's more like the Republic. Everybody comes, and we try to work together. That was the whole spirit of the unified platform, right? Was everybody comes together, and they work on the same data. They use the same tools, and they get their job done.

And it's open. We're not locking you in. So that's how I think about it.

I hope we don't come to become the empire. I hope we don't live long enough to become the villain.

Sheikh Shuvo: Yeah. Now, one of the recent announcements I saw from Databricks was our marketplace. I'm wondering if you could talk a little bit about that and how developers are using it. My understanding was it was a commercial alternative to Hugging Face. I'm wondering if that analogy is valid.

Franco Patano: Not really. I would say Hugging Face is a really solid platform. And it's, we actually integrate really well with it. You can pull Hugging Face models into Databricks. The marketplace was more a response to "Hey, I'm on Databricks. Why don't I just have data that I can just bring in like I would from that other platform?" And then it started off with data assets. We had all these organizations that wanted to publish their data assets and have people access things like weather data. Why isn't weather data just like a click to bring it in, right? Like it should be everywhere. And then we actually, there's more than just data now.

What about notebooks to create the models? And then what if you want to produce models too? So I guess you could say a little bit might overlap on the model side. But the marketplace is more than just data models. It's notebooks. It's all of the data assets you would need to build out these projects on Databricks or using Databricks plus like any other tooling. So the marketplace is more of a way to centralize all of the different content and assets that you'd want to leverage within Databricks. And it actually also allows the market to post their things and sell or kind of share their content. So I'd say it's more of all of the different assets you'd want to use on Databricks or publish on Databricks, not just models, but it also includes models. So you might have a little bit of overlap there, but we integrate with Hugging Face really well. So I don't really see that we're competitors, but maybe a slight overlap. Databricks is so large; we overlap a little bit with everybody, right?

Sheikh Shuvo: Makes sense. The very last question I have for you, Franco, is it seems one of the themes throughout your career has been tinkering with technology and just seeing where things go and naturally evolve. I'm wondering, from where you are now, what are the types of tools and learnings that you're doing to stay on top of the next great thing?

Franco Patano: Yeah, I'm fortunate to be inside of Databricks. We have an amazing team of academic researchers. One of the things I tell new Bricksters is don't discount the access you have here. A lot of times when people are like, "Oh, should I go somewhere else? Should I do this? Should I do that?" I'm like, "Don't discount the access you have." One of our threads is an email thread called Market Info, where we share what we're seeing in the marketplace, and we're actually encouraged to share our ideas and our content and how we think about things. So internally, I have really good support systems to keep up with what's going on. I have a lot of professional people all over the world that feed into that. Podcasts is one of them. I really like a few really good ones.

I look at guests more than the podcasts themselves. Lex always has good guests, but not every guest is awesome. Joe's got good guests, but not every guest is awesome. So I pick and choose. I follow more thought leaders in the space. I used to be on Twitter. I don't know what this Musk thing is. It got really weird. They are what they are. A lot of it's LinkedIn. Actually, I feel like a lot of the communities around LinkedIn now, they got tired of Twitter. Maybe Facebook isn't really the right place. But I'm actually seeing I can keep up with a lot of the folks that are in the industry on Medium.

Actually, ever since I started writing on Medium, I started reading more on Medium, and its algorithm is actually pretty decent at surfacing things for me. So I would say LinkedIn and Medium have been my two good public sources of content for staying up with what's going on. Keeping up with these AI models and algorithms, there's a new one, a new discovery every day. I'm constantly reading a post: "We broke this record," "We broke that record," "Now we're better with this," "Now this is cheaper," "There's more parameters," "More billions of parameters," "We figured this out," "And now forget H100s, now there's H200s," right? A lot of times it's so hard to keep up with what's going on in this space. I feel like just trying to follow the right people on social, be it LinkedIn or Medium, has really helped me out, to stay on top of what's going on. But just stay curious. Even the Google algorithm, because I'm an old Android fan, now I'm all Apple. I was a big Android person, and so Google, that has the news feed and actually it is pretty decent understanding my search and like surfacing things that are good. So I'd say those are the ways that I keep up nowadays. I wouldn't say it's one publication or one source that I track. It's actually people. I find people that I grok, but I like their interpretation, I like their analysis, like the way they think. And it's more I follow them and I see what they produce. So I read their blogs, I've listened to their podcasts, I've watched their vidcasts. I've got a soft spot for Mark Andreessen and Ben Horowitz. Just, they're big investors in Databricks, but I really like their trajectory in Silicon Valley and in tech in general. They had come out and said that like software was going to eat the world and now AI is going to eat software. I see that shift happening, right now as Databricks, we're just pivoting to a new kind of paradigm. We pioneered lakehouse because of all the complexities that we were having with the data lake and the warehouse. And then now we're finding that a lot of in the world, a lot of vendors are just slapping an assistant and saying, "Oh, we've got AI." But what we're doing is we're actually not just putting AI into the product, but it's the core of what we do. Yes, you have an assistant, but we're trying to actually figure out, like, how can we give it more context? How can we make the assistant smarter? And now we have this new concept called the data intelligence platform, because we think that AI should not just help the user, but it should actually help the system be a better system, right? There should be intelligence in the platform in and of itself. The platform should be intelligent, and then how it helps you should have AI in it as well. So that's what we're thinking about and how I keep up, if that helps.

Sheikh Shuvo: You mentioned Mark Andreessen; just looking at his recent publications, would you consider yourself a techno-humanist as well?

Franco Patano: I would say I follow people, but there are some things that I don't completely agree with. I believe that technology is good. I think that AI is going to help us push the boundaries of human knowledge like nothing we've ever experienced before. Actually, one interesting thing while I was doing my recent travels, I never thought that this would be a thing. But it turned out to be a huge debate in this meetup we had here. And it turns out they were still talking about it afterwards. I was actually catching up with someone, and I made the comment.

I heard one of them mention this in a podcast. They gave ChatGPT to their kid. And the joke was like, I went to my kid, he was like 10 years old, and I gave him the ChatGPT, and it's like in the computer, you can ask it anything and it'll answer your question. He's yeah. And he doesn't understand like the amount of tech that goes into this, right? Yeah, you don't understand. I just brought fire down from the mountain for you, but, and you can really answer. And the kid's of course it's a computer. It's going to answer any of my questions, right? Like duh. And I was like, would you give your kid ChatGPT? Would you give your child ChatGPT? And it was actually interesting. The room was split. They were like, "God, no, I would not give my child ChatGPT. They wouldn't learn anything." And I was like, "Are you sure about that?"

Like kids have a billion questions. This thing can answer every single one of their questions and they could just continue helping them be curious and learn, right? What's the hardest problem? Like you can ask it questions. What do you say? You answer one? You're like, "Alright, I've had enough," right? ChatGPT doesn't get tired of answering their questions, right? Like they can be curious and learn so much faster. I believe it has the capability of extending critical thinking and curiosity and helping humanity actually push the boundaries of human knowledge, but there was this other part of the room that was like, "Oh my gosh, no, they're just not going to learn anything anymore. They're just going to ask ChatGPT." And it's like, when you ask ChatGPT to summarize something for you, it could take like a 500-page book and boil it down to 10 pages. What does that tell you about human language? We have a lot of filler. There's a lot of filler in our content, right?

You really, there's a, you could boil stuff down. If you could boil the concepts down that much, what does that mean? Does the kid, do they really need filler? Do you just want to get to the point? So that's I don't really, I don't know if that helps, but I find that it's interesting how people are thinking about how the technology is going to enhance humanity. I think how we do that is going to be interesting. You've got Elon saying that AI is going to kill us. And you've got the other side of the house that are saying, "Hey, AI can help us." So I just think it's interesting. Humanity itself, though. If you look at the internet, humanity has depths that are very dirty, right? And there are heights that are amazing that we can accomplish. I think AI is just going to extend all those things. And we have to, just like we did with the internet, we have to be able to put some type of controls in place. But it's going to be interesting. I don't think I have the right answer yet. I don't think a lot of people really understand what it's going to be. Will we fuse with these things at some point? I don't know. I think it does have the ability to help us.

Sheikh Shuvo: One of my main takeaways from that then is as dads, we need to work harder to impress our kids.

Franco Patano: Exactly. Personally, I think AI could be better at helping the kids than we can. I think parents need to come to the conclusion that maybe they could get a little more help. Because maybe I'm not the best. Maybe my explanation to a five-year-old of what I do wasn't good enough. What if I have AI do it instead, right? I think that's the hardest part of being a parent, is when do you step back? When does your child need more than what you can provide?

Sheikh Shuvo: Absolutely. Those are good parting words. This was awesome to connect, Franco. Thanks so much for sharing about your life and your world, and good luck on the rest of your global tour. You got some fun adventures on the horizon.

Franco Patano: Thanks, bud. Thanks, Sheikh. Thanks for having me. It was a pleasure.

This podcast is brought to you by H10. The part about advanced technology that never changes is the need for the right people to design, build, and manage it. H10 offers just that with an on-demand talent and management service that covers all aspects of engineering, program management, and AI. Trusted by over 400 companies, including half of the Fortune 10, H10 is here to help lighten your load and make you the hero.

View episode details

Listen to Humans of AI using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#22. From Excel to the Lakehouse: a Data Journey with Franco Patano

Subscribe