Presto

Abhijeet Kamble
5 min readNov 12, 2019

Let’s start by talking about where Presto comes from. And I think it’s always important to understand the nature of something, because this will help us understand its purpose and how we can use this platform to solve the challenges that we face when we’re trying to deliver insights to our customers. It all started in 2006, when a group of developers from a company called Nutch were hired by Yahoo. This project they were working on became Hadoop and they open-sourced it at this time, and the point of Hadoop was to help Yahoo index all of the data on the internet that they were ingesting and then trying to provide search for. Now this worked exceedingly well because it was able to spread the data out across multiple different machines, different nodes in the cluster, and create what we now call a distributed computing platform. The problem with Hadoop was that in order to access and work with the data, you had to understand some very low-level APIs and know Java. This is when Google added MapReduce. MapReduce is a Java-based interface for working with data in the Hadoop distributed file system. Now, MapReduce helped developers start to use Hadoop for all the data storage needs that they had. So that was great, so far, so good. But when companies really began to realize that their data analysts and data scientists needed access to this new data store, they faced a big problem. Most data people simply weren’t interested or weren’t going to learn Java and these low-level APIs in order to access this data. They wanted to use the tools that they were familiar with, such as SQL. Now this need for SQL access to Hadoop data is what led to, in 2008, Facebook developing a SQL abstraction on top of MapReduce. This was called Hive. And this is when Hadoop really took off, in my opinion, because now data scientists and data analysts had the ability to access and process and analyze these big data volumes that were stored in Hadoop. So over the years, Hadoop has grown tremendously in popularity. It, in fact, is synonymous with big data. So in my opinion, Hive really helped Hadoop grow. And the reason is because now the analysts could actually access, process, and analyze huge amounts of data stored in the platform. Now, the big problem with Hive was that it was still batch-oriented and really just a translator. So you would write your code in Hive, and then the engine would translate that into MapReduce, which would then process the data. This was, again, great and very useful. However, it was really, really slow. So over the years, Hadoop has grown tremendously in popularity and today is the most prevalent big data engine out there. The challenge still remains for most companies that processing and analyzing the data takes a lot of time, however. Now, this is where the nature of Presto comes from. And many of the analytical database vendors out there use an architecture known as MPP, or Massively Parallel Processing. What this does is to distribute parts of the query across different nodes in a cluster, and each little tiny query runs very quickly, then pulls all that data back together in memory and returns it to the user, much faster than if it were a single database or a single node or a batch operation such as how MapReduce works. The problem with many of these MPP database vendors is that in order to use and gain the efficiency in speed, you have to move your data from where it lives, what’s in Hadoop, and put it into their database system. So not only do you have to maintain the data flows and pipelines capturing the data and putting it in Hadoop from the start, you also have another set of data pipelines that has to maintain and synchronize the data between Hadoop and this third-party system. So at Facebook, they saw this problem, and they developed a solution that would allow you to run these MPP-style really fast queries wherever the data lives. And this is what Presto gives you. Now Presto is licensed under the Apache license. However, it’s not a top-level project. But it may become one as popularity seems to be growing around this idea of running fast queries on Hadoop data without having to migrate it to another system first.

here are four main reasons I think why you would want to use Presto. The first one is speed. And I think this is where Presto really shines because when you’re working with large data volumes speed can really slow the analysis process down. Many times you just generally accept that it isn’t going to be possible with such large volumes of data. However this can cost your company a lot of money. If I’m a highly-paid data scientist making over six figures per year, and if you add up all the time I spent waiting for a query to finish and for me to get the answer I’m looking for, it will cost the company a ton of money over the years that I work there. So speed is incredibly important and it is often worth whatever the cost is to speed up those queries. Especially if you have a data-centered company like a high tech organization from Silicon Valley. Another important factor when thinking about why you would use Presto is the open source nature of the platform itself. Big data really works best when you have an open platform that a large community can get behind and support. And as well as advance the platform without worrying about licensing restrictions or the proprietary nature of a vendor software. Open source has won this battle. And Presto is obviously one of those systems that has a large community just like Hadoop and Hive and many of the other big data platforms out there. One of the features that is really interesting with Presto, is the pluggable nature of the platform. That nature is one that will allow you to extend Presto to query your data no matter where it lives. So if your data is in a relational database, such as MySQL or Oracle, you can still use Presto to query it. And in fact you can combine those data sources in a single query in Presto. So with this pluggable architecture, Presto becomes a query abstraction layer. Which let’s you access all of your data sources regardless of the underlying data platform. Now this is incredibly powerful and it will save your analyst tons of times and eliminate the need to move the data between these systems to run your queries. The last reason I’ll mention to use Presto is the scalable nature of it. Because it’s a distributed engine, it allows you to scale as large as you need to. This is a very . brief introduction aboout Presto and we will talk about it more in the next blog.

--

--