Window Functions

Abhijeet Kamble
4 min readNov 21, 2019

Sometimes when working with data, you’re going to need to perform calculations that change across the span of time or other discrete values. A good example is a rank. If you wanted to calculate the rank of something. And these are known as window functions in SQL terms. Again, not many or not all data platforms support these more advanced functions, but Presto does regardless of whether or not the underlying database does. So let’s dig in here and see some of these functions in action. First I want to find the highest ranked parts from my tiny database, the tpch benchmarking database. So I’ll start by saying presto — server localhost logging in. And I want to run a function against this, so let’s do select, and we’ll do part key, extended price, which is the price listed, rank, this is our function, this is our window function. And when we do this we have to say over. What are we ranking over? Now we could partition this let’s say by category or by year, but in this case I just want to find the rank over all of them. So I’ll just order by the extended price in a descending order, so the highest to lowest. We’ll call it as rnk for rank. We’ll call that from tpch.tiny.lineitem and we’ll order by rank. So there you have the part key, the extended price, and then the rank. And notice that we have a tie for our third one here, so we don’t have a fourth place. We have two that are tied at three, and then fifth. And in fact we have a tie there, a tie at seven. A lot of these make sense because they’re all similarly priced items. So this is how you can calculate a rank using the over function in a window function built into Presto. Another thing you might want to do where you’re calculating something which changes over time is a moving average. And this is all going to lead up to something called Bollinger Bands, which I’ll show you how we can graph that and what that looks like at the end. So first we need to create an aggregation table. Now this performs the rolling average, and then we’ll reference that additionally after we create it. So the way we do that is using something called with. So we start out with sales weekly, that’s the name of our table. We’re going to say as select date_trunc, which is a date function we reviewed already, week, and orderdate. We’ll call that wk for week. So that’s going to give me the rolled up date to the week level. Then we’ll do the sum of totalprice as total. This is coming from tpch.tiny.orders, and we’re grouping by the date_trunk week of orderdate. So, this is our with table, this is our temporary table, essentially, that we’re creating. Now we’re going to use a function and join to that. So, we’re going to say select wk avg(total) over, this is our rolling function, order by wk, so we’re ordering sequentially here by time, and we’re going to say rows between, so we’re getting it a range, so before, we were just ranking over all, now we’re going to create an average with just a specific period in mind. So it’s going to be rows between five preceding and current row. So what that’s going to do, that’s going to give me, essentially, a six-week rolling average that’s going to calculate this average over, and it’s ordering by week, chronologically there, the five preceding rows and the current row. So, the first five are just going to be leading up to where it actually becomes moving or rolling. So, here I say, from sales_weekly and we order by wk. So, it starts out with our first week and our second week and all that, where it’s just kind of starting to build the aggregate, and then after six weeks, we’re actually into the rolling average. So this is really key in something that you’re probably going to get asked to do a lot if you’re working with sales data. Hit “q” to exit there, and another question you may come into is a running total. So I want to find the total for every customer over time. And this is something that is really common, so to do that, let’s start out by saying select, and we’ll give it the key that we want to use for the customer, so custkey is the field, and we’ll do it by order date. Now, on a new line, I’m going to say sum of totalprice, this is going to be the value that I’m actually doing a running total of, and again, I use the over clause, and I need to partition this by customer key. What that does is that means that every time I have a new customer key, I’m going to start my running total, my sum of totalprice, over again. That way, the running total doesn’t span multiple customers, which isn’t what I want in this case. I also need to tell it what to order by, so order by orderdate as running_total from tpch.tiny.orders, and order by custkey and orderdate. So now, what you have is the customer key, the first order that they ever placed with us, in fact, every order, the date there, and a rolling total. So these numbers will only get bigger. And then, as you go down, you’re going to find the next customer key, and when that happens, you’re just going to have that new totalprice, the new order amount, instead of the rolling total. So you can see every time we have a new customer key, it sort of resets back to the first amount that that new customer had with our first order.

--

--