Pages

Tuesday, November 30, 2010

The Really Cool NTILE() Window Function

If you regularly code queries and have never been introduced to the windowing functions, then you are in for a treat. I've been meaning to write about these for over a year, and now it's time to get down to it.

Support in Major Servers

SQL Server calls these functions Ranking Functions.

PostgreSQL supports a wider range of functions than MS SQL Server, having put them in at 8.4, and PostgreSQL and calls them Window Functions.

Oracle's support is broader (by a reading of the docs) than SQL Server or PostgreSQL, and they call them Analytic Functions.

I try to stay away from MySQL, but I did a quick Google on all three terms and came up with a few forum posts asking when and if they will be supported.

The NTILE() Function

In this post we are going to look at NTILE, a cool function that allows you to segment query results into groups and put numbers onto them. The name is easy to remember because it can create any -tile, a percentile, a decile, or anything else. In short, an n-tile. But it is much easier to understand with an example, so let's go right to it.

Finding percentiles

Consider a table of completed sales, perhaps on an eCommerce site. The Sales Manager would like them divided up into quartiles, four equally divided groups, and she wants the average and maximum sale in each quartile. Let's say the company is not exactly hopping, and there are only twelve sales, which is good because we can list them all for the example. If we already had the quartiles provided then the query would be easy, so if we were lucky enough to be starting with this:

 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 DEALER   |  500.00 |   3
 RETAIL   |  738.00 |   3
 DEALER   | 1250.00 |   3
 RETAIL   | 2029.00 |   4
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

The query would be child's play if we already had the quartile:

Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM ORDERS
 GROUP BY quartile
 ORDER BY quartile

The Problem is We Do Not Have Quartile

The problem of course is that we do not usually have handy columns like QUARTILE provided, but we can generate the QUARTILE column during the query by using NTILE.

Select quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY quartile
 ORDER BY quartile

This query will give us what the Sales Manager wants.

Dissecting the Function and The OVER Clause

The NTILE() function takes a single argument, which tells the server how many groups to divide the data into. If there are not an exact number of rows in each group, the server decides which groups will be missing one row. So in an exact case all of your groups have the same count of rows, but when it does not divide evenly, one or more of them will be one row short.

If you pass 100 to NTILE(), you get a percentile. If you pass 10, you get a decile, and so forth.

The magic is in the OVER() function. This supports two clauses, and the example shows one, the ORDER BY. Quite simply, the ORDER BY clause tells the server how to line up the rows when adding the NTILE values. The clause is very flexible, and has nothing to do with your query's overall ORDER BY clause.

The Second Clause: PARTITION

Now we will pretend the Sales Manager is not satisfied, and wants separate numbers for the two Customer Types. We could do this if the NTILE() function would create two sets of quartiles, one for each Customer Type, like so:

 CUSTTYPE | AMOUNT  | QUARTILE
----------+---------+----------
 DEALER   |  249.00 |   1
 DEALER   |  278.00 |   2
 DEALER   |  500.00 |   3
 DEALER   | 1250.00 |   4
 RETAIL   |   78.00 |   1
 RETAIL   |  234.00 |   1
 RETAIL   |  392.00 |   2
 RETAIL   |  498.00 |   2
 RETAIL   |  738.00 |   3
 RETAIL   | 2029.00 |   3
 RETAIL   | 2393.00 |   4
 RETAIL   | 3933.00 |   4

We can do this by using the PARTITION BY clause, which tells the server to break the rows into groups and apply the NTILE() numbering separately within each group. The new query would be this:

Select custtype
     , quartile
     , avg(amount) as avgAmount
     , max(amount) as maxAmount
  FROM (
        -- The subquery is necessary
        -- to process all rows and add the quartile column
        SELECT amount
             , ntile(4) over (partition by custtype
                                 order by amount) as quartile
          FROM ORDERS
       ) x
 GROUP BY custtype,quartile
 ORDER BY custtype,quartile

Bonus Points: The Median

Now once again the Sales Manager, who is never satisified, comes down and says that the average is no good, she needs the max and the median sale value within each quartile. To keep it simple, she does not need this broken out by customer type, it can be applied to the entire set.

This is a case where we can use NTILE() twice. The first time we will break all sales up into four groups, to get the quartiles, and then we will break up each quartile into two groups to get the median. The code looks like this:

Select quartile
     , max(case when bitile=1 then amount else 0 end) as medAmount
     , max(amount) as maxAmount
  FROM (
        -- The second pass adds the
        -- 2-tile value we will use to find medians
        SELECT quartile
             , amount
             , ntile(2) over (partition by quartile
                                  order by amount) as bitile
          FROM (
                -- The subquery is necessary
                -- to process all rows and add the quartile column
                SELECT amount
                     , ntile(4) over (order by amount) as quartile
                  FROM ORDERS
               ) x1
       ) x2
 GROUP BY quartile
 ORDER BY quartile

The magic here is that we know we've divided the data evenly into four sets, so the median will be the maximum value half way through each set. In other words, it will be the maximum value when the value of bitile=1 for each quartile.

One More Note About Oracle

Once you get down the basics of the OVER clause, Oracle looks really good, because they support the clause over the largest range of functions, at least going by the respective doc pages for each platform.

24 comments:

  1. Ken,

    Nice post. I am an Oracle developer and frequently make use of analytic functions. They reduce a great deal of work and make writing complex queries easy.

    One minor correction : I think you need to add the " as quartile " alias to all the subqueries after the "ntile(4) over " without which it would raise an error (at least in oracle).

    Thanks!
    Rajesh
    www.etl-developer.com

    ReplyDelete
  2. Rajesh, thanks, I've corrected it. I've also realized a mistake in my median query, but that will have to wait until the evening to correct.

    ReplyDelete
  3. I found this blog doing a search for NTILE() function. It's a very interesting blog, but I think the NTILE() can not completely simulate PERCENTILE as TOP N PERCENT.
    In case the number of records is less than number in the NTILE function, we will not create 100 groups.

    Do you know about this limitation?

    ReplyDelete
  4. Naomi: You are correct, if the count of rows in a partition is less than 100, the highest value of NTILE() will be the row count in that partition.

    This is similar to the standard deviation, which returns null on only one row. The limitation is defined by the function itself.

    ReplyDelete
  5. There is also a MEDIAN function, which makes the last query something like :
    select quartile
    , median (amount)
    , max (amount)
    from (
    select amount
    , ntile (4) over (order by amount) quartile
    from orders)
    group by quartile

    ReplyDelete
  6. I just tried this out and found that ntile(100) does not, in fact, give you a percentile if you have fewer than 100 samples. It just ranks however many samples you have. In my case, I have twenty samples and expected the top ranked out of 20 to be the 95th percentile. Instead it was 20th, exactly the same answer as if I used ntile(20).

    ReplyDelete
  7. After a bit of fumbling I arrived at the following:

    100*cast(ntile(20) over(order by score) as numeric)/20 as percentile

    The only trouble is that it doesn't look like you can substitute select count(distinct ...) for a hard coded number. That's a problem since the code may not work right after an update to the data.

    ReplyDelete
  8. Hi Ken,

    Nice post, well explained. I really started loving those window functions, also for calculating running totals in an accounting/banking application, where the totals have to be calculated on the fly over a set of known transactions. I have posted an article about that on dzone:

    http://java.dzone.com/articles/sql2003-window-functions-jooq

    But I wasn't aware of the power of the NTILE() function. I'm certainly going to use that more often!

    Cheers
    Lukas

    ReplyDelete
  9. I soooo struggled with this and your post saved me.... thanks

    ReplyDelete
  10. This is exactly the format I was looking for in order to understand all this. Thanks!

    ReplyDelete
  11. This is a good post, but it should be called out that NTILE groups are *not* percentiles in the true sense for how most think of it. What you should really consider is the PERCENT_RANK() window function.

    This article does a good job explaining this.

    ReplyDelete
  12. Contact to Postgres SQL Support for Windows if PostgreSQL 9.5 Cannot Start Perfectly
    Generally new Postgres SQL customers need to defy this issue and they are not prepared to start their PostgreSQL 9.5 adequately. This is exceptional concern and it requires heaps of particular aptitudes to overcome of this issue. Everything considered, here we indicate few of the way to deal with this issue like: deleting your foundation index, changing PREFIX to the correct region, recompiling and reinstalling Postgres. However you are going up against the same the issue by then contacts to Cognegic's PostgreSQL Relational Database Service or Postgres SQL Support for Linux.
    For More Info: https://cognegicsystems.com/
    Contact Number: 1-800-450-8670
    Email Address- info@cognegicsystems.com
    Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801

    ReplyDelete
  13. Thank you so much for the Latest Web Hosting Coupons i.e FastComet Coupon Code, WPX Hosting Coupon Code and all others you'll find there.
    You can find latest Popular FastComet Coupon Codes here as well.

    ReplyDelete
  14. You can learn more about Web Hosting Coupons right over Tumblr.com

    ReplyDelete
  15. You can also learn about MilesWeb Hosting which is the best hosting.

    ReplyDelete
  16. I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at at this place. ExcelR Data Analyst Course

    ReplyDelete
  17. Start Enrolling Today and avail our extremely sophisticated career path of Python Training in Hyderabad by AI Patasala.
    AI Patasala Python Courses

    ReplyDelete
  18. data analytics courses malaysiaMarch 9, 2022 at 12:37 AM

    It's late discovering this demonstration. At any rate, it's a thing to be acquainted with that there are such occasions exist. I concur with your Blog and I will have returned to investigate it more later on so please keep up your demonstration.

    ReplyDelete
  19. 360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.

    ReplyDelete
  20. Most people know that getting a good night’s sleep is important, but what many don’t realize is how interconnected sleep and stress are. Believe it or not, how well you sleep can actually affect your stress levels, and vice versa. In this article, we’ll explore the relationship between sleep and stress, and we’ll give you some tips on how to get a good night’s sleep and reduce your stress levels. One of the key ways that sleep affects stress is through cortisol levels. Cortisol is a hormone that’s released in response to stress, and high levels of cortisol can interfere with your ability to get a good night’s sleep. In fact, one study found that people who had higher cortisol levels took longer to fall asleep and spent less time in deep sleep than those with lower cortisol levels. In this blog post, we will learn everything we need to know about sleep and stress.

    Read more here

    ReplyDelete
  21. Junior developers generally have as long as two years of involvement with the field. They have solid coding and testing abilities and practice with different arrangements. A Lesser Java designer independent should be able in object-situated programming, something like one working framework, web administrations, CSS/HTML, SQL, JVM, memory the executives, and data sets. Information on other programming dialects is consistently an advantage (C++, JavaScript, or PHP). Such experts work fundamentally under management on unambiguous errands of the task>>> part time java developer remote

    ReplyDelete
  22. I like how frequently you update your website with informative articles and other content. Keep up the wonderful work!. how to connect generator to house without transfer switch

    ReplyDelete
  23. At Attract Group, their expertise in audio-video streaming app development services shines through in their comprehensive approach to creating user-friendly and cutting-edge applications. They understand the importance of seamless streaming, high-quality audio, and video, and innovative features to engage users. Their commitment to meeting specific industry needs and providing personalized solutions sets them apart in the competitive app development market. Trust Attract Group to bring your streaming app vision to life with their skills and experience.

    ReplyDelete