Select first row in each GROUP BY group?

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY .

Specifically, if I've got a purchases table that looks like this:

SELECT * FROM purchases;

My Output:

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

I'd like to query for the id of the largest purchase ( total ) made by each customer . Something like this:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

Expected Output:

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rk
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

In PostgreSQL this is typically simpler and faster (more performance optimization below):

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

...
ORDER  BY customer, total DESC NULLS LAST, id;

Major points

  • DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

  • List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

    Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

    Bold emphasis mine.

  • DISTINCT ON can be combined with ORDER BY . Leading expressions have to match leading DISTINCT ON expressions in the same order. You can add additional expressions to ORDER BY to pick a particular row from each group of peers. I added id as last item to break ties:

    "Pick the row with the smallest id from each group sharing the highest total ."

    If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. Details:

  • PostgreSQL sort by datetime asc, null first?
  • The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

  • You don't have to include any of the expressions in DISTINCT ON or ORDER BY .

  • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

  • I tested with versions 8.3 – 10. But the feature has been there at least since version 7.1, so basically always.

  • Index

    The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

    CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
    

    May be too specialized for real world applications. But use it if read performance is crucial. If you have DESC NULLS LAST in the query, use the same in the index so Postgres knows sort order matches.

    Effectiveness / Performance optimization

    You have to weigh cost and benefit before you create a tailored index for every query. The potential of above index largely depends on data distribution .

    The index is used because it delivers pre-sorted data, and in Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

  • For few rows per customer , this is very efficient (even more so if you need sorted output anyway). The benefit shrinks with a growing number of rows per customer.
    Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. Generally setting work_mem too high can have adverse effects. Consider SET LOCAL for singular queries on big sets. Find how much you need with EXPLAIN ANALYZE . Mention of "Disk:" in the sort step indicates the need for more:

  • Configuration parameter work_mem in PostgreSQL on Linux
  • Optimize simple query using ORDER BY date and text
  • For many rows per customer , a loose index scan would be (much) more efficient, but that's not currently implemented in Postgres (up to v10).
    There are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

  • Optimize GROUP BY query to retrieve latest record per user
  • Optimize groupwise maximum query
  • Query last N related rows per row
  • Benchmark

    I had a simple benchmark here for Postgres 9.1, which was outdated by 2016. So I ran a new one with a better, reproducible setup for Postgres 9.4 and 9.5 and added the detailed results in another answer .


    Benchmark

    Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

    For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

    Setup

    Main table

    CREATE TABLE purchases (
      id          serial
    , customer_id int  -- REFERENCES customer
    , total       int  -- could be amount of money in Cent
    , some_column text -- to make the row bigger, more realistic
    );
    

    I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

    Dummy data, PK, index - a typical table also has some dead tuples:

    INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
    SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
         , (random() * random() * 100000)::int AS total     
         , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
    FROM   generate_series(1,200000) g;
    
    ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);
    
    DELETE FROM purchases WHERE random() > 0.9; -- some dead rows
    
    INSERT INTO purchases (customer_id, total, some_column)
    SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
         , (random() * random() * 100000)::int AS total     
         , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
    FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k
    
    CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);
    
    VACUUM ANALYZE purchases;
    

    customer table - for superior query

    CREATE TABLE customer AS
    SELECT customer_id, 'customer_' || customer_id AS customer
    FROM   purchases
    GROUP  BY 1
    ORDER  BY 1;
    
    ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);
    
    VACUUM ANALYZE customer;
    

    In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id .

    Object sizes for table purchases

    Generated with this query.

                   what                | bytes/ct | bytes_pretty | bytes_per_row
    -----------------------------------+----------+--------------+---------------
     core_relation_size                | 20496384 | 20 MB        |           102
     visibility_map                    |        0 | 0 bytes      |             0
     free_space_map                    |    24576 | 24 kB        |             0
     table_size_incl_toast             | 20529152 | 20 MB        |           102
     indexes_size                      | 10977280 | 10 MB        |            54
     total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
     live_rows_in_text_representation  | 13729802 | 13 MB        |            68
     ------------------------------    |          |              |
     row_count                         |   200045 |              |
     live_tuples                       |   200045 |              |
     dead_tuples                       |    19955 |              |
    

    Queries

    1. row_number() in CTE, (see other answer)

    WITH cte AS (
       SELECT id, customer_id, total
            , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
       FROM   purchases
       )
    SELECT id, customer_id, total
    FROM   cte
    WHERE  rn = 1;
    

    2. row_number() in subquery (my optimization)

    SELECT id, customer_id, total
    FROM   (
       SELECT id, customer_id, total
            , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
       FROM   purchases
       ) sub
    WHERE  rn = 1;
    

    3. DISTINCT ON (see other answer)

    SELECT DISTINCT ON (customer_id)
           id, customer_id, total
    FROM   purchases
    ORDER  BY customer_id, total DESC, id;
    

    4. rCTE with LATERAL subquery (see here)

    WITH RECURSIVE cte AS (
       (  -- parentheses required
       SELECT id, customer_id, total
       FROM   purchases
       ORDER  BY customer_id, total DESC
       LIMIT  1
       )
       UNION ALL
       SELECT u.*
       FROM   cte c
       ,      LATERAL (
          SELECT id, customer_id, total
          FROM   purchases
          WHERE  customer_id > c.customer_id  -- lateral reference
          ORDER  BY customer_id, total DESC
          LIMIT  1
          ) u
       )
    SELECT id, customer_id, total
    FROM   cte
    ORDER  BY customer_id;
    

    5. customer table with LATERAL (see here)

    SELECT l.*
    FROM   customer c
    ,      LATERAL (
       SELECT id, customer_id, total
       FROM   purchases
       WHERE  customer_id = c.customer_id  -- lateral reference
       ORDER  BY total DESC
       LIMIT  1
       ) l;
    

    6. array_agg() with ORDER BY (see other answer)

    SELECT (array_agg(id ORDER BY total DESC))[1] AS id
         , customer_id
         , max(total) AS total
    FROM   purchases
    GROUP  BY customer_id;
    

    Results

    Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

    All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

    A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

    1. 273.274 ms  
    2. 194.572 ms  
    3. 111.067 ms  
    4.  92.922 ms  
    5.  37.679 ms  -- winner
    6. 189.495 ms
    

    B. The same with Postgres 9.5

    1. 288.006 ms
    2. 223.032 ms  
    3. 107.074 ms  
    4.  78.032 ms  
    5.  33.944 ms  -- winner
    6. 211.540 ms  
    

    C. Same as B., but with ~ 2.3 rows per customer_id

    1. 381.573 ms
    2. 311.976 ms
    3. 124.074 ms  -- winner
    4. 710.631 ms
    5. 311.976 ms
    6. 421.679 ms
    

    Original (outdated) benchmark from 2011

    I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
    Comparing @OMGPonies' first query ( A ) to the above DISTINCT ON solution ( B ):

  • Select the whole table, results in 5958 rows in this case.

    A: 567.218 ms
    B: 386.673 ms
    
  • Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.

    A: 249.136 ms
    B:  55.111 ms
    
  • Select a single customer with WHERE customer = x .

    A:   0.143 ms
    B:   0.072 ms
    
  • Same test repeated with the index described in the other answer

    CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
    

    1A: 277.953 ms  
    1B: 193.547 ms
    
    2A: 249.796 ms -- special index not used  
    2B:  28.679 ms
    
    3A:   0.120 ms  
    3B:   0.048 ms
    
    链接地址: http://www.djcxy.com/p/24730.html

    上一篇: 如何将多行中的文本连接成SQL服务器中的单个文本字符串?

    下一篇: 在每个GROUP BY组中选择第一行?