Retrieving the last record in each group
There is a table messages
that contains data as shown below:
Id Name Other_Columns
-------------------------
1 A A_data_1
2 A A_data_2
3 A A_data_3
4 B B_data_1
5 B B_data_2
6 C C_data_1
If I run a query select * from messages group by name
, I will get the result as:
1 A A_data_1
4 B B_data_1
6 C C_data_1
What query will return the following result?
3 A A_data_3
5 B B_data_2
6 C C_data_1
That is, the last record in each group should be returned.
At present, this is the query that I use:
SELECT
*
FROM (SELECT
*
FROM messages
ORDER BY id DESC) AS x
GROUP BY name
But this looks highly inefficient. Any other ways to achieve the same result?
MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:
WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;
Below is the original answer I wrote for this question in 2009:
I write the solution this way:
SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;
Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.
For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts
table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.
I'll write a query to find the most recent post for a given user ID (mine).
First using the technique shown by @Eric with the GROUP BY
in a subquery:
SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;
1 row in set (1 min 17.89 sec)
Even the EXPLAIN
analysis takes over 16 seconds:
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)
Now produce the same query result using my technique with LEFT JOIN
:
SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;
1 row in set (0.28 sec)
The EXPLAIN
analysis shows that both tables are able to use their indexes:
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)
Here's the DDL for my Posts
table:
CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;
UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.
Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2
of only IS NULL
comparisons.
I made my tests on a InnoDB table of 18684446
rows with 1182
groups. The table contains testresults for functional tests and has the (test_id, request_id)
as the primary key. Thus, test_id
is a group and I was searching for the last request_id
for each test_id
.
Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index
in EXPLAIN).
I have a couple of other solutions that are based on the same ideas:
(group_id, item_value)
pair is the last value within each group_id
, that is the first for each group_id
if we walk through the index in descending order; 3 ways MySQL uses indexes is a great article to understand some details.
Solution 1
This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC;
If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:
SELECT test_id, request_id
FROM (
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC) as ids
ORDER BY test_id;
This one takes about 1,2 secs on my data.
Solution 2
Here is another solution that takes about 19 seconds for my table:
SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC
It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.
The disadvantage of the query is that its result cannot be cached by the query cache.
Use your subquery to return the correct grouping, because you're halfway there.
Try this:
select
a.*
from
messages a
inner join
(select name, max(id) as maxid from messages group by name) as b on
a.id = b.maxid
If it's not id
you want the max of:
select
a.*
from
messages a
inner join
(select name, max(other_col) as other_col
from messages group by name) as b on
a.name = b.name
and a.other_col = b.other_col
This way, you avoid correlated subqueries and/or ordering in your subqueries, which tend to be very slow/inefficient.
链接地址: http://www.djcxy.com/p/73394.html上一篇: 在MySQL中创建新用户并授予其对一个数据库的完全访问权限
下一篇: 检索每个组中的最后一条记录