Query Optimization (WHERE, GROUP BY, LEFT JOINs)
I am using InnoDB.
QUERY, EXPLAIN & INDEXES
SELECT stories.*, count(comments.id) AS comments, GROUP_CONCAT( DISTINCT classifications2.name SEPARATOR ';' ) AS classifications_name, GROUP_CONCAT( DISTINCT images.id ORDER BY images.position, images.id SEPARATOR ';' ) AS images_id, GROUP_CONCAT( DISTINCT images.caption ORDER BY images.position, images.id SEPARATOR ';' ) AS images_caption, GROUP_CONCAT( DISTINCT images.thumbnail ORDER BY images.position, images.id SEPARATOR ';' ) AS images_thumbnail, GROUP_CONCAT( DISTINCT images.medium ORDER BY images.position, images.id SEPARATOR ';' ) AS images_medium, GROUP_CONCAT( DISTINCT images.large ORDER BY images.position, images.id SEPARATOR ';' ) AS images_large, GROUP_CONCAT( DISTINCT users.id ORDER BY users.id SEPARATOR ';' ) AS authors_id, GROUP_CONCAT( DISTINCT users.display_name ORDER BY users.id SEPARATOR ';' ) AS authors_display_name, GROUP_CONCAT( DISTINCT users.url ORDER BY users.id SEPARATOR ';' ) AS authors_url FROM stories LEFT JOIN classifications ON stories.id = classifications.story_id LEFT JOIN classifications AS classifications2 ON stories.id = classifications2.story_id LEFT JOIN comments ON stories.id = comments.story_id LEFT JOIN image_story ON stories.id = image_story.story_id LEFT JOIN images ON images.id = image_story.`image_id` LEFT JOIN author_story ON stories.id = author_story.story_id LEFT JOIN users ON users.id = author_story.author_id WHERE classifications.`name` LIKE 'Home:Top%' AND stories.status = 1 GROUP BY stories.id ORDER BY classifications.`name`, classifications.`position` +----+-------------+------------------+--------+---------------+----------+---------+------------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+--------+---------------+----------+---------+------------------------+--------+----------------------------------------------+ | 1 | SIMPLE | stories | ref | status | status | 1 | const | 434792 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | classifications | ref | story_id | story_id | 4 | stories.id | 1 | Using where | | 1 | SIMPLE | classifications2 | ref | story_id | story_id | 4 | stories.id | 1 | Using where | | 1 | SIMPLE | comments | ref | story_id | story_id | 8 | stories.id | 6 | Using where; Using index | | 1 | SIMPLE | image_story | ref | story_id | story_id | 4 | stories.id | 1 | NULL | | 1 | SIMPLE | images | eq_ref | PRIMARY | PRIMARY | 4 | image_story.image_id | 1 | NULL | | 1 | SIMPLE | author_story | ref | story_id | story_id | 4 | stories.id | 1 | Using where | | 1 | SIMPLE | users | eq_ref | PRIMARY | PRIMARY | 4 | author_story.author_id | 1 | Using where | +----+-------------+------------------+--------+---------------+----------+---------+------------------------+--------+----------------------------------------------+ +-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | +-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+ | stories | 0 | PRIMARY | 1 | id | A | 869584 | NULL | NULL | | BTREE | | stories | 1 | created_at | 1 | created_at | A | 434792 | NULL | NULL | | BTREE | | stories | 1 | source | 1 | source | A | 2 | NULL | NULL | YES | BTREE | | stories | 1 | source_id | 1 | source_id | A | 869584 | NULL | NULL | YES | BTREE | | stories | 1 | type | 1 | type | A | 2 | NULL | NULL | | BTREE | | stories | 1 | status | 1 | status | A | 2 | NULL | NULL | | BTREE | | stories | 1 | type_status | 1 | type | A | 2 | NULL | NULL | | BTREE | | stories | 1 | type_status | 2 | status | A | 2 | NULL | NULL | | BTREE | | classifications | 0 | PRIMARY | 1 | id | A | 207 | NULL | NULL | | BTREE | | classifications | 1 | story_id | 1 | story_id | A | 207 | NULL | NULL | | BTREE | | classifications | 1 | name | 1 | name | A | 103 | NULL | NULL | | BTREE | | classifications | 1 | name | 2 | position | A | 207 | NULL | NULL | YES | BTREE | | comments | 0 | PRIMARY | 1 | id | A | 239336 | NULL | NULL | | BTREE | | comments | 1 | status | 1 | status | A | 2 | NULL | NULL | | BTREE | | comments | 1 | date | 1 | date | A | 239336 | NULL | NULL | | BTREE | | comments | 1 | story_id | 1 | story_id | A | 39889 | NULL | NULL | | BTREE | +-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+
QUERY TIMES
It takes on average 0.035 seconds
to run.
If I remove only the GROUP BY
, the time drops to 0.007
on average.
If I remove only the stories.status=1
filter, the time drops to 0.025
on average. This one seems like it can be easily optimized.
And if I remove only the LIKE
filter and ORDER BY
clause, the time drops to 0.006
on average.
UPDATE 1: 2013-04-13
My understanding has improved manifold going through the answers.
I added indices to author_story
and images_story
which seemed improved query to 0.025
seconds but for some strange reason the EXPLAIN
plan looks a whole lot better. At this point removing ORDER BY
drops query to 0.015
seconds and dropping both ORDER BY
and GROUP BY
improves query performance to 0.006
. I am these are the two things to focus on right now? I may move ORDER BY
into app logic if needed.
Here are the revised EXPLAIN
and INDEXES
+----+-------------+------------------+--------+---------------------------------+----------+---------+--------------------------+------+--------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+--------+---------------------------------+----------+---------+--------------------------+------+--------------------------------------------------------+ | 1 | SIMPLE | classifications | range | story_id,name | name | 102 | NULL | 14 | Using index condition; Using temporary; Using filesort | | 1 | SIMPLE | stories | eq_ref | PRIMARY,status | PRIMARY | 4 | classifications.story_id | 1 | Using where | | 1 | SIMPLE | classifications2 | ref | story_id | story_id | 4 | stories.id | 1 | Using where | | 1 | SIMPLE | author_story | ref | author_id,story_id,author_story | story_id | 4 | stories.id | 1 | Using index condition | | 1 | SIMPLE | users | eq_ref | PRIMARY | PRIMARY | 4 | author_story.author_id | 1 | Using where | | 1 | SIMPLE | comments | ref | story_id | story_id | 8 | stories.id | 8 | Using where; Using index | | 1 | SIMPLE | image_story | ref | story_id,story_id_2 | story_id | 4 | stories.id | 1 | NULL | | 1 | SIMPLE | images | eq_ref | PRIMARY,position_id | PRIMARY | 4 | image_story.image_id | 1 | NULL | +----+-------------+------------------+--------+---------------------------------+----------+---------+--------------------------+------+--------------------------------------------------------+ +-----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | author_story | 0 | PRIMARY | 1 | id | A | 220116 | NULL | NULL | | BTREE | | | | author_story | 0 | story_author | 1 | story_id | A | 220116 | NULL | NULL | | BTREE | | | | author_story | 0 | story_author | 2 | author_id | A | 220116 | NULL | NULL | | BTREE | | | | author_story | 1 | author_id | 1 | author_id | A | 2179 | NULL | NULL | | BTREE | | | | author_story | 1 | story_id | 1 | story_id | A | 220116 | NULL | NULL | | BTREE | | | | image_story | 0 | PRIMARY | 1 | id | A | 148902 | NULL | NULL | | BTREE | | | | image_story | 0 | story_image | 1 | story_id | A | 148902 | NULL | NULL | | BTREE | | | | image_story | 0 | story_image | 2 | image_id | A | 148902 | NULL | NULL | | BTREE | | | | image_story | 1 | story_id | 1 | story_id | A | 148902 | NULL | NULL | | BTREE | | | | image_story | 1 | image_id | 1 | image_id | A | 148902 | NULL | NULL | | BTREE | | | | classifications | 0 | PRIMARY | 1 | id | A | 257 | NULL | NULL | | BTREE | | | | classifications | 1 | story_id | 1 | story_id | A | 257 | NULL | NULL | | BTREE | | | | classifications | 1 | name | 1 | name | A | 128 | NULL | NULL | | BTREE | | | | classifications | 1 | name | 2 | position | A | 257 | NULL | NULL | YES | BTREE | | | | stories | 0 | PRIMARY | 1 | id | A | 962570 | NULL | NULL | | BTREE | | | | stories | 1 | created_at | 1 | created_at | A | 481285 | NULL | NULL | | BTREE | | | | stories | 1 | source | 1 | source | A | 4 | NULL | NULL | YES | BTREE | | | | stories | 1 | source_id | 1 | source_id | A | 962570 | NULL | NULL | YES | BTREE | | | | stories | 1 | type | 1 | type | A | 2 | NULL | NULL | | BTREE | | | | stories | 1 | status | 1 | status | A | 4 | NULL | NULL | | BTREE | | | | stories | 1 | type_status | 1 | type | A | 2 | NULL | NULL | | BTREE | | | | stories | 1 | type_status | 2 | status | A | 6 | NULL | NULL | | BTREE | | | | comments | 0 | PRIMARY | 1 | id | A | 232559 | NULL | NULL | | BTREE | | | | comments | 1 | status | 1 | status | A | 6 | NULL | NULL | | BTREE | | | | comments | 1 | date | 1 | date | A | 232559 | NULL | NULL | | BTREE | | | | comments | 1 | story_id | 1 | story_id | A | 29069 | NULL | NULL | | BTREE | | | | images | 0 | PRIMARY | 1 | id | A | 147206 | NULL | NULL | | BTREE | | | | images | 0 | source_id | 1 | source_id | A | 147206 | NULL | NULL | YES | BTREE | | | | images | 1 | position | 1 | position | A | 4 | NULL | NULL | | BTREE | | | | images | 1 | position_id | 1 | id | A | 147206 | NULL | NULL | | BTREE | | | | images | 1 | position_id | 2 | position | A | 147206 | NULL | NULL | | BTREE | | | | users | 0 | PRIMARY | 1 | id | A | 981 | NULL | NULL | | BTREE | | | | users | 0 | users_email_unique | 1 | email | A | 981 | NULL | NULL | | BTREE | | | +-----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ SELECT stories.*, count(comments.id) AS comments, GROUP_CONCAT(DISTINCT users.id ORDER BY users.id SEPARATOR ';') AS authors_id, GROUP_CONCAT(DISTINCT users.display_name ORDER BY users.id SEPARATOR ';') AS authors_display_name, GROUP_CONCAT(DISTINCT users.url ORDER BY users.id SEPARATOR ';') AS authors_url, GROUP_CONCAT(DISTINCT classifications2.name SEPARATOR ';') AS classifications_name, GROUP_CONCAT(DISTINCT images.id ORDER BY images.position,images.id SEPARATOR ';') AS images_id, GROUP_CONCAT(DISTINCT images.caption ORDER BY images.position,images.id SEPARATOR ';') AS images_caption, GROUP_CONCAT(DISTINCT images.thumbnail ORDER BY images.position,images.id SEPARATOR ';') AS images_thumbnail, GROUP_CONCAT(DISTINCT images.medium ORDER BY images.position,images.id SEPARATOR ';') AS images_medium, GROUP_CONCAT(DISTINCT images.large ORDER BY images.position,images.id SEPARATOR ';') AS images_large FROM classifications INNER JOIN stories ON stories.id = classifications.story_id LEFT JOIN classifications AS classifications2 ON stories.id = classifications2.story_id LEFT JOIN comments ON stories.id = comments.story_id LEFT JOIN image_story ON stories.id = image_story.story_id LEFT JOIN images ON images.id = image_story.`image_id` INNER JOIN author_story ON stories.id = author_story.story_id INNER JOIN users ON users.id = author_story.author_id WHERE classifications.`name` LIKE 'Home:Top%' AND stories.status = 1 GROUP BY stories.id ORDER BY NULL
UPDATE 2: 2013-04-14
I noticed one other thing. If I don't SELECT stories.content
(LONGTEXT) and stories.content_html
(LONGTEXT) the query drops from 0.015
seconds to 0.006
seconds. For now I am considering if I can do without content
and content_html
or replace them with something else.
I have updated the query, indexes and explain in the 2013-04-13 update above instead of re-posting in this one since they were minor and incremental. The query is still using filesort
. I can't get rid of GROUP BY
but have gotten rid of ORDER BY
.
UPDATE 3: 2013-04-16
As requested, I dropped the stories_id INDEXES from both image_story
and author_story
as they are redundant. The result was that output of explain only changed to show that the possible_keys
changed. It still didn't show Using Index
optimization unfortunately.
Also changed LONGTEXT
to TEXT
and am now fetching LEFT(stories.content, 500)
instead of stories.content
which is making a very significant difference in query execution time.
+----+-------------+------------------+--------+-----------------------------+--------------+---------+--------------------------+------+---------------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+--------+-----------------------------+--------------+---------+--------------------------+------+---------------------------------------------------------------------+ | 1 | SIMPLE | classifications | ref | story_id,name,name_position | name | 102 | const | 10 | Using index condition; Using where; Using temporary; Using filesort | | 1 | SIMPLE | stories | eq_ref | PRIMARY,status | PRIMARY | 4 | classifications.story_id | 1 | Using where | | 1 | SIMPLE | classifications2 | ref | story_id | story_id | 4 | stories.id | 1 | Using where | | 1 | SIMPLE | author_story | ref | story_author | story_author | 4 | stories.id | 1 | Using where; Using index | | 1 | SIMPLE | users | eq_ref | PRIMARY | PRIMARY | 4 | author_story.author_id | 1 | Using where | | 1 | SIMPLE | comments | ref | story_id | story_id | 8 | stories.id | 8 | Using where; Using index | | 1 | SIMPLE | image_story | ref | story_image | story_image | 4 | stories.id | 1 | Using index | | 1 | SIMPLE | images | eq_ref | PRIMARY,position_id | PRIMARY | 4 | image_story.image_id | 1 | NULL | +----+-------------+------------------+--------+-----------------------------+--------------+---------+--------------------------+------+---------------------------------------------------------------------+ innodb_buffer_pool_size 134217728 TABLE_NAME INDEX_LENGTH image_story 10010624 image_story 4556800 image_story 0 TABLE_NAME INDEX_NAMES SIZE dawn/image_story story_image 13921
I can see two opportunities for optimization right away:
Change an OUTER JOIN to INNER JOIN
Your query is currently scanning 434792 stories, and you should be able to narrow that down better, assuming not every story has a classification matching 'Home:Top%'. It would be better to use an index to find the classifications you're looking for, and then look up the matching stories.
But you're using LEFT OUTER JOIN
for classifications, meaning all stories will be scanned whether they have a matching classification or not. Then you're defeating that by putting a condition on classifications in the WHERE
clause, effectively making it mandatory that there be a classification matching your pattern with LIKE
. So it's no longer an outer join, it's an inner join.
If you put the classifications table first, and make it an inner join, the optimizer will use that to narrow down the search for stories just to those that have a matching classification.
. . .
FROM
classifications
INNER JOIN stories
ON stories.id = classifications.story_id
. . .
The optimizer is supposed to be able to figure out when it's advantageous to re-order tables, so you may not have to change the order in your query. But you do need to use an INNER JOIN
in this case.
Add compound indexes
Your intersection tables image_story and author_story don't have compound indexes. It's often a big advantage to add compound indexes to the intersection tables in a many-to-many relationship, so that they can perform the join and get the "Using index" optimization.
ALTER TABLE image_story ADD UNIQUE KEY imst_st_im (story_id, image_id);
ALTER TABLE author_story ADD UNIQUE KEY aust_st_au (story_id, author_id);
Re your comments and update:
I'm not sure you created the new indexes correctly. Your dump of the indexes doesn't show the columns, and according to the updated EXPLAIN, the new indexes aren't being used, which I would expect to happen. Using the new indexes would result in "Using index" in the extra field of EXPLAIN, which should help performance.
Output of SHOW CREATE TABLE
for each table would be more complete information than a dump of the indexes (without column names) as you have shown.
You may have to run ANALYZE TABLE once on each of those tables after creating the indexes. Also, run the query more than once, to make sure the indexes are in the buffer pool. Is this table InnoDB or MyISAM?
I also notice in your EXPLAIN output that the rows
column shows a lot fewer rows being touched. That's an improvement.
Do you really need the ORDER BY
? If you use ORDER BY NULL
you should be able to get rid of the "Using filesort" and that may improve performance.
Re your update:
You still aren't getting the "Using index" optimization from your image_story
and author_story
tables. One suggestion I'd have is to eliminate the redundant indexes:
ALTER TABLE image_story DROP KEY story_id;
ALTER TABLE author_story DROP KEY story_id;
The reason is that any query that could benefit from the single-column index on story_id can also benefit from the two-column index on (story_id,image_id). Eliminating the redundant index helps the optimizer make a better decision (as well as saving some space). This is the theory behind a tool like pt-duplicate-key-checker.
I'd also check to make sure your buffer pool is large enough to hold your indexes. You don't want indexes to be paging in and out of the buffer pool during a query.
SHOW VARIABLES LIKE 'innodb_buffer_pool_size'
Check the size of indexes for your image_story table:
SELECT TABLE_NAME, INDEX_LENGTH FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'image_story';
And compare that to how much of those indexes are currently residing in the buffer pool:
SELECT TABLE_NAME, GROUP_CONCAT(DISTINCT INDEX_NAME) AS INDEX_NAMES, SUM(DATA_SIZE) AS SIZE
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE_LRU
WHERE TABLE_NAME = '`test`.`image_story`' AND INDEX_NAME <> 'PRIMARY'
Of course, change `test` above to the database name your table belongs to.
That information_schema table is new in MySQL 5.6. I assume you're using MySQL 5.6 because your EXPLAIN shows "Using index condition" which is also new in MySQL 5.6.
I don't use LONGTEXT at all unless I really need to story very long strings. Keep in mind:
As you are using MYSQL you can take advantage of Straight_join
STRAIGHT_JOIN forces the optimizer to join the tables in the order in which they are listed in the FROM clause. You can use this to speed up a query if the optimizer joins the tables in nonoptimal order
Also one scope of improvement is in filtering the data of table stories
as you only need data having status 1
So in the form clause instead of adding the whole stories
table add the only the needed records as your query plan shows that there are 434792
rows and same for the classification
table
FROM
(SELECT
*
FROM
STORIES
WHERE
STORIES.status = 1) stories
LEFT JOIN
(SELECT
*
FROM
classifications
WHERE
classifications.`name` LIKE 'Home:Top%') classifications
ON stories.id = classifications.story_id
Also one more suggestion you can increase sort_buffer_size
as you are shown as order by
and group by
, but be careful increasing your buffer size as the size of the buffer increases for each session.
Also if it is possible you can order your records in your application if possible as you have mentioned removing the order by
clause improves the takes only 1/6
part of the original time...
EDIT
Add indexes to image_story.image_id
for image_story
table and author_story.story_id
for author_story
table as these columns are used for join
Also index on images.position, images.id
has to be created as you are using it.
EDIT 16/4
I think so you almost optimized your query seeing you update...
Still one place you can improve is using appropriate data type as BillKarwin has mentioned... You can use ENUM
or TINYINT
type for columns like status and other which don'y have any scope of growth, it will help you to optimize your query performance and also storage performance of your table....
Hope this helps....
Computing
GROUP_CONCAT(DISTINCT classifications2.name SEPARATOR ';')
is probably the most time-consuming operation because classifications
is a big table and the number of rows to work with is multiplied because of all the joins.
So I would recommend using a temporary table for that information. Also, to avoid computing the LIKE condition twice (once for the temporary table and once for the "real" query), I would also create a temporary table for that.
Your original query, in a very simplified version (without the images and users table so that it's easier to read) is:
SELECT
stories.*,
count(DISTINCT comments.id) AS comments,
GROUP_CONCAT(DISTINCT classifications2.name ORDER BY 1 SEPARATOR ';' )
AS classifications_name
FROM
stories
LEFT JOIN classifications
ON stories.id = classifications.story_id
LEFT JOIN classifications AS classifications2
ON stories.id = classifications2.story_id
LEFT JOIN comments
ON stories.id = comments.story_id
WHERE
classifications.`name` LIKE 'Home:Top%'
AND stories.status = 1
GROUP BY stories.id
ORDER BY stories.id, classifications.`name`, classifications.`positions`;
I would replace it with the following queries, with temporary tables _tmp_filtered_classifications
(the ids of classifications with name LIKE Home:Top%') and _tmp_classifications_of_story
(for each story id 'contained' in _tmp_filtered_classifications
, all classification names):
DROP TABLE IF EXISTS `_tmp_filtered_classifications`;
CREATE TEMPORARY TABLE _tmp_filtered_classifications
SELECT id FROM classifications WHERE name LIKE 'Home:Top%';
DROP TABLE IF EXISTS `_tmp_classifications_of_story`;
CREATE TEMPORARY TABLE _tmp_classifications_of_story ENGINE=MEMORY
SELECT stories.id AS story_id, classifications2.name
FROM
_tmp_filtered_classifications
INNER JOIN classifications
ON _tmp_filtered_classifications.id=classifications.id
INNER JOIN stories
ON stories.id = classifications.story_id
LEFT JOIN classifications AS classifications2
ON stories.id = classifications2.story_id
GROUP BY 1,2;
SELECT
stories.*,
count(DISTINCT comments.id) AS comments,
GROUP_CONCAT(DISTINCT classifications2.name ORDER BY 1 SEPARATOR ';')
AS classifications_name
FROM
_tmp_filtered_classifications
INNER JOIN classifications
ON _tmp_filtered_classifications.id=classifications.id
INNER JOIN stories
ON stories.id = classifications.story_id
LEFT JOIN _tmp_classifications_of_story AS classifications2
ON stories.id = classifications2.story_id
LEFT JOIN comments
ON stories.id = comments.story_id
WHERE
stories.status = 1
GROUP BY stories.id
ORDER BY stories.id, classifications.`name`, classifications.`positions`;
Note that I added some more "order by" clauses to your query in order to check that both queries provide the same results (using diff). I also changed count(comments.id)
to count(DISTINCT comments.id)
otherwise the number of comments the query computes is wrong (again, because of the joins that multiply the number of rows).
上一篇: 在相同的查询中选择并插入查询记录