EXPLAIN EXTENDED

How to create fast database queries

OLAP in MySQL: four ways to filter on higher level dimensions

with 6 comments

Answering questions asked on the site.

Denis Kuzmenok asks:

I need some help with a query I'm fighting with.

I have these tables:

Product
id parent name
Site
id region name
Price
id product site value

For the products with a certain product.parent, I need to select minimal and maximal price within the given site.region, and return first 10 products ordered by the minimal price.

Each parent has 100 to 10,000 products, and there are 10,000,000 records in price table.

We have a classical OLAP task here: a fact table (price) and two dimension tables (product and site).

The task would be almost trivial if we were given the exact values of product and site. This way, we could build a composite index on these fields and value, filter on the exact values and get the first 10 values from the index.

However, we are given not the values of the dimensions themselves but those of the higher levels (parent for product and region for site).

Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.

This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.

Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.

There are three tables and, hence, 3! = 6 permutations that define all possible join orders:

product site price
product price site
site price product
site product price
price product site
price site product

However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:

product/site price
product price site
site price product
price product/site

The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.

Now, let's build the queries for all types of the join. To do this, we will create sample tables:

Table creation details

Members of the dimensions are distributed logarithmically, reducing from the lower to the higher values.

Say, 11,379 products of 250,000 (4.55% of all products) belong to the parent 1, while only 2,906 products (1.16%) belong to the parent 30:

parent total percent
1 11379 4.55
2 10724 4.29
3 10259 4.10
4 9830 3.93
5 9502 3.80
6 8880 3.55
7 8555 3.42
8 8072 3.23
9 7715 3.09
10 7522 3.01
252 1 0.00
253 1 0.00
263 1 0.00
309 1 0.00
250000 100.00

Same distribution applies to the sites:

region total percent
1 10446 20.89
2 8010 16.02
3 6602 13.20
4 5119 10.24
5 4096 8.19
6 3296 6.59
7 2604 5.21
8 2006 4.01
9 1614 3.23
10 1243 2.49
43 1 0.00
45 1 0.00
46 1 0.00
50 1 0.00
50000 100.00
47 rows fetched in 0.0014s (0.0709s)

Product/site, price

This is the most straightforward query. Since the product and site table come first, it's best for sparse products and sparse sites.

They will be filtered first and the results of the filtering cross joined (still returning relatively few values to join against the prices).

Here's the query:

SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
CROSS JOIN
site s
STRAIGHT_JOIN
price r
ON      r.product = p.id
AND r.site = s.id
WHERE   p.parent = 100
AND s.region = 30
GROUP BY
p.id
ORDER BY
min_value
LIMIT 10

id parent name min_value
195818 100 Product 195818 30.3642
54185 100 Product 54185 58.377
2 rows fetched in 0.0002s (0.0131s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE p ref PRIMARY,ix_product_parent ix_product_parent 4 const 136 100.00 Using where; Using temporary; Using filesort
1 SIMPLE s ref PRIMARY,ix_site_region ix_site_region 4 const 12 100.00 Using index
1 SIMPLE r ref ix_price_site_product_value_id,ix_price_product_value_id ix_price_site_product_value_id 8 20100930_prices.s.id,20100930_prices.p.id 1 100.00 Using index
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`product` `p` join `20100930_prices`.`site` `s` straight_join `20100930_prices`.`price` `r` where ((`20100930_prices`.`r`.`site` = `20100930_prices`.`s`.`id`) and (`20100930_prices`.`r`.`product` = `20100930_prices`.`p`.`id`) and (`20100930_prices`.`s`.`region` = 30) and (`20100930_prices`.`p`.`parent` = 100)) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10

We see that there is a filesort operation in the plan, but since there are relatively few values to sort, this operation is quite fast.

This query completes in 13 ms.

Site, product, price

This query is quite similar to the previous one, differing only in the JOIN order. It is best for sparse sites and dense products:

SELECT  p.*, MIN(r.value) AS min_value
FROM    site s
STRAIGHT_JOIN
price r
ON      r.site = s.id
STRAIGHT_JOIN
product p
ON      p.parent = 1
AND p.id = r.product
WHERE   s.region = 30
GROUP BY
p.id
ORDER BY
min_value
LIMIT 10

id parent name min_value
82678 1 Product 82678 22.7273
35757 1 Product 35757 23.1834
244754 1 Product 244754 23.5035
217731 1 Product 217731 24.3615
99861 1 Product 99861 25.5681
127003 1 Product 127003 26.8058
69848 1 Product 69848 26.9732
128191 1 Product 128191 32.3508
99234 1 Product 99234 32.9937
242035 1 Product 242035 33.9945
10 rows fetched in 0.0004s (0.0300s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE s ref PRIMARY,ix_site_region ix_site_region 4 const 12 100.00 Using index; Using temporary; Using filesort
1 SIMPLE r ref ix_price_site_product_value_id,ix_price_product_value_id ix_price_site_product_value_id 4 20100930_prices.s.id 54 100.00 Using index
1 SIMPLE p eq_ref PRIMARY,ix_product_parent PRIMARY 4 20100930_prices.r.product 1 100.00 Using where
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`site` `s` straight_join `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`r`.`site` = `20100930_prices`.`s`.`id`) and (`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`s`.`region` = 30)) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10

This query completes in 30 ms, since the conditions (in their totality) are less selective.

Product, price, site

This query is best for sparse products and dense sites:

In this query, we will replace the JOIN against the site table with an IN predicate, since we don't actually need any information from that table:

SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
STRAIGHT_JOIN
price r
ON      r.product = p.id
WHERE   p.parent = 100
AND r.site IN
(
SELECT  id
FROM    site
WHERE   region = 1
)
GROUP BY
p.id
ORDER BY
min_value
LIMIT 10
id parent name min_value
44246 100 Product 44246 20.4741
54185 100 Product 54185 20.5249
241517 100 Product 241517 20.7065
131185 100 Product 131185 21.1123
61725 100 Product 61725 21.1389
80285 100 Product 80285 21.3354
22533 100 Product 22533 21.4976
226588 100 Product 226588 21.5061
114991 100 Product 114991 21.5185
5571 100 Product 5571 21.7684
10 rows fetched in 0.0004s (0.0302s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY p ref PRIMARY,ix_product_parent ix_product_parent 4 const 136 100.00 Using where; Using temporary; Using filesort
1 PRIMARY r ref ix_price_product_value_id ix_price_product_value_id 4 20100930_prices.p.id 12 100.00 Using where
2 DEPENDENT SUBQUERY site unique_subquery PRIMARY,ix_site_region PRIMARY 4 func 1 100.00 Using where
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`product` `p` straight_join `20100930_prices`.`price` `r` where ((`20100930_prices`.`r`.`product` = `20100930_prices`.`p`.`id`) and (`20100930_prices`.`p`.`parent` = 100) and <in_optimizer>(`20100930_prices`.`r`.`site`,<exists>(<primary_index_lookup>(<cache>(`20100930_prices`.`r`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (<cache>(`20100930_prices`.`r`.`site`) = `20100930_prices`.`site`.`id`)))))) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10

Again, this query is 30 ms.

Price, product/site

This query is best for dense products and dense sites. This is the most complex query which would require some explanation.

At the first sight, it may seem that price cannot be filtered on. However, it's not true.

What we need is 10 lowest values from price satisfying certain conditions (right products and sites, and, additionally, products should be unique).

This means that we could make use of an index on price.value, scanning it until 10 records satisfying these conditions are returned. In this case, ORDER BY and LIMIT 10 would serve as filtering conditions: the scanning would cease as soon as the limit is reached.

And of course the more dense are the products and sites, the more is the probability for the conditions to be satisfied, the sooner the query completes.

Filtering on products and sites is easy, but to ensure that the products are unique, we will use the trick described in my previous article:

Basically, when filtering a price, we should not only make sure that it holds the correct product, but that it's the first (lowest) price for a given product.

This can be easily done with an additional ORDER BY / LIMIT in a subquery.

Here's the query:

SELECT  p.*, value AS min_value
FROM    price r
STRAIGHT_JOIN
product p
ON      p.id = r.product
WHERE   p.parent = 1
AND r.id =
(
SELECT  ri.id
FROM    price ri
WHERE   ri.product = p.id
AND ri.site IN
(
SELECT  id
FROM    site
WHERE   region = 1
)
ORDER BY
ri.product, ri.value, ri.id
LIMIT 1
)
ORDER BY
r.value, r.id
LIMIT 10

id parent name min_value
80226 1 Product 80226 20.0021
168980 1 Product 168980 20.0028
185520 1 Product 185520 20.0041
131416 1 Product 131416 20.0054
72154 1 Product 72154 20.0055
76637 1 Product 76637 20.0067
95946 1 Product 95946 20.0091
28136 1 Product 28136 20.0102
21426 1 Product 21426 20.0142
24976 1 Product 24976 20.0145
10 rows fetched in 0.0004s (0.0144s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY r index PRIMARY,ix_price_product_value_id ix_price_value_id 8 10 62505352.00
1 PRIMARY p eq_ref PRIMARY,ix_product_parent PRIMARY 4 20100930_prices.r.product 1 100.00 Using where
2 DEPENDENT SUBQUERY ri ref ix_price_product_value_id ix_price_product_value_id 4 20100930_prices.p.id 12 100.00 Using where
3 DEPENDENT SUBQUERY site unique_subquery PRIMARY,ix_site_region PRIMARY 4 func 1 100.00 Using where
Field or reference '20100930_prices.p.id' of SELECT #2 was resolved in SELECT #1
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,`20100930_prices`.`r`.`value` AS `min_value` from `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`r`.`id` = (select `20100930_prices`.`ri`.`id` from `20100930_prices`.`price` `ri` where ((`20100930_prices`.`ri`.`product` = `20100930_prices`.`p`.`id`) and <in_optimizer>(`20100930_prices`.`ri`.`site`,<exists>(<primary_index_lookup>(<cache>(`20100930_prices`.`ri`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (<cache>(`20100930_prices`.`ri`.`site`) = `20100930_prices`.`site`.`id`)))))) order by `20100930_prices`.`ri`.`product`,`20100930_prices`.`ri`.`value`,`20100930_prices`.`ri`.`id` limit 1))) order by `20100930_prices`.`r`.`value`,`20100930_prices`.`r`.`id` limit 10

We see that this query completes in only 14 ms despite the fact that we used the most populated parent and region. As I said before, the LIMIT 10 served as another filtering condition.

Summary

Let's run all possible queries for all possible distributions and record the query times in a table

P, S and R stand for product, site and price in the join order.

+Inf means that the query did not complete in a reasonable time and had to be terminated.

Query time, ms
Product
Sparse (parent 100) Dense (parent 1)
P/S,R S,R,P P,R,S R,P/S P/S,R S,R,P P,R,S R,P/S
Site Sparse (region 30) 13 27 30 +Inf 621 30 6,703 51,468
Dense (region 1) 7,679 4,427 30 1,051 +Inf 5,914 6,736 14

Conclusion

If limiting the results, MySQL allows filtering fact tables efficiently on higher level dimensions despite the fact these dimensions cannot be indexed. However, in each case the selectivity of the dimension level should be taken into account and the appropriate query should be used.

It should be noted that the queries involving a CROSS JOIN and index scan on the fact table perform intolerably poorly in the edge cases (too dense and too sparse dimensions, accordingly). On the other hand, the queries involving a join only differ in the join order which can be predicted by MySQL.

This means that when the dimension selectivity is unknown, a query using a plain join of all three tables (without forcing the join order) should be the query of choice:

SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
CROSS JOIN
site s
JOIN    price r
ON      r.product = p.id
AND r.site = s.id
WHERE   p.parent = 1
AND s.region = 1
GROUP BY
p.id
ORDER BY
min_value
LIMIT 10

, since MySQL can select a join order quite close to optimal and the query would complete in reasonable time.

However, when the distribution on both dimensions is known, the appropriate query from the table above can be chosen.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

September 30th, 2010 at 11:00 pm

Posted in MySQL

6 Responses to 'OLAP in MySQL: four ways to filter on higher level dimensions'

Subscribe to comments with RSS

  1. I have a related question to this post on SO, please have a look: http://stackoverflow.com/questions/3845181/does-mysql-use-existing-indexes-on-creating-new-indexes

    Pentium10

    2 Oct 10 at 13:55

  2. The first two queries run smoothly, under 0.030 seconds, but the 3rd one takes 47 seconds. Tried to run multiple times, time no more improves. Why happens this so differently for 3rd one? I’ve run the explain extended on it, and returns the same as in the post above.

    Pentium10

    2 Oct 10 at 20:16

  3. @Pentium10: well, the only thing I can say: “it works on my machine” :)

    Check that all your indexes are created and used (though they must be if you are saying the plans are identical). Especially check the UNIQUE_SUBQUERY on site.

    Quassnoi

    4 Oct 10 at 00:22

  4. It works on my system too (mysql version 5.1.48-2 on Fedora 13)

    Sanjay

    4 Oct 10 at 02:56

  5. links for 2010-10-22…

    OLAP in MySQL: four ways to filter on higher level dimensions | EXPLAIN … OLAP Work in MYSQL, Explain EXTENDED ist sowieso ein generell zu empfehlendes Blog (tags: sql mysql)…

  6. can i use data cubes in mysql like ssas in the mssql

    javad

    2 Sep 19 at 12:49

Leave a Reply