|
A Taxonomy of Query Patterns
The question is: What are the most useful aggregations to pre-compute and store, so that any query
has reasonable response time? In order to answer this question it is useful to organize the way we think about OLAP queries.
Every query is associated with three specific aggregation levels:
1. The Selection
(filtering) level, i.e. the multidimensional level that the selection of the query is made, e.g. the level of the cell
that we click on to drill down, or the level defined by the “where” clause in a SQL query. For dimensions that
don’t participate in selections, the top level of the hierarchy representing “all” or “any”
is implied. We refer to this level as the S-Level
2. The Grouping
level is the level at which we want to see the results aggregated. This is the level defined by the “group by”
clause of the query. We refer to this level as the G-Level
3. The Materialized
Table level is the actual level of the table being queried. The further apart the selection from the table level, the
higher the cost of the query per item selected. We refer to this level as the M-Level. The M-Level maybe the base level (0,..,0)
or any other level denoted by a level of the aggregation lattice at which an aggregate table already exists and can be queried
to answer the query.
|
|
| Figure 4: Query profile over a 5-dimensional aggregation hierarchy. |
Figure 4 shows a fact with five dimensions, and five levels.
The selection (dotted) and grouping (solid) lines describe the query, and the table line at the bottom shows the level at
which the data is read; in this case level (0,0,0,0,0). We now define the Query Level
as the ideal table level defined by the lowest of the selection and grouping levels along each dimension. In this case the
grouping is lower than the selection line, so the grouping defines the ideal table level. This means that if we had a table
with pre-aggregated data at the grouping level, the query could run on that table at the minimum possible cost. The area of
the graph between the selection and the grouping levels is proportional to the “intrinsic cost” of the query.
That is, the query’s cost can not be less than it would be if ran over a table aggregated at the grouping level, since
the number of rows selected and sorted is driven by the degree difference between the selection and grouping lines. The area
between the ideal table line (i.e. the query level) and the actual table line is proportional to the aggregation cost incurred
because our physical data is at a lower level than needed. The actual cost of the query is proportional to the area between
the selection level profile and the table level where the query is ran. This is because this area defines the amount of data
that need to be actually retrieved and sorted for grouping.
We can classify aggregation queries into distinct patterns based on the relative positioning of these lines:
1. Drill-Down Interactive queries are typically initiated
from a specific cell in the aggregation lattice, which already contains data. With this type of query, dimensions are pivoted
into the rows and columns of a spreadsheet. The user typically picks a cell, possibly a whole row or column, or otherwise
selects a collection of cells, and requests a drill down. The selected cells are at the “selection level” (S-Level)
and the next drill-down level defines the “grouping level” (G-Level. The drill-down is typically along one of
the dimensions, possibly two. Consequently, only a limited amount of data at the requested grouping level needs to be accessed.
Actually the most economical level of a table such a query can run is at the grouping level. The intrinsic cost of drill-down
queries is rather low and consequently they can sustain interactive response times. These queries are the best candidates
for aggregation optimization, both because of their low intrinsic cost, but also because of the need for interactive resonse
time. The diagram shows how from a selection level at (3,2,4,3,1) the drill down along two dimensions delivers grouped data
at (3,2,3,2,1). The red shaded area is proportional to the intrinsic cost of the query.
2. Breakdown Reports are characterized by a relatively
broad selection and relatively low-level grouping of data e.g. select sales at a product category in a Division, for the past
year and show sales by SKU-store by month. Such a pattern is shown at the top-right diagram of Figure 5. These queries have a high intrinsic cost because of the large distance between selection
and grouping levels. They can only be run on relatively low level detail and they require a large number of rows (or elements
in a cube). They are classified as “reports” since they tend to be rather voluminous due to the level of detail
they convey.
This type of query is often used “preemptively” in drill-down situations to cache enough
data at the client (or closer to the client) so that subsequent requests can be handled locally and not have to go to the
database. This is in most cases counter-productive since clients have a high rendering and computation cost and to predict
the users next move often results in high intrinsic cost queries, and low probability of using the cached aggregations, not
to mention the adverse load it places on the query engine. An alternative approach is to always go the database or OLAP cube,
keeping the intrinsic cost of every request as low as possible and insuring that adequate pre-aggregated materialized tables
exist to limit aggregation cost.
3. Non-conforming rollups are queries that select nodes
at a level lower that the grouping level for all or some of the dimensions. The lower diagrams of Figure 5 show such query patterns resulting from mining low level data points and rolling them
up to cluster, classify, or aggregate them along the dimensions. For example, we may select, for all stores opened in the
last 12 months, SKUs that encountered an out-of-stock condition in the past quarter, and report counts, sales and inventory
at retail by month by region by product class. This type of query needs to make its selection at a low level, “mining”
the data points based on criteria provided, and then rollup the counts and amounts to a higher grouping level.
Aggregation Optimization
To recap from the query aggregation cost discussion, the aggregation cost of a query is determined by the
level distance between the selection point and the level of the table that the data queried resides. This cost has two components:
1. The Intrinsic Aggregation Cost of the query, which
is determined by the degree difference between selection and ideal query levels (the lowest along each dimension of the S-
and G-Levels). The intrinsic aggregation cost is the lowest possible cost of a query. This cost is attained if the query runs
over a table with data pre-aggregated at the ideal aggregation level. It is a cost, however, that we cannot control because
it is based on what kind of information the user is trying to extract from the database.
2. The controlable Aggregation Cost is the part of
the total cost in excess of the intrinsic cost, and is caused by having to aggregate rows from a level lower than the query
ideal level. When no pre-aggregated tables exist, this level is the base level, (0,0,0,0,0). The aggregation cost is zero
when data is pre-aggregated (an MQT exists) at the ideal level a query can run on. In a drill-down this is the grouping level.
In the general case the query level is the lowest of the selection and grouping levels along each dimension. This is the element
of total cost that we can attempt to minimize. We will do so by identifying aggregation points to materialize, usually under
given constraints of space and materialization (batch processing) time.
|