|
Introduction
This topic would have required a more extensive introduction a few years ago, prior to the broad acceptance
of on-line analytical processing (OLAP) and data warehousing technologies. Multidimensional hierarchies are typically manifested
in dimensional schemas, whereby an event or fact is linked to each of a given number of hierarchical dimensions. Additive
metrics attached to such events or facts can be rolled-up along the multidimensional dimensions and reported, or interactively
displayed. The performance characteristics of this process of aggregating low-level metrics along the dimensions divide the
proponents of the various OLAP technologies and implementations. Our aim is to understand and quantify these performance characteristics
and architect systems that meet desirable availability, currency and performance service levels. We hope that our analysis
will help practitioners design solutions using either type of technology or effectively blend them in hybrid solutions.
Events and facts carry a foreign key that associates them to each of the relevant dimensions, typically
at the leaf level. For example, sales can be associated with a customer in a demographic dimension, a store in a location
dimension, an SKU in a product classification dimension, a promotion in a marketing dimension or a fiscal week in a time period
dimension. Let’s assume that this is the prime (bottom, atomic) fact, from which higher aggregation levels can be derived. Each dimension forms a hierarchy of nodes that define aggregation points that metrics
can be reported at. When we aggregate facts we pick a level for each associated dimension. For example, we can roll all sales
up to a level defined by demographic category by district by product class across all promotions by month. Notice that instead
of ignoring the promotion dimension to include all promotions in our query, we logically pick the top node that represents
“all” or “any” promotion activities. Our facts have a reference to some leaf of every one of the participating
dimensions. By appropriate level selection and attribute filtering we achieve the inclusion or exclusion of the facts/events
we are interested in.
In order to reason about the behavior and performance of queries over such a multidimensional hierarchical
structure, we will first assume uniform hierarchies of fixed depth, consistent fan-out and constant cardinality ratio from
level to level. We will then look at a more realistic model that removes some of these simplifying assumptions, e.g. deals
with unbalanced hierarchies of different depths and varying cardinality ratios from one level to the next.
The “Aggregation Lattice”
Our first aim is to understand the topology of the “Aggregation Lattice”, a graph that
is defined by how different aggregations can be derived from lower level aggregations, and develop our understanding about
how it scales as levels or dimensions increase. Let’s assume a simple situation with only a few dimensions as follows:
1. Each individual dimension forms a tree, i.e. each
node has exactly one parent except for the single top node representing “all” or “any”.
2. The tree is of uniform depth, i.e. the same number
of levels from the top node to any leaf.
3. We name the level of each node in a dimension as
follows: A leaf node is at level zero, its parents are at level 1, their parents at level 2, etc. The root node is assigned
level number n-1, where n is the number of levels.
4. Every node at any level in our fictitious dimension
has the same fixed number of children. Fact aggregations from any level to the next have the same cardinality ratio. For example,
if store-level sales (level 0) represent 1 million records, and the same sales aggregated at the market level (level 1) is
reduced to 100,000 distinct records, the market-to-store level cardinality ratio R
(1,0) = 1/10.
5. Aggregations are defined across all dimensions available
for a fact. The non-participation of a dimension in a roll-up is equivalent to using its top node (denoting aggregating across
“any” or “all”). We define the multi-dimensional (composite)
level as the combination of the levels of each dimension represented as an ordered tuple. For example, in the five-dimensional
sales hierarchy discussed earlier, level (0,0,0,0,0) represents the lowest level independent fact. If each of the five dimensions
has 4 levels, then (3,3,3,3,3) represents the total sales for all periods for all customers and products. There are 4^5=1024
combinations of possible composite levels including the two extremes – base and top. We will simply use the term “level”
to refer to the composite level of an aggregation.
6. We construct a directed acyclic graph whose nodes
represent all possible aggregation levels and whose arcs connect each level to lower levels from which it can be derived by
aggregation. This graph is always a lattice with one top, the highest multidimensional level of aggregation, (3,3,3,3,3) in
our previous example, and one bottom representing the base level (0,0,0,0,0). Such a lattice can either depict only the elementary
(single step) aggregations, which are easier to view, or all “transitive” aggregations that a given node can be
derived from.
7. Facts are not necessarily dense, i.e. there are
a lot of dimension key combinations (particularly in the lower levels) that have no events or facts associated with them.
Dimensions are also not necessarily independent. Each customer, for example, may have sales for only a few out of thousands
of possible products, leaving most customer-product combinations empty.
Three lattices of a two-dimensional hierarchy are depicted in Figure 1. The number of nodes in these graphs is the product of the number of levels in each dimension. In the
three cases, from left to right, 2x2 = 4, 4x4 = 16 and 3x3 = 9. Similarly, Figure 2 depicts three-dimensional lattices for 2 and three level hierarchies with 2x2x2 = 8 and 3x3x3 = 27 nodes
respectively.
These graphs are maps of elementary aggregation paths that connect any aggregation level to each of the
levels that can be used to derive it. An elementary aggregation path is an aggregation along one of the dimensions by one
level. Like roads connecting towns adjacent to each other, we can reach a city from all adjacent towns and all further away
towns that can reach these adjacent towns. For example, to construct an aggregation at a level (2,3) in Figure 1, we can sum
facts from level (2,2) along the second dimension, or those from (1,3) along the first dimension. These are the elementary
aggregation paths. We can also use transitive aggregations, like roads connect adjacent paths to form longer routes: We can
reach level (2,3) by aggregating not only the nodes immediately below, but also any node in a descendent path, i.e. (1,2),
(2,1), (0,3), (2,0), (1,1), (0,2), (0,1), (1,0), (0,0). For illustration purposes we show transitive aggregations using dashed
lines for the 3x3 lattice.
Another observation is that there is more than one way to derive an aggregation. In Figure 2, for example,
aggregations at level (2,0,2) can be derived in various ways: Aggregating by one level the third dimension of (2,0,1)
or the first dimension of (1,0,2), or by aggregating any of the levels that these have been derived from, i.e. (2,0,0), (1,0,1),
(0,0,2) or (1,0,0), (0,0,1) and (0,0,0).
|
|
| Figure 1: Two-dimensional aggregation lattices 2, 3 and 4 levels deep. |
|
|
| Figure 2 Tree-dimensional aggregation lattice 2 and 3 levels deep |
The nodes, in the diagrams, are arranged in rows to show how each node is derived from nodes of the next
row below by increasing the level of exactly one of the dimensions by one. If the average cardinality ratio from level to
level is 1:k, each node on the lattice contains 1/k of the rows of the next level it is derived from. All nodes in the same
row have the same cardinality ratio relative to the bottom. This is because each progression from a lower level aggregation
to a higher one is by raising the level of exactly one dimension, and we assumed total symmetry and uniform cardinality ratios.
To get from (0,0,0) to (0,2,0) we span two levels of aggregation, i.e. reducing the cardinality by a ratio 1/k2
regardless of which path we followed. As a matter of fact the nodes in that row happen to represent all possible aggregate
levels derived from the base level (0,0,0) by applying two single-level aggregations. Each row is thus associated with a “degree”
derived as follows: The bottom level is always of degree 0. The next row is of degree 1; the next row above is of degree 2
etc. Any node of the lattice has a degree equal to the sum of the level numbers of its dimensions, i.e. Di,j,k=i+j+k.The
two-dimensional 3-level lattice has seven degrees, 0 at (0,0,0) through 6 at (2,2,2). The three-dimensional two-level lattice
has four degrees, 0 at the bottom through 3 at (1,1,1). Two important properties of nodes of the same degree are (a) that
they can never be derived from a node of the same or higher degree, but rather they can only be derived from nodes of lower
degree and (b) they are at the same “distance” from the bottom level in terms of single-step aggregations.
In an idealized situation where all dimensions are independent and topologically similar, having the same
depth and cardinality ratio from level to level, the “degree” defines a locus of aggregation points with similar
size and performance characteristics. Given that OLAP queries typically select a single node of the multidimensional hierarchy
and group by one or two degrees of aggregation lower in the lattice, the theoretical cost of the query is proportional to
k*log k or 2k2*log k respectively. This cost represents the number of fact rows accessed and
sorted if the targeted level of aggregation is available. If a query runs over a lower level aggregation or the base level,
the cost as a function of “degree differential” d between the node selected and the level accessed is d*kd*log
k. To simplify the math we will stick to the dominant factor kd and ignore the logarithmic and constant
parts. When we talk about performance and query cost in this model, we are focused
on aggregation cost, presuming optimal access path selection by the DBMS and abstracting potential asymmetries in the queries
caused by the number of joins, indexing, partitioning, clustering etc.
The graph in Figure 3 shows the daunting relationship between cardinality ratio per level, degree (varying
from 0 to 16) and number of dimensions (varying from 2 to 10). The numbers are derived from all theoretical combinations of
independent dimensions and constant level-to-level cardinality ratio. Nevertheless, the shape of the curve illustrates how
extreme the response time problem may be if one has to always aggregate from the base facts to any level needed. In real situations
the actual combinations occurring in facts is much less. However this does not change the lack of scaleability as it only
scales the graph so that the prohibitive growth happens a few degrees higher. By looking at the surface we can visualize a
boundary of rows to be aggregated beyond which we will have to keep redundant pre-aggregated data in order to maintain interactive
response time or sustain an agreed level of service. For illustration let’s assume that we don’t want to allow
any query to have to aggregate more than a threshold of 10,000 rows for each selection. This will require us to construct
intermediate aggregates at points of the lattice that allow high degree queries to derive their answers from. Rows from these
summary tables are further aggregated to the level the “group by” clause of the query dictates. The number of
rows that need to be aggregated, however, are now measured from those pre-aggregated summary tables and up.
|
|
| Figure 3: Relationship between cardinality ratio, degree of request (0-16) & number of dims (2-12) |
For dimensions that are “tall and narrow” or, in our terminology, have low cardinality ratios
and many levels, we can afford higher degree pre-aggregations or even none of them at all. If the cardinality ratio is medium
to high, that is, as the dimensions get wider, the degree difference we will be required to pre-aggregate falls drastically.
For average cardinality ratio 2:1, we don’t reach the aggregation threshold of 1:10,000 until degree 13. For ratio 4:1
the same threshold is reached at degree 6.
|