Massively Parallel: A Story of Distributing the Processing and the Data
MPP environments achieve impressive performance on “big data” by distributing and parallelizing work across multiple processing units. This is accomplished by splitting queries into separate copies which work in parallel, each on their own slice of data. For this reason, in MPP environments, we distribute the storage as well as the processing, so that data can be crunched “locally”, minimizing the amount of data that must cross over from one processing unit to another. The preference is that only minimal result sets are passed back up to the shared memory and processing units for final packaging and delivery. This concept is simple enough, and is certainly an elegant solution for dealing with high performance expectations from users querying big data. The reliance of an MPP system on good data distribution often presents a challenge worth discussing here.
While this topic and the concepts discussed herein are relevant in most MPP database environments, the demonstrations below were performed on a Netezza Performance Server, Mustang series.
A Snowflake Data Model
For the sake of keeping this “blog-size”, let’s assume we are dealing with a typical snowflake schema with one large sales fact table (32GB, 1.5b rows) and two large dimensions, a customer household dimension (9GB, 195m rows – due to some interesting house-holding logic) and a product dimension (leaf is 727MB, 22m rows).

Distributing the Data
So, the initial challenge is deciding which of these two large dimensions we should distribute our fact table on. MPP best practices state that you should distribute your fact table on your largest dimension. Great, household it is. However, understanding a little more about the business processes and the data we are working with, it turns out that just under half of our transactions don’t actually belong to any customer household – I call this dimensional scarcity. This is pretty typical in retail data, where transactions are sourced from a combination of bricks and mortar, and online stores.
So, this makes household_key a poor choice for distribution, right? Not necessarily. If we are using surrogate keys, which we control inside the data warehouse, we can do something a little tricky and fairly harmless to the integrity of our system. We can assign a “dummy” household_key to each row in our fact table which doesn’t belong to an actual customer household. Since we control the surrogate key values, we know which values we can use, so not to collide with valid customer household_key values. And how do we do this without significantly increasing the size of our household dimension table? Well, thanks to a recent conversation with some of our friends at Netezza, we can apply some simple math to assign a value which resides within the range of the number of processing units we have (on Netezza these are called SPUs, and on Teradata they are called AMPs).
Here’s a sample of the DML code which we can use to populate our fact table.

We can now see that our fact table is well distributed on household_key, showing a skew of only 0.73. The table is also well sorted on transaction_date, showing a selectivity of 92% on this column.
Now that we have new “households” in our fact table, we need to insert these into our dimension table as well. We can identify these as household_key values < 0 (as you can see from the multiplier of -1 in the DML code above). Since we will want to know that these are “unassigned” or dummy households, we include an unassigned_flag column on the household dimension table – we could have chosen to put this flag on the fact table as well, but why waste the space and risk an update anomaly if the transaction ultimately gets assigned to a household later.

… And the Results
So, what does all of this get us? Here are some test results … the good and the bad.
|
Query Scenario
|
Date Constraint
|
Distributed on household_key
Query Time
|
Query Path
|
|
Join fact to household dimension.
|
7 days
|
2 seconds
|
Co-located join to dimension. No redistribution necessary.
|
|
Join fact to product dimension, sku level.
|
7 days
|
2 seconds
|
Redistribute fact result and join to dimension locally.
|
|
Join fact to product dimension, division level.
|
7 days
|
2 seconds
|
Redistribute fact result and join to dimension locally.
|
|
Join fact to household dimension.
|
365 days
|
5 seconds
|
Co-located join to dimension. No redistribution necessary.
|
|
Join fact to product dimension, sku level.
|
365 days
|
72 seconds
|
Broadcast dimension to all SPUs.
|
|
Join fact to product dimension, division level.
|
365 days
|
45 seconds
|
Broadcast dimension to all SPUs.
|
First, the good. Because of great Zone Mapping on transaction_date, all 3 queries of 7 days data return very quickly. This is because Netezza will filter the fact table first, hitting only the physical extents which hold the 7 days of data required. The small result set is then redistributed if necessary and joined locally to the desired dimensions.
Looking at our 365 days queries, we see a different story. Our query joining to the household dimension to perhaps “count households and sum sales grouped by household income” returns very quickly because the join is completely co-located and no broadcasting or redistribution is required.
However, joining to product dimension to perhaps “sum sales by sku or division” significantly increases the time for our query, likely pushing us beyond user expectations (we’ll get angry emails from impatient users). This is happening because the optimizer feels there’s too much fact data to redistribute on the sku_key, and so it broadcasts the large product sku dimension to all SPUs and then performs the join. This takes time. Netezza DBAs can modify system settings on how big of a dimension to broadcast, so it is possible that improvements can be made by further system tuning, but you get the idea here.
While Netezza offers Materialized Views, which allow us to materialize a system maintained view where we can restrict the number of columns and specify the sort order, this feature does not allow us to redistribute the data. Nor does it actually build any index similar to Teradata’s hash index – may we hope this gets on the Netezza product roadmap? Follow this link for a decent description of Teradata’s hash index and how it works … pay special attention to the join back concept.
One More Important Optimization
To wrap this up, I’ll offer one more piece of advice … know your user’s needs. In our scenario, we assume that since the household dimension is by far the largest, the household_key is the best key to distribute on. Well, even with our new trick, this may not be the case. If users will typically only join to the household dimension for the nearly 50% of transactions which actually belong to customer households (because all attributes will be “unassigned” for the other 50%), we can rethink our distribution strategy. Since all transactions are likely to have some product, and therefore will all be accessed when joining to the product dimension, we may choose to distribute our fact table on the sku_key.
After doing so, we see the following results.
|
Query Scenario
|
Date Constraint
|
Distributed on sku_key
Query Time
|
Query Path
|
|
Join fact to household dimension.
*Restrict the fact to only transactions belonging to households.
|
7 days
|
3 seconds
|
Restrict the fact to transactions with households, then redistribute the result, and join to dimension locally.
|
|
Join fact to product dimension, sku level.
|
7 days
|
2 seconds
|
Redistribute fact result and join to dimension locally.
|
|
Join fact to product dimension, division level.
|
7 days
|
2 seconds
|
Redistribute fact result and join to dimension locally.
|
|
Join fact to household dimension.
*Restrict the fact to only transactions belonging to households.
|
365 days
|
7 seconds
|
Restrict the fact to transactions with households, then redistribute the result, and join to dimension locally.
|
|
Join fact to product dimension, sku level.
|
365 days
|
4 seconds
|
Co-located join to dimension. No redistribution necessary.
|
|
Join fact to product dimension, division level.
|
365 days
|
4 seconds
|
Co-located join to dimension. No redistribution necessary.
|
|
Join fact to household dimension and product dimension, division level.
|
365 days
|
12 seconds
|
Restrict the fact to transactions with households, and then join locally to the product dimension. Finally redistribute the result, and join to the household dimension locally.
|
While the household joins take slightly longer to process, they are still well within the range of user expectations. The joins to the product dimension however see an incredible performance improvement down to just 4 seconds. I think we have a winner!