When Data Drives Your Business
  GO
Contact Us 888-828-8201

 
 The Aginity Blog

 

Massively Parallel: A Story of Distributing the Processing and the Data

MPP environments achieve impressive performance on “big data” by distributing and parallelizing work across multiple processing units.  This is accomplished by splitting queries into separate copies which work in parallel, each on their own slice of data.  For this reason, in MPP environments, we distribute the storage as well as the processing, so that data can be crunched “locally”, minimizing the amount of data that must cross over from one processing unit to another.  The preference is that only minimal result sets are passed back up to the shared memory and processing units for final packaging and delivery.  This concept is simple enough, and is certainly an elegant solution for dealing with high performance expectations from users querying big data.  The reliance of an MPP system on good data distribution often presents a challenge worth discussing here. 

While this topic and the concepts discussed herein are relevant in most MPP database environments, the demonstrations below were performed on a Netezza Performance Server, Mustang series.

A Snowflake Data Model

For the sake of keeping this “blog-size”, let’s assume we are dealing with a typical snowflake schema with one large sales fact table (32GB, 1.5b rows) and two large dimensions, a customer household dimension (9GB, 195m rows – due to some interesting house-holding logic) and a product dimension (leaf is 727MB, 22m rows).

image

 

Distributing the Data

So, the initial challenge is deciding which of these two large dimensions we should distribute our fact table on.  MPP best practices state that you should distribute your fact table on your largest dimension.  Great, household it is.  However, understanding a little more about the business processes and the data we are working with, it turns out that just under half of our transactions don’t actually belong to any customer household – I call this dimensional scarcity.  This is pretty typical in retail data, where transactions are sourced from a combination of bricks and mortar, and online stores. 

So, this makes household_key a poor choice for distribution, right?  Not necessarily.  If we are using surrogate keys, which we control inside the data warehouse, we can do something a little tricky and fairly harmless to the integrity of our system.  We can assign a “dummy” household_key to each row in our fact table which doesn’t belong to an actual customer household.  Since we control the surrogate key values, we know which values we can use, so not to collide with valid customer household_key values.  And how do we do this without significantly increasing the size of our household dimension table?  Well, thanks to a recent conversation with some of our friends at Netezza, we can apply some simple math to assign a value which resides within the range of the number of processing units we have (on Netezza these are called SPUs, and on Teradata they are called AMPs). 

Here’s a sample of the DML code which we can use to populate our fact table.

image

 

We can now see that our fact table is well distributed on household_key, showing a skew of only 0.73.  The table is also well sorted on transaction_date, showing a selectivity of 92% on this column.

Now that we have new “households” in our fact table, we need to insert these into our dimension table as well.  We can identify these as household_key values < 0 (as you can see from the multiplier of -1 in the DML code above).  Since we will want to know that these are “unassigned” or dummy households, we include an unassigned_flag column on the household dimension table – we could have chosen to put this flag on the fact table as well, but why waste the space and risk an update anomaly if the transaction ultimately gets assigned to a household later.

image

… And the Results

 So, what does all of this get us?  Here are some test results … the good and the bad.

Query Scenario

Date Constraint

Distributed on household_key

Query Time

Query Path

Join fact to household dimension.

7 days

2 seconds

Co-located join to dimension.  No redistribution necessary.

Join fact to product dimension, sku level.

7 days

2 seconds

Redistribute fact result and join to dimension locally.

Join fact to product dimension, division level.

7 days

2 seconds

Redistribute fact result and join to dimension locally.

Join fact to household dimension.

365 days

5  seconds

Co-located join to dimension.  No redistribution necessary.

Join fact to product dimension, sku level.

365 days

72 seconds

Broadcast dimension to all SPUs.

Join fact to product dimension, division level.

365 days

45 seconds

Broadcast dimension to all SPUs.

 

First, the good.  Because of great Zone Mapping on transaction_date, all 3 queries of 7 days data return very quickly.  This is because Netezza will filter the fact table first, hitting only the physical extents which hold the 7 days of data required.  The small result set is then redistributed if necessary and joined locally to the desired dimensions.

Looking at our 365 days queries, we see a different story.  Our query joining to the household dimension to perhaps “count households and sum sales grouped by household income” returns very quickly because the join is completely co-located and no broadcasting or redistribution is required. 

However, joining to product dimension to perhaps “sum sales by sku or division” significantly increases the time for our query, likely pushing us beyond user expectations (we’ll get angry emails from impatient users).  This is happening because the optimizer feels there’s too much fact data to redistribute on the sku_key, and so it broadcasts the large product sku dimension to all SPUs and then performs the join.  This takes time.  Netezza DBAs can modify system settings on how big of a dimension to broadcast, so it is possible that improvements can be made by further system tuning, but you get the idea here.

While Netezza offers Materialized Views, which allow us to materialize a system maintained view where we can restrict the number of columns and specify the sort order, this feature does not allow us to redistribute the data.   Nor does it actually build any index similar to Teradata’s hash index – may we hope this gets on the Netezza product roadmap?  Follow this link for a decent description of Teradata’s hash index and how it works … pay special attention to the join back concept.

One More Important Optimization

To wrap this up, I’ll offer one more piece of advice … know your user’s needs.  In our scenario, we assume that since the household dimension is by far the largest, the household_key is the best key to distribute on.  Well, even with our new trick, this may not be the case.  If users will typically only join to the household dimension for the nearly 50% of transactions which actually belong to customer households (because all attributes will be “unassigned” for the other 50%), we can rethink our distribution strategy.  Since all transactions are likely to have some product, and therefore will all be accessed when joining to the product dimension, we may choose to distribute our fact table on the sku_key.

After doing so, we see the following results.

Query Scenario

Date Constraint

Distributed on sku_key

Query Time

Query Path

Join fact to household dimension. 

*Restrict the fact to only transactions belonging to households.

7 days

3 seconds

Restrict the fact to transactions with households, then redistribute the result, and join to dimension locally.

Join fact to product dimension, sku level.

7 days

2 seconds

Redistribute fact result and join to dimension locally.

Join fact to product dimension, division level.

7 days

2 seconds

Redistribute fact result and join to dimension locally.

Join fact to household dimension.

*Restrict the fact to only transactions belonging to households.

365 days

7  seconds

Restrict the fact to transactions with households, then redistribute the result, and join to dimension locally.

Join fact to product dimension, sku level.

365 days

4 seconds

Co-located join to dimension.  No redistribution necessary.

Join fact to product dimension, division level.

365 days

4 seconds

Co-located join to dimension.  No redistribution necessary.

Join fact to household dimension and product dimension, division level.

365 days

12 seconds

Restrict the fact to transactions with households, and then join locally to the product dimension.   Finally redistribute the result, and join to the household dimension locally.

 

While the household joins take slightly longer to process, they are still well within the range of user expectations.  The joins to the product dimension however see an incredible performance improvement down to just 4 seconds.  I think we have a winner!

Posted by: Dan Kuhn - CTO on 4/8/2010 | 0 Comments

I've been wanting an excuse to work with Amazon's various cloud computing services.  In my world where building solutions that process terabytes of data is common, I have been struggling with scenarios that might be a good fit.  Today I stumbled into a need. 

I have a need to create a build box for our Aginity Netezza Workbench tool.  Until now, development has been in the hands of a single person where we did not need to centralize the build.  But now as we are expanding the development to more people and transitioning into a more formal process - we now have a need. 

I had a lot of different options I could have chosen - but in the end I decided I would do it on one of the Amazon EC2  services.  My reasoning is this:

·         A build box is needed in small bursts of time throughout a week or day - with the remainder of the time having it sit idle waiting for a build event.

·         Many of our projects crank up quickly and need a build box before our clients can set one up internally and grant access to it.  It would be great to create one at the click of a button.

·         We have source control persistently hosted already, so there is no need to continuously persist a build box.

·         I needed an excuse to try the EC2 services

 

So for at least a little while, I plan to journal my first experience with EC2 here in my blog.

I started by heading to Amazon's Web Services site and signed up for an Amazon Web Services Account.

http://aws.amazon.com/

Signing up was as easy as making my first buy of a book from Amazon (other than one of those Captcha challenges)

Next I poked around setting up a server and getting access to it.  The primary challenge is getting used to their terms and language - as well as. I created EC2 instances, key pairs, security, groups.  Then deleted them.   

I must admit, after about an hour spent on Day 1, I ended with nothing tangible other than a better grasp of the terms and order of steps to do what I want.  A brief summary of what I learned

Key Pairs: This is to setup public and private key pairs used to decrypt certain information.  In the case of setting up your first instance, you use this to decrypt your administrative password.  Be sure to allow pop-up windows before you do this.  If you fail to download the key file on the first pass, you will not get another chance (and thus need to try again on a new one).

Security Group: This defines the ports open for access.  I used the default group but needed to add RDP to in order to get access.

Instance: This is a running instance of a server.  I chose a small capacity Windows 2008 server so we could build a C# application.  This is easiest to perform after Key Pairs and Security Groups have been successfully.

Total account cost: $0.00 (never started an instance)

Day 2 still ahead...

 



  • Syndicate    
     

    Recent Posts

    Archive

    Bloggers

    Category List

    Tag Cloud

       


    MapReduce Clickstream Response Attribution
    Java AP Basket
    MapReduce Keyword Tokenization
    Interactive Reporting Patterns

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now


    Privacy Statement  |  Terms Of Use  |  Copyright 2010 by Aginity, Inc. Register   |   Login