In the last few years, MPP (Massively Parallel Processing) databases vendors like Netezza, Greenplum and Aster have driven the cost for performance way down - allowing smaller organizations to consider BI and data processing capabilities they could only dream of. As a solution architect and admitted data junkie I now find myself giddy as I ponder how this alters what is possible and changes traditional architectures.
Although there are a lot of areas in the BI stack that are impacted by the power now available (analytics, OLAP, BI reporting, data quality, etc.), the area I seem to be focused on most recently is the Data Integration layer. The disruption has already taken its toll by messing with the acronym ETL. Now people are using the term ELT (get the data on the MPP box fast then do your transformations) and even ETLT (do a little transformation work before you land the data then do some more on the box).
In the real world of projects I see the same struggle within data integration teams wondering when and where to perform their operations. "Should data cleansing or surrogate key assignments be done prior to landing data or while we move it from the source?" “If I just bulk load files and use multi-pass SQL statements to transform my data on the box, why do I need an ETL tool?" "Some of the ETL tasks have a pushdown optimization option while others do not? How does that help me in my data flow?"
ETL versus ELT
ETL vs. ELT ... what's out there?
I think the core problem is that ETL and data integration tool vendors like Informatica, Ab Initio, DataStage, etc. have added significant value in the past by offloading the databases and processing in their own resource space. When the data volumes got enormous, they enabled scaling up and out on additional hardware and created leading edge optimization strategies. Now all of a sudden their target databases are MPP enabled with stacks of blades, memory and disks. With enormous processing capabilities and the ability to handle mixed workloads, why should organizations also invest in additional ETL processing hardware?
I find all the conflicting aspects of this problem fun to ponder. MPP Database Vendors have now given us an opportunity and the need to re-invent the BI stack. For them moment, the future I see is one where the data integration tool vendors begin to truly leverage the MPP architectures the databases are offering - either by pushing transformation down to the databases or actually embedding themselves into the database hardware and software. Vendors like Informatica have seen this coming and are reacting in kind – but the reality is that there is still a long way to go.
Investigating best ETL / ELt for NPS
It is a cool time to be in the world of data.