When Data Drives Your Business
  GO
Contact Us 888-828-8201

 
 The Aginity Blog

Posted by: Brian Weissler on 7/12/2010 | 0 Comments

Here's a short demonstration video of how to create, compile, register, and call a custom iClass Analytic Process. 

The Analytic Process is one of the most powerful features of the next Netezza release (v6 + iClass), currently in Beta.  This provides developers with extreme flexibility to create programs which execute on the NPS, taking advantage of the MPP architecture for distribution and parallelism.

 

CompileandRegister

Posted by: Oleg on 7/7/2010 | 0 Comments

This is the second part of the UDA creation article. The first part can be found here: Creating a Netezza User Defined Aggregate (UDA) – understanding the basics.

The most simple and intuitive tool for creating a UDA is Aginity Netezza Workbench. Well, it is the most simple and intuitive tool I am aware of :) The tool will be used for creating a sample UDA.

Sample UDA Description

We are going to create a user defined aggregate SUM_IGNORE_NULLS(). The UDA will behave exactly like SUM() does with the only exception that a NULL value would not cause the function to return NULL, e.g. NULLs will be considered zeroes.

Step by step instructions for creating SUM_IGNORE_NULLS aggregate

1.       Start the Workbench

2.       Click File / New UDX / Aggregate menu

3.       Specify SUM_IGNORE_NULLS as the SQL Name on the General page


 

4.       This sample function will accept a BIGING parameter and return a BIGINT value. Switch to the Arguments page and specify the function accepts single parameter called Summand and its data type is BIGINT as shown on the image below:


 

5.       The function should be able maintaining its stage between the calls (please see the first part of this article for more information.) Since we are calculating a sum of integer values, single BIGINT state value is sufficient. Switch to the State page and add an Accumulator state variable of data type BIGINT


 

6.       Specify the return value on the Returns page


 

7.       Click OK button to proceed. The tool will generate a CPP file that contains the following items:

a.       Include directives for importing required headers

b.      Using directives to start using appropriate namespaces

c.       CSUM_IGNORE_NULLS class interface that includes base virtual functions to be overridden as well as several inline helper functions for accessing function argument and state variables to make the code more readable

d.      A fully implemented instantiate() function that creates an instance of your class and returns it to the caller

e.      Empty stubs for the virtual functions that must be overridden by you

8.       Implement initializeState() function.
This function should set the state variables to initial values. Since our aggregate is going to ignore NULLs and treat them as zeroes, replace the second parameter passed to the
setStateNull() with false and set the state variable initial value to zero:

void CSUM_IGNORE_NULLS::initializeState()
{
      
setStateNull(0, false);
      
*get_state_Accumulator() = 0;
}

9.        Implement accumulate() function.
This function accepts single argument and, in accord to the logic, should add the incoming value to the state like this:

void CSUM_IGNORE_NULLS::accumulate()
{
      
// see if the argument is NULL and ignore it if so
      
if (!is_arg_null_Summand())
       {
             
// accumulate
             
*get_state_Accumulator() += get_arg_Summand();
       }
}

10.   Implement merge() function.
This function should merge two state variables – local and the one that is passed in via arguments. Our function will look like this:

void CSUM_IGNORE_NULLS::merge()
{
      
// the second state variable passed in via arguments is guaranteed to hold
       // no NULL value because the initializeState() explicitly sets it to zero,
      
// so an extra check can be avoided
      
*get_state_Accumulator() += get_arg_Summand();
}

11.   Implement the finalResult() function.
The state variable holds an aggregated sum of values from all SPUs, so simply return the value stored in the session variable. Here a strange thing happens. The session variables within the
finalResult() function have to be accessed using arguments accessor functions. E.g. state variable 0 should be accessed using int64Arg(0)as opposed to int64State(0) as you might had been expecting. So the function body is

ReturnValue CSUM_IGNORE_NULLS::finalResult()
{
      
NZ_UDX_RETURN_INT64(get_arg_Summand());
}

12.   The C++ part is ready now. Let’s define a sample SQL statement for testing. Switch to the Test Query tab and type the following query:

select category_id, sum_ignore_nulls(item_count)
 
from test_uda
group by category_id;

The script for creating the test table and populating it with test data can be found at the bottom of this article.

13.   We are done. Hit F5 for compiling the UDA, registering it in the target database and running the test SQL script. The following output is produced:

Posted by: Oleg on 7/7/2010 | 1 Comment

Exactly like a User Defined Function (UDF), a User Defined Aggregate (UDA) in Netezza is a piece of code that is compiled into a binary output and can be called from within a SQL statement. Some examples of an aggregate function are AVG(), SUM() and MIN(). UDAs are implemented in C++ at the moment, but it is planned that other languages (such as Java, Python, R and so on) can be used in the future.

The similarity between an UDF and UDA is that both return single scalar value.

The difference is that the former accepts a scalar set of arguments while the latter accepts zero-to-many rows of arguments and produce an aggregate value based on a set of rows rather than on single row of input parameters.

Terminology

A typical query that uses an aggregate function contains a group by statement, for instance

select store_id, item_id, max(price)

  from products

 group by store_id, item_id;

A term Distinct Combination (DC) hereafter is used to refer to a unique combination of values stored in a set of columns specified in group by clause. E.g. each unique combination of store_id and item_id returned by the query above is considered to be a Distinct Combination.

In other words, DC refers to single output row produced by a query that contains a GROUP BY clause. The output query contains as many DCs as there are distinct combinations of fields in the group by clause of the SQL statement.

The term DC is important and will be used extensively below to make it clear what is called under circumstances and what is not. And what is called where :)

An UDA in MPP world

Netezza box (NPS) consists of a central host and a bunch of SPU modules that can be thought of as separate computers connected via a high-speed network appliance. The central host features RedHat operating system, while SPUs use some kind of embedded Linux OS. A UDA can be executed on either central NPS host on or an SPU depending on how it is invoked. This is why it is compiled for two different target platforms and two different binaries are produced. The binaries are then copied over to every SPU as well as to a system location on the central hosts and registered with the database system to make them available for using in SQL statements.

The fact that data reside on many SPUs means that we must aggregate the data twice:

1.       Produce an aggregate value on each individual SPU for each DC

2.       Merge the aggregated values on the central host in order to produce a final aggregate result for each DC

We will see how it affects the UDA interface in the next sections.

Keeping the UDA state between the calls

A UDA is called for each row of data produced by a SQL statement. This means it must have means of storing certain pieces of information between the calls in order to produce an aggregated value for a particular DC. A UDA in Netezza uses state variables for that.

You, as a designer of the UDA, specify what state variables are required for your UDA and the data type of each state variable. Netezza creates a set of state variables for each DC produced by a SQL query and makes sure that your UDA has access to the same set of variables for a given DC.

A base class for a UDA

A UDA C++ class must be inherited from Uda class that defines several abstract functions to be overridden as follows:

static Uda *instantiate();
virtual void initializeState() = 0;
virtual void accumulate() = 0;
virtual void merge() = 0;
virtual ReturnValue finalResult() = 0;

The instantiate() function should simply create an instance of your class and return it to the caller.

initializeState() is called once for each DC and the UDA must initialize its state variables at this point. For instance, you can set the some or all state variables to NULL, zero or any other initial value that you think is appropriate.

accumulate() is called once for each row of data. This is the place where you update your state variables to take into account the values from a new set of input parameters.

merge() is called or it is not :) In the most generic case, it is called on a central host to merge two states received from two different SPUs into single set of state variables. This is done repeatedly for each individual SPU state until all states from all SPUs are merged into single one.

An important point here is that this function is not called under certain circumstances. Consider the situation when the column a table is distributed on is a part of the group by clause. In this case, every SPU holds all the data that belong to a given DC. As a result, there is no need to merge state variables produced by each SPU because their DCs never overlap.

finalResult() is the last function called by NPS. This function must return single scalar result using values stored in UDA state variables if necessary.

What’s next?

I am going to show you how to create a sample UDA in the next article. See you soon :)



 

Posted by: Brian Weissler on 6/28/2010 | 0 Comments
Posted by: Oleg on 5/25/2010 | 6 Comments

Introduction

An OnStream User Defined Function (UDF) in Netezza is a piece of code compiled into binary file and called from within a SQL statement.  UDFs are implemented in C++ now, but it is planned that other languages (such as Java, Python, R and so on) can be used in the future.
A Netezza box (NPS) consists of a central host (two hosts forming an NLB cluster, actually) and a bunch of SPU modules. The central host uses the RedHat operating system, while SPUs use an embedded Linux OS. A UDF runs on either a central NPS host or a SPU depending on how it is invoked. This is why it is compiled for two different target platforms and two different binaries are produced. The binaries are copied to every SPU as well as to a system location on the central hosts and registered with the database system to make them available for use in SQL statements.
Sample UDF description
I am going to show you how to implement a Netezza UDF using the following real world scenario.
It is a common practice to use natural and surrogate keys for referencing database entities and to provide entity traceability both internally and externally. A “natural key” is often a concatenation of values from several columns that come from an external system and used for uniquely identifying an entity.
For the sake of this exercise, consider a function that accepts 3 varchar columns to form a natural key (NK) using the following algorithm:
NK = upper(trim(column1)) || upper(trim(column2)) || upper(trim(column3))
Therefore, we are going to implement a UDF named generate_nk() that accepts three varchar arguments; part1, part2, and part3 and returns a natural key built in accordance with the formulae above.
How to create a UDF manually
A UDF in Netezza is a class that inherits from an abstract parent class. The base class for a UDF is Udf (what a coincidence of names!)
The base class defines several abstract functions that you as a UDF developer must implement. So let’s start by including the required headers and creating a declaration for our class:
// add base UDF support
#include "udxinc.h"
#include "udxhelpers.h"
// this is required because we are going to use a toupper() function declared in ctype.h
#include <ctype.h>
 
// start using right namespaces
using namespace nz::udx::dthelpers;
using namespace nz::udx;
 
// declare the class
class Cgenerate_nk : public Udf
{
public:
       static Udf *instantiate();
       virtual ReturnValue evaluate();
};
 
Now that we have the class skeleton, let’s have a look at the two functions that it contains:
1.       static Udf instantiate();
This function is required for creating an instance of your UDF.   Create a new instance of the Cgenerate_nk class and return it.

2.       virtual ReturnValue evaluate();
This function does the actual calculations and returns a value to the caller.
The instantiate() function body is a simple one-liner (add the following code snippets immediately after the function declaration within the same file:
// This static function creates a new instance of the class.
Udf *Cgenerate_nk::instantiate()
{
       return new Cgenerate_nk();
}
 
The evaluate() is a little bit more complex (it can be very complex in fact. Everything depends on what you are trying to achieve.) In our case, the evaluate() looks like this:
// Calculates and returns the function result.
ReturnValue Cgenerate_nk::evaluate()
{
       // find out input string lengths
       int len[3];
       len[0] = isArgNull(0) ? 0 : stringArg(0)->length;
       len[1] = isArgNull(1) ? 0 : stringArg(1)->length;
       len[2] = isArgNull(2) ? 0 : stringArg(2)->length;
      
       // create an array that holds input strings
       char* *parts = new char*[3];
       parts[0] = len[0] == 0 ? NULL : stringArg(0)->data;
       parts[1] = len[1] == 0 ? NULL : stringArg(1)->data;
       parts[2] = len[2] == 0 ? NULL : stringArg(2)->data;
      
       // prepare return value object
       StringReturn *pResult = stringReturnInfo();
       pResult->data[0] = 0;
       char*pCurrent = pResult->data;
      
       // process arguments one by one in a loop
       for (int i = 0; i < 3; i++)
       {
              if (len[i] > 0)
              {
// trim leading spaces
                     int j = 0;
                     while (parts[i][j] == ' ')
                      j++;
 
                     // trim trailing spaces
                     int k = len[i] - 1;
                     while (k > j && parts[i][k] == ' ')
                           k--;
 
                     // copy non-space chars to the result buffer converting to upper case
                     while (j <= k)
                     {
                           *pCurrent = toupper(parts[i][j]);
                           pCurrent++;
                      j++;
                     }
              }
       }
 
       // clean up  
       delete []parts;
      
       // specify the output string length
       pResult->size = pCurrent - pResult->data;
 
       // done, return the generated NK
       NZ_UDX_RETURN_STRING(pResult);
}
The UDF is implemented!
What is left is to compile the UDF and register it with the NPS.   Here are the steps.
1.       Save the file as Cgenerate_nk.cpp
2.       Log on to the NPS host using SSH
3.       Copy the Cgenerate_nk.cpp to the NPS. For the sake of this sample, I assume you are logged in to the Netezza as nz user so the home folder is /home/nz and the full UDF file path is /home/nz/Cgenerate_nk.cpp
4.       Compile and register the UDF in the target database using the following shell command:
/nz/kit/bin/adm/nzudxcompile "/home/nz/Cgenerate_nk.cpp" -o "/home/nz/Cgenerate_nk.o" --sig "generate_nk(VARCHAR(50), VARCHAR(50), VARCHAR(50))" --return "VARCHAR(150)" --class "Cgenerate_nk" --user "admin" --pw "adminpwd" --db "DEMO_DB" --nullcall

This command compiles the CPP file into a binary file and registers a UDF that accepts 3 varchar(50) parameters and returns a varchar(150) as result. The function is registered in DEMO_DB database, and the owner is ADMIN.  The function does not return NULL if any argument is null.
The function is created, compiled, registered and ready for use! A sample call is this:
select generate_nk('first', 'second', 'third')
which returns FIRSTSECONDTHIRD
Another example of the function usage is
select generate_nk(manufacturer, vendor, product_code) from admin..raw_products
This produces a natural key for every record in the raw_products table by concatenating product manufacturer, vendor and product code.
How to create a UDF using Aginity Netezza Workbench
Aginity Netezza Workbench simplifies the whole procedure described above by hiding the details from the user and allowing the user to concentrate on coding the UDF itself without being distracted by those tedious SSH operations and manual compilation. Let’s implement the same UDF using the Workbench tool.
1.       Launch the Workbench.
2.       Use File / New UDX / Function menu to start a New UDF Wizard.
3.       Specify generate_nk as the SQL name (this is the function name as it appears in SQL statements.)
4.       Go to the Arguments node and add part1, part2, and part3 arguments and specify varchar(50) as an argument data type for all of them.
5.       Go to the Returns node and specify varchar(150) as the return value.
6.       Click OK to close the wizard.
7.       The Workbench generates for you everything except the evaluate()body and even a little bit more :) Namely:
a.       Includes
b.      Namespaces
c.       Class declaration
d.      Argument accessors and helper methods for checking if an argument is NULL
e.      Fully implemented instantiate() function
f.        An optional calculateSize() function that is not useful for the sake of this sample. It is used with the functions that return variable length values (for instance varchar(any) or numeric(any). This function usage is out of scope for this post.
8.       What Workbench has not done for you is the logic within the evaluate() function. I would call it artificial intelligence if it did :) Let’s add the missing pieces to see the whole puzzle.
a.       Add line #include <ctype.h> to the includes section of the file.
b.      Copy and paste the body of the evaluate() function from the section above. Please note that the tool has generated inline helper functions for improving arguments handling. (is_arg_null_part1(),is_arg_null_part2() and is_arg_null_part3() can be used for checking the arguments for NULLs. get_arg_part1(),get_arg_part2() and get_arg_part3() provide access to argument values by name rather than by index. This way a line of code that reads
len[0] =isArgNull(0) ? 0 : stringArg(0)->length;
can be re-written in a more readable form like
len[0] = is_arg_null_part1() ? 0 : get_arg_part1()->length;
c.       Switch to the Test Query tab of the UDX designer window and specify a test SQL to be used for testing the UDF:
select generate_nk('   abc   ', ' 1 2 3 ', '   def   ');
9.       The function is created!  Press the F5 key (or use the UDX / Build & Create & Run menu) to create, and register the UDF and execute a test SQL statement.  It prompts for your Linux OS user credentials when the SSH connection is established.  You have an option to tell the tool to remember those credentials so that you are not prompted every time. The credentials are stored securely in Windows registry encrypted by a key specific to the currently logged in user (which is safe enough.)
10.   Congratulations! :)
What’s next?
I am going to provide a sample for creating a Netezza Aggregate (User Defined Aggregate - UDA) in the next article. So stay tuned!
1 2 3  Go to Page:  



  • Syndicate    
     

    Recent Posts

    Archive

    Bloggers

    Category List

    Tag Cloud

       


    MapReduce Clickstream Response Attribution
    Java AP Basket
    MapReduce Keyword Tokenization
    Interactive Reporting Patterns

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now


    Privacy Statement  |  Terms Of Use  |  Copyright 2010 by Aginity, Inc. Register   |   Login