Custom Aggregator

class xframes.aggregator_property_set.AggregatorPropertySet(agg_function, output_type, default_column_name, num_args)[source]

Store aggregator properties for one aggregator.

XFrames comes with an assortment of aggregators, which are used in conjunction with xframes.groupby. You can write your own aggregators if you need to.

This class is used to define custom aggregators. We will explain how by describing the builtin SUM function.

You create a function that you use in the groupby call. This function takes arguments, for instance the column name. It returns an instance of AggregatorPropertySet and also a list of the arguments passed to it. Let’s examine SUM:

def SUM(src_column):
    return AggregatorPropertySet(agg_sum, int, 'sum', 1), [src_column]

In this example, SUM takes one argument, the name of the column to sum over. This argument is placed into an array and returned.

The AggregatorPropertyset instance tells groupby how to do its work. This includes the function to use to perform the aggregation (agg_sum), the type of the column to be created (int), the default name of the column if no name is given (‘sum’) and the number of arguments expected (1).

The agg_sum function does the actual aggregation. A simplified version of agg_sum is given below as an example.

This function takes arguments:

rows : A collection of rows. Each row is a dictionary, in the form passed to xframes.apply.

cols : A list of the arguments. These are the values returned as the second member of SUM.

Then the function (agg_sum) computes and returns the aggregated value.

Here is the code for agg_sum:

def agg_sum(rows, cols):
    src_col = cols[0]
    total = 0
    for row in rows:
        val = row[src_col]
        if not _is_missing(val):
            total += val
    return total

If you call SUM(‘some-col’) in a groupby statement, then SUM returns the AggregatorPropertySet as shown above, and [‘some-col’].

Then the groupby command is executed. For every distinct value of the grouped variable, it creates a row iterator and passes it to agg_sum, along with [‘some-col’]. The number of rows may be very large, so they are not all computed and passed as a list, but are provided by an iterator.

Then agg_sum extracts the desired value from each row (row[src_col]) and sums up the values.

The function agg_sum is executed in a spark worker node, as with the function used in xframes.apply, and the same restrictions apply.

__init__(agg_function, output_type, default_column_name, num_args)[source]

Create a new instance.

Parameters:

agg_function: func(rows, cols)

The agregator function. This is given a pyspark resultIterable produced by rdd.groupByKey and containing the rows matching a single group. It’s responsibility is to compute and return the aggregate value for the group.

output_type: type or int

If a type is given, use that type as the output column type. If an integer is given, then the output type is the same as the input type of the column indexed by the integer.

default_column_name: str

The name of the aggregate column, if not supplied explicitly.

num_args : int

The number of arguments to the agg_function.