Sketch

class xframes.Sketch(array=None, sub_sketch_keys=[], impl=None)[source]

The Sketch object contains a sketch of a single XArray (a column of an SFrame). Using a sketch representation of an XArray, many approximate and exact statistics can be computed very quickly.

To construct a Sketch object, the following methods are equivalent:

>>> my_xarray = xframes.XArray([1,2,3,4,5])
>>> sketch_ctor = xframes.Sketch(my_xarray)
>>> sketch_factory = my_xarray.sketch_summary()

Typically, the XArray is a column of an XFrame:

>>> my_sframe =  xframes.XFrame({'column1': [1,2,3]})
>>> sketch_ctor = xframes.Sketch(my_sframe['column1'])
>>> sketch_factory = my_sframe['column1'].sketch_summary()

The sketch computation is fast, with complexity approximately linear in the length of the XArray. After the Sketch is computed, all queryable functions are performed nearly instantly.

A sketch can compute the following information depending on the dtype of the XArray:

For numeric columns, the following information is provided exactly:
And the following information is provided approximately:
For non-numeric columns(str), the following information is provided exactly:
And the following information is provided approximately:

For XArray of type list or array, there is a sub sketch for all sub elements. The sub sketch flattens all list/array values and then computes sketch summary over flattened values.

Element sub sketch may be retrieved through:

For XArray of type dict, there are sub sketches for both dict key and value.

The sub sketch may be retrieved through:

For XArray of type dict, user can also pass in a list of dictionary keys to sketch_summary function, this would generate one sub sketch for each key. For example:

>>> sa = xframes.XArray([{'a':1, 'b':2}, {'a':3}])
>>> sketch = sa.sketch_summary(sub_sketch_keys=["a", "b"])

Then the sub summary may be retrieved by:

>>> sketch.element_sub_sketch()

or to get subset keys:

>>> sketch.element_sub_sketch(["a"])

Similarly, for XArray of type vector(array), user can also pass in a list of integers which is the index into the vector to get sub sketch For example:

>>> sa = xframes.XArray([[100,200,300,400,500], [100,200,300], [400,500]])
>>> sketch = sa.sketch_summary(sub_sketch_keys=[1,3,5])

Then the sub summary may be retrieved by:

>>> sketch.element_sub_sketch()

Or:

>>> sketch.element_sub_sketch([1,3])

for subset of keys.

Please see the individual function documentation for detail about each of these statistics.

Parameters:

array : XArray

Array to generate sketch summary.

References

__init__(array)[source]

Construct a new Sketch from an XArray.

Parameters:

array : XArray

Array to sketch.

sub_sketch_keys : list

The list of sub sketch to calculate, for XArray of dictionary type. key needs to be a string, for XArray of vector(array) type, the key needs to be positive integer

avg_length()[source]

Returns the average length of the values in the xarray. Returns 0 on an empty array.

The length of a value in a numeric array is 1. The length of a list or dictionary value is the length of the list or dict. The length of a string value is the string lenth.

Returns:

out : float

The average length of the values. Returns 0 if the XArray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.

dict_key_summary()[source]

Returns the sketch summary for all dictionary keys. This is only valid for sketch object from an XArray of dict type. Dictionary keys are converted to strings and then do the sketch summary.

Examples

>>> sa = xframes.XArray([{'I':1, 'love': 2}, {'nature':3, 'beauty':4}])
>>> sa.sketch_summary().dict_key_summary()
+------------------+-------+----------+
|       item       | value | is exact |
+------------------+-------+----------+
|      Length      |   4   |   Yes    |
| # Missing Values |   0   |   Yes    |
| # unique values  |   4   |    No    |
+------------------+-------+----------+
Most frequent items:
+-------+---+------+--------+--------+
| value | I | love | beauty | nature |
+-------+---+------+--------+--------+
| count | 1 |  1   |   1    |   1    |
+-------+---+------+--------+--------+
dict_value_summary()[source]

Returns the sketch summary for all dictionary values. This is only valid for sketch object from an XArray of dict type.

Type of value summary is inferred from first set of values.

Examples

>>> sa = xframes.XArray([{'I':1, 'love': 2}, {'nature':3, 'beauty':4}])
>>> sa.sketch_summary().dict_value_summary()
+--------------------+---------------+----------+
|        item        |     value     | is exact |
+--------------------+---------------+----------+
|       Length       |       4       |   Yes    |
|        Min         |      1.0      |   Yes    |
|        Max         |      4.0      |   Yes    |
|        Mean        |      2.5      |   Yes    |
|        Sum         |      10.0     |   Yes    |
|      Variance      |      1.25     |   Yes    |
| Standard Deviation | 1.11803398875 |   Yes    |
|  # Missing Values  |       0       |   Yes    |
|  # unique values   |       4       |    No    |
+--------------------+---------------+----------+
Most frequent items:
+-------+-----+-----+-----+-----+
| value | 1.0 | 2.0 | 3.0 | 4.0 |
+-------+-----+-----+-----+-----+
| count |  1  |  1  |  1  |  1  |
+-------+-----+-----+-----+-----+
Quantiles:
+-----+-----+-----+-----+-----+-----+-----+-----+------+
|  0% |  1% |  5% | 25% | 50% | 75% | 95% | 99% | 100% |
+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0  |
+-----+-----+-----+-----+-----+-----+-----+-----+------+
element_length_summary()[source]

Returns the sketch summary for the element length. This is only valid for a sketch constructed XArray of type list/array/dict, raises Runtime exception otherwise.

Returns:

out : Sketch

An new sketch object regarding the element length of the current XArray

Examples

>>> sa = xframes.XArray([[j for j in range(i)] for i in range(1,1000)])
>>> sa.sketch_summary().element_length_summary()
+--------------------+---------------+----------+
|        item        |     value     | is exact |
+--------------------+---------------+----------+
|       Length       |      999      |   Yes    |
|        Min         |      1.0      |   Yes    |
|        Max         |     999.0     |   Yes    |
|        Mean        |     500.0     |   Yes    |
|        Sum         |    499500.0   |   Yes    |
|      Variance      | 83166.6666667 |   Yes    |
| Standard Deviation | 288.386314978 |   Yes    |
|  # Missing Values  |       0       |   Yes    |
|  # unique values   |      992      |    No    |
+--------------------+---------------+----------+
Most frequent items:
+-------+---+---+---+---+---+---+---+---+---+----+
| value | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
+-------+---+---+---+---+---+---+---+---+---+----+
| count | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1  |
+-------+---+---+---+---+---+---+---+---+---+----+
Quantiles:
+-----+------+------+-------+-------+-------+-------+-------+-------+
|  0% |  1%  |  5%  |  25%  |  50%  |  75%  |  95%  |  99%  |  100% |
+-----+------+------+-------+-------+-------+-------+-------+-------+
| 1.0 | 10.0 | 50.0 | 250.0 | 500.0 | 750.0 | 950.0 | 990.0 | 999.0 |
+-----+------+------+-------+-------+-------+-------+-------+-------+
element_sub_sketch(keys=None)[source]

Returns the sketch summary for the given set of keys. This is only applicable for sketch summary created from XArray of xarray or dict type. For dict XArray, the keys are the keys in dict value. For array Xarray, the keys are indexes into the array value.

The keys must be passed into original sketch_summary() call in order to be able to be retrieved later

Parameters:

keys : list of str | str | list of int | int

The list of dictionary keys or array index to get sub sketch from. if not given, then retrieve all sub sketches that are available

Returns:

A dictionary that maps from the key(index) to the actual sketch summary

for that key(index)

Examples

>>> sa = xframes.XArray([{'a':1, 'b':2}, {'a':4, 'd':1}])
>>> s = sa.sketch_summary(sub_sketch_keys=['a','b'])
>>> s.element_sub_sketch(['a'])
{'a':
 +--------------------+-------+----------+
 |        item        | value | is exact |
 +--------------------+-------+----------+
 |       Length       |   2   |   Yes    |
 |        Min         |  1.0  |   Yes    |
 |        Max         |  4.0  |   Yes    |
 |        Mean        |  2.5  |   Yes    |
 |        Sum         |  5.0  |   Yes    |
 |      Variance      |  2.25 |   Yes    |
 | Standard Deviation |  1.5  |   Yes    |
 |  # Missing Values  |   0   |   Yes    |
 |  # unique values   |   2   |    No    |
 +--------------------+-------+----------+
 Most frequent items:
 +-------+-----+-----+
 | value | 1.0 | 4.0 |
 +-------+-----+-----+
 | count |  1  |  1  |
 +-------+-----+-----+
 Quantiles:
 +-----+-----+-----+-----+-----+-----+-----+-----+------+
 |  0% |  1% |  5% | 25% | 50% | 75% | 95% | 99% | 100% |
 +-----+-----+-----+-----+-----+-----+-----+-----+------+
 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0  |
 +-----+-----+-----+-----+-----+-----+-----+-----+------+}
element_summary()[source]

Returns the sketch summary for all element values. This is only valid for sketch object created from XArray of list or vector(array) type. For XArray of list type, all list values are treated as string for sketch summary. For XArray of vector type, the sketch summary is on FLOAT type.

Examples

>>> sa = xframes.XArray([[1,2,3], [4,5]])
>>> sa.sketch_summary().element_summary()
+--------------------+---------------+----------+
|        item        |     value     | is exact |
+--------------------+---------------+----------+
|       Length       |       5       |   Yes    |
|        Min         |      1.0      |   Yes    |
|        Max         |      5.0      |   Yes    |
|        Mean        |      3.0      |   Yes    |
|        Sum         |      15.0     |   Yes    |
|      Variance      |      2.0      |   Yes    |
| Standard Deviation | 1.41421356237 |   Yes    |
|  # Missing Values  |       0       |   Yes    |
|  # unique values   |       5       |    No    |
+--------------------+---------------+----------+
Most frequent items:
+-------+-----+-----+-----+-----+-----+
| value | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
+-------+-----+-----+-----+-----+-----+
| count |  1  |  1  |  1  |  1  |  1  |
+-------+-----+-----+-----+-----+-----+
Quantiles:
+-----+-----+-----+-----+-----+-----+-----+-----+------+
|  0% |  1% |  5% | 25% | 50% | 75% | 95% | 99% | 100% |
+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 5.0 | 5.0  |
+-----+-----+-----+-----+-----+-----+-----+-----+------+
frequency_count(element)[source]

Returns a sketched estimate of the number of occurrences of a given element. This estimate is based on the count sketch. The element type must be of the same type as the input XArray. Throws an exception if element is of the incorrect type.

Parameters:

element : val

An element of the same type as the XArray.

Returns:

out : int

An estimate of the number of occurrences of the element.

Raises:

RuntimeError

Throws an exception if element is of the incorrect type.

frequent_items()[source]

Returns a sketched estimate of the most frequent elements in the XArray based on the SpaceSaving sketch. It is only guaranteed that all elements which appear in more than 0.01% rows of the array will appear in the set of returned elements. However, other elements may also appear in the result. The item counts are estimated using the CountSketch.

Missing values are not taken into account when copmuting frequent items.

If this function returns no elements, it means that all elements appear with less than 0.01% occurrence.

Returns:

out : dict

A dictionary mapping items and their estimated occurrence frequencies.

max()[source]

Returns the maximum value in the XArray. Returns nan on an empty array. Throws an exception if called on an XArray with non-numeric type.

Returns:

out : type of XArray

Maximum value of XArray. Returns nan if the XArray is empty.

Raises:

RuntimeError

Throws an exception if the XArray is a non-numeric type.

mean()[source]

Returns the mean of the values in the XArray. Returns 0 on an empty array. Throws an exception if called on an XArray with non-numeric type.

Returns:

out : float

Mean of all values in XArray. Returns 0 if the xarray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.

min()[source]

Returns the minimum value in the XArray. Returns nan on an empty array. Throws an exception if called on an XArray with non-numeric type.

Returns:

out : type of XArray

Minimum value of XArray. Returns nan if the xarray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.

num_undefined()[source]

Returns the the number of undefined elements in the XArray. Return 0 on an empty XArray.

Returns:

out : int

The number of missing values in the XArray.

num_unique()[source]

Returns a sketched estimate of the number of unique values in the XArray based on the Hyperloglog sketch.

Returns:

out : float

An estimate of the number of unique values in the XArray.

quantile(quantile_val)[source]

Returns a sketched estimate of the value at a particular quantile between 0.0 and 1.0. The quantile is guaranteed to be accurate within 1%: meaning that if you ask for the 0.55 quantile, the returned value is guaranteed to be between the true 0.54 quantile and the true 0.56 quantile. The quantiles are only defined for numeric arrays and this function will raise an exception if called on a sketch constructed for a non-numeric column.

Parameters:

quantile_val : float

A value between 0.0 and 1.0 inclusive. Values below 0.0 will be interpreted as 0.0. Values above 1.0 will be interpreted as 1.0.

Returns:

out : float | str

An estimate of the value at a quantile.

Raises:

RuntimeError

If the xarray is a non-numeric type.

set_frequency_sketch_parms(num_items=None, epsilon=None, delta=None)[source]

Set the frequency sketch accuracy settings.

Parameters:

num_items: int, optional

The number “most frequent” values that are tracked.

epsilon: float (0 .. 1.0), optional

The precision of the result

delta: float (0 .. 1.0), optional

The probability that the precision specified above is not achieved.

set_quantile_accumulator_parms(num_levels=None, epsilon=None, delta=None)[source]

Set the quantile accumulator accuracy settings.

Parameters:

num_levels: int, optional

The number of levels of hash map.

epsilon: float (0 .. 1.0), optional

The precision of the result

delta: float (0 .. 1.0), optional

The probability that the precision specified above is not achieved.

size()[source]

Returns the size of the input XArray.

Returns:

out : int

The number of elements of the input XArray.

std()[source]

Returns the standard deviation of the values in the XArray. Returns 0 on an empty array. Throws an exception if called on an XArray with non-numeric type.

Returns:

out : float

The standard deviation of all the values. Returns 0 if the xarray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.

sum()[source]

Returns the sum of all the values in the XArray. Returns 0 on an empty array. Throws an exception if called on an xarray with non-numeric type. Will overflow without warning.

Returns:

out : type of XArray

Sum of all values in XArray. Returns 0 if the XArray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.

tf_idf()[source]

Returns a tf-idf analysis of each document in a collection.

If the elements in the column are documents in string form, then a simple splitter is used to create a list of words.

If the elemenst are already in list form, then the list elements are used as the terms. These are usually strings, but could be numeric instead.

Returns:

out : XArray of dict

For each document, a dictionary mapping terms to their tf_idf score.

var()[source]

Returns the variance of the values in the xarray. Returns 0 on an empty array. Throws an exception if called on an XArray with non-numeric type.

Returns:

out : float

The variance of all the values. Returns 0 if the XArray is empty.

Raises:

RuntimeError

If the xarray is a non-numeric type.