XArray

class xframes.XArray(data=None, dtype=None, ignore_cast_failure=False, impl=None)[source]

An immutable, homogeneously typed array object backed by Spark RDD.

XArray is able to hold data that are much larger than the machine’s main memory. It fully supports missing values and random access (although random access is inefficient). The data backing an XArray is located on the cluster hosting Spark.

__init__(data=None, dtype=None, ignore_cast_failure=False, impl=None)[source]

Construct a new XArray. The source of data includes: list, numpy.ndarray, pandas.Series, and urls.

Parameters:

data : list | numpy.ndarray | pandas.Series | string

The input data. If this is a list, numpy.ndarray, or pandas.Series, the data in the list is converted and stored in an XArray. Alternatively if this is a string, it is interpreted as a path (or url) to a text file. Each line of the text file is loaded as a separate row. If data is a directory where an XArray was previously saved, this is loaded as an XArray read directly out of that directory.

dtype : {int, float, str, list, array.array, dict, datetime.datetime}, optional

The data type of the XArray. If not specified, we attempt to infer it from the input. If it is a numpy array or a Pandas series, the data type of the array or series is used. If it is a list, the data type is inferred from the inner list. If it is a URL or path to a text file, we default the data type to str.

ignore_cast_failure : bool, optional

If True, ignores casting failures but warns when elements cannot be cast into the specified data type.

See also

xframes.XArray.from_const
Constructs an XArray of a given size with a const value.
xframes.XArray.from_sequence
Constructs an XArray by generating a sequence of consecutive numbers.
xframes.XArray.from_rdd
Create a new XArray from a Spark RDD or Spark DataFrame.
xframes.XArray.set_trace
Controls entry and exit tracing.
xframes.XArray.spark_context
Returns the spark context.
xframes.XArray.spark_sql_context
Returns the spark sql context.
xframes.XArray.hive_context
Returns the spark hive context.

Notes

  • If data is pandas.Series, the index will be ignored.
The following functionality is currently not implemented:
  • numpy.ndarray as row data
  • pandas.Series data
  • count_words, count_ngrams
  • sketch sub_sketch_keys

Examples

>>> xa = XArray(data=[1,2,3,4,5], dtype=int)
>>> xa = XArray('s3://testdatasets/a_to_z.txt.gz')
>>> xa = XArray([[1,2,3], [3,4,5]])
>>> xa = XArray(data=[{'a':1, 'b': 2}, {'b':2, 'c': 1}])
>>> xa = XArray(data=[datetime.datetime(2011, 10, 20, 9, 30, 10)])
all()[source]

Return True if every element of the XArray evaluates to True.

For numeric XArrays zeros and missing values (None) evaluate to False, while all non-zero, non-missing values evaluate to True. For string, list, and dictionary XArrays, empty values (zero length strings, lists or dictionaries) or missing values (None) evaluate to False. All other values evaluate to True.

Returns True on an empty XArray.

Returns:bool

Examples

>>> xframes.XArray([1, None]).all()
False
>>> xframes.XArray([1, 0]).all()
False
>>> xframes.XArray([1, 2]).all()
True
>>> xframes.XArray(["hello", "world"]).all()
True
>>> xframes.XArray(["hello", ""]).all()
False
>>> xframes.XArray([]).all()
True
any()[source]

Return True if any element of the XArray evaluates to True.

For numeric XArrays any non-zero value evaluates to True. For string, list, and dictionary XArrays, any element of non-zero length evaluates to True.

Returns False on an empty XArray.

Returns:bool

Examples

>>> xframes.XArray([1, None]).any()
True
>>> xframes.XArray([1, 0]).any()
True
>>> xframes.XArray([0, 0]).any()
False
>>> xframes.XArray(["hello", "world"]).any()
True
>>> xframes.XArray(["hello", ""]).any()
True
>>> xframes.XArray(["", ""]).any()
False
>>> xframes.XArray([]).any()
False
append(other)[source]

Append an XArray to the current XArray. Creates a new XArray with the rows from both XArrays. Both XArrays must be of the same data type.

Parameters:

other : XArray

Another XArray whose rows are appended to current XArray.

Returns:

XArray

A new XArray that contains rows from both XArrays, with rows from the other XArray coming after all rows from the current XArray.

See also

xframes.XFrame.append
Appends XFrames

Examples

>>> xa = xframes.XArray([1, 2, 3])
>>> xa2 = xframes.XArray([4, 5, 6])
>>> xa.append(xa2)
dtype: int
Rows: 6
[1, 2, 3, 4, 5, 6]
apply(fn, dtype=None, skip_undefined=True, seed=None)[source]

Transform each element of the XArray by a given function.

The result XArray is of type dtype. fn should be a function that returns exactly one value which can be cast into the type specified by dtype. If dtype is not specified, the first 100 elements of the XArray are used to make a guess about the data type.

Parameters:

fn : function

The function to transform each element. Must return exactly one value which can be cast into the type specified by dtype.

dtype : {int, float, str, list, array.array, dict}, optional

The data type of the new XArray. If not supplied, the first 100 elements of the array are used to guess the target data type.

skip_undefined : bool, optional

If True, will not apply fn to any missing values.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XArray

The XArray transformed by fn. Each element of the XArray is of type dtype.

See also

xframes.XFrame.apply
Applies a function to a column of an XFrame. Note that the functions differ in these two cases: on an XArray the function receives one value, on an XFrame it receives a dict of the column name/value pairs.

Examples

>>> xa = xframes.XArray([1,2,3])
>>> xa.apply(lambda x: x*2)
dtype: int
Rows: 3
[2, 4, 6]
astype(dtype, undefined_on_failure=False)[source]

Create a new XArray with all values cast to the given type. Throws an exception if the types are not castable to the given type.

Parameters:

dtype : {int, float, str, list, array.array, dict, datetime.datetime}

The type to cast the elements to in XArray

undefined_on_failure: bool, optional

If set to True, runtime cast failures will be emitted as missing values rather than failing.

Returns:

XArray of dtype

The XArray converted to the type dtype.

Notes

  • The string parsing techniques used to handle conversion to dictionary and list types are quite generic and permit a variety of interesting formats to be interpreted. For instance, a JSON string can usually be interpreted as a list or a dictionary type. See the examples below.
  • For datetime-to-string and string-to-datetime conversions, use xa.datetime_to_str() and xa.str_to_datetime() functions.

Examples

>>> xa = xframes.XArray(['1','2','3','4'])
>>> xa.astype(int)
dtype: int
Rows: 4
[1, 2, 3, 4]

Given an XArray of strings that look like dicts, convert to a dictionary type:

>>> xa = xframes.XArray(['{1:2 3:4}', '{a:b c:d}'])
>>> xa.astype(dict)
dtype: dict
Rows: 2
[{1: 2, 3: 4}, {'a': 'b', 'c': 'd'}]
clip(lower=None, upper=None)[source]

Create a new XArray with each value clipped to be within the given bounds.

In this case, “clipped” means that values below the lower bound will be set to the lower bound value. Values above the upper bound will be set to the upper bound value. This function can operate on XArrays of numeric type as well as array type, in which case each individual element in each array is clipped. By default lower and upper are set to None which indicates the respective bound should be ignored. The method fails if invoked on an XArray of non-numeric type.

Parameters:

lower : int, optional

The lower bound used to clip. Ignored if equal to None (the default).

upper : int, optional

The upper bound used to clip. Ignored if equal to None (the default).

Returns:

XArray

Examples

>>> xa = xframes.XArray([1,2,3])
>>> xa.clip(2,2)
dtype: int
Rows: 3
[2, 2, 2]
clip_lower(threshold)[source]

Create new XArray with all values clipped to the given lower bound. This function can operate on numeric arrays, as well as vector arrays, in which case each individual element in each vector is clipped. Throws an exception if the XArray is empty or the types are non-numeric.

Parameters:

threshold : float

The lower bound used to clip values.

Returns:

XArray

Examples

>>> xa = xframes.XArray([1,2,3])
>>> xa.clip_lower(2)
dtype: int
Rows: 3
[2, 2, 3]
clip_upper(threshold)[source]

Create new XArray with all values clipped to the given upper bound. This function can operate on numeric arrays, as well as vector arrays, in which case each individual element in each vector is clipped.

Parameters:

threshold : float

The upper bound used to clip values.

Returns:

XArray

Examples

>>> xa = xframes.XArray([1,2,3])
>>> xa.clip_upper(2)
dtype: int
Rows: 3
[1, 2, 2]
countna()[source]

Count the number of missing values in the XArray.

A missing value is represented in a float XArray as ‘NaN’ or None. A missing value in other types of XArrays is None.

Returns:

int

The count of missing values.

datetime_to_str(str_format='%Y-%m-%dT%H:%M:%S%ZP')[source]

Create a new XArray with all the values cast to str. The string format is specified by the ‘str_format’ parameter.

Parameters:

str_format : str

The format to output the string. Default format is “%Y-%m-%dT%H:%M:%S%ZP”.

Returns:

XArray of str

The XArray converted to the type ‘str’.

Examples

>>> dt = datetime.datetime(2011, 10, 20, 9, 30, 10, tzinfo=GMT(-5))
>>> xa = xframes.XArray([dt])
>>> xa.datetime_to_str('%e %b %Y %T %ZP')
dtype: str
Rows: 1
[20 Oct 2011 09:30:10 GMT-05:00]
dict_has_all_keys(keys)[source]

Create a boolean XArray by checking the keys of an XArray of dictionaries.

An element of the output XArray is True if the corresponding input element’s dictionary has all of the given keys. Fails on XArrays whose data type is not dict.

Parameters:

keys : list

A list of key values to check each dictionary against.

Returns:

XArray

An XArray of int type, where each element indicates whether the input XArray element contains all keys in the input list.

Examples

>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7},
                         {"this": 2, "are": 1, "cat": 5}])
>>> xa.dict_has_all_keys(["is", "this"])
dtype: int
Rows: 2
[1, 0]
dict_has_any_keys(keys)[source]

Create a boolean XArray by checking the keys of an XArray of dictionaries. An element of the output XArray is True if the corresponding input element’s dictionary has any of the given keys. Fails on XArrays whose data type is not dict.

Parameters:

keys : list

A list of key values to check each dictionary against.

Returns:

XArray

A XArray of int type, where each element indicates whether the input XArray element contains any key in the input list.

Examples

>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"animal":1},
                         {"this": 2, "are": 1, "cat": 5}])
>>> xa.dict_has_any_keys(["is", "this", "are"])
dtype: int
Rows: 3
[1, 1, 0]
dict_keys()[source]

Create an XArray that contains all the keys from each dictionary element as a list. Fails on XArrays whose data type is not dict.

Returns:

XArray

A XArray of list type, where each element is a list of keys from the input XArray element.

Examples

>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7},
                          {"this": 2, "are": 1, "cat": 5}])
>>> xa.dict_keys()
dtype: list
Rows: 2
[['this', 'is', 'dog'], ['this', 'are', 'cat']]
dict_trim_by_keys(keys, exclude=True)[source]

Filter an XArray of dictionary type by the given keys. By default, all keys that are in the provided list in keys are excluded from the returned XArray.

Parameters:

keys : list

A collection of keys to trim down the elements in the XArray.

exclude : bool, optional

If True, all keys that are in the input key list are removed. If False, only keys that are in the input key list are retained.

Returns:

XArray

A XArray of dictionary type, with each dictionary element trimmed according to the input criteria.

Examples

>>> xa = xframes.XArray([{"this":1, "is":1, "dog":2},
                          {"this": 2, "are": 2, "cat": 1}])
>>> xa.dict_trim_by_keys(["this", "is", "and", "are"], exclude=True)
dtype: dict
Rows: 2
[{'dog': 2}, {'cat': 1}]
dict_trim_by_values(lower=None, upper=None)[source]

Filter dictionary values to a given range (inclusive). Trimming is only performed on values which can be compared to the bound values. Fails on XArrays whose data type is not dict.

Parameters:

lower : int or long or float, optional

The lowest dictionary value that would be retained in the result. If not given, lower bound is not applied.

upper : int or long or float, optional

The highest dictionary value that would be retained in the result. If not given, upper bound is not applied.

Returns:

XArray

An XArray of dictionary type, with each dict element trimmed according to the input criteria.

Examples

>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7},
                          {"this": 2, "are": 1, "cat": 5}])
>>> xa.dict_trim_by_values(2,5)
dtype: dict
Rows: 2
[{'is': 5}, {'this': 2, 'cat': 5}]
>>> xa.dict_trim_by_values(upper=5)
dtype: dict
Rows: 2
[{'this': 1, 'is': 5}, {'this': 2, 'are': 1, 'cat': 5}]
dict_values()[source]

Create an XArray that contains all the values from each dictionary element as a list. Fails on XArrays whose data type is not dict.

Returns:

XArray

A XArray of list type, where each element is a list of values from the input XArray element.

Examples

>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7},
                         {"this": 2, "are": 1, "cat": 5}])
>>> xa.dict_values()
dtype: list
Rows: 2
[[1, 5, 7], [2, 1, 5]]
dropna()[source]

Create new XArray containing only the non-missing values of the XArray.

A missing value is represented in a float XArray as ‘NaN’ on None. A missing value in other types of XArrays is None.

Returns:

XArray

The new XArray with missing values removed.

dtype()[source]

The data type of the XArray.

Returns:

type

The type of the XArray.

Examples

>>> xa = XArray(['The quick brown fox jumps over the lazy dog.'])
>>> xa.dtype()
str
>>> xa = XArray(range(10))
>>> xa.dtype()
int
dump_debug_info()[source]

Print information about the Spark RDD associated with this XArray.

fillna(value)[source]

Create new XArray with all missing values (None or NaN) filled in with the given value.

The size of the new XArray will be the same as the original XArray. If the given value is not the same type as the values in the XArray, fillna will attempt to convert the value to the original XArray’s type. If this fails, an error will be raised.

Parameters:

value : type convertible to XArray’s type

The value used to replace all missing values.

Returns:

XArray

A new XArray with all missing values filled.

filter(fn, skip_undefined=True, seed=None)[source]

Filter this XArray by a function.

Returns a new XArray filtered by a function. If fn evaluates an element to true, this element is copied to the new XArray. If not, it isn’t. Throws an exception if the return type of fn is not castable to a boolean value.

Parameters:

fn : function

Function that filters the XArray. Must evaluate to bool or int.

skip_undefined : bool, optional

If True, will not apply fn to any undefined values.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XArray

The XArray filtered by fn. Each element of the XArray is of type int.

Examples

>>> xa = xframes.XArray([1,2,3])
>>> xa.filter(lambda x: x < 3)
dtype: int
Rows: 2
[1, 2]
flat_map(fn=None, dtype=None, skip_undefined=True, seed=None)[source]

Transform each element of the XArray by a given function, which must return a list.

Each item in the result XArray is made up of a list element. The result XArray is of type dtype. fn should be a function that returns a list of values which can be cast into the type specified by dtype. If dtype is not specified, the first 100 elements of the XArray are used to make a guess about the data type.

Parameters:

fn : function

The function to transform each element. Must return a list of values which can be cast into the type specified by dtype.

dtype : {None, int, float, str, list, array.array, dict}, optional

The data type of the new XArray. If None, the first 100 elements of the array are used to guess the target data type.

skip_undefined : bool, optional

If True, will not apply fn to any undefined values.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XArray

The XArray transformed by fn and flattened. Each element of the XArray is of type dtype.

Examples

>>> xa = xframes.XArray([[1], [1, 2], [1, 2, 3]])
>>> xa.apply(lambda x: x*2)
dtype: int
Rows: 3
[2, 2, 4, 2, 4, 6]
classmethod from_const(value, size)[source]

Constructs an XArray of size with a const value.

Parameters:

value : [int | float | str | array.array | datetime.datetime | list | dict]

The value to fill the XArray.

size : int

The size of the XArray. Must be positive.

Examples

Construct an XArray consisting of 10 zeroes:

>>> xframes.XArray.from_const(0, 10)
classmethod from_rdd(rdd, dtype, lineage=None)[source]

Convert a Spark RDD into an XArray

Parameters:

rdd : pyspark.rdd.RDD

The Spark RDD containing the XArray values.

dtype : type

The values in rdd should have the data type dtype.

lineage: dict, optional

The lineage to apply to the rdd.

Returns:

class:.XArray

This incorporates the given RDD.

classmethod from_sequence(start, stop=None)[source]

Constructs an XArray by generating a sequence of consecutive numbers.

Parameters:

start : int

If stop is not given, the sequence consists of numbers 0 .. start-1. Otherwise, the sequence starts with start.

stop : int, optional

If given, the sequence consists of the numbers start, start`+1 ... `end-1. The sequence will not contain this value.

Examples

>>> from_sequence(1000)
Construct an XArray of integer values from 0 to 999

This is equivalent, but more efficient than: >>> XArray(range(1000))

>>> from_sequence(10, 1000)
Construct an XArray of integer values from 10 to 999

This is equivalent, but more efficient than: >>> XArray(range(10, 1000))

head(n=10)[source]

Returns an XArray which contains the first n rows of this XArray.

Parameters:

n : int

The number of rows to fetch.

Returns:

XArray

A new XArray which contains the first n rows of the current XArray.

Examples

>>> XArray(range(10)).head(5)
dtype: int
Rows: 5
[0, 1, 2, 3, 4]
impl()[source]

Get the impl. For internal use.

item_length()[source]

Length of each element in the current XArray.

Only works on XArrays of string, dict, array, or list type. If a given element is a missing value, then the output elements is also a missing value. This function is equivalent to the following but more performant:

xa_item_len = xa.apply(lambda x: len(x) if x is not None else None)
Returns:

XArray

A new XArray, each element in the XArray is the len of the corresponding items in original XArray.

Examples

>>> xa = XArray([
...  {"is_restaurant": 1, "is_electronics": 0},
...  {"is_restaurant": 1, "is_retail": 1, "is_electronics": 0},
...  {"is_restaurant": 0, "is_retail": 1, "is_electronics": 0},
...  {"is_restaurant": 0},
...  {"is_restaurant": 1, "is_electronics": 1},
...  None])
>>> xa.item_length()
dtype: int
Rows: 6
[2, 3, 3, 1, 2, None]
lineage()[source]

The lineage: the files that went into building this array.

Returns:

dict

  • key ‘table’: set[filename]
    The files that were used to build the XArray
  • key ‘column’: dict{column_name: set[filename]}
    The set of files that were used to build each column
max()[source]

Get maximum numeric value in XArray.

Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.

Returns:

type of XArray

Maximum value of XArray

Examples

>>> xframes.XArray([14, 62, 83, 72, 77, 96, 5, 25, 69, 66]).max()
96
mean()[source]

Mean of all the values in the XArray.

Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.

Returns:

float

Mean of all values in XArray.

min()[source]

Get minimum numeric value in XArray.

Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.

Returns:

type of XArray

Minimum value of XArray

Examples

>>> xframes.XArray([14, 62, 83, 72, 77, 96, 5, 25, 69, 66]).min()
nnz()[source]

Number of non-zero elements in the XArray.

Returns:

int

Number of non-zero elements.

num_missing()[source]

Number of missing elements in the XArray.

Returns:

int

Number of missing values.

classmethod read_text(path, delimiter=None, nrows=None, verbose=False)[source]

Constructs an XArray from a text file or a path to multiple text files.

Parameters:

path : string

Location of the text file or directory to load. If ‘path’ is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for separating records. Must be a single character. Defaults to newline.

nrows : int, optional

If set, only this many rows will be read from the file.

verbose : bool, optional

If True, print the progress while reading files.

Returns:

XArray

Examples

Read a regular text file, with default options.

>>> path = 'http://s3.amazonaws.com/gl-testdata/rating_data_example.csv'
>>> xa = xframes.XArray.read_text(path)
>>> xa
[25904, 25907, 25923, 25924, 25928,  ... ]

Read only the first 100 lines of the text file:

>>> xa = xframes.XArray.read_text(path, nrows=100)
>>> xa
[25904, 25907, 25923, 25924, 25928,  ... ]
sample(fraction, max_partitions=None, seed=None)[source]

Create an XArray which contains a subsample of the current XArray.

Parameters:

fraction : float

The fraction of the rows to fetch. Must be between 0 and 1.

max_partitions : int, optional

After sampling, coalesce to this number of partition. If not given, do not perform this step.

seed : int

The random seed for the random number generator.

Returns:

XArray

The new XArray which contains the subsampled rows.

Examples

>>> xa = xframes.XArray(range(10))
>>> xa.sample(.3)
dtype: int
Rows: 3
[2, 6, 9]
save(filename, format=None)[source]

Saves the XArray to file.

The saved XArray will be in a directory named with the filename parameter.

Parameters:

filename : string

A local path or a remote URL. If format is ‘text’, it will be saved as a text file. If format is ‘binary’, a directory will be created at the location which will contain the XArray.

format : {‘binary’, ‘text’, ‘csv’}, optional

Format in which to save the XFrame. Binary saved XArrays can be loaded much faster and without any format conversion losses. The values ‘text’ and ‘csv’ are synonymous: Each XArray row will be written as a single line in an output text file. If not given, will try to infer the format from filename given. If file name ends with ‘csv’, or ‘txt’, then save as ‘csv’ format, otherwise save as ‘binary’ format.

size()[source]

The size of the XArray.

sketch_summary(sub_sketch_keys=None)[source]

Summary statistics that can be calculated with one pass over the XArray.

Returns a Sketch object which can be further queried for many descriptive statistics over this XArray. Many of the statistics are approximate. See the Sketch documentation for more detail.

Parameters:

sub_sketch_keys: int | str | list of int | list of str, optional

For XArray of dict type, also constructs sketches for a given set of keys, For XArray of array type, also constructs sketches for the given indexes. The sub sketches may be queried using: element_sub_sketch() Defaults to None in which case no subsketches will be constructed.

Returns:

Sketch

Sketch object that contains descriptive statistics for this XArray. Many of the statistics are approximate.

sort(ascending=True)[source]

Sort all values in this XArray.

Sort only works for xarray of type str, int and float, otherwise TypeError will be raised. Creates a new, sorted XArray.

Parameters:

ascending: boolean, optional

If True, the xarray values are sorted in ascending order, otherwise, descending order.

Returns:

XArray

The sorted XArray.

Examples

>>> xa = XArray([3,2,1])
>>> xa.sort()
dtype: int
Rows: 3
[1, 2, 3]
split_datetime(column_name_prefix='X', limit=None)[source]

Splits an XArray of datetime type to multiple columns, return a new XFrame that contains expanded columns. A XArray of datetime will be split by default into an XFrame of 6 columns, one for each year/month/day/hour/minute/second element.

column naming: When splitting a XArray of datetime type, new columns are named: prefix.year, prefix.month, etc. The prefix is set by the parameter “column_name_prefix” and defaults to ‘X’. If column_name_prefix is None or empty, then no prefix is used.

Parameters:

column_name_prefix: str, optional

If provided, expanded column names would start with the given prefix. Defaults to “X”.

limit: str, list[str], optional

Limits the set of datetime elements to expand. Elements may be ‘year’,’month’,’day’,’hour’,’minute’, and ‘second’.

Returns:

XFrame

A new XFrame that contains all expanded columns

Examples

To expand only day and year elements of a datetime XArray

>>> xa = XArray(
   [datetime.datetime(2011, 1, 21, 7, 7, 21),
    datetime.datetime(2010, 2, 5, 7, 8, 21])
>>> xa.split_datetime(column_name_prefix=None,limit=['day','year'])
   Columns:
       day   int
       year  int
   Rows: 2
   Data:
   +-------+--------+
   |  day  |  year  |
   +-------+--------+
   |   21  |  2011  |
   |   5   |  2010  |
   +-------+--------+
   [2 rows x 2 columns]
std(ddof=0)[source]

Standard deviation of all the values in the XArray.

Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type or if ddof >= length of XArray.

Parameters:

ddof : int, optional

“delta degrees of freedom” in the variance calculation.

Returns:

float

The standard deviation of all the values.

str_to_datetime(str_format=None)[source]

Create a new XArray whose column type is datetime. The string format is specified by the ‘str_format’ parameter.

Parameters:

str_format : str, optional

The string format of the input XArray. If not given, dateutil parser is used.

Returns:

XArray of datetime.datetime

The XArray converted to the type ‘datetime’.

Examples

>>> xa = xframes.XArray(['20-Oct-2011 09:30:10 GMT-05:30'])
>>> xa.str_to_datetime('%d-%b-%Y %H:%M:%S %ZP')
dtype: datetime.datetime
Rows: 1
datetime.datetime(2011, 10, 20, 9, 30, 10)
>>> xa = xframes.XArray(['Aug 23, 2015'])
>>> xa.str_to_datetime()
dtype: datetime.datetime
Rows: 1
datetime.datetime(2015, 8, 23, 0, 0, 0)
sum()[source]

Sum of all values in this XArray.

Raises an exception if called on an XArray of strings. If the XArray contains numeric arrays (list or array.array) and all the lists or arrays are the same length, the sum over all the arrays will be returned. If the XArray contains dictionaries whose values are numeric, then the sum of values whose keys appear in every row. Returns None on an empty XArray. For large values, this may overflow without warning.

Returns:

type of XArray

Sum of all values in XArray

tail(n=10)[source]

Creates an XArray that contains the last n elements in the given XArray.

Parameters:

n : int

The number of elements.

Returns:

XArray

A new XArray which contains the last n rows of the current XArray.

to_rdd(number_of_partitions=4)[source]

Convert the current XArray to the Spark RDD.

Parameters:

number_of_partitions: int, optional

The number of partitions to create in the rdd. Defaults to 4.

Returns:

out: RDD

The internal RDD used to stores XArray instances.

topk_index(topk=10, reverse=False)[source]

Create an XArray indicating which elements are in the top k.

Entries are ‘1’ if the corresponding element in the current XArray is a part of the top k elements, and ‘0’ if that corresponding element is not. Order is descending by default.

Parameters:

topk : int

The number of elements to determine if ‘top’

reverse: bool

If True, return the topk elements in ascending order

Returns:

XArray of int

Notes

This is used internally by XFrame’s topk function.

unique()[source]

Get all unique values in the current XArray.

Will not necessarily preserve the order of the given XArray in the new XArray. Raises a TypeError if the XArray is of dictionary type.

Returns:

XArray

A new XArray that contains the unique values of the current XArray.

See also

xframes.XFrame.unique
Unique rows in XFrames.
unpack(column_name_prefix='X', column_types=None, na_value=None, limit=None)[source]

Convert an XFrame of list, array, or dict type to an XFrame with multiple columns.

unpack expands an XArray using the values of each list/array/dict as elements in a new XFrame of multiple columns. For example, an XArray of lists each of length 4 will be expanded into an XFrame of 4 columns, one for each list element. An XArray of lists/tuples/arrays of varying size will be expand to a number of columns equal to the longest list/array. An XArray of dictionaries will be expanded into as many columns as there are keys.

When unpacking an XArray of list or array type, new columns are named: column_name_prefix.0, column_name_prefix.1, etc. If unpacking a column of dict type, unpacked columns are named column_name_prefix.key1, column_name_prefix.key2, etc.

When unpacking an XArray of list or dictionary types, missing values in the original element remain as missing values in the resultant columns. If the na_value parameter is specified, all values equal to this given value are also replaced with missing values. In an XArray of array.array type, NaN is interpreted as a missing value.

xframes.XFrame.pack_columns() is the reverse effect of unpack

Parameters:

column_name_prefix: str, optional

If provided, unpacked column names would start with the given prefix.

column_types: list[type], optional

Column types for the unpacked columns. If not provided, column types are automatically inferred from first 100 rows. Defaults to None.

na_value: optional

Convert all values that are equal to na_value to missing value if specified.

limit: list, optional

Limits the set of list/array/dict keys to unpack. For list/array XArrays, ‘limit’ must contain integer indices. For dict XArray, ‘limit’ must contain dictionary keys.

Returns:

XFrame

A new XFrame that contains all unpacked columns

Examples

To unpack a dict XArray

>>> xa = XArray([{ 'word': 'a',     'count': 1},
...              { 'word': 'cat',   'count': 2},
...              { 'word': 'is',    'count': 3},
...              { 'word': 'coming','count': 4}])

Normal case of unpacking XArray of type dict:

>>> xa.unpack(column_name_prefix=None)
Columns:
    count   int
    word    str

Rows: 4

Data:
+-------+--------+
| count |  word  |
+-------+--------+
|   1   |   a    |
|   2   |  cat   |
|   3   |   is   |
|   4   | coming |
+-------+--------+
[4 rows x 2 columns]

Unpack only keys with ‘word’:

>>> xa.unpack(limit=['word'])
Columns:
    X.word  str

Rows: 4

Data:
+--------+
| X.word |
+--------+
|   a    |
|  cat   |
|   is   |
| coming |
+--------+
[4 rows x 1 columns]
>>> xa2 = XArray([
...               [1, 0, 1],
...               [1, 1, 1],
...               [0, 1]])

Convert all zeros to missing values:

>>> xa2.unpack(column_types=[int, int, int], na_value=0)
Columns:
    X.0     int
    X.1     int
    X.2     int

Rows: 3

Data:
+------+------+------+
| X.0  | X.1  | X.2  |
+------+------+------+
|  1   | None |  1   |
|  1   |  1   |  1   |
| None |  1   | None |
+------+------+------+
[3 rows x 3 columns]
var(ddof=0)[source]

Variance of all the values in the XArray.

Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type or if ddof >= length of XArray.

Parameters:

ddof : int, optional

“delta degrees of freedom” in the variance calculation.

Returns:

float

Variance of all values in XArray.

vector_slice(start, end=None)[source]

If this XArray contains vectors or recursive types, this returns a new XArray containing each individual vector sliced, between start and end, exclusive.

Parameters:

start : int

The start position of the slice.

end : int, optional.

The end position of the slice. Note that the end position is NOT included in the slice. Thus a g.vector_slice(1,3) will extract entries in position 1 and 2.

Returns:

XArray

Each individual vector sliced according to the arguments.

Examples

If g is a vector of floats:

>>> g = XArray([[1,2,3],[2,3,4]])
>>> g
dtype: array
Rows: 2
[array('d', [1.0, 2.0, 3.0]), array('d', [2.0, 3.0, 4.0])]
>>> g.vector_slice(0) # extracts the first element of each vector
dtype: float
Rows: 2
[1.0, 2.0]
>>> g.vector_slice(0, 2) # extracts the first two elements of each vector
dtype: array.array
Rows: 2
[array('d', [1.0, 2.0]), array('d', [2.0, 3.0])]

If a vector cannot be sliced, the result will be None:

>>> g = XArray([[1],[1,2],[1,2,3]])
>>> g
dtype: array.array
Rows: 3
[array('d', [1.0]), array('d', [1.0, 2.0]), array('d', [1.0, 2.0, 3.0])]
>>> g.vector_slice(2)
dtype: float
Rows: 3
[None, None, 3.0]
>>> g.vector_slice(0,2)
dtype: list
Rows: 3
[None, array('d', [1.0, 2.0]), array('d', [1.0, 2.0])]

If g is a vector of mixed types (float, int, str, array, list, etc.):

>>> g = XArray([['a',1,1.0],['b',2,2.0]])
>>> g
dtype: list
Rows: 2
[['a', 1, 1.0], ['b', 2, 2.0]]
>>> g.vector_slice(0) # extracts the first element of each vector
dtype: list
Rows: 2
[['a'], ['b']]