XArray¶
-
class
xframes.
XArray
(data=None, dtype=None, ignore_cast_failure=False, impl=None)[source]¶ An immutable, homogeneously typed array object backed by Spark RDD.
XArray is able to hold data that are much larger than the machine’s main memory. It fully supports missing values and random access (although random access is inefficient). The data backing an XArray is located on the cluster hosting Spark.
-
__init__
(data=None, dtype=None, ignore_cast_failure=False, impl=None)[source]¶ Construct a new XArray. The source of data includes: list, numpy.ndarray, pandas.Series, and urls.
Parameters: data : list | numpy.ndarray | pandas.Series | string
The input data. If this is a list, numpy.ndarray, or pandas.Series, the data in the list is converted and stored in an XArray. Alternatively if this is a string, it is interpreted as a path (or url) to a text file. Each line of the text file is loaded as a separate row. If data is a directory where an XArray was previously saved, this is loaded as an XArray read directly out of that directory.
dtype : {int, float, str, list, array.array, dict, datetime.datetime}, optional
The data type of the XArray. If not specified, we attempt to infer it from the input. If it is a numpy array or a Pandas series, the data type of the array or series is used. If it is a list, the data type is inferred from the inner list. If it is a URL or path to a text file, we default the data type to str.
ignore_cast_failure : bool, optional
If True, ignores casting failures but warns when elements cannot be cast into the specified data type.
See also
xframes.XArray.from_const
- Constructs an XArray of a given size with a const value.
xframes.XArray.from_sequence
- Constructs an XArray by generating a sequence of consecutive numbers.
xframes.XArray.from_rdd
- Create a new XArray from a Spark RDD or Spark DataFrame.
xframes.XArray.set_trace
- Controls entry and exit tracing.
xframes.XArray.spark_context
- Returns the spark context.
xframes.XArray.spark_sql_context
- Returns the spark sql context.
xframes.XArray.hive_context
- Returns the spark hive context.
Notes
- If data is pandas.Series, the index will be ignored.
- The following functionality is currently not implemented:
- numpy.ndarray as row data
- pandas.Series data
- count_words, count_ngrams
- sketch sub_sketch_keys
Examples
>>> xa = XArray(data=[1,2,3,4,5], dtype=int) >>> xa = XArray('s3://testdatasets/a_to_z.txt.gz') >>> xa = XArray([[1,2,3], [3,4,5]]) >>> xa = XArray(data=[{'a':1, 'b': 2}, {'b':2, 'c': 1}]) >>> xa = XArray(data=[datetime.datetime(2011, 10, 20, 9, 30, 10)])
-
all
()[source]¶ Return True if every element of the XArray evaluates to True.
For numeric XArrays zeros and missing values (None) evaluate to False, while all non-zero, non-missing values evaluate to True. For string, list, and dictionary XArrays, empty values (zero length strings, lists or dictionaries) or missing values (None) evaluate to False. All other values evaluate to True.
Returns True on an empty XArray.
Returns: bool See also
Examples
>>> xframes.XArray([1, None]).all() False >>> xframes.XArray([1, 0]).all() False >>> xframes.XArray([1, 2]).all() True >>> xframes.XArray(["hello", "world"]).all() True >>> xframes.XArray(["hello", ""]).all() False >>> xframes.XArray([]).all() True
-
any
()[source]¶ Return True if any element of the XArray evaluates to True.
For numeric XArrays any non-zero value evaluates to True. For string, list, and dictionary XArrays, any element of non-zero length evaluates to True.
Returns False on an empty XArray.
Returns: bool See also
Examples
>>> xframes.XArray([1, None]).any() True >>> xframes.XArray([1, 0]).any() True >>> xframes.XArray([0, 0]).any() False >>> xframes.XArray(["hello", "world"]).any() True >>> xframes.XArray(["hello", ""]).any() True >>> xframes.XArray(["", ""]).any() False >>> xframes.XArray([]).any() False
-
append
(other)[source]¶ Append an XArray to the current XArray. Creates a new XArray with the rows from both XArrays. Both XArrays must be of the same data type.
Parameters: other :
XArray
Another XArray whose rows are appended to current XArray.
Returns: A new XArray that contains rows from both XArrays, with rows from the other XArray coming after all rows from the current XArray.
See also
xframes.XFrame.append
- Appends XFrames
Examples
>>> xa = xframes.XArray([1, 2, 3]) >>> xa2 = xframes.XArray([4, 5, 6]) >>> xa.append(xa2) dtype: int Rows: 6 [1, 2, 3, 4, 5, 6]
-
apply
(fn, dtype=None, skip_undefined=True, seed=None)[source]¶ Transform each element of the XArray by a given function.
The result XArray is of type dtype. fn should be a function that returns exactly one value which can be cast into the type specified by dtype. If dtype is not specified, the first 100 elements of the XArray are used to make a guess about the data type.
Parameters: fn : function
The function to transform each element. Must return exactly one value which can be cast into the type specified by dtype.
dtype : {int, float, str, list, array.array, dict}, optional
The data type of the new XArray. If not supplied, the first 100 elements of the array are used to guess the target data type.
skip_undefined : bool, optional
If True, will not apply fn to any missing values.
seed : int, optional
Used as the seed if a random number generator is included in fn.
Returns: The XArray transformed by fn. Each element of the XArray is of type dtype.
See also
xframes.XFrame.apply
- Applies a function to a column of an XFrame. Note that the functions differ in these two cases: on an XArray the function receives one value, on an XFrame it receives a dict of the column name/value pairs.
Examples
>>> xa = xframes.XArray([1,2,3]) >>> xa.apply(lambda x: x*2) dtype: int Rows: 3 [2, 4, 6]
-
astype
(dtype, undefined_on_failure=False)[source]¶ Create a new XArray with all values cast to the given type. Throws an exception if the types are not castable to the given type.
Parameters: dtype : {int, float, str, list, array.array, dict, datetime.datetime}
The type to cast the elements to in XArray
undefined_on_failure: bool, optional
If set to True, runtime cast failures will be emitted as missing values rather than failing.
Returns: XArray
of dtypeThe XArray converted to the type dtype.
Notes
- The string parsing techniques used to handle conversion to dictionary and list types are quite generic and permit a variety of interesting formats to be interpreted. For instance, a JSON string can usually be interpreted as a list or a dictionary type. See the examples below.
- For datetime-to-string and string-to-datetime conversions, use xa.datetime_to_str() and xa.str_to_datetime() functions.
Examples
>>> xa = xframes.XArray(['1','2','3','4']) >>> xa.astype(int) dtype: int Rows: 4 [1, 2, 3, 4]
Given an XArray of strings that look like dicts, convert to a dictionary type:
>>> xa = xframes.XArray(['{1:2 3:4}', '{a:b c:d}']) >>> xa.astype(dict) dtype: dict Rows: 2 [{1: 2, 3: 4}, {'a': 'b', 'c': 'd'}]
-
clip
(lower=None, upper=None)[source]¶ Create a new XArray with each value clipped to be within the given bounds.
In this case, “clipped” means that values below the lower bound will be set to the lower bound value. Values above the upper bound will be set to the upper bound value. This function can operate on XArrays of numeric type as well as array type, in which case each individual element in each array is clipped. By default lower and upper are set to
None
which indicates the respective bound should be ignored. The method fails if invoked on an XArray of non-numeric type.Parameters: lower : int, optional
The lower bound used to clip. Ignored if equal to
None
(the default).upper : int, optional
The upper bound used to clip. Ignored if equal to
None
(the default).Returns: Examples
>>> xa = xframes.XArray([1,2,3]) >>> xa.clip(2,2) dtype: int Rows: 3 [2, 2, 2]
-
clip_lower
(threshold)[source]¶ Create new XArray with all values clipped to the given lower bound. This function can operate on numeric arrays, as well as vector arrays, in which case each individual element in each vector is clipped. Throws an exception if the XArray is empty or the types are non-numeric.
Parameters: threshold : float
The lower bound used to clip values.
Returns: See also
Examples
>>> xa = xframes.XArray([1,2,3]) >>> xa.clip_lower(2) dtype: int Rows: 3 [2, 2, 3]
-
clip_upper
(threshold)[source]¶ Create new XArray with all values clipped to the given upper bound. This function can operate on numeric arrays, as well as vector arrays, in which case each individual element in each vector is clipped.
Parameters: threshold : float
The upper bound used to clip values.
Returns: See also
Examples
>>> xa = xframes.XArray([1,2,3]) >>> xa.clip_upper(2) dtype: int Rows: 3 [1, 2, 2]
-
countna
()[source]¶ Count the number of missing values in the XArray.
A missing value is represented in a float XArray as ‘NaN’ or None. A missing value in other types of XArrays is None.
Returns: int
The count of missing values.
-
datetime_to_str
(str_format='%Y-%m-%dT%H:%M:%S%ZP')[source]¶ Create a new XArray with all the values cast to str. The string format is specified by the ‘str_format’ parameter.
Parameters: str_format : str
The format to output the string. Default format is “%Y-%m-%dT%H:%M:%S%ZP”.
Returns: XArray
of strThe XArray converted to the type ‘str’.
See also
Examples
>>> dt = datetime.datetime(2011, 10, 20, 9, 30, 10, tzinfo=GMT(-5)) >>> xa = xframes.XArray([dt]) >>> xa.datetime_to_str('%e %b %Y %T %ZP') dtype: str Rows: 1 [20 Oct 2011 09:30:10 GMT-05:00]
-
dict_has_all_keys
(keys)[source]¶ Create a boolean XArray by checking the keys of an XArray of dictionaries.
An element of the output XArray is True if the corresponding input element’s dictionary has all of the given keys. Fails on XArrays whose data type is not
dict
.Parameters: keys : list
A list of key values to check each dictionary against.
Returns: An XArray of int type, where each element indicates whether the input XArray element contains all keys in the input list.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"this": 2, "are": 1, "cat": 5}]) >>> xa.dict_has_all_keys(["is", "this"]) dtype: int Rows: 2 [1, 0]
-
dict_has_any_keys
(keys)[source]¶ Create a boolean XArray by checking the keys of an XArray of dictionaries. An element of the output XArray is True if the corresponding input element’s dictionary has any of the given keys. Fails on XArrays whose data type is not
dict
.Parameters: keys : list
A list of key values to check each dictionary against.
Returns: A XArray of int type, where each element indicates whether the input XArray element contains any key in the input list.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"animal":1}, {"this": 2, "are": 1, "cat": 5}]) >>> xa.dict_has_any_keys(["is", "this", "are"]) dtype: int Rows: 3 [1, 1, 0]
-
dict_keys
()[source]¶ Create an XArray that contains all the keys from each dictionary element as a list. Fails on XArrays whose data type is not
dict
.Returns: A XArray of list type, where each element is a list of keys from the input XArray element.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"this": 2, "are": 1, "cat": 5}]) >>> xa.dict_keys() dtype: list Rows: 2 [['this', 'is', 'dog'], ['this', 'are', 'cat']]
-
dict_trim_by_keys
(keys, exclude=True)[source]¶ Filter an XArray of dictionary type by the given keys. By default, all keys that are in the provided list in keys are excluded from the returned XArray.
Parameters: keys : list
A collection of keys to trim down the elements in the XArray.
exclude : bool, optional
If True, all keys that are in the input key list are removed. If False, only keys that are in the input key list are retained.
Returns: A XArray of dictionary type, with each dictionary element trimmed according to the input criteria.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":1, "dog":2}, {"this": 2, "are": 2, "cat": 1}]) >>> xa.dict_trim_by_keys(["this", "is", "and", "are"], exclude=True) dtype: dict Rows: 2 [{'dog': 2}, {'cat': 1}]
-
dict_trim_by_values
(lower=None, upper=None)[source]¶ Filter dictionary values to a given range (inclusive). Trimming is only performed on values which can be compared to the bound values. Fails on XArrays whose data type is not
dict
.Parameters: lower : int or long or float, optional
The lowest dictionary value that would be retained in the result. If not given, lower bound is not applied.
upper : int or long or float, optional
The highest dictionary value that would be retained in the result. If not given, upper bound is not applied.
Returns: An XArray of dictionary type, with each dict element trimmed according to the input criteria.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"this": 2, "are": 1, "cat": 5}]) >>> xa.dict_trim_by_values(2,5) dtype: dict Rows: 2 [{'is': 5}, {'this': 2, 'cat': 5}]
>>> xa.dict_trim_by_values(upper=5) dtype: dict Rows: 2 [{'this': 1, 'is': 5}, {'this': 2, 'are': 1, 'cat': 5}]
-
dict_values
()[source]¶ Create an XArray that contains all the values from each dictionary element as a list. Fails on XArrays whose data type is not
dict
.Returns: A XArray of list type, where each element is a list of values from the input XArray element.
See also
Examples
>>> xa = xframes.XArray([{"this":1, "is":5, "dog":7}, {"this": 2, "are": 1, "cat": 5}]) >>> xa.dict_values() dtype: list Rows: 2 [[1, 5, 7], [2, 1, 5]]
-
dropna
()[source]¶ Create new XArray containing only the non-missing values of the XArray.
A missing value is represented in a float XArray as ‘NaN’ on None. A missing value in other types of XArrays is None.
Returns: The new XArray with missing values removed.
-
dtype
()[source]¶ The data type of the XArray.
Returns: type
The type of the XArray.
Examples
>>> xa = XArray(['The quick brown fox jumps over the lazy dog.']) >>> xa.dtype() str >>> xa = XArray(range(10)) >>> xa.dtype() int
-
fillna
(value)[source]¶ Create new XArray with all missing values (None or NaN) filled in with the given value.
The size of the new XArray will be the same as the original XArray. If the given value is not the same type as the values in the XArray, fillna will attempt to convert the value to the original XArray’s type. If this fails, an error will be raised.
Parameters: value : type convertible to XArray’s type
The value used to replace all missing values.
Returns: A new XArray with all missing values filled.
-
filter
(fn, skip_undefined=True, seed=None)[source]¶ Filter this XArray by a function.
Returns a new XArray filtered by a function. If fn evaluates an element to true, this element is copied to the new XArray. If not, it isn’t. Throws an exception if the return type of fn is not castable to a boolean value.
Parameters: fn : function
Function that filters the XArray. Must evaluate to bool or int.
skip_undefined : bool, optional
If True, will not apply fn to any undefined values.
seed : int, optional
Used as the seed if a random number generator is included in fn.
Returns: The XArray filtered by fn. Each element of the XArray is of type int.
Examples
>>> xa = xframes.XArray([1,2,3]) >>> xa.filter(lambda x: x < 3) dtype: int Rows: 2 [1, 2]
-
flat_map
(fn=None, dtype=None, skip_undefined=True, seed=None)[source]¶ Transform each element of the XArray by a given function, which must return a list.
Each item in the result XArray is made up of a list element. The result XArray is of type dtype. fn should be a function that returns a list of values which can be cast into the type specified by dtype. If dtype is not specified, the first 100 elements of the XArray are used to make a guess about the data type.
Parameters: fn : function
The function to transform each element. Must return a list of values which can be cast into the type specified by dtype.
dtype : {None, int, float, str, list, array.array, dict}, optional
The data type of the new XArray. If None, the first 100 elements of the array are used to guess the target data type.
skip_undefined : bool, optional
If True, will not apply fn to any undefined values.
seed : int, optional
Used as the seed if a random number generator is included in fn.
Returns: The XArray transformed by fn and flattened. Each element of the XArray is of type dtype.
See also
Examples
>>> xa = xframes.XArray([[1], [1, 2], [1, 2, 3]]) >>> xa.apply(lambda x: x*2) dtype: int Rows: 3 [2, 2, 4, 2, 4, 6]
-
classmethod
from_const
(value, size)[source]¶ Constructs an XArray of size with a const value.
Parameters: value : [int | float | str | array.array | datetime.datetime | list | dict]
The value to fill the XArray.
size : int
The size of the XArray. Must be positive.
Examples
Construct an XArray consisting of 10 zeroes:
>>> xframes.XArray.from_const(0, 10)
-
classmethod
from_rdd
(rdd, dtype, lineage=None)[source]¶ Convert a Spark RDD into an XArray
Parameters: rdd : pyspark.rdd.RDD
The Spark RDD containing the XArray values.
dtype : type
The values in rdd should have the data type dtype.
lineage: dict, optional
The lineage to apply to the rdd.
Returns: class:.XArray
This incorporates the given RDD.
-
classmethod
from_sequence
(start, stop=None)[source]¶ Constructs an XArray by generating a sequence of consecutive numbers.
Parameters: start : int
If stop is not given, the sequence consists of numbers 0 .. start-1. Otherwise, the sequence starts with start.
stop : int, optional
If given, the sequence consists of the numbers start, start`+1 ... `end-1. The sequence will not contain this value.
Examples
>>> from_sequence(1000) Construct an XArray of integer values from 0 to 999
This is equivalent, but more efficient than: >>> XArray(range(1000))
>>> from_sequence(10, 1000) Construct an XArray of integer values from 10 to 999
This is equivalent, but more efficient than: >>> XArray(range(10, 1000))
-
head
(n=10)[source]¶ Returns an XArray which contains the first n rows of this XArray.
Parameters: n : int
The number of rows to fetch.
Returns: A new XArray which contains the first n rows of the current XArray.
Examples
>>> XArray(range(10)).head(5) dtype: int Rows: 5 [0, 1, 2, 3, 4]
-
item_length
()[source]¶ Length of each element in the current XArray.
Only works on XArrays of string, dict, array, or list type. If a given element is a missing value, then the output elements is also a missing value. This function is equivalent to the following but more performant:
xa_item_len = xa.apply(lambda x: len(x) if x is not None else None)Returns: A new XArray, each element in the XArray is the len of the corresponding items in original XArray.
Examples
>>> xa = XArray([ ... {"is_restaurant": 1, "is_electronics": 0}, ... {"is_restaurant": 1, "is_retail": 1, "is_electronics": 0}, ... {"is_restaurant": 0, "is_retail": 1, "is_electronics": 0}, ... {"is_restaurant": 0}, ... {"is_restaurant": 1, "is_electronics": 1}, ... None]) >>> xa.item_length() dtype: int Rows: 6 [2, 3, 3, 1, 2, None]
-
lineage
()[source]¶ The lineage: the files that went into building this array.
Returns: dict
- key ‘table’: set[filename]
- The files that were used to build the XArray
- key ‘column’: dict{column_name: set[filename]}
- The set of files that were used to build each column
-
max
()[source]¶ Get maximum numeric value in XArray.
Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.
Returns: type of XArray
Maximum value of XArray
See also
Examples
>>> xframes.XArray([14, 62, 83, 72, 77, 96, 5, 25, 69, 66]).max() 96
-
mean
()[source]¶ Mean of all the values in the XArray.
Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.
Returns: float
Mean of all values in XArray.
-
min
()[source]¶ Get minimum numeric value in XArray.
Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type.
Returns: type of XArray
Minimum value of XArray
See also
Examples
>>> xframes.XArray([14, 62, 83, 72, 77, 96, 5, 25, 69, 66]).min()
-
num_missing
()[source]¶ Number of missing elements in the XArray.
Returns: int
Number of missing values.
-
classmethod
read_text
(path, delimiter=None, nrows=None, verbose=False)[source]¶ Constructs an XArray from a text file or a path to multiple text files.
Parameters: path : string
Location of the text file or directory to load. If ‘path’ is a directory or a “glob” pattern, all matching files will be loaded.
delimiter : string, optional
This describes the delimiter used for separating records. Must be a single character. Defaults to newline.
nrows : int, optional
If set, only this many rows will be read from the file.
verbose : bool, optional
If True, print the progress while reading files.
Returns: Examples
Read a regular text file, with default options.
>>> path = 'http://s3.amazonaws.com/gl-testdata/rating_data_example.csv' >>> xa = xframes.XArray.read_text(path) >>> xa [25904, 25907, 25923, 25924, 25928, ... ]
Read only the first 100 lines of the text file:
>>> xa = xframes.XArray.read_text(path, nrows=100) >>> xa [25904, 25907, 25923, 25924, 25928, ... ]
-
sample
(fraction, max_partitions=None, seed=None)[source]¶ Create an XArray which contains a subsample of the current XArray.
Parameters: fraction : float
The fraction of the rows to fetch. Must be between 0 and 1.
max_partitions : int, optional
After sampling, coalesce to this number of partition. If not given, do not perform this step.
seed : int
The random seed for the random number generator.
Returns: The new XArray which contains the subsampled rows.
Examples
>>> xa = xframes.XArray(range(10)) >>> xa.sample(.3) dtype: int Rows: 3 [2, 6, 9]
-
save
(filename, format=None)[source]¶ Saves the XArray to file.
The saved XArray will be in a directory named with the filename parameter.
Parameters: filename : string
A local path or a remote URL. If format is ‘text’, it will be saved as a text file. If format is ‘binary’, a directory will be created at the location which will contain the XArray.
format : {‘binary’, ‘text’, ‘csv’}, optional
Format in which to save the XFrame. Binary saved XArrays can be loaded much faster and without any format conversion losses. The values ‘text’ and ‘csv’ are synonymous: Each XArray row will be written as a single line in an output text file. If not given, will try to infer the format from filename given. If file name ends with ‘csv’, or ‘txt’, then save as ‘csv’ format, otherwise save as ‘binary’ format.
-
sketch_summary
(sub_sketch_keys=None)[source]¶ Summary statistics that can be calculated with one pass over the XArray.
Returns a
Sketch
object which can be further queried for many descriptive statistics over this XArray. Many of the statistics are approximate. See theSketch
documentation for more detail.Parameters: sub_sketch_keys: int | str | list of int | list of str, optional
For XArray of dict type, also constructs sketches for a given set of keys, For XArray of array type, also constructs sketches for the given indexes. The sub sketches may be queried using:
element_sub_sketch()
Defaults to None in which case no subsketches will be constructed.Returns: Sketch object that contains descriptive statistics for this XArray. Many of the statistics are approximate.
-
sort
(ascending=True)[source]¶ Sort all values in this XArray.
Sort only works for xarray of type str, int and float, otherwise TypeError will be raised. Creates a new, sorted XArray.
Parameters: ascending: boolean, optional
If True, the xarray values are sorted in ascending order, otherwise, descending order.
Returns: The sorted XArray.
Examples
>>> xa = XArray([3,2,1]) >>> xa.sort() dtype: int Rows: 3 [1, 2, 3]
-
split_datetime
(column_name_prefix='X', limit=None)[source]¶ Splits an XArray of datetime type to multiple columns, return a new XFrame that contains expanded columns. A XArray of datetime will be split by default into an XFrame of 6 columns, one for each year/month/day/hour/minute/second element.
column naming: When splitting a XArray of datetime type, new columns are named: prefix.year, prefix.month, etc. The prefix is set by the parameter “column_name_prefix” and defaults to ‘X’. If column_name_prefix is None or empty, then no prefix is used.
Parameters: column_name_prefix: str, optional
If provided, expanded column names would start with the given prefix. Defaults to “X”.
limit: str, list[str], optional
Limits the set of datetime elements to expand. Elements may be ‘year’,’month’,’day’,’hour’,’minute’, and ‘second’.
Returns: A new XFrame that contains all expanded columns
Examples
To expand only day and year elements of a datetime XArray
>>> xa = XArray( [datetime.datetime(2011, 1, 21, 7, 7, 21), datetime.datetime(2010, 2, 5, 7, 8, 21])
>>> xa.split_datetime(column_name_prefix=None,limit=['day','year']) Columns: day int year int Rows: 2 Data: +-------+--------+ | day | year | +-------+--------+ | 21 | 2011 | | 5 | 2010 | +-------+--------+ [2 rows x 2 columns]
-
std
(ddof=0)[source]¶ Standard deviation of all the values in the XArray.
Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type or if ddof >= length of XArray.
Parameters: ddof : int, optional
“delta degrees of freedom” in the variance calculation.
Returns: float
The standard deviation of all the values.
-
str_to_datetime
(str_format=None)[source]¶ Create a new XArray whose column type is datetime. The string format is specified by the ‘str_format’ parameter.
Parameters: str_format : str, optional
The string format of the input XArray. If not given, dateutil parser is used.
Returns: XArray
of datetime.datetimeThe XArray converted to the type ‘datetime’.
See also
Examples
>>> xa = xframes.XArray(['20-Oct-2011 09:30:10 GMT-05:30']) >>> xa.str_to_datetime('%d-%b-%Y %H:%M:%S %ZP') dtype: datetime.datetime Rows: 1 datetime.datetime(2011, 10, 20, 9, 30, 10)
>>> xa = xframes.XArray(['Aug 23, 2015']) >>> xa.str_to_datetime() dtype: datetime.datetime Rows: 1 datetime.datetime(2015, 8, 23, 0, 0, 0)
-
sum
()[source]¶ Sum of all values in this XArray.
Raises an exception if called on an XArray of strings. If the XArray contains numeric arrays (list or array.array) and all the lists or arrays are the same length, the sum over all the arrays will be returned. If the XArray contains dictionaries whose values are numeric, then the sum of values whose keys appear in every row. Returns None on an empty XArray. For large values, this may overflow without warning.
Returns: type of XArray
Sum of all values in XArray
-
tail
(n=10)[source]¶ Creates an XArray that contains the last n elements in the given XArray.
Parameters: n : int
The number of elements.
Returns: A new XArray which contains the last n rows of the current XArray.
-
to_rdd
(number_of_partitions=4)[source]¶ Convert the current XArray to the Spark RDD.
Parameters: number_of_partitions: int, optional
The number of partitions to create in the rdd. Defaults to 4.
Returns: out: RDD
The internal RDD used to stores XArray instances.
-
topk_index
(topk=10, reverse=False)[source]¶ Create an XArray indicating which elements are in the top k.
Entries are ‘1’ if the corresponding element in the current XArray is a part of the top k elements, and ‘0’ if that corresponding element is not. Order is descending by default.
Parameters: topk : int
The number of elements to determine if ‘top’
reverse: bool
If True, return the topk elements in ascending order
Returns: XArray
of intNotes
This is used internally by XFrame’s topk function.
-
unique
()[source]¶ Get all unique values in the current XArray.
Will not necessarily preserve the order of the given XArray in the new XArray. Raises a TypeError if the XArray is of dictionary type.
Returns: A new XArray that contains the unique values of the current XArray.
See also
xframes.XFrame.unique
- Unique rows in XFrames.
-
unpack
(column_name_prefix='X', column_types=None, na_value=None, limit=None)[source]¶ Convert an XFrame of list, array, or dict type to an XFrame with multiple columns.
unpack expands an XArray using the values of each list/array/dict as elements in a new XFrame of multiple columns. For example, an XArray of lists each of length 4 will be expanded into an XFrame of 4 columns, one for each list element. An XArray of lists/tuples/arrays of varying size will be expand to a number of columns equal to the longest list/array. An XArray of dictionaries will be expanded into as many columns as there are keys.
When unpacking an XArray of list or array type, new columns are named: column_name_prefix.0, column_name_prefix.1, etc. If unpacking a column of dict type, unpacked columns are named column_name_prefix.key1, column_name_prefix.key2, etc.
When unpacking an XArray of list or dictionary types, missing values in the original element remain as missing values in the resultant columns. If the na_value parameter is specified, all values equal to this given value are also replaced with missing values. In an XArray of array.array type, NaN is interpreted as a missing value.
xframes.XFrame.pack_columns()
is the reverse effect of unpackParameters: column_name_prefix: str, optional
If provided, unpacked column names would start with the given prefix.
column_types: list[type], optional
Column types for the unpacked columns. If not provided, column types are automatically inferred from first 100 rows. Defaults to None.
na_value: optional
Convert all values that are equal to na_value to missing value if specified.
limit: list, optional
Limits the set of list/array/dict keys to unpack. For list/array XArrays, ‘limit’ must contain integer indices. For dict XArray, ‘limit’ must contain dictionary keys.
Returns: A new XFrame that contains all unpacked columns
Examples
To unpack a dict XArray
>>> xa = XArray([{ 'word': 'a', 'count': 1}, ... { 'word': 'cat', 'count': 2}, ... { 'word': 'is', 'count': 3}, ... { 'word': 'coming','count': 4}])
Normal case of unpacking XArray of type dict:
>>> xa.unpack(column_name_prefix=None) Columns: count int word str Rows: 4 Data: +-------+--------+ | count | word | +-------+--------+ | 1 | a | | 2 | cat | | 3 | is | | 4 | coming | +-------+--------+ [4 rows x 2 columns]
Unpack only keys with ‘word’:
>>> xa.unpack(limit=['word']) Columns: X.word str Rows: 4 Data: +--------+ | X.word | +--------+ | a | | cat | | is | | coming | +--------+ [4 rows x 1 columns]
>>> xa2 = XArray([ ... [1, 0, 1], ... [1, 1, 1], ... [0, 1]])
Convert all zeros to missing values:
>>> xa2.unpack(column_types=[int, int, int], na_value=0) Columns: X.0 int X.1 int X.2 int Rows: 3 Data: +------+------+------+ | X.0 | X.1 | X.2 | +------+------+------+ | 1 | None | 1 | | 1 | 1 | 1 | | None | 1 | None | +------+------+------+ [3 rows x 3 columns]
-
var
(ddof=0)[source]¶ Variance of all the values in the XArray.
Returns None on an empty XArray. Raises an exception if called on an XArray with non-numeric type or if ddof >= length of XArray.
Parameters: ddof : int, optional
“delta degrees of freedom” in the variance calculation.
Returns: float
Variance of all values in XArray.
-
vector_slice
(start, end=None)[source]¶ If this XArray contains vectors or recursive types, this returns a new XArray containing each individual vector sliced, between start and end, exclusive.
Parameters: start : int
The start position of the slice.
end : int, optional.
The end position of the slice. Note that the end position is NOT included in the slice. Thus a g.vector_slice(1,3) will extract entries in position 1 and 2.
Returns: Each individual vector sliced according to the arguments.
Examples
If g is a vector of floats:
>>> g = XArray([[1,2,3],[2,3,4]]) >>> g dtype: array Rows: 2 [array('d', [1.0, 2.0, 3.0]), array('d', [2.0, 3.0, 4.0])]
>>> g.vector_slice(0) # extracts the first element of each vector dtype: float Rows: 2 [1.0, 2.0]
>>> g.vector_slice(0, 2) # extracts the first two elements of each vector dtype: array.array Rows: 2 [array('d', [1.0, 2.0]), array('d', [2.0, 3.0])]
If a vector cannot be sliced, the result will be None:
>>> g = XArray([[1],[1,2],[1,2,3]]) >>> g dtype: array.array Rows: 3 [array('d', [1.0]), array('d', [1.0, 2.0]), array('d', [1.0, 2.0, 3.0])]
>>> g.vector_slice(2) dtype: float Rows: 3 [None, None, 3.0]
>>> g.vector_slice(0,2) dtype: list Rows: 3 [None, array('d', [1.0, 2.0]), array('d', [1.0, 2.0])]
If g is a vector of mixed types (float, int, str, array, list, etc.):
>>> g = XArray([['a',1,1.0],['b',2,2.0]]) >>> g dtype: list Rows: 2 [['a', 1, 1.0], ['b', 2, 2.0]]
>>> g.vector_slice(0) # extracts the first element of each vector dtype: list Rows: 2 [['a'], ['b']]
-