XFrame

class xframes.XFrame(data=None, format='auto', impl=None, verbose=False)[source]

A tabular, column-mutable dataframe object that can scale to big data. XFrame is able to hold data that are much larger than the machine’s main memory. The data in XFrame is stored row-wise in a Spark RDD. Each row of the RDD is a list, whose elements correspond to the values in each column. The column names and types are stored in the XFrame instance, and give the mapping to the row list.

__init__(data=None, format='auto', impl=None, verbose=False)[source]

Construct a new XFrame from a url, a pandas.DataFrame or a Spark RDD or DataFrame.

An XFrame can be constructed from the following data formats: * csv file (comma separated value) * xframe directory archive (A directory where an XFrame was saved previously) * a spark RDD plus the column names and types * a spark.DataFrame * general text file (with csv parsing options, See read_csv()) * parquet file * a Python dictionary * pandas.DataFrame * JSON * Apache Avro

and from the following sources:

  • your local file system
  • the XFrame Server’s file system
  • HDFS
  • Hive
  • Amazon S3
  • HTTP(S)

Only basic examples of construction are covered here. For more information and examples, please see the User Guide.

XFrames are immutable except for assignments to a column.

Parameters:

data : array | pandas.DataFrame | spark.rdd | spark.DataFrame | string | dict, optional

The actual interpretation of this field is dependent on the format parameter. If data is an array, Pandas DataFrame or Spark RDD, the contents are stored in the XFrame. If data is an object supporting iteritems, then is is handled like a dictionary. If data is an object supporting iteration, then the values are iterated to form the XFrame. If data is a string, it is interpreted as a file. Files can be read from local file system or urls (hdfs://, s3://, or other Hadoop-supported file systems). To read files from s3, you must set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, even if the file is publicly accessible.

format : string, optional

Format of the data. The default, “auto” will automatically infer the input data format. The inference rules are simple: If the data is an array or a dataframe, it is associated with ‘array’ and ‘dataframe’ respectively. If the data is a string, it is interpreted as a file, and the file extension is used to infer the file format. The explicit options are:

  • “auto”
  • “array”
  • “dict”
  • “xarray”
  • “pandas.dataframe”
  • “csv”
  • “tsv”
  • “psv”
  • “parquet”
  • “rdd”
  • “spark.dataframe”
  • “hive”
  • “xframe”

verbose : bool, optional

If True, print the progress while reading a file.

See also

xframes.XFrame.read_csv
Create a new XFrame from a csv file. Preferred for text and CSV formats, because it has a lot more options for controlling the parser.
xframes.XFrame.read_parquet
Read an XFrame from a parquet file.
xframes.XFrame.from_rdd
Create a new XFrame from a Spark RDD or Spark DataFrame. Column names and types can be specified if a spark RDD is given; otherwise they are taken from the DataFrame.
xframes.XFrame.save
Save an XFrame in a file for later use within XFrames or Spark.
xframes.XFrame.load
Load an XFrame from a file. The filename extension is used to determine the file format.
xframes.XFrame.set_trace
Controls entry and exit tracing.
xframes.XFrame.spark_context
Returns the spark context.
xframes.XFrame.spark_sql_context
Returns the spark sql context.

Notes

The following functionality is currently not implemented.
  • pack_columns data types except list, array, and dict
  • groupby quantile

Examples

Create an XFrame from a Python dictionary.

>>> from xframes import XFrame
>>> sf = XFrame({'id':[1,2,3], 'val':['A','B','C']})
>>> sf
Columns:
    id  int
    val str
Rows: 3
Data:
      id  val
   0  1   A
   1  2   B
   2  3   C

Create an XFrame from a remote CSV file.

>>> url = 'http://testdatasets.s3-website-us-west-2.amazonaws.com/users.csv.gz'
>>> xf = XFrame.read_csv(url,
...     delimiter=',', header=True, comment_char="#",
...     column_type_hints={'user_id': int})
__getitem__(key)[source]

This provides XFrame “indexing”, for examle xf[‘column_name’]. The type of the index determine what the construct does: electing a column, doing a logical filter, or returning one or more rows from the XFrame.

This method does things based on the type of key.

If key is:
  • str Calls select_column on key to return a single column as an XArray.
  • XArray Performs a logical filter. Expects given XArray to be the same length as all columns in current XFrame. Every row corresponding with an entry in the given XArray that is equivalent to False is filtered from the result.
  • int Returns a single row of the XFrame (the `key`th one) as a dictionary.
  • slice Returns an XFrame including only the sliced rows.

Examples

>>> xf = xframes.XFrame({'id': [4, 6, 8], 'val': ['D', 'F', 'H']})
>>> xf
add_column(col, name='')[source]

Add a column to this XFrame. The length of the new column must match the length of the existing XFrame. This operation returns a new XFrame with the additional columns. If no name is given, a default name is chosen.

Parameters:

col : XArray

The ‘column’ of data to add.

name : string, optional

The name of the column. If no name is given, a default name is chosen.

Returns:

XFrame

A new XFrame with the new column.

See also

xframes.XFrame.add_columns
Adds multiple columns.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xa = xframes.XArray(['cat', 'dog', 'fossa'])
>>> # This line is equivalant to `xf['species'] = xa`
>>> xf2 = xf.add_column(xa, name='species')
>>> xf2
+----+-----+---------+
| id | val | species |
+----+-----+---------+
| 1  |  A  |   cat   |
| 2  |  B  |   dog   |
| 3  |  C  |  fossa  |
+----+-----+---------+
[3 rows x 3 columns]
add_columns(cols, names=None)[source]

Adds multiple columns to this XFrame. The length of the new columns must match the length of the existing XFrame. This operation returns a new XFrame with the additional columns.

Parameters:

cols : XArray or list of XArray or XFrame

The columns to add. If cols is an XFrame, all columns in it are added.

names : string or list of string, optional

If cols is an XArray, then the name of the column. If no name is given, a default name is chosen. If cols is a list of XArray, then a list of column names. All names must be specified. Namelist is ignored if cols is an XFrame. If there are columns with duplicate names, they will be made unambiguous by adding .1 to the second copy.

Returns:

XFrame

The XFrame with additional columns.

See also

xframes.XFrame.add_column
Adds one column

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xa = xframes.XArray(['cat', 'dog', 'fossa'])
>>> # This line is equivalant to `xf['species'] = xa`
>>> xf2 = xf.add_columns(xa, names='species')
>>> xf2
+----+-----+---------+
| id | val | species |
+----+-----+---------+
| 1  |  A  |   cat   |
| 2  |  B  |   dog   |
| 3  |  C  |  fossa  |
+----+-----+---------+
[3 rows x 3 columns]
>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xf2 = xframes.XFrame({'species': ['cat', 'dog', 'horse'],
...                        'age': [3, 5, 9]})
>>> xf3 = xf.add_columns(xf2)
>>> xf3
+----+-----+-----+---------+
| id | val | age | species |
+----+-----+-----+---------+
| 1  |  A  |  3  |   cat   |
| 2  |  B  |  5  |   dog   |
| 3  |  C  |  9  |  horse  |
+----+-----+-----+---------+
[3 rows x 4 columns]
add_row_number(column_name='id', start=0)[source]

Returns a new XFrame with a new column that numbers each row sequentially. By default the count starts at 0, but this can be changed to a positive or negative number. The new column will be named with the given column name. An error will be raised if the given column name already exists in the XFrame.

Parameters:

column_name : str, optional

The name of the new column that will hold the row numbers.

start : int, optional

The number used to start the row number count.

Returns:

XFrame

The new XFrame with a column name

Notes

The range of numbers is constrained by a signed 64-bit integer, so beware of overflow if you think the results in the row number column will be greater than 9 quintillion.

Examples

>>> xf = xframes.XFrame({'a': [1, None, None], 'b': ['a', 'b', None]})
>>> xf.add_row_number()
+----+------+------+
| id |  a   |  b   |
+----+------+------+
| 0  |  1   |  a   |
| 1  | None |  b   |
| 2  | None | None |
+----+------+------+
[3 rows x 3 columns]
append(other)[source]

Add the rows of an XFrame to the end of this XFrame.

Both XFrame must have the same set of columns with the same column names and column types.

Parameters:

other : XFrame

Another XFrame whose rows are appended to the current XFrame.

Returns:

XFrame

The result XFrame from the append operation.

Examples

>>> xf = xframes.XFrame({'id': [4, 6, 8], 'val': ['D', 'F', 'H']})
>>> xf2 = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xf = xf.append(xf2)
+----+-----+
| 4  |  D  |
| 6  |  F  |
| 8  |  H  |
| 1  |  A  |
| 2  |  B  |
| 3  |  C  |
+----+-----+
[6 rows x 2 columns]
apply(fn, dtype=None, use_columns=None, seed=None)[source]

Transform each row to an XArray according to a specified function. Returns a new XArray of dtype where each element in this XArray is transformed by fn(x) where x is a single row in the XFrame represented as a dictionary. The fn should return exactly one value which can be cast into type dtype. If dtype is not specified, the first 100 rows of the XFrame are used to make a guess of the target data type.

Parameters:

fn : function

The function to transform each row of the XFrame. The return type should be convertible to dtype if dtype is not None.

dtype : data type, optional

The dtype of the new XArray. If None, the first 100 elements of the array are used to guess the target data type.

use_columns : str | list[str], optional

The column or list of columns to be supplied in the row passed to the function. If not given, all columns wll be used to build the row.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XArray

The XArray transformed by fn. Each element of the XArray is of type dtype

Examples

Concatenate strings from several columns:

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.apply(lambda x: str(x['user_id']) + str(x['movie_id']) + str(x['rating']))
dtype: str
Rows: 3
['134', '235', '361']
column_names()[source]

The name of each column in the XFrame.

Returns:

list[string]

Column names of the XFrame.

See also

xframes.XFrame.rename
Renames the columns.
column_types()[source]

The type of each column in the XFrame.

Returns:

list[type]

Column types of the XFrame.

See also

xframes.XFrame.dtype
This is a synonym for column_types.
detect_type(column_name)[source]

If the column is of string type, and the values can safely be cast to int or float, then return the type to be cast to. Uses the entire column to detect the type.

Parameters:

column_name : str

The name of the column to cast.

Returns:

type

int or float: The column can be cast to this type.

str: The column cannot be cast to one of the types above.

Examples

>>> xf = xpatterns.XFrame({'value': ['1', '2', '3']})
>>> xf.detect_type('value')
detect_type_and_cast(column_name)[source]

If the column is of string type, and the values can all be interpreted as integer or float values, then cast the column to the numerical type. Otherwise, returns a copy of the XFrame.

Parameters:

column_name : str

The name of the column to cast.

Examples

>>> xf = xpatterns.XFrame({'value': ['1', '2', '3']})
>>> xf.detect_type_and_cast('value')
dropna(columns=None, how='any')[source]

Remove missing values from an XFrame. A missing value is either None or NaN. If how is ‘any’, a row will be removed if any of the columns in the columns parameter contains at least one missing value. If how is ‘all’, a row will be removed if all of the columns in the columns parameter are missing values.

If the columns parameter is not specified, the default is to consider all columns when searching for missing values.

Parameters:

columns : list or str, optional

The columns to use when looking for missing values. By default, all columns are used.

how : {‘any’, ‘all’}, optional

Specifies whether a row should be dropped if at least one column has missing values, or if all columns have missing values. ‘any’ is default.

Returns:

XFrame

XFrame with missing values removed (according to the given rules).

See also

xframes.XFrame.dropna_split
Drops missing rows from the XFrame and returns them.

Examples

Drop all missing values.

>>> xf = xframes.XFrame({'a': [1, None, None], 'b': ['a', 'b', None]})
>>> xf.dropna()
+---+---+
| a | b |
+---+---+
| 1 | a |
+---+---+
[1 rows x 2 columns]

Drop rows where every value is missing.

>>> xf.dropna(any="all")
+------+---+
|  a   | b |
+------+---+
|  1   | a |
| None | b |
+------+---+
[2 rows x 2 columns]

Drop rows where column ‘a’ has a missing value.

>>> xf.dropna('a', any="all")
+---+---+
| a | b |
+---+---+
| 1 | a |
+---+---+
[1 rows x 2 columns]
dropna_split(columns=None, how='any')[source]

Split rows with missing values from this XFrame. This function has the same functionality as dropna(), but returns a tuple of two XFrames. The first item is the expected output from dropna(), and the second item contains all the rows filtered out by the dropna algorithm.

Parameters:

columns : list or str, optional

The columns to use when looking for missing values. By default, all columns are used.

how : {‘any’, ‘all’}, optional

Specifies whether a row should be dropped if at least one column has missing values, or if all columns have missing values. ‘any’ is default.

Returns:

(XFrame, XFrame)

(XFrame with missing values removed,

XFrame with the removed missing values)

Examples

>>> xf = xframes.XFrame({'a': [1, None, None], 'b': ['a', 'b', None]})
>>> good, bad = xf.dropna_split()
>>> good
+---+---+
| a | b |
+---+---+
| 1 | a |
+---+---+
[1 rows x 2 columns]
>>> bad
+------+------+
|  a   |  b   |
+------+------+
| None |  b   |
| None | None |
+------+------+
[2 rows x 2 columns]
dtype()[source]

The type of each column in the XFrame.

Returns:

list[type]

Column types of the XFrame.

See also

xframes.XFrame.column_types
This is a synonym for dtype.
dump_debug_info()[source]

Print information about the Spark RDD associated with this XFrame.

classmethod empty(column_names, column_types)[source]

Create an empty XFrame.

Creates an empty XFrame, with column names and column types.

Parameters:

column_names : list[str]

The column names.

column_types : list[type]

The column types.

Returns:

XFrame

An empty XFrame with the given column names and types.

fillna(column, value)[source]

Fill all missing values with a given value in a given column. If the value is not the same type as the values in column, this method attempts to convert the value to the original column’s type. If this fails, an error is raised.

Parameters:

column : str

The name of the column to modify.

value : type convertible to XArray’s type

The value used to replace all missing values.

Returns:

XFrame

A new XFrame with the specified value in place of missing values.

Examples

>>> xf = xframes.XFrame({'a':[1, None, None],
...                       'b':['13.1', '17.2', None]})
>>> xf = xf.fillna('a', 0)
>>> xf
+---+------+
| a |  b   |
+---+------+
| 1 | 13.1 |
| 0 | 17.2 |
| 0 | None |
+---+------+
[3 rows x 2 columns]
filterby(values, column_name, exclude=False)[source]

Filter an XFrame by values inside an iterable object. Result is an XFrame that only includes (or excludes) the rows that have a column with the given column_name which holds one of the values in the given values XArray. If values is not an XArray, we attempt to convert it to one before filtering.

Parameters:

values : XArray | list |tuple | set | iterable | numpy.ndarray | pandas.Series | str | function

The values to use to filter the XFrame. The resulting XFrame will only include rows that have one of these values in the given column. If this is f function, it is called on each row and is passed the value in the column given by ‘column_name’. The result includes rows where the function returns True.

column_name : str | None

The column of the XFrame to match with the given values. This can only be None if the values argument is a function. In this case, the function is passed the whole row.

exclude : bool

If True, the result XFrame will contain all rows EXCEPT those that have one of values in column_name.

Returns:

XFrame

The filtered XFrame.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3, 4],
...                      'animal_type': ['dog', 'cat', 'cow', 'horse'],
...                      'name': ['bob', 'jim', 'jimbob', 'bobjim']})
>>> household_pets = ['cat', 'hamster', 'dog', 'fish', 'bird', 'snake']
>>> xf.filterby(household_pets, 'animal_type')
+-------------+----+------+
| animal_type | id | name |
+-------------+----+------+
|     dog     | 1  | bob  |
|     cat     | 2  | jim  |
+-------------+----+------+
[2 rows x 3 columns]
>>> xf.filterby(household_pets, 'animal_type', exclude=True)
+-------------+----+--------+
| animal_type | id |  name  |
+-------------+----+--------+
|    horse    | 4  | bobjim |
|     cow     | 3  | jimbob |
+-------------+----+--------+
[2 rows x 3 columns]
flat_map(column_names, fn, column_types='auto', use_columns=None, seed=None)[source]

Map each row of the XFrame to multiple rows in a new XFrame via a function.

The output of fn must have type list[list[...]]. Each inner list will be a single row in the new output, and the collection of these rows within the outer list make up the data for the output XFrame. All rows must have the same length and the same order of types to make sure the result columns are homogeneously typed. For example, if the first element emitted into the outer list by fn is [43, 2.3, 'string'], then all other elements emitted into the outer list must be a list with three elements, where the first is an int, second is a float, and third is a string. If column_types is not specified, the first 10 rows of the XFrame are used to determine the column types of the returned XFrame.

Parameters:

column_names : list[str]

The column names for the returned XFrame.

fn : function

The function that maps each of the XFrame rows into multiple rows, returning list[list[...]]. All output rows must have the same length and order of types. The function is passed a dictionary of column name: value for each row.

column_types : list[type], optional

The column types of the output XFrame. Default value will be automatically inferred by running fn on the first 10 rows of the output.

use_columns : str | list[str], optional

The column or list of columns to be supplied in the row passed to the function. If not given, all columns wll be used to build the row.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XFrame

A new XFrame containing the results of the flat_map of the original XFrame.

Examples

Repeat each row according to the value in the ‘number’ column.

>>> xf = xframes.XFrame({'letter': ['a', 'b', 'c'],
...                       'number': [1, 2, 3]})
>>> xf.flat_map(['number', 'letter'],
...             lambda x: [list(x.itervalues()) for _ in range(0, x['number'])])
+--------+--------+
| number | letter |
+--------+--------+
|   1    |   a    |
|   2    |   b    |
|   2    |   b    |
|   3    |   c    |
|   3    |   c    |
|   3    |   c    |
+--------+--------+
[6 rows x 2 columns]
foreach(row_fn, init_fn=None, final_fn=None, use_columns=None, seed=None)[source]

Apply the given function to each row of a XFrame. This is intended to be used for functions with side effects.

Rows are processed in groups. Each group is processed sequentially in one execution context. An initial funciton, if given, is executed forst for each group. Its results are passed to each row function. The row function receives the row data as a dictionary of column name: column value.

Parameters:

row_fn : function

The function to be applied to each row of the XFrame. Any value that is returned is ignored. The row_fn takes two parameters: row and init. The row is a dictionary of column-name: column_value. The init value is returned by init_fn.

init_fn : function, optional

The function to be applied before row_fn is called. The rows are processed in groups: init_fn is called once for each group. If no init_fn is supplied, the row_fn is passed None as its second parameter. Init_fn takes no parameters.

final_fn : function, optional

The function to be applied after all row_fn calls are made. Final_fn takes one parameter, the value returned by the init_fn.

use_columns : str | list[str], optional

The column or list of columns to be supplied in the row passed to the function. If not given, all columns wll be used to build the row.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Examples

Send rows to an external sink.

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.foreach(lambda row, ini: send(row['user_id'], row['movie_id'], row['rating']))

Send rows to an external sink with modification.

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.foreach(lambda row, bias: send(row['user_id'], row['movie_id'], row['rating'] + bias),
            lambda: 10)
classmethod from_rdd(rdd, column_names=None, column_types=None)[source]

Create a XFrame from a spark RDD or spark DataFrame. The data should be: * an RDD of tuples * Each tuple should be of the same length. * Each “column” should be of a uniform type.

Parameters:

rdd: spark.RDD or spark.DataFrame

Data used to populate the XFrame

column_names : list of string, optional

The column names to use. Ignored for Spark DataFrames.

column_types : list of type, optional

The column types to use. Ignored for Spark DataFrames.

Returns:

XFrame

See also

to_rdd
Converts to a Spark RDD.
classmethod from_xarray(arry, name)[source]

Constructs a one column XFrame from an XArray and a column name.

Parameters:

arry : XArray

The XArray that will become an XFrame of one column.

name: str

The column name.

Returns:

out: XFrame

Returns an XFrame with one column, containing the values in arry and with the given name.

Examples

Create an XFrame from an XArray.

>>>  print  XFrame.from_xarray(XArray([1, 2, 3]), 'name')
name |
1 |
2 |
3 |
groupby(key_columns, operations=None, *args)[source]

Perform a group on the key_columns followed by aggregations on the columns listed in operations.

The operations parameter is a dictionary that indicates which aggregation operators to use and which columns to use them on. The available operators are SUM, MAX, MIN, COUNT, MEAN, VARIANCE, STD, CONCAT, SELECT_ONE, ARGMIN, ARGMAX, and QUANTILE. See aggregate for more detail on the aggregators.

Parameters:

key_columns : string | list[string]

Column(s) to group by. Key columns can be of any type other than dictionary.

operations : dict, list, optional

Dictionary of columns and aggregation operations. Each key is a output column name and each value is an aggregator. This can also be a list of aggregators, in which case column names will be automatically assigned.

*args

All other remaining arguments will be interpreted in the same way as the operations argument.

Returns:

out_xf : XFrame

A new XFrame, with a column for each groupby column and each aggregation operation.

Examples

Suppose we have an XFrame with movie ratings by many users.

>>> import xframes.aggregate as agg
>>> url = 'http://atg-testdata/rating.csv'
>>> xf = xframes.XFrame.read_csv(url)
>>> xf
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|  25933  |   1663   |   4    |
|  25934  |   1663   |   4    |
|  25935  |   1663   |   4    |
|  25936  |   1663   |   5    |
|  25937  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Compute the number of occurrences of each user.

>>> user_count = xf.groupby('user_id',
...                         {'count': agg.COUNT()})
>>> user_count
+---------+-------+
| user_id | count |
+---------+-------+
|  62361  |   1   |
|  30727  |   1   |
|  40111  |   1   |
|  50513  |   1   |
|  35140  |   1   |
|  42352  |   1   |
|  29667  |   1   |
|  46242  |   1   |
|  58310  |   1   |
|  64614  |   1   |
|   ...   |  ...  |
+---------+-------+
[9852 rows x 2 columns]

Compute the mean and standard deviation of ratings per user.

>>> user_rating_stats = xf.groupby('user_id',
...                                {
...                                    'mean_rating': agg.MEAN('rating'),
...                                    'std_rating': agg.STD('rating')
...                                })
>>> user_rating_stats
+---------+-------------+------------+
| user_id | mean_rating | std_rating |
+---------+-------------+------------+
|  62361  |     5.0     |    0.0     |
|  30727  |     4.0     |    0.0     |
|  40111  |     2.0     |    0.0     |
|  50513  |     4.0     |    0.0     |
|  35140  |     4.0     |    0.0     |
|  42352  |     5.0     |    0.0     |
|  29667  |     4.0     |    0.0     |
|  46242  |     5.0     |    0.0     |
|  58310  |     2.0     |    0.0     |
|  64614  |     2.0     |    0.0     |
|   ...   |     ...     |    ...     |
+---------+-------------+------------+
[9852 rows x 3 columns]

Compute the movie with the minimum rating per user.

>>> chosen_movies = xf.groupby('user_id',
...                            {
...                                'worst_movies': agg.ARGMIN('rating','movie_id')
...                            })
>>> chosen_movies
+---------+-------------+
| user_id | worst_movies |
+---------+-------------+
|  62361  |     1663    |
|  30727  |     1663    |
|  40111  |     1663    |
|  50513  |     1663    |
|  35140  |     1663    |
|  42352  |     1663    |
|  29667  |     1663    |
|  46242  |     1663    |
|  58310  |     1663    |
|  64614  |     1663    |
|   ...   |     ...     |
+---------+-------------+
[9852 rows x 2 columns]

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> xf['imdb-ranking'] = xf['rating'] * 10
>>> chosen_movies = xf.groupby('user_id',
...         {('max_rating_movie','max_imdb_ranking_movie'):
...            agg.ARGMAX(('rating','imdb-ranking'),'movie_id')})
>>> chosen_movies
+---------+------------------+------------------------+
| user_id | max_rating_movie | max_imdb_ranking_movie |
+---------+------------------+------------------------+
|  62361  |       1663       |          16630         |
|  30727  |       1663       |          16630         |
|  40111  |       1663       |          16630         |
|  50513  |       1663       |          16630         |
|  35140  |       1663       |          16630         |
|  42352  |       1663       |          16630         |
|  29667  |       1663       |          16630         |
|  46242  |       1663       |          16630         |
|  58310  |       1663       |          16630         |
|  64614  |       1663       |          16630         |
|   ...   |       ...        |          ...           |
+---------+------------------+------------------------+
[9852 rows x 3 columns]

Compute the movie with the max rating per user.

>>> chosen_movies = xf.groupby('user_id',
...         {'best_movies': agg.ARGMAX('rating','movie')})

Compute the movie with the max rating per user and also the movie with the maximum imdb-ranking per user.

>>> chosen_movies = xf.groupby('user_id',
...        {('max_rating_movie','max_imdb_ranking_movie'):
...                              agg.ARGMAX(('rating','imdb-ranking'),'movie')})

Compute the count, mean, and standard deviation of ratings per (user, time), automatically assigning output column names.

>>> xf['time'] = xf.apply(lambda x: (x['user_id'] + x['movie_id']) % 11 + 2000)
>>> user_rating_stats = xf.groupby(['user_id', 'time'],
...                                [agg.COUNT(),
...                                 agg.MEAN('rating'),
...                                 agg.STDV('rating')])
>>> user_rating_stats
+------+---------+-------+---------------+----------------+
| time | user_id | Count | Avg of rating | Stdv of rating |
+------+---------+-------+---------------+----------------+
| 2006 |  61285  |   1   |      4.0      |      0.0       |
| 2000 |  36078  |   1   |      4.0      |      0.0       |
| 2003 |  47158  |   1   |      3.0      |      0.0       |
| 2007 |  34446  |   1   |      3.0      |      0.0       |
| 2010 |  47990  |   1   |      3.0      |      0.0       |
| 2003 |  42120  |   1   |      5.0      |      0.0       |
| 2007 |  44940  |   1   |      4.0      |      0.0       |
| 2008 |  58240  |   1   |      4.0      |      0.0       |
| 2002 |   102   |   1   |      1.0      |      0.0       |
| 2009 |  52708  |   1   |      3.0      |      0.0       |
| ...  |   ...   |  ...  |      ...      |      ...       |
+------+---------+-------+---------------+----------------+
[10000 rows x 5 columns]

The groupby function can take a variable length list of aggregation specifiers so if we want the count and the 0.25 and 0.75 quantiles of ratings:

>>> user_rating_stats = xf.groupby(['user_id', 'time'], agg.COUNT(),
...                                {'rating_quantiles': agg.QUANTILE('rating',[0.25, 0.75])})
>>> user_rating_stats
+------+---------+-------+------------------------+
| time | user_id | Count |    rating_quantiles    |
+------+---------+-------+------------------------+
| 2006 |  61285  |   1   | array('d', [4.0, 4.0]) |
| 2000 |  36078  |   1   | array('d', [4.0, 4.0]) |
| 2003 |  47158  |   1   | array('d', [3.0, 3.0]) |
| 2007 |  34446  |   1   | array('d', [3.0, 3.0]) |
| 2010 |  47990  |   1   | array('d', [3.0, 3.0]) |
| 2003 |  42120  |   1   | array('d', [5.0, 5.0]) |
| 2007 |  44940  |   1   | array('d', [4.0, 4.0]) |
| 2008 |  58240  |   1   | array('d', [4.0, 4.0]) |
| 2002 |   102   |   1   | array('d', [1.0, 1.0]) |
| 2009 |  52708  |   1   | array('d', [3.0, 3.0]) |
| ...  |   ...   |  ...  |          ...           |
+------+---------+-------+------------------------+
[10000 rows x 4 columns]

To put all items a user rated into one list value by their star rating:

>>> user_rating_stats = xf.groupby(["user_id", "rating"],
...                                {"rated_movie_ids": agg.CONCAT("movie_id")})
>>> user_rating_stats
+--------+---------+----------------------+
| rating | user_id |     rated_movie_ids  |
+--------+---------+----------------------+
|   3    |  31434  | array('d', [1663.0]) |
|   5    |  25944  | array('d', [1663.0]) |
|   4    |  38827  | array('d', [1663.0]) |
|   4    |  51437  | array('d', [1663.0]) |
|   4    |  42549  | array('d', [1663.0]) |
|   4    |  49532  | array('d', [1663.0]) |
|   3    |  26124  | array('d', [1663.0]) |
|   4    |  46336  | array('d', [1663.0]) |
|   4    |  52133  | array('d', [1663.0]) |
|   5    |  62361  | array('d', [1663.0]) |
|  ...   |   ...   |         ...          |
+--------+---------+----------------------+
[9952 rows x 3 columns]

To put all items and rating of a given user together into a dictionary value:

>>> user_rating_stats = xf.groupby("user_id",
...                                {"movie_rating": agg.CONCAT("movie_id", "rating")})
>>> user_rating_stats
+---------+--------------+
| user_id | movie_rating |
+---------+--------------+
|  62361  |  {1663: 5}   |
|  30727  |  {1663: 4}   |
|  40111  |  {1663: 2}   |
|  50513  |  {1663: 4}   |
|  35140  |  {1663: 4}   |
|  42352  |  {1663: 5}   |
|  29667  |  {1663: 4}   |
|  46242  |  {1663: 5}   |
|  58310  |  {1663: 2}   |
|  64614  |  {1663: 2}   |
|   ...   |     ...      |
+---------+--------------+
[9852 rows x 2 columns]
head(n=10)[source]

The first n rows of the XFrame.

Parameters:

n : int, optional

The number of rows to fetch.

Returns:

XFrame

A new XFrame which contains the first n rows of the current XFrame

See also

xframes.XFrame.tail
Returns the last part of the XFrame.
xframes.XFrame.print_rows
Prints the XFrame.
join(right, on=None, how='inner')[source]

Merge two XFrames. Merges the current (left) XFrame with the given (right) XFrame using a SQL-style equi-join operation by columns.

Parameters:

right : XFrame

The XFrame to join.

on : str | list | dict, optional

The column name(s) representing the set of join keys. Each row that has the same value in this set of columns will be merged together.

  • If on is not given, the join keyd are all columns in the left and right XFrames that have the same name
  • If a string is given, this is interpreted as a join using one column, where both XFrames have the same column name.
  • If a list is given, this is interpreted as a join using one or more column names, where each column name given exists in both XFrames.
  • If a dict is given, each dict key is taken as a column name in the left XFrame, and each dict value is taken as the column name in right XFrame that will be joined together. e.g. {‘left_column_name’:’right_column_name’}.

how : {‘inner’, ‘left’, ‘right’, ‘outer’, ‘full’}, optional

The type of join to perform. ‘inner’ is default.

  • inner: Equivalent to a SQL inner join. Result consists of the rows from the two frames whose join key values match exactly, merged together into one XFrame.
  • left: Equivalent to a SQL left outer join. Result is the union between the result of an inner join and the rest of the rows from the left XFrame, merged with missing values.
  • right: Equivalent to a SQL right outer join. Result is the union between the result of an inner join and the rest of the rows from the right XFrame, merged with missing values.
  • full: Equivalent to a SQL full outer join. Result is the union between the result of a left outer join and a right outer join.
  • cartesian: Cartesian product of left and right tables, with columns from each. There is no common column matching: the resulting number of rows is the product of the row counts of the left and right XFrames.
Returns:

XFrame

The joined XFrames.

Examples

>>> animals = xframes.XFrame({'id': [1, 2, 3, 4],
...                           'name': ['dog', 'cat', 'sheep', 'cow']})
>>> sounds = xframes.XFrame({'id': [1, 3, 4, 5],
...                          'sound': ['woof', 'baa', 'moo', 'oink']})
>>> animals.join(sounds, how='inner')
+----+-------+-------+
| id |  name | sound |
+----+-------+-------+
| 1  |  dog  |  woof |
| 3  | sheep |  baa  |
| 4  |  cow  |  moo  |
+----+-------+-------+
[3 rows x 3 columns]
>>> animals.join(sounds, on='id', how='left')
+----+-------+-------+
| id |  name | sound |
+----+-------+-------+
| 1  |  dog  |  woof |
| 3  | sheep |  baa  |
| 4  |  cow  |  moo  |
| 2  |  cat  |  None |
+----+-------+-------+
[4 rows x 3 columns]
>>> animals.join(sounds, on=['id'], how='right')
+----+-------+-------+
| id |  name | sound |
+----+-------+-------+
| 1  |  dog  |  woof |
| 3  | sheep |  baa  |
| 4  |  cow  |  moo  |
| 5  |  None |  oink |
+----+-------+-------+
[4 rows x 3 columns]
>>> animals.join(sounds, on={'id':'id'}, how='full')
+----+-------+-------+
| id |  name | sound |
+----+-------+-------+
| 1  |  dog  |  woof |
| 3  | sheep |  baa  |
| 4  |  cow  |  moo  |
| 5  |  None |  oink |
| 2  |  cat  |  None |
+----+-------+-------+
[5 rows x 3 columns]
lineage()[source]

The table lineage: the files that went into building this table.

Returns:

dict

  • key ‘table’: set[filename]
    The files that were used to build the XArray
  • key ‘column’: dict{column_name: set[filename]}
    The set of files that were used to build each column
classmethod load(filename)[source]

Load an XFrame. The filename extension is used to determine the format automatically. This function is particularly useful for XFrames previously saved in binary format. For CSV imports the read_csv() function provides greater control. If the XFrame is in binary format, filename is actually a directory, created when the XFrame is saved.

Parameters:

filename : string

Location of the file to load. Can be a local path or a remote URL.

Returns:

XFrame

See also

xframes.XFrame.save
Saves the XFrame to a file.
xframes.XFrame.read_csv
Allows more control over csv parsing.

Examples

>>> sf = xframes.XFrame({'id':[1,2,3], 'val':['A','B','C']})
>>> sf.save('my_xframe')        # 'my_xframe' is a directory
>>> sf_loaded = xframes.XFrame.load('my_xframe')
num_columns()[source]

The number of columns in this XFrame.

Returns:

int

Number of columns in the XFrame.

See also

xframes.XFrame.num_rows
Returns the number of rows.
num_rows()[source]

The number of rows in this XFrame.

Returns:

int

Number of rows in the XFrame.

See also

xframes.XFrame.num_columns
Returns the number of columns.
pack_columns(columns=None, column_prefix=None, dtype=<type 'list'>, fill_na=None, remove_prefix=True, new_column_name=None)[source]

Pack two or more columns of the current XFrame into one single column. The result is a new XFrame with the unaffected columns from the original XFrame plus the newly created column.

The list of columns that are packed is chosen through either the columns or column_prefix parameter. Only one of the parameters is allowed to be provided: columns explicitly specifies the list of columns to pack, while column_prefix specifies that all columns that have the given prefix are to be packed.

The type of the resulting column is decided by the dtype parameter. Allowed values for dtype are dict, array.array list, and tuple:

  • dict: pack to a dictionary XArray where column name becomes dictionary key and column value becomes dictionary value
  • array.array: pack all values from the packing columns into an array
  • list: pack all values from the packing columns into a list.
  • tuple: pack all values from the packing columns into a tuple.
Parameters:

columns : list[str], optional

A list of column names to be packed. There needs to have at least two columns to pack. If omitted and column_prefix is not specified, all columns from current XFrame are packed. This parameter is mutually exclusive with the column_prefix parameter.

column_prefix : str, optional

Pack all columns with the given column_prefix. This parameter is mutually exclusive with the columns parameter.

dtype : dict | array.array | list | tuple, optional

The resulting packed column type. If not provided, dtype is list.

fill_na : value, optional

Value to fill into packed column if missing value is encountered. If packing to dictionary, fill_na is only applicable to dictionary values; missing keys are not replaced.

remove_prefix : bool, optional

If True and column_prefix is specified, the dictionary key will be constructed by removing the prefix from the column name. This option is only applicable when packing to dict type.

new_column_name : str, optional

Packed column name. If not given and column_prefix is given, then the prefix will be used as the new column name, otherwise name is generated automatically.

Returns:

XFrame

An XFrame that contains columns that are not packed, plus the newly packed column.

Notes

  • There must be at least two columns to pack.
  • If packing to dictionary, a missing key is always dropped. Missing values are dropped if fill_na is not provided, otherwise, missing value is replaced by fill_na. If packing to list or array, missing values will be kept. If fill_na is provided, the missing value is replaced with fill_na value.

Examples

Suppose ‘xf’ is an an XFrame that maintains business category information.

>>> xf = xframes.XFrame({'business': range(1, 5),
...                       'category.retail': [1, None, 1, None],
...                       'category.food': [1, 1, None, None],
...                       'category.service': [None, 1, 1, None],
...                       'category.shop': [1, 1, None, 1]})
>>> xf
+----------+-----------------+---------------+------------------+---------------+
| business | category.retail | category.food | category.service | category.shop |
+----------+-----------------+---------------+------------------+---------------+
|    1     |        1        |       1       |       None       |       1       |
|    2     |       None      |       1       |        1         |       1       |
|    3     |        1        |      None     |        1         |      None     |
|    4     |       None      |       1       |       None       |       1       |
+----------+-----------------+---------------+------------------+---------------+
[4 rows x 5 columns]

To pack all category columns into a list:

>>> xf.pack_columns(column_prefix='category')
+----------+--------------------+
| business |         X2         |
+----------+--------------------+
|    1     |  [1, 1, None, 1]   |
|    2     |  [None, 1, 1, 1]   |
|    3     | [1, None, 1, None] |
|    4     | [None, 1, None, 1] |
+----------+--------------------+
[4 rows x 2 columns]

To pack all category columns into a dictionary, with new column name:

>>> xf.pack_columns(column_prefix='category', dtype=dict,
...                 new_column_name='category')
+----------+--------------------------------+
| business |            category            |
+----------+--------------------------------+
|    1     | {'food': 1, 'shop': 1, 're ... |
|    2     | {'food': 1, 'shop': 1, 'se ... |
|    3     |  {'retail': 1, 'service': 1}   |
|    4     |     {'food': 1, 'shop': 1}     |
+----------+--------------------------------+
[4 rows x 2 columns]

To keep column prefix in the resulting dict key:

>>> xf.pack_columns(column_prefix='category', dtype=dict,
...                 remove_prefix=False)
+----------+--------------------------------+
| business |               X2               |
+----------+--------------------------------+
|    1     | {'category.retail': 1, 'ca ... |
|    2     | {'category.food': 1, 'cate ... |
|    3     | {'category.retail': 1, 'ca ... |
|    4     | {'category.food': 1, 'cate ... |
+----------+--------------------------------+
[4 rows x 2 columns]

To explicitly pack a set of columns:

>>> xf.pack_columns(columns = ['business', 'category.retail',
...                            'category.food', 'category.service',
...                            'category.shop'])
+-----------------------+
|           X1          |
+-----------------------+
|   [1, 1, 1, None, 1]  |
|   [2, None, 1, 1, 1]  |
| [3, 1, None, 1, None] |
| [4, None, 1, None, 1] |
+-----------------------+
[4 rows x 1 columns]

To pack all columns with name starting with ‘category’ into an array type, and with missing value replaced with 0:

>>> xf.pack_columns(column_prefix="category", dtype=array.array,
...                 fill_na=0)
+----------+--------------------------------+
| business |               X2               |
+----------+--------------------------------+
|    1     | array('d', [1.0, 1.0, 0.0, ... |
|    2     | array('d', [0.0, 1.0, 1.0, ... |
|    3     | array('d', [1.0, 0.0, 1.0, ... |
|    4     | array('d', [0.0, 1.0, 0.0, ... |
+----------+--------------------------------+
[4 rows x 2 columns]
persist(persist_flag)[source]

Persist or unpersist the underlying data storage object.

Persisting makes a copy of the object on the disk, so that it does not have to be recomputed in times of low memory. Unpersisting frees up this space.

Parameters:

persist_flag : boolean

If True, peersist the object. If False, unpersist it.

print_rows(num_rows=10, num_columns=40, max_column_width=30, max_row_width=None, wrap_text=False, max_wrap_rows=2, footer=True)[source]

Print the first rows and columns of the XFrame in human readable format.

Parameters:

num_rows : int, optional

Number of rows to print.

num_columns : int, optional

Number of columns to print.

max_column_width : int, optional

Maximum width of a column. Columns use fewer characters if possible.

max_row_width : int, optional

Maximum width of a printed row. Columns beyond this width wrap to a new line. max_row_width is automatically reset to be the larger of itself and max_column_width.

wrap_text : boolean, optional

Wrap the text within a cell. Defaults to False.

max_wrap_rows : int, optional

When wrapping is in effect, the maximum number of resulting rows for each cell before truncation takes place.

footer : bool, optional

True to pinrt a footer.

See also

xframes.XFrame.head
Returns the first part of a XFrame.
xframes.XFrame.tail
Returns the last part of an XFrame.
random_split(fraction, seed=None)[source]

Randomly split the rows of an XFrame into two XFrames. The first XFrame contains M rows, sampled uniformly (without replacement) from the original XFrame. M is approximately the fraction times the original number of rows. The second XFrame contains the remaining rows of the original XFrame.

Parameters:

fraction : float

Approximate fraction of the rows to fetch for the first returned XFrame. Must be between 0 and 1.

seed : int, optional

Seed for the random number generator used to split.

Returns:

tuple [XFrame]

Two new XFrame.

Examples

Suppose we have an XFrame with 6,145 rows and we want to randomly split it into training and testing datasets with about a 70%/30% split.

>>> xf = xframes.XFrame({'id': range(1024)})
>>> xf_train, xf_test = xf.random_split(.9, seed=5)
>>> print len(xf_test), len(xf_train)
102 922
range(key)[source]

Extracts and returns rows of the XFrame.

Parameters:

key: int or slice

If key is:
  • int Returns a single row of the XFrame (the `key`th one) as a dictionary.
  • slice Returns an XFrame including only the sliced rows.
Returns:

dict or XFrame

The specified row of the XFrame or an XFrame containing the specified rows.

classmethod read_csv(url, delimiter=', ', header=True, error_bad_lines=False, comment_char='', escape_char='\\', double_quote=True, quote_char='"', skip_initial_space=True, column_type_hints=None, na_values=None, nrows=None, verbose=False)[source]

Constructs an XFrame from a CSV file or a path to multiple CSVs.

Parameters:

url : string

Location of the CSV file or directory to load. If URL is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for parsing csv files. Must be a single character.

header : bool, optional

If true, uses the first row as the column names. Otherwise use the default column names : ‘X1, X2, ...’.

error_bad_lines : bool

If true, will fail upon encountering a bad line. If false, will continue parsing skipping lines which fail to parse correctly. A sample of the first 10 encountered bad lines will be printed.

comment_char : string, optional

The character which denotes that the remainder of the line is a comment.

escape_char : string, optional

Character which begins a C escape sequence

double_quote : bool, optional

If True, two consecutive quotes in a string are parsed to a single quote.

quote_char : string, optional

Character sequence that indicates a quote.

skip_initial_space : bool, optional

Ignore extra spaces at the start of a field

column_type_hints : None, type, list[type], dict[string, type], optional

This provides type hints for each column. By default, this method attempts to detect the type of each column automatically.

Supported types are int, float, str, list, dict, and array.array.

  • If a single type is provided, the type will be applied to all columns. For instance, column_type_hints=float will force all columns to be parsed as float.
  • If a list of types is provided, the types applies to each column in order, e.g.[int, float, str] will parse the first column as int, second as float and third as string.
  • If a dictionary of column name to type is provided, each type value in the dictionary is applied to the key it belongs to. For instance {‘user’:int} will hint that the column called “user” should be parsed as an integer, and the rest will default to string.

na_values : str | list of str, optional

A string or list of strings to be interpreted as missing values.

nrows : int, optional

If set, only this many rows will be read from the file.

verbose : bool, optional

If True, print the progress while reading files.

Returns:

XFrame

See also

xframes.XFrame.read_csv_with_errors
Allows more control over errors.
xframes.XFrame
The constructor can read csv files, but is not configurable.

Examples

Read a regular csv file, with all default options, automatically determine types:

>>> url = 'http://s3.amazonaws.com/gl-testdata/rating_data_example.csv'
>>> xf = xframes.XFrame.read_csv(url)
>>> xf
Columns:
  user_id int
  movie_id  int
  rating  int
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Read only the first 100 lines of the csv file:

>>> xf = xframes.XFrame.read_csv(url, nrows=100)
>>> xf
Columns:
  user_id int
  movie_id  int
  rating  int
Rows: 100
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[100 rows x 3 columns]

Read all columns as str type

>>> xf = xframes.XFrame.read_csv(url, column_type_hints=str)
>>> xf
Columns:
  user_id  str
  movie_id  str
  rating  str
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Specify types for a subset of columns and leave the rest to be str.

>>> xf = xframes.XFrame.read_csv(url,
...                               column_type_hints={
...                               'user_id':int, 'rating':float
...                               })
>>> xf
Columns:
  user_id str
  movie_id  str
  rating  float
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |  3.0   |
|  25907  |   1663   |  3.0   |
|  25923  |   1663   |  3.0   |
|  25924  |   1663   |  3.0   |
|  25928  |   1663   |  2.0   |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Not treat first line as header:

>>> xf = xframes.XFrame.read_csv(url, header=False)
>>> xf
Columns:
  X1  str
  X2  str
  X3  str
Rows: 10001
+---------+----------+--------+
|    X1   |    X2    |   X3   |
+---------+----------+--------+
| user_id | movie_id | rating |
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10001 rows x 3 columns]

Treat ‘3’ as missing value:

>>> xf = xframes.XFrame.read_csv(url, na_values=['3'], column_type_hints=str)
>>> xf
Columns:
  user_id str
  movie_id  str
  rating  str
Rows: 10000
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |  None  |
|  25907  |   1663   |  None  |
|  25923  |   1663   |  None  |
|  25924  |   1663   |  None  |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[10000 rows x 3 columns]

Throw error on parse failure:

>>> bad_url = 'https://s3.amazonaws.com/gl-testdata/bad_csv_example.csv'
>>> xf = xframes.XFrame.read_csv(bad_url, error_bad_lines=True)
RuntimeError: Runtime Exception. Unable to parse line "x,y,z,a,b,c"
Set error_bad_lines=False to skip bad lines
classmethod read_csv_with_errors(url, delimiter=', ', header=True, comment_char='', escape_char='\\', double_quote=True, quote_char='"', skip_initial_space=True, column_type_hints=None, na_values=None, nrows=None, verbose=False)[source]

Constructs an XFrame from a CSV file or a path to multiple CSVs, and returns a pair containing the XFrame and a dict of error type to XArray indicating for each type, what are the incorrectly parsed lines encountered.

The kinds of errors that are detected are:
  • width – The row has the wrong number of columns.
  • header – The first row in the file did not parse correctly. This row is used to
    determine the table width, so the rest of the file is not processed. The result is an empty XFrame.
  • csv – The csv parser raised a csv.Error or a SystemError exception.
    This can be caused by having an unacceptable character, such as a null byte, in the input, or by serious system errors. This presence of this error indicates that processing has been interrupted, so all remaining data in the input file is not processed.
Parameters:

url : string

Location of the CSV file or directory to load. If URL is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for parsing csv files. Must be a single character. Files with double delimiters such as “||” should specify delimiter=’|’ and should drop columns with empty heading and data.

header : bool, optional

If true, uses the first row as the column names. Otherwise use the default column names: ‘X.1, X.2, ...’.

comment_char : string, optional

The character which denotes that the remainder of the line is a comment. The line must contain valid data preceding the commant.

escape_char : string, optional

Character which begins a C escape sequence

double_quote : bool, optional

If True, two consecutive quotes in a string are parsed to a single quote.

quote_char : string, optional

Character sequence that indicates a quote.

skip_initial_space : bool, optional

Ignore extra spaces at the start of a field

column_type_hints : None, type, list[type], dict{string: type}, optional

This provides type hints for each column. By default, this method attempts to detect the type of each column automatically.

Supported types are int, float, str, list, dict, and array.array.

  • If a single type is provided, the type will be applied to all columns. For instance, column_type_hints=float will force all columns to be parsed as float.
  • If a list of types is provided, the types applies to each column in order, e.g.[int, float, str] will parse the first column as int, second as float and third as string.
  • If a dictionary of column name to type is provided, each type value in the dictionary is applied to the key it belongs to. For instance {‘user’:int} will hint that the column called “user” should be parsed as an integer, and the rest will default to string.

na_values : str | list of str, optional

A string or list of strings to be interpreted as missing values.

nrows : int, optional

If set, only this many rows will be read from the file.

verbose : bool, optional

If True, print the progress while reading files.

Returns:

tuple

The first element is the XFrame with good data. The second element is a dictionary of filenames to XArrays indicating for each file, what are the incorrectly parsed lines encountered.

See also

xframes.XFrame.read_csv
Reads csv without error controls.
xframes.XFrame
The constructor can read csv files, but is not configurable.

Examples

>>> bad_url = 'https://s3.amazonaws.com/gl-testdata/bad_csv_example.csv'
>>> (xf, bad_lines) = xframes.XFrame.read_csv_with_errors(bad_url)
>>> xf
+---------+----------+--------+
| user_id | movie_id | rating |
+---------+----------+--------+
|  25904  |   1663   |   3    |
|  25907  |   1663   |   3    |
|  25923  |   1663   |   3    |
|  25924  |   1663   |   3    |
|  25928  |   1663   |   2    |
|   ...   |   ...    |  ...   |
+---------+----------+--------+
[98 rows x 3 columns]
>>> bad_lines
{'https://s3.amazonaws.com/gl-testdata/bad_csv_example.csv': dtype: str
 Rows: 1
 ['x,y,z,a,b,c']}
classmethod read_parquet(url)[source]

Constructs an XFrame from a parquet file.

Parameters:

url : string

Location of the parquet file to load.

Returns:

XFrame

See also

xframes.XFrame
The constructor can read parquet files.
classmethod read_text(path, delimiter=None, nrows=None, verbose=False)[source]

Constructs an XFrame from a text file or a path to multiple text files.

Parameters:

path : string

Location of the text file or directory to load. If ‘path’ is a directory or a “glob” pattern, all matching files will be loaded.

delimiter : string, optional

This describes the delimiter used for separating records. Must be a single character. Defaults to newline.

nrows : int, optional

If set, only this many rows will be read from the file.

verbose : bool, optional

If True, print the progress while reading files.

Returns:

XFrame

Examples

Read a regular text file, with default options.

>>> path = 'http://s3.amazonaws.com/gl-testdata/rating_data_example.csv'
>>> xf = xframes.XFrame.read_text(path)
>>> xf
+-------
| text |
+---------+
|  25904  |
|  25907  |
|  25923  |
|  25924  |
|  25928  |
|   ...   |
+---------+
[10000 rows x 1 column]

Read only the first 100 lines of the text file:

>>> xf = xframes.XFrame.read_text(path, nrows=100)
>>> xf
Rows: 100
+---------+
|  25904  |
|  25907  |
|  25923  |
|  25924  |
|  25928  |
|   ...   |
+---------+
[100 rows x 1 columns]

Read using a given delimiter.

>>> xf = xframes.XFrame.read_text(path, delimiter='.')
>>> xf
Rows: 250
+---------+
|  25904  |
|  25907  |
|  25923  |
|  25924  |
|  25928  |
|   ...   |
+---------+
[250 rows x 1 columns]
remove_column(name)[source]

Remove one or more columns from this XFrame. This operation returns a new XFrame with the given column or columns removed.

Parameters:

name : string or list or iterable

The name of the column to remove. If a list or iterable is given, all the named columns are removed.

Returns:

XFrame

A new XFrame with given column or columns removed.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xf2 = xf.remove_column('val')
>>> xf2
+----+
| id |
+----+
| 1  |
| 2  |
| 3  |
+----+
[3 rows x 1 columns]
>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val1': ['A', 'B', 'C'], 'val2': [10, 11, 12]})
>>> xf2 = xf.remove_column(['val1', 'val2'])
>>> xf2
+----+
| id |
+----+
| 1  |
| 2  |
| 3  |
+----+
[3 rows x 1 columns]
remove_columns(column_names)[source]

Removes one or more columns from this XFrame. This operation returns a new XFrame with the given columns removed.

Parameters:

column_names : list or iterable

A list or iterable of the column names.

Returns:

XFrame

A new XFrame with given columns removed.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val1': ['A', 'B', 'C'], 'val2': [10, 11, 12]})
>>> xf2 = xf.remove_columns(['val1', 'val2'])
>>> xf2
+----+
| id |
+----+
| 1  |
| 2  |
| 3  |
+----+
[3 rows x 1 columns]
rename(names)[source]

Rename the given columns. Names can be a dict specifying the old and new names. This changes the names of the columns given as the keys and replaces them with the names given as the values. Alternatively, names can be a list of the new column names. In this case it must be the same length as the number of columns. This operation returns a new XFrame with the given columns renamed.

Parameters:

names : dict [string, string] | list [ string ]

Dictionary of [old_name, new_name] or list of new names

Returns:

XFrame

A new XFrame with columns renamed.

Examples

>>> xf = XFrame({'X.1': ['Alice','Bob'],
...              'X.2': ['123 Fake Street','456 Fake Street']})
>>> xf2 = xf.rename({'X.1': 'name', 'X.2':'address'})
>>> xf2
+-------+-----------------+
|  name |     address     |
+-------+-----------------+
| Alice | 123 Fake Street |
|  Bob  | 456 Fake Street |
+-------+-----------------+
[2 rows x 2 columns]
reorder_columns(column_names)[source]

Reorder the columns in the table. This operation returns a new XFrame with the given columns reordered.

Parameters:

column_names : list of string

Names of the columns in desired order.

Returns:

XFrame

A new XFrame with reordered columns.

See also

xframes.XFrame.select_columns
Returns a subset of the columns but does not change the column order.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xf2 = xf.reorder_columns(['val', 'id'])
>>> xf2
+-----+-----+
| val | id  |
+-----+-----+
|  A  |  1  |
|  B  |  2  |
|  C  |  3  |
+----+------+
[3 rows x 2 columns]
replace_column(name, col)[source]

Replace a column in this XFrame. The length of the new column must match the length of the existing XFrame. This operation returns a new XFrame with the replacement column.

Parameters:

name : string

The name of the column.

col : XArray

The ‘column’ to add.

Returns:

XFrame

A new XFrame with specified column replaced.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xa = xframes.XArray(['cat', 'dog', 'horse'])
>>> xf2 = xf.replace_column('val', xa)
>>> xf2
+----+---------+
| id | species |
+----+---------+
| 1  |   cat   |
| 2  |   dog   |
| 3  |  horse  |
+----+---------+
[3 rows x 2 columns]
sample(fraction, max_partitions=None, seed=None)[source]

Sample the current XFrame’s rows.

Parameters:

fraction : float

Approximate fraction of the rows to fetch. Must be between 0 and 1. The number of rows returned is approximately the fraction times the number of rows.

max_partitions : int, optional

After sampling, coalesce to this number of partition. If not given, do not perform this step.

seed : int, optional

Seed for the random number generator used to sample.

Returns:

XFrame

A new XFrame containing sampled rows of the current XFrame.

Examples

Suppose we have an XFrame with 6,145 rows.

>>> import random
>>> xf = XFrame({'id': range(0, 6145)})

Retrieve about 30% of the XFrame rows with repeatable results by setting the random seed.

>>> len(xf.sample(.3, seed=5))
1783
save(filename, format=None)[source]

Save the XFrame to a file system for later use.

Parameters:

filename : string

The location to save the XFrame. Either a local directory or a remote URL. If the format is ‘binary’, a directory will be created at the location which will contain the XFrame.

format : {‘binary’, ‘csv’, ‘tsv’, ‘parquet’, json}, optional

Format in which to save the XFrame. Binary saved XFrames can be loaded much faster and without any format conversion losses. If not given, will try to infer the format from filename given. If file name ends with ‘csv’ or ‘.csv.gz’, then save as ‘csv’ format. If the file ends with ‘json’, then save as json file. If the file ends with ‘parquet’, then save as parquet file. Otherwise save as ‘binary’ format.

See also

xframes.XFrame.load, xframes.XFrame.XFrame

Examples

>>> # Save the xframe into binary format
>>> xf.save('data/training_data_xframe')
>>> # Save the xframe into csv format
>>> xf.save('data/training_data.csv', format='csv')
select_column(column_name)[source]

Return an XArray that corresponds with the given column name. Throws an exception if the column name is something other than a string or if the column name is not found.

Subscripting an XFrame by a column name is equivalent to this function.

Parameters:

column_name : str

The column name.

Returns:

XArray

The XArray that is referred by column_name.

See also

xframes.XFrame.select_columns
Returns multiple columns.

Examples

>>> xf = xframes.XFrame({'user_id': [1,2,3],
...                       'user_name': ['alice', 'bob', 'charlie']})
>>> # This line is equivalent to `sa = xf['user_name']`
>>> sa = xf.select_column('user_name')
>>> sa
dtype: str
Rows: 3
['alice', 'bob', 'charlie']
select_columns(keylist)[source]

Get XFrame composed only of the columns referred to in the given list of keys. Throws an exception if ANY of the keys are not in this XFrame or if keylist is anything other than a list of strings.

Parameters:

keylist : list[str]

The list of column names.

Returns:

XFrame

A new XFrame that is made up of the columns referred to in keylist from the current XFrame. The order of the columns is preserved.

See also

xframes.XFrame.select_column
Returns a single column.

Examples

>>> xf = xframes.XFrame({'user_id': [1,2,3],
...                       'user_name': ['alice', 'bob', 'charlie'],
...                       'zipcode': [98101, 98102, 98103]
...                      })
>>> # This line is equivalent to `xf2 = xf[['user_id', 'zipcode']]`
>>> xf2 = xf.select_columns(['user_id', 'zipcode'])
>>> xf2
+---------+---------+
| user_id | zipcode |
+---------+---------+
|    1    |  98101  |
|    2    |  98102  |
|    3    |  98103  |
+---------+---------+
[3 rows x 2 columns]
select_rows(xa)[source]

Selects rows of the XFrame where the XArray evaluates to True.

Parameters:

xa : XArray

Must be the same length as the XFrame. The filter values.

Returns:

XFrame

A new XFrame which contains the rows of the XFrame where the XArray is True. The truth test is the same as in python, so non-zero values are considered true.

Set the footer printed beneath tables.

Parameters:

footer_strs : list

A list of strings. Each string is a separate line, printed beneath a table. This footer is used when the length of the table is known. To disable printing the footer, pass an empty list.

classmethod set_html_max_row_width(width)[source]

Set the maximum display width for displaying in HTML.

Parameters:

width : int

The maximum width of the table when printing in html.

Set the footer printed beneath tables when the length is unknown.

Parameters:

footer_strs : list

A list of strings. Each string is a separate line, printed beneath a table. This footer is used when the length of the table is not known because the XFrame has not been evaluated. To disable printing the footer, pass an empty list.

classmethod set_max_row_width(width)[source]

Set the maximum display width for printing.

Parameters:

width : int

The maximum width of the table when printing.

shape

The shape of the XFrame, in a tuple. The first entry is the number of rows, the second is the number of columns.

Examples

>>> xf = xframes.XFrame({'id':[1,2,3], 'val':['A','B','C']})
>>> xf.shape
(3, 2)
sort(sort_columns, ascending=True)[source]

Sort current XFrame by the given columns, using the given sort order. Only columns that are type of str, int and float can be sorted.

Parameters:

sort_columns : str | list of str | list of (str, bool) pairs

Names of columns to be sorted. The result will be sorted first by first column, followed by second column, and so on. All columns will be sorted in the same order as governed by the ascending parameter. To control the sort ordering for each column individually, sort_columns must be a list of (str, bool) pairs. Given this case, the first value is the column name and the second value is a boolean indicating whether the sort order is ascending.

ascending : bool, optional

Sort all columns in the given order.

Returns:

XFrame

A new XFrame that is sorted according to given sort criteria

Examples

Suppose ‘xf’ is an xframe that has three columns ‘a’, ‘b’, ‘c’. To sort by column ‘a’, ascending:

>>> xf = xframes.XFrame({'a':[1,3,2,1],
...                       'b':['a','c','b','b'],
...                       'c':['x','y','z','y']})
>>> xf
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | a | x |
| 3 | c | y |
| 2 | b | z |
| 1 | b | y |
+---+---+---+
[4 rows x 3 columns]
>>> xf.sort('a')
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | a | x |
| 1 | b | y |
| 2 | b | z |
| 3 | c | y |
+---+---+---+
[4 rows x 3 columns]

To sort by column ‘a’, descending:

>>> xf.sort('a', ascending = False)
+---+---+---+
| a | b | c |
+---+---+---+
| 3 | c | y |
| 2 | b | z |
| 1 | a | x |
| 1 | b | y |
+---+---+---+
[4 rows x 3 columns]

To sort by column ‘a’ and ‘b’, all ascending:

>>> xf.sort(['a', 'b'])
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | a | x |
| 1 | b | y |
| 2 | b | z |
| 3 | c | y |
+---+---+---+
[4 rows x 3 columns]

To sort by column ‘a’ ascending, and then by column ‘c’ descending:

>>> xf.sort([('a', True), ('c', False)])
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | b | y |
| 1 | a | x |
| 2 | b | z |
| 3 | c | y |
+---+---+---+
[4 rows x 3 columns]
split_datetime(expand_column, column_name_prefix=None, limit=None)[source]

Splits a datetime column of XFrame to multiple columns, with each value in a separate column. Returns a new XFrame with the expanded column replaced with a list of new columns. The expanded column must be of datetime.datetime type.

For more details regarding name generation and other, refer to xframes.XArray.expand()

Parameters:

expand_column : str

Name of the unpacked column.

column_name_prefix : str, optional

If provided, expanded column names would start with the given prefix. If not provided, the default value is the name of the expanded column.

limit : list[str], optional

Limits the set of datetime elements to expand. Elements are ‘year’,’month’,’day’,’hour’,’minute’, and ‘second’.

Returns:

XFrame

A new XFrame that contains rest of columns from original XFrame with the given column replaced with a collection of expanded columns.

Examples

>>> xf
Columns:
    id   int
    submission  datetime.datetime
Rows: 2
Data:
    +----+----------------------------------------------------------+
    | id |               submission                                 |
    +----+----------------------------------------------------------+
    | 1  | datetime.datetime(2011, 1, 21, 7, 17, 21)                |
    | 2  | datetime.datetime(2011, 1, 21, 5, 43, 21)                |
    +----+----------------------------------------------------------+
>>> xf.split_datetime('submission',limit=['hour','minute'])
Columns:
    id  int
    submission.hour int
    submission.minute int
Rows: 2
Data:
+----+-----------------+-------------------+
| id | submission.hour | submission.minute |
+----+-----------------+-------------------+
| 1  |        7        |        17         |
| 2  |        5        |        43         |
+----+-----------------+-------------------+
sql(sql_statement, table_name='xframe')[source]

Executes the given sql statement over the data in the table. Returns a new XFrame with the results.

Parameters:

sql_statement : str

The statement to execute.

The statement is executed by the Spark Sql query processor. See the SparkSql documentation for details. XFrame column names and types are translated to Spark for query processing.

table_name : str, optional

The table name to create, referred to in the sql statement. Defaulst to ‘xframe’.

Returns:

XFrame

The new XFrame with the results.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['a', 'b', 'c']})
>>> xf.sql("SELECT * FROM xframe WHERE id > 1"
+----+--------+
| id |  val   |
+----+--------+
| 2  |   'b'  |
| 3  |   'c'  |
+----+-----  -+
[3 rows x 2 columns]
stack(column_name, new_column_name=None, drop_na=False)[source]

Convert a “wide” column of an XFrame to one or two “tall” columns by stacking all values.

The stack works only for columns of dict, list, or array type. If the column is dict type, two new columns are created as a result of stacking: one column holds the key and another column holds the value. The rest of the columns are repeated for each key/value pair.

If the column is array or list type, one new column is created as a result of stacking. With each row holds one element of the array or list value, and the rest columns from the same original row repeated.

The new XFrame includes the newly created column and all columns other than the one that is stacked.

Parameters:

column_name : str

The column to stack. This column must be of dict/list/array type

new_column_name : str | list of str, optional

The new column name(s). If original column is list/array type, new_column_name must a string. If original column is dict type, new_column_name must be a list of two strings. If not given, column names are generated automatically.

drop_na : boolean, optional

If True, missing values and empty list/array/dict are all dropped from the resulting column(s). If False, missing values are maintained in stacked column(s).

Returns:

XFrame

A new XFrame that contains newly stacked column(s) plus columns in original XFrame other than the stacked column.

See also

xframes.XFrame.unstack
Undo the effect of stack.

Examples

Suppose ‘xf’ is an XFrame that contains a column of dict type:

>>> xf = xframes.XFrame({'topic':[1,2,3,4],
...                       'words': [{'a':3, 'cat':2},
...                                 {'a':1, 'the':2},
...                                 {'the':1, 'dog':3},
...                                 {}]
...                      })
+-------+----------------------+
| topic |        words         |
+-------+----------------------+
|   1   |  {'a': 3, 'cat': 2}  |
|   2   |  {'a': 1, 'the': 2}  |
|   3   | {'the': 1, 'dog': 3} |
|   4   |          {}          |
+-------+----------------------+
[4 rows x 2 columns]

Stack would stack all keys in one column and all values in another column:

>>> xf.stack('words', new_column_name=['word', 'count'])
+-------+------+-------+
| topic | word | count |
+-------+------+-------+
|   1   |  a   |   3   |
|   1   | cat  |   2   |
|   2   |  a   |   1   |
|   2   | the  |   2   |
|   3   | the  |   1   |
|   3   | dog  |   3   |
|   4   | None |  None |
+-------+------+-------+
[7 rows x 3 columns]

Observe that since topic 4 had no words, an empty row is inserted. To drop that row, set dropna=True in the parameters to stack.

Suppose ‘xf’ is an XFrame that contains a user and his/her friends, where ‘friends’ columns is an array type. Stack on ‘friends’ column would create a user/friend list for each user/friend pair:

>>> xf = xframes.XFrame({'topic':[1,2,3],
...                       'friends':[[2,3,4], [5,6],
...                                  [4,5,10,None]]
...                      })
>>> xf
+------+------------------+
| user |     friends      |
+------+------------------+
|  1   |     [2, 3, 4]    |
|  2   |      [5, 6]      |
|  3   | [4, 5, 10, None] |
+------+------------------+
[3 rows x 2 columns]
>>> xf.stack('friends', new_column_name='friend')
+------+--------+
| user | friend |
+------+--------+
|  1   |  2     |
|  1   |  3     |
|  1   |  4     |
|  2   |  5     |
|  2   |  6     |
|  3   |  4     |
|  3   |  5     |
|  3   |  10    |
|  3   |  None  |
+------+--------+
[9 rows x 2 columns]
swap_columns(column_1, column_2)[source]

Swap the columns with the given names. This operation returns a new XFrame with the given columns swapped.

Parameters:

column_1 : string

Name of column to swap

column_2 : string

Name of other column to swap

Returns:

XFrame

A new XFrame with specified columns swapped.

Examples

>>> xf = xframes.XFrame({'id': [1, 2, 3], 'val': ['A', 'B', 'C']})
>>> xf2 = xf.swap_columns('id', 'val')
>>> xf2
+-----+-----+
| val | id  |
+-----+-----+
|  A  |  1  |
|  B  |  2  |
|  C  |  3  |
+----+-----+
[3 rows x 2 columns]
tail(n=10)[source]

The last n rows of the XFrame.

Parameters:

n : int, optional

The number of rows to fetch.

Returns:

XFrame

A new XFrame which contains the last n rows of the current XFrame.

See also

xframes.XFrame.head
Returns the first part of the XFrame.
xframes.XFrame.print_rows
Prints the XFrame.
to_pandas_dataframe()[source]

Convert this XFrame to pandas.DataFrame.

This operation will construct a pandas.DataFrame in memory. Care must be taken when size of the returned object is big.

Returns:

pandas.DataFrame

The dataframe which contains all rows of XFrame.

to_rdd()[source]

Convert the current XFrame to a Spark RDD. The RDD consists of tuples containing the column data. No conversion is necessary: the internal RDD is returned.

Returns:

spark.RDD

The spark RDD that is used to represent the XFrame.

See also

from_rdd
Converts from a Spark RDD.
to_spark_dataframe(table_name=None, column_names=None, column_type_hints=None, number_of_partitions=None)[source]

Convert the current XFrame to a Spark DataFrame.

Parameters:

table_name : str, optional

If given, give this name to the temporary table.

column_names : list, optional

A list of the column names to assign. Defaults to the names in the table, edited to fit the Dataframe restrictions.

column_type_hints : dict, optional

Column types must be supplied when creating a DataFrame. These hints specify these types, If hints are not given, the column types are derived from the XFrame column types. The column types in DataFrames are more restricted in XFrames.

XFrames attempts to supply the correct column types, but cannot always determine the correct settings. The caller can supply hints to ensure the desired settings, but the caller is still responsible for making sure the values in the XFrame are consistent with these settings. * Integers: In DataFrames integers must fit in 64 bits. In python, large integers can be larger. If an XFrame contains such integers, it will fail to store as a DataFrame. The column can be converted to strings in this case.

  • Lists must be of a uniform type in a DataFrame. The caller must convert

lists to meet this requirement, and must provide a hint specifying the element type.

  • Dictionaries must have a uniform key and value type.

The caller must convert dictionaries to meet this requirement and must provide a hint specifying the key and value types.

Hints are given as a dictionary of column_name: column_hint. Any column without a hint is handled using the XFrame column type. For simple types, hints are just type names (as strings): int, long float, bool, datetime, or str. For lists, hints are “list[<type>]” where <type> is one of the simple types. For dictionaries, hints are “dict{<key_type>:<value_type>}” where key_type and value_type is one of the simple types.

number_of_partitions : int, optional

The number of partitions to create.

Returns:

spark.DataFrame

The converted spark dataframe.

topk(column_name, k=10, reverse=False)[source]

Get k rows according to the largest values in the given column. Result is sorted by column_name in the given order (default is descending). When k is small, topk is more efficient than sort.

Parameters:

column_name : string

The column to sort on

k : int, optional

The number of rows to return

reverse : bool, optional

If True, return the top k rows in ascending order, otherwise, in descending order.

Returns:

XFrame

An XFrame containing the top k rows sorted by column_name.

Examples

>>> xf = xframes.XFrame({'id': range(1000)})
>>> xf['value'] = -xf['id']
>>> xf.topk('id', k=3)
+--------+--------+
|   id   |  value |
+--------+--------+
|   999  |  -999  |
|   998  |  -998  |
|   997  |  -997  |
+--------+--------+
[3 rows x 2 columns]
>>> xf.topk('value', k=3)
+--------+--------+
|   id   |  value |
+--------+--------+
|   1    |  -1    |
|   2    |  -2    |
|   3    |  -3    |
+--------+--------+
[3 rows x 2 columns]
transform_col(col, fn=None, dtype=None, use_columns=None, seed=None)[source]

Transform a single column according to a specified function. The remaining columns are not modified. The type of the transformed column types becomes dtype, with the new value being the result of fn(x), where x is a single row in the XFrame represented as a dictionary. The fn should return exactly one value which can be cast into type dtype. If dtype is not specified, the first 100 rows of the XFrame are used to make a guess of the target data type.

Parameters:

col : string

The name of the column to transform.

fn : function, optional

The function to transform each row of the XFrame. The return type should be convertible to dtype if dtype is not None. If the function is not given, an identity function is used.

dtype : dtype, optional

The column data type of the new XArray. If None, the first 100 elements of the array are used to guess the target data type.

use_columns : str | list[str], optional

The column or list of columns to be supplied in the row passed to the function. If not given, all columns wll be used to build the row.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XFrame

An XFrame with the given column transformed by the function and cast to the given type.

Examples

Translate values in a column:

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.transform_col('rating', lambda row: row['rating'] * 2)

Cast values in a column to a different type

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.transform_col('user_id', dtype=str)
transform_cols(cols, fn=None, dtypes=None, use_columns=None, seed=None)[source]

Transform multiple columns according to a specified function. The remaining columns are not modified. The type of the transformed column types are given by dtypes, with the new values being the result of fn(x) where x is a single row in the XFrame represented as a dictionary. The fn should return a value for each element of cols, which can be cast into the corresponding dtype. If dtypes is not specified, the first 100 rows of the XFrame are used to make a guess of the target data types.

Parameters:

cols : list [str]

The names of the column to transform.

fn : function, optional

The function to transform each row of the XFrame. The return value should be a list of values, one for each column of cols. each type should be convertible to the corresponding dtype if dtype is not None. If the function is not given, an identity function is used.

dtypes : list[type], optional

The data types of the new columns. There must be one data type for each column in cols. If not supplied, the first 100 elements of the array are used to guess the target data types.

use_columns : str | list[str], optional

The column or list of columns to be supplied in the row passed to the function. If not given, all columns wll be used to build the row.

seed : int, optional

Used as the seed if a random number generator is included in fn.

Returns:

XFrame

An XFrame with the given columns transformed by the function and cast to the given types.

Examples

Translate values in a column:

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.transform_col(['movie_id', 'rating'], lambda row: [row['movie_id'] + 1, row['rating'] * 2])

Cast types in several columns:

>>> xf = xframes.XFrame({'user_id': [1, 2, 3], 'movie_id': [3, 3, 6],
                          'rating': [4, 5, 1]})
>>> xf.transform_col(['movie_id', 'rating'], dtype=[str, str])
unique()[source]

Remove duplicate rows of the XFrame. Will not necessarily preserve the order of the given XFrame in the new XFrame.

Returns:

XFrame

A new XFrame that contains the unique rows of the current XFrame.

Raises:

TypeError

If any column in the XFrame is a dictionary type.

Examples

>>> xf = xframes.XFrame({'id':[1,2,3,3,4], 'value':[1,2,3,3,4]})
>>> xf
+----+-------+
| id | value |
+----+-------+
| 1  |   1   |
| 2  |   2   |
| 3  |   3   |        | 3  |   3   |
| 4  |   4   |
+----+-------+
[5 rows x 2 columns]
>>> xf.unique()
+----+-------+
| id | value |
+----+-------+
| 2  |   2   |
| 4  |   4   |
| 3  |   3   |
| 1  |   1   |
+----+-------+
[4 rows x 2 columns]
unpack(unpack_column, column_name_prefix=None, column_types=None, na_value=None, limit=None)[source]

Expand one column of this XFrame to multiple columns with each value in a separate column. Returns a new XFrame with the unpacked column replaced with a list of new columns. The column must be of list, tuple, array, or dict type.

For more details regarding name generation, missing value handling and other, refer to the XArray version of unpack().

Parameters:

unpack_column : str

Name of the unpacked column

column_name_prefix : str, optional

If provided, unpacked column names would start with the given prefix. If not provided, default value is the name of the unpacked column.

column_types : [type], optional

Column types for the unpacked columns. If not provided, column types are automatically inferred from first 100 rows. For array type, default column types are float. If provided, column_types also restricts how many columns to unpack.

na_value : flexible_type, optional

If provided, convert all values that are equal to “na_value” to missing value (None).

limit : list[str] | list[int], optional

Control unpacking only a subset of list/array/dict value. For dictionary XArray, limit is a list of dictionary keys to restrict. For list/array XArray, limit is a list of integers that are indexes into the list/array value.

Returns:

XFrame

A new XFrame that contains rest of columns from original XFrame with the given column replaced with a collection of unpacked columns.

See also

xframes.XFrame.pack_columns
The opposite of unpack.

Examples

>>> xf = xframes.XFrame({'id': [1,2,3],
...                      'wc': [{'a': 1}, {'b': 2}, {'a': 1, 'b': 2}]})
+----+------------------+
| id |        wc        |
+----+------------------+
| 1  |     {'a': 1}     |
| 2  |     {'b': 2}     |
| 3  | {'a': 1, 'b': 2} |
+----+------------------+
[3 rows x 2 columns]
>>> xf.unpack('wc')
+----+------+------+
| id | wc.a | wc.b |
+----+------+------+
| 1  |  1   | None |
| 2  | None |  2   |
| 3  |  1   |  2   |
+----+------+------+
[3 rows x 3 columns]

To not have prefix in the generated column name:

>>> xf.unpack('wc', column_name_prefix="")
+----+------+------+
| id |  a   |  b   |
+----+------+------+
| 1  |  1   | None |
| 2  | None |  2   |
| 3  |  1   |  2   |
+----+------+------+
[3 rows x 3 columns]

To limit subset of keys to unpack:

>>> xf.unpack('wc', limit=['b'])
+----+------+
| id | wc.b |
+----+------+
| 1  | None |
| 2  |  2   |
| 3  |  2   |
+----+------+
[3 rows x 3 columns]

To unpack an array column:

>>> xf = xframes.XFrame({'id': [1,2,3],
...                       'friends': [array.array('d', [1.0, 2.0, 3.0]),
...                                   array.array('d', [2.0, 3.0, 4.0]),
...                                   array.array('d', [3.0, 4.0, 5.0])]})
>>> xf
+----+-----------------------------+
| id |            friends          |
+----+-----------------------------+
| 1  | array('d', [1.0, 2.0, 3.0]) |
| 2  | array('d', [2.0, 3.0, 4.0]) |
| 3  | array('d', [3.0, 4.0, 5.0]) |
+----+-----------------------------+
[3 rows x 2 columns]
>>> xf.unpack('friends')
+----+-----------+-----------+-----------+
| id | friends.0 | friends.1 | friends.2 |
+----+-----------+-----------+-----------+
| 1  |    1.0    |    2.0    |    3.0    |
| 2  |    2.0    |    3.0    |    4.0    |
| 3  |    3.0    |    4.0    |    5.0    |
+----+-----------+-----------+-----------+
[3 rows x 4 columns]
unstack(column, new_column_name=None)[source]

Concatenate values from one or two columns into one column, grouping by all other columns. The resulting column could be of type list, array or dictionary. If column is a numeric column, the result will be of array.array type. If column is a non-numeric column, the new column will be of list type. If column is a list of two columns, the new column will be of dict type where the keys are taken from the first column in the list.

Parameters:

column : str | [str, str]

The column(s) that is(are) to be concatenated. If str, then collapsed column type is either array or list. If [str, str], then collapsed column type is dict

new_column_name : str, optional

New column name. If not given, a name is generated automatically.

Returns:

XFrame

A new XFrame containing the grouped columns as well as the new column.

See also

xframes.XFrame.stack
The inverse of unstack.
xframes.XFrame.groupby
Unstack is a special version of groupby that uses the CONCAT aggregator

Notes

  • There is no guarantee the resulting XFrame maintains the same order as the original XFrame.
  • Missing values are maintained during unstack.
  • When unstacking into a dictionary, if there is more than one instance of a given key for a particular group, an arbitrary value is selected.

Examples

>>> xf = xframes.XFrame({'count':[4, 2, 1, 1, 2, None],
...                       'topic':['cat', 'cat', 'dog', 'elephant', 'elephant', 'fish'],
...                       'word':['a', 'c', 'c', 'a', 'b', None]})
>>> xf.unstack(column=['word', 'count'], new_column_name='words')
+----------+------------------+
|  topic   |      words       |
+----------+------------------+
| elephant | {'a': 1, 'b': 2} |
|   dog    |     {'c': 1}     |
|   cat    | {'a': 4, 'c': 2} |
|   fish   |       None       |
+----------+------------------+
[4 rows x 2 columns]
>>> xf = xframes.XFrame({'friend': [2, 3, 4, 5, 6, 4, 5, 2, 3],
...                      'user': [1, 1, 1, 2, 2, 2, 3, 4, 4]})
>>> xf.unstack('friend', new_column_name='friends')
+------+-----------------------------+
| user |           friends           |
+------+-----------------------------+
|  3   |      array('d', [5.0])      |
|  1   | array('d', [2.0, 4.0, 3.0]) |
|  2   | array('d', [5.0, 6.0, 4.0]) |
|  4   |    array('d', [2.0, 3.0])   |
+------+-----------------------------+
[4 rows x 2 columns]
width()[source]

Diagnostic: the number of elements in each tuple of the RDD.