PyBMF.datasets package

Submodules

PyBMF.datasets.BaseData module

class PyBMF.datasets.BaseData.BaseData(path=None)[source]

Bases: object

Base class for built-in datasets.

Note

Attributes of BaseData for a single-matrix dataset.

Xspmatrix

The data matrix, which can be passed to NoSplit, RatioSplit or CrossValidation or be used for factorization directly.

factor_infolist of 2 tuples

The list of factor info. For example, [user_info, item_info]. More specifically, the list may look like [(u_order, u_idmap, u_alias), (i_order, i_idmap, i_alias)].

Note

Attributes of BaseData for a multi-matrix dataset.

Xslist of spmatrix

E.g., [X_ratings, X_genres, X_cast]

factorslist of lists of 2 ints

The list of factor id pairs. For example, [[0, 1], [2, 1], [3, 1]] if the 3 datasets are user-movie, genre-movie and cast-movie.

factor_infolist of tuples

The list of factor info. For example, [user_info, movie_info, genre_info, cast_info].

dump_pickle(name=None)[source]

Dump pickle to cache directory.

Parameters:

name (str) – The name of pickle file.

property has_pickle

If pickle exists.

load(overwrite_cache=False)[source]

Load data.

If pickle exists, load from cache directory. If not, read from data directory. Dump to pickle when overwrite_cache is True.

Parameters:

overwrite_cache (bool, default: False) – If True, overwrite the cache.

load_data()[source]

Load data.

read_data()[source]

Read data.

read_pickle()[source]

Read pickle from cache directory.

sample(factor_id, idx=None, n_samples=None, seed=None)[source]

Sample the whole dataset with given factor_id and idx.

Parameters:
  • factor_id (int) –

    For single-matrix dataset, factor_id is the axis to sample, i.e., 0 and 1 for rows and columns.

    For multi-matrix dataset, factor_id is the index of the factor to sample.

  • idx (np.ndarray) – The given indices to sample with.

  • n_samples (int) – Randomly down-sample to this length.

  • seed (int) – Random seed for down-sampling.

show_matrix(scaling=1.0, pixels=5, colorbar=True, discrete=True, center=True, clim=[0, 1], keep_nan=True, **kwargs)[source]

The show_matrix wrapper for Boolean datasets.

to_single()[source]

Concatenate Xs to form a single X.

PyBMF.datasets.BaseSplit module

class PyBMF.datasets.BaseSplit.BaseSplit(X)[source]

Bases: object

Base class for data splitting and negative sampling methods NoSplit, RatioSplit and CrossValidation.

Note

Attributes of BaseSplit.

X_trainspmatrix

The training data matrix.

X_valspmatrix

The validation data matrix.

X_testspmatrix

The test data matrix.

Parameters:

X (ndarray, spmatrix) – The data matrix.

check_params(**kwargs)[source]

Check patameters.

Checking the random seed.

get_neg_indices(n_negatives, type)[source]

Generate negative indices.

Used in RatioSplit.negative_sample and CrossValidation.negative_sample.

This is fast but intractable for large dataset. Use trial-and-error for large dataset.

Parameters:
  • n_negatives (int) – Number of negative samples.

  • type (str) – Negative sampling type.

load_neg_data(**kwargs)
load_pos_data(train_idx, val_idx, test_idx)[source]

Load positive data.

Used in RatioSplit and CrossValidation.

Leave X_val, X_test empty if val_idx/test_idx length is 0 for negative sampling.

Parameters:
  • train_idx (ndarray) – The indices of training data.

  • val_idx (ndarray) – The indices of validation data.

  • test_idx (ndarray) – The indices of test data.

negative_sample()[source]

Negative sampling.

Note

We can only add 0’s using csr/csc_matrix, and validate negative samples using coo_matrix or triplet.

coo_matrix does not support value assignment; lil_matrix has no effect when adding 0’s onto it.

Any arithmetic operation or csr_matrix.eliminate_zeros() will cause a sparse matrix to lose the negative samples.

PyBMF.datasets.CrossValidation module

class PyBMF.datasets.CrossValidation.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]

Bases: BaseSplit

K-fold cross-validation, used in prediction tasks

Parameters:
  • X (ndarray, spmatrix) – The data matrix.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • n_folds (int) – Number of folds.

  • current_fold (int) – Index of the current fold.

get_fold(current_fold)[source]

Get current fold.

Parameters:

current_fold (int) – Index of the current fold.

static get_indices(data_idx, partition, current_fold)[source]

Get indices for current fold.

Parameters:
  • data_idx (ndarray) – The indices of dataset.

  • partition (ndarray) – An array of starting indices of each fold and the test set.

  • current_fold (int) – The index of current fold.

Returns:

  • train_idx (ndarray) – The indices of training data.

  • val_idx (ndarray) – The indices of validation data.

  • test_idx (ndarray) – The indices of test data.

static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]

Get partition for cross-validation.

Used in CrossValidation and CrossValidation.cv_negative_sample.

Parameters:
  • n_folds (int) – Number of folds.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • train_val_size (int, float or None) – If it is None, use the remaining data outside test_size. If it is int, train_val_size is the integer size of dataset. If it is float, train_val_size is the fraction size of dataset. Note that 0.0 is not valid.

Returns:

  • partition (ndarray) – An array of starting indices of each fold and the test set.

  • test_size (int) – The size of test set.

negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]

Negative sampling for cross-validation.

Parameters:
  • test_size (int) – Number of test samples.

  • train_val_size (int) – Number of train and validation samples.

  • seed (int) – Random seed.

  • type (str) – Type of negative sampling.

PyBMF.datasets.MovieLensData module

class PyBMF.datasets.MovieLensData.MovieLensData(path=None, size='1m')[source]

Bases: BaseData

Load MovieLens dataset.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]

Load data.

read_data()[source]

Read data.

sort_factor(X, dim, factor_info)[source]

Sort the matrix and factor_info by factor order.

Parameters:
  • X (csr_matrix) – The matrix to be sorted.

  • dim (int) – If dim is 0, sort rows. If dim is 1, sort columns.

  • factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

Returns:

  • X (csr_matrix) – The sorted matrix.

  • factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

PyBMF.datasets.MovieLensGenreCastData module

class PyBMF.datasets.MovieLensGenreCastData.MovieLensGenreCastData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with IMDB genre and cast information.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

get_attribute_info(attribute)[source]

Get attribute information.

Parameters:

attribute (str) – The name of columns in df_info.

load_data()[source]

Load data.

read_data()[source]

Read data.

PyBMF.datasets.MovieLensGenreCastUserData module

class PyBMF.datasets.MovieLensGenreCastUserData.MovieLensGenreCastUserData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with user profiles.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]

Load data.

read_data()[source]

Read data.

PyBMF.datasets.MovieLensUserData module

class PyBMF.datasets.MovieLensUserData.MovieLensUserData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with user profiles

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

get_user_profile()[source]

Get user profile.

load_data()[source]

Load data.

read_data()[source]

Read data.

PyBMF.datasets.NetflixData module

class PyBMF.datasets.NetflixData.NetflixData(path=None, size='small')[source]

Bases: BaseData

Load Netflix dataset.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

load_data()[source]

Load data.

read_data()[source]

Read data.

PyBMF.datasets.NetflixGenreCastData module

class PyBMF.datasets.NetflixGenreCastData.NetflixGenreCastData(path=None, size='small', source='imdb')[source]

Bases: NetflixData

Load Netflix dataset with genre and cast information.

Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

  • source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.

get_attribute_info(attribute)[source]

Get attribute information.

Parameters:

attribute (str) – The name of columns in df_info.

load_data()[source]

Load data.

read_data()[source]

Read data.

PyBMF.datasets.NoSplit module

class PyBMF.datasets.NoSplit.NoSplit(X, seed=None)[source]

Bases: BaseSplit

No split, usually used in reconstruction tasks.

Designed for reconstruction tasks, where training, validation and testing use the same full set of samples. NoSplit supports negative sampling.

negative_sample(size, type='uniform', seed=None)[source]

Select and append negative samples onto train, val and test set.

Parameters:
  • size (int) – Number of negative samples.

  • type (str) – Type of negative sampling.

  • seed (int) – Random seed.

PyBMF.datasets.RatioSplit module

class PyBMF.datasets.RatioSplit.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]

Bases: BaseSplit

Ratio split, usually used in prediction tasks.

Parameters:
  • X (ndarray, spmatrix) – The data matrix.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • seed (int) – Random seed.

static get_indices(data_idx, train_size, test_size)[source]

Get indices for train, val and test set.

Used in RatioSplit and RatioSplit.negative_sampling.

Parameters:
  • data_idx (ndarray) – The indices of dataset.

  • train_size (int) – The size of training set.

  • test_size (int) – The size of test set.

static get_size(val_size, test_size, n_ratings, train_size=None)[source]

Get size of train, val and test set.

Used in both RatioSplit and RatioSplit.negative_sample.

Parameters:
  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • n_ratings (int) – Number of ratings.

  • train_size (int, float or None) – If None, use the rest of data. If 0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.

negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]

Select and append negative samples onto train, val and test set.

Parameters:
  • train_size (int or float) – If it is int, train_size is the integer size of dataset. If it is float, train_size is the fraction size of dataset.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • seed (int) – Random seed.

  • type (str in {'uniform', 'popularity'}) – Type of negative sampling.

Module contents

class PyBMF.datasets.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]

Bases: BaseSplit

K-fold cross-validation, used in prediction tasks

Parameters:
  • X (ndarray, spmatrix) – The data matrix.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • n_folds (int) – Number of folds.

  • current_fold (int) – Index of the current fold.

get_fold(current_fold)[source]

Get current fold.

Parameters:

current_fold (int) – Index of the current fold.

static get_indices(data_idx, partition, current_fold)[source]

Get indices for current fold.

Parameters:
  • data_idx (ndarray) – The indices of dataset.

  • partition (ndarray) – An array of starting indices of each fold and the test set.

  • current_fold (int) – The index of current fold.

Returns:

  • train_idx (ndarray) – The indices of training data.

  • val_idx (ndarray) – The indices of validation data.

  • test_idx (ndarray) – The indices of test data.

static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]

Get partition for cross-validation.

Used in CrossValidation and CrossValidation.cv_negative_sample.

Parameters:
  • n_folds (int) – Number of folds.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • train_val_size (int, float or None) – If it is None, use the remaining data outside test_size. If it is int, train_val_size is the integer size of dataset. If it is float, train_val_size is the fraction size of dataset. Note that 0.0 is not valid.

Returns:

  • partition (ndarray) – An array of starting indices of each fold and the test set.

  • test_size (int) – The size of test set.

negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]

Negative sampling for cross-validation.

Parameters:
  • test_size (int) – Number of test samples.

  • train_val_size (int) – Number of train and validation samples.

  • seed (int) – Random seed.

  • type (str) – Type of negative sampling.

class PyBMF.datasets.MovieLensData(path=None, size='1m')[source]

Bases: BaseData

Load MovieLens dataset.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]

Load data.

read_data()[source]

Read data.

sort_factor(X, dim, factor_info)[source]

Sort the matrix and factor_info by factor order.

Parameters:
  • X (csr_matrix) – The matrix to be sorted.

  • dim (int) – If dim is 0, sort rows. If dim is 1, sort columns.

  • factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

Returns:

  • X (csr_matrix) – The sorted matrix.

  • factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

class PyBMF.datasets.MovieLensGenreCastData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with IMDB genre and cast information.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

get_attribute_info(attribute)[source]

Get attribute information.

Parameters:

attribute (str) – The name of columns in df_info.

load_data()[source]

Load data.

read_data()[source]

Read data.

class PyBMF.datasets.MovieLensGenreCastUserData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with user profiles.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]

Load data.

read_data()[source]

Read data.

class PyBMF.datasets.MovieLensUserData(path=None, size='1m')[source]

Bases: MovieLensData

Load MovieLens dataset with user profiles

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'100k', '1m'}) – MovieLens dataset size.

get_user_profile()[source]

Get user profile.

load_data()[source]

Load data.

read_data()[source]

Read data.

class PyBMF.datasets.NetflixData(path=None, size='small')[source]

Bases: BaseData

Load Netflix dataset.

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

load_data()[source]

Load data.

read_data()[source]

Read data.

class PyBMF.datasets.NetflixGenreCastData(path=None, size='small', source='imdb')[source]

Bases: NetflixData

Load Netflix dataset with genre and cast information.

Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset

Parameters:
  • path (str) – Path to the cached dataset.

  • size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

  • source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.

get_attribute_info(attribute)[source]

Get attribute information.

Parameters:

attribute (str) – The name of columns in df_info.

load_data()[source]

Load data.

read_data()[source]

Read data.

class PyBMF.datasets.NoSplit(X, seed=None)[source]

Bases: BaseSplit

No split, usually used in reconstruction tasks.

Designed for reconstruction tasks, where training, validation and testing use the same full set of samples. NoSplit supports negative sampling.

negative_sample(size, type='uniform', seed=None)[source]

Select and append negative samples onto train, val and test set.

Parameters:
  • size (int) – Number of negative samples.

  • type (str) – Type of negative sampling.

  • seed (int) – Random seed.

class PyBMF.datasets.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]

Bases: BaseSplit

Ratio split, usually used in prediction tasks.

Parameters:
  • X (ndarray, spmatrix) – The data matrix.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • seed (int) – Random seed.

static get_indices(data_idx, train_size, test_size)[source]

Get indices for train, val and test set.

Used in RatioSplit and RatioSplit.negative_sampling.

Parameters:
  • data_idx (ndarray) – The indices of dataset.

  • train_size (int) – The size of training set.

  • test_size (int) – The size of test set.

static get_size(val_size, test_size, n_ratings, train_size=None)[source]

Get size of train, val and test set.

Used in both RatioSplit and RatioSplit.negative_sample.

Parameters:
  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • n_ratings (int) – Number of ratings.

  • train_size (int, float or None) – If None, use the rest of data. If 0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.

negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]

Select and append negative samples onto train, val and test set.

Parameters:
  • train_size (int or float) – If it is int, train_size is the integer size of dataset. If it is float, train_size is the fraction size of dataset.

  • test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.

  • val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.

  • seed (int) – Random seed.

  • type (str in {'uniform', 'popularity'}) – Type of negative sampling.