PyBMF.datasets package¶

Submodules¶

PyBMF.datasets.BaseData module¶

class PyBMF.datasets.BaseData.BaseData(path=None)[source]¶

Bases: object

Base class for built-in datasets.

Note

Attributes of BaseData for a single-matrix dataset.

Xspmatrix: The data matrix, which can be passed to NoSplit, RatioSplit or CrossValidation or be used for factorization directly.
factor_infolist of 2 tuples: The list of factor info. For example, [user_info, item_info]. More specifically, the list may look like [(u_order, u_idmap, u_alias), (i_order, i_idmap, i_alias)].

Note

Attributes of BaseData for a multi-matrix dataset.

Xslist of spmatrix: E.g., [X_ratings, X_genres, X_cast]
factorslist of lists of 2 ints: The list of factor id pairs. For example, [[0, 1], [2, 1], [3, 1]] if the 3 datasets are user-movie, genre-movie and cast-movie.
factor_infolist of tuples: The list of factor info. For example, [user_info, movie_info, genre_info, cast_info].

dump_pickle(name=None)[source]¶

Dump pickle to cache directory.

Parameters:: name (str) – The name of pickle file.

property has_pickle¶: If pickle exists.

load(overwrite_cache=False)[source]¶

Load data.

If pickle exists, load from cache directory. If not, read from data directory. Dump to pickle when overwrite_cache is True.

Parameters:: overwrite_cache (bool, default: False) – If True, overwrite the cache.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

read_pickle()[source]¶: Read pickle from cache directory.

sample(factor_id, idx=None, n_samples=None, seed=None)[source]¶

Sample the whole dataset with given factor_id and idx.

Parameters:

factor_id (int) –
For single-matrix dataset, factor_id is the axis to sample, i.e., 0 and 1 for rows and columns.

For multi-matrix dataset, factor_id is the index of the factor to sample.
idx (np.ndarray) – The given indices to sample with.
n_samples (int) – Randomly down-sample to this length.
seed (int) – Random seed for down-sampling.

show_matrix(scaling=1.0, pixels=5, colorbar=True, discrete=True, center=True, clim=[0, 1], keep_nan=True, **kwargs)[source]¶: The show_matrix wrapper for Boolean datasets.

to_single()[source]¶: Concatenate Xs to form a single X.

PyBMF.datasets.BaseSplit module¶

class PyBMF.datasets.BaseSplit.BaseSplit(X)[source]¶

Bases: object

Base class for data splitting and negative sampling methods NoSplit, RatioSplit and CrossValidation.

Note

Attributes of BaseSplit.

X_trainspmatrix: The training data matrix.
X_valspmatrix: The validation data matrix.
X_testspmatrix: The test data matrix.

Parameters:: X (ndarray, spmatrix) – The data matrix.

check_params(**kwargs)[source]¶

Check patameters.

Checking the random seed.

get_neg_indices(n_negatives, type)[source]¶

Generate negative indices.

Used in RatioSplit.negative_sample and CrossValidation.negative_sample.

This is fast but intractable for large dataset. Use trial-and-error for large dataset.

Parameters:

n_negatives (int) – Number of negative samples.
type (str) – Negative sampling type.

load_neg_data(**kwargs)¶

load_pos_data(train_idx, val_idx, test_idx)[source]¶

Load positive data.

Used in RatioSplit and CrossValidation.

Leave X_val, X_test empty if val_idx/test_idx length is 0 for negative sampling.

Parameters:

train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.

negative_sample()[source]¶: Negative sampling.

Note

We can only add 0’s using csr/csc_matrix, and validate negative samples using coo_matrix or triplet.

coo_matrix does not support value assignment; lil_matrix has no effect when adding 0’s onto it.

Any arithmetic operation or csr_matrix.eliminate_zeros() will cause a sparse matrix to lose the negative samples.

PyBMF.datasets.CrossValidation module¶

class PyBMF.datasets.CrossValidation.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]¶

Bases: BaseSplit

K-fold cross-validation, used in prediction tasks

Parameters:

X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
n_folds (int) – Number of folds.
current_fold (int) – Index of the current fold.

get_fold(current_fold)[source]¶

Get current fold.

Parameters:: current_fold (int) – Index of the current fold.

static get_indices(data_idx, partition, current_fold)[source]¶

Get indices for current fold.

Parameters:

data_idx (ndarray) – The indices of dataset.
partition (ndarray) – An array of starting indices of each fold and the test set.
current_fold (int) – The index of current fold.

Returns:

train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.

static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]¶

Get partition for cross-validation.

Used in CrossValidation and CrossValidation.cv_negative_sample.

Parameters:

n_folds (int) – Number of folds.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
train_val_size (int, float or None) – If it is None, use the remaining data outside test_size. If it is int, train_val_size is the integer size of dataset. If it is float, train_val_size is the fraction size of dataset. Note that 0.0 is not valid.

Returns:

partition (ndarray) – An array of starting indices of each fold and the test set.
test_size (int) – The size of test set.

negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]¶

Negative sampling for cross-validation.

Parameters:

test_size (int) – Number of test samples.
train_val_size (int) – Number of train and validation samples.
seed (int) – Random seed.
type (str) – Type of negative sampling.

PyBMF.datasets.MovieLensData module¶

class PyBMF.datasets.MovieLensData.MovieLensData(path=None, size='1m')[source]¶

Bases: BaseData

Load MovieLens dataset.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

sort_factor(X, dim, factor_info)[source]¶

Sort the matrix and factor_info by factor order.

Parameters:

X (csr_matrix) – The matrix to be sorted.
dim (int) – If dim is 0, sort rows. If dim is 1, sort columns.
factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

Returns:

X (csr_matrix) – The sorted matrix.
factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

PyBMF.datasets.MovieLensGenreCastData module¶

class PyBMF.datasets.MovieLensGenreCastData.MovieLensGenreCastData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with IMDB genre and cast information.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

get_attribute_info(attribute)[source]¶

Get attribute information.

Parameters:: attribute (str) – The name of columns in df_info.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

PyBMF.datasets.MovieLensGenreCastUserData module¶

class PyBMF.datasets.MovieLensGenreCastUserData.MovieLensGenreCastUserData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with user profiles.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

PyBMF.datasets.MovieLensUserData module¶

class PyBMF.datasets.MovieLensUserData.MovieLensUserData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with user profiles

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

get_user_profile()[source]¶: Get user profile.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

PyBMF.datasets.NetflixData module¶

class PyBMF.datasets.NetflixData.NetflixData(path=None, size='small')[source]¶

Bases: BaseData

Load Netflix dataset.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

PyBMF.datasets.NetflixGenreCastData module¶

class PyBMF.datasets.NetflixGenreCastData.NetflixGenreCastData(path=None, size='small', source='imdb')[source]¶

Bases: NetflixData

Load Netflix dataset with genre and cast information.

Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset

Parameters:

path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.

get_attribute_info(attribute)[source]¶

Get attribute information.

Parameters:: attribute (str) – The name of columns in df_info.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

PyBMF.datasets.NoSplit module¶

class PyBMF.datasets.NoSplit.NoSplit(X, seed=None)[source]¶

Bases: BaseSplit

No split, usually used in reconstruction tasks.

Designed for reconstruction tasks, where training, validation and testing use the same full set of samples. NoSplit supports negative sampling.

negative_sample(size, type='uniform', seed=None)[source]¶

Select and append negative samples onto train, val and test set.

Parameters:

size (int) – Number of negative samples.
type (str) – Type of negative sampling.
seed (int) – Random seed.

PyBMF.datasets.RatioSplit module¶

class PyBMF.datasets.RatioSplit.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]¶

Bases: BaseSplit

Ratio split, usually used in prediction tasks.

Parameters:

X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
seed (int) – Random seed.

static get_indices(data_idx, train_size, test_size)[source]¶

Get indices for train, val and test set.

Used in RatioSplit and RatioSplit.negative_sampling.

Parameters:

data_idx (ndarray) – The indices of dataset.
train_size (int) – The size of training set.
test_size (int) – The size of test set.

static get_size(val_size, test_size, n_ratings, train_size=None)[source]¶

Get size of train, val and test set.

Used in both RatioSplit and RatioSplit.negative_sample.

Parameters:

val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
n_ratings (int) – Number of ratings.
train_size (int, float or None) – If None, use the rest of data. If 0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.

negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]¶

Select and append negative samples onto train, val and test set.

Parameters:

train_size (int or float) – If it is int, train_size is the integer size of dataset. If it is float, train_size is the fraction size of dataset.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
seed (int) – Random seed.
type (str in {'uniform', 'popularity'}) – Type of negative sampling.

Module contents¶

class PyBMF.datasets.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]¶

Bases: BaseSplit

K-fold cross-validation, used in prediction tasks

Parameters:

X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
n_folds (int) – Number of folds.
current_fold (int) – Index of the current fold.

get_fold(current_fold)[source]¶

Get current fold.

Parameters:: current_fold (int) – Index of the current fold.

static get_indices(data_idx, partition, current_fold)[source]¶

Get indices for current fold.

Parameters:

data_idx (ndarray) – The indices of dataset.
partition (ndarray) – An array of starting indices of each fold and the test set.
current_fold (int) – The index of current fold.

Returns:

train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.

static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]¶

Get partition for cross-validation.

Used in CrossValidation and CrossValidation.cv_negative_sample.

Parameters:

n_folds (int) – Number of folds.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
train_val_size (int, float or None) – If it is None, use the remaining data outside test_size. If it is int, train_val_size is the integer size of dataset. If it is float, train_val_size is the fraction size of dataset. Note that 0.0 is not valid.

Returns:

partition (ndarray) – An array of starting indices of each fold and the test set.
test_size (int) – The size of test set.

negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]¶

Negative sampling for cross-validation.

Parameters:

test_size (int) – Number of test samples.
train_val_size (int) – Number of train and validation samples.
seed (int) – Random seed.
type (str) – Type of negative sampling.

class PyBMF.datasets.MovieLensData(path=None, size='1m')[source]¶

Bases: BaseData

Load MovieLens dataset.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

sort_factor(X, dim, factor_info)[source]¶

Sort the matrix and factor_info by factor order.

Parameters:

X (csr_matrix) – The matrix to be sorted.
dim (int) – If dim is 0, sort rows. If dim is 1, sort columns.
factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

Returns:

X (csr_matrix) – The sorted matrix.
factor_info (list of 2 tuples) – The list of factor info. For example, [u_order, u_idmap, u_alias].

class PyBMF.datasets.MovieLensGenreCastData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with IMDB genre and cast information.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

get_attribute_info(attribute)[source]¶

Get attribute information.

Parameters:: attribute (str) – The name of columns in df_info.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

class PyBMF.datasets.MovieLensGenreCastUserData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with user profiles.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

class PyBMF.datasets.MovieLensUserData(path=None, size='1m')[source]¶

Bases: MovieLensData

Load MovieLens dataset with user profiles

Parameters:

path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.

get_user_profile()[source]¶: Get user profile.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

class PyBMF.datasets.NetflixData(path=None, size='small')[source]¶

Bases: BaseData

Load Netflix dataset.

Parameters:

path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

class PyBMF.datasets.NetflixGenreCastData(path=None, size='small', source='imdb')[source]¶

Bases: NetflixData

Load Netflix dataset with genre and cast information.

Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset

Parameters:

path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.

get_attribute_info(attribute)[source]¶

Get attribute information.

Parameters:: attribute (str) – The name of columns in df_info.

load_data()[source]¶: Load data.

read_data()[source]¶: Read data.

class PyBMF.datasets.NoSplit(X, seed=None)[source]¶

Bases: BaseSplit

No split, usually used in reconstruction tasks.

Designed for reconstruction tasks, where training, validation and testing use the same full set of samples. NoSplit supports negative sampling.

negative_sample(size, type='uniform', seed=None)[source]¶

Select and append negative samples onto train, val and test set.

Parameters:

size (int) – Number of negative samples.
type (str) – Type of negative sampling.
seed (int) – Random seed.

class PyBMF.datasets.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]¶

Bases: BaseSplit

Ratio split, usually used in prediction tasks.

Parameters:

X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
seed (int) – Random seed.

static get_indices(data_idx, train_size, test_size)[source]¶

Get indices for train, val and test set.

Used in RatioSplit and RatioSplit.negative_sampling.

Parameters:

data_idx (ndarray) – The indices of dataset.
train_size (int) – The size of training set.
test_size (int) – The size of test set.

static get_size(val_size, test_size, n_ratings, train_size=None)[source]¶

Get size of train, val and test set.

Used in both RatioSplit and RatioSplit.negative_sample.

Parameters:

val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
n_ratings (int) – Number of ratings.
train_size (int, float or None) – If None, use the rest of data. If 0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.

negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]¶

Select and append negative samples onto train, val and test set.

Parameters:

train_size (int or float) – If it is int, train_size is the integer size of dataset. If it is float, train_size is the fraction size of dataset.
test_size (int or float) – If it is int, test_size is the integer size of dataset. If it is float, test_size is the fraction size of dataset.
val_size (int or float) – If it is int, val_size is the integer size of dataset. If it is float, val_size is the fraction size of dataset.
seed (int) – Random seed.
type (str in {'uniform', 'popularity'}) – Type of negative sampling.