PyBMF.datasets package¶
Submodules¶
PyBMF.datasets.BaseData module¶
- class PyBMF.datasets.BaseData.BaseData(path=None)[source]¶
Bases:
objectBase class for built-in datasets.
Note
Attributes of
BaseDatafor a single-matrix dataset.- Xspmatrix
The data matrix, which can be passed to
NoSplit,RatioSplitorCrossValidationor be used for factorization directly.- factor_infolist of 2 tuples
The list of factor info. For example, [
user_info,item_info]. More specifically, the list may look like [(u_order,u_idmap,u_alias), (i_order,i_idmap,i_alias)].
Note
Attributes of
BaseDatafor a multi-matrix dataset.- Xslist of spmatrix
E.g., [
X_ratings,X_genres,X_cast]- factorslist of lists of 2 ints
The list of factor id pairs. For example, [[0, 1], [2, 1], [3, 1]] if the 3 datasets are user-movie, genre-movie and cast-movie.
- factor_infolist of tuples
The list of factor info. For example, [
user_info,movie_info,genre_info,cast_info].
- dump_pickle(name=None)[source]¶
Dump pickle to cache directory.
- Parameters:
name (str) – The name of pickle file.
- property has_pickle¶
If pickle exists.
- load(overwrite_cache=False)[source]¶
Load data.
If pickle exists, load from cache directory. If not, read from data directory. Dump to pickle when
overwrite_cacheis True.- Parameters:
overwrite_cache (bool, default: False) – If True, overwrite the cache.
- sample(factor_id, idx=None, n_samples=None, seed=None)[source]¶
Sample the whole dataset with given
factor_idandidx.- Parameters:
factor_id (int) –
For single-matrix dataset,
factor_idis the axis to sample, i.e., 0 and 1 for rows and columns.For multi-matrix dataset,
factor_idis the index of the factor to sample.idx (np.ndarray) – The given indices to sample with.
n_samples (int) – Randomly down-sample to this length.
seed (int) – Random seed for down-sampling.
PyBMF.datasets.BaseSplit module¶
- class PyBMF.datasets.BaseSplit.BaseSplit(X)[source]¶
Bases:
objectBase class for data splitting and negative sampling methods
NoSplit,RatioSplitandCrossValidation.Note
Attributes of
BaseSplit.- X_trainspmatrix
The training data matrix.
- X_valspmatrix
The validation data matrix.
- X_testspmatrix
The test data matrix.
- Parameters:
X (ndarray, spmatrix) – The data matrix.
- get_neg_indices(n_negatives, type)[source]¶
Generate negative indices.
Used in
RatioSplit.negative_sampleandCrossValidation.negative_sample.This is fast but intractable for large dataset. Use trial-and-error for large dataset.
- Parameters:
n_negatives (int) – Number of negative samples.
type (str) – Negative sampling type.
- load_neg_data(**kwargs)¶
- load_pos_data(train_idx, val_idx, test_idx)[source]¶
Load positive data.
Used in
RatioSplitandCrossValidation.Leave
X_val,X_testempty ifval_idx/test_idxlength is 0 for negative sampling.- Parameters:
train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.
- negative_sample()[source]¶
Negative sampling.
Note
We can only add 0’s using csr/csc_matrix, and validate negative samples using
coo_matrixor triplet.coo_matrixdoes not support value assignment;lil_matrixhas no effect when adding 0’s onto it.Any arithmetic operation or
csr_matrix.eliminate_zeros()will cause a sparse matrix to lose the negative samples.
PyBMF.datasets.CrossValidation module¶
- class PyBMF.datasets.CrossValidation.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]¶
Bases:
BaseSplitK-fold cross-validation, used in prediction tasks
- Parameters:
X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.n_folds (int) – Number of folds.
current_fold (int) – Index of the current fold.
- get_fold(current_fold)[source]¶
Get current fold.
- Parameters:
current_fold (int) – Index of the current fold.
- static get_indices(data_idx, partition, current_fold)[source]¶
Get indices for current fold.
- Parameters:
data_idx (ndarray) – The indices of dataset.
partition (ndarray) – An array of starting indices of each fold and the test set.
current_fold (int) – The index of current fold.
- Returns:
train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.
- static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]¶
Get partition for cross-validation.
Used in
CrossValidationandCrossValidation.cv_negative_sample.- Parameters:
n_folds (int) – Number of folds.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.train_val_size (int, float or None) – If it is
None, use the remaining data outsidetest_size. If it is int,train_val_sizeis the integer size of dataset. If it is float,train_val_sizeis the fraction size of dataset. Note that0.0is not valid.
- Returns:
partition (ndarray) – An array of starting indices of each fold and the test set.
test_size (int) – The size of test set.
- negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]¶
Negative sampling for cross-validation.
- Parameters:
test_size (int) – Number of test samples.
train_val_size (int) – Number of train and validation samples.
seed (int) – Random seed.
type (str) – Type of negative sampling.
PyBMF.datasets.MovieLensData module¶
- class PyBMF.datasets.MovieLensData.MovieLensData(path=None, size='1m')[source]¶
Bases:
BaseDataLoad MovieLens dataset.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
- sort_factor(X, dim, factor_info)[source]¶
Sort the matrix and factor_info by factor order.
- Parameters:
X (csr_matrix) – The matrix to be sorted.
dim (int) – If
dimis 0, sort rows. Ifdimis 1, sort columns.factor_info (list of 2 tuples) – The list of factor info. For example, [
u_order,u_idmap,u_alias].
- Returns:
X (csr_matrix) – The sorted matrix.
factor_info (list of 2 tuples) – The list of factor info. For example, [
u_order,u_idmap,u_alias].
PyBMF.datasets.MovieLensGenreCastData module¶
- class PyBMF.datasets.MovieLensGenreCastData.MovieLensGenreCastData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with IMDB genre and cast information.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
PyBMF.datasets.MovieLensGenreCastUserData module¶
- class PyBMF.datasets.MovieLensGenreCastUserData.MovieLensGenreCastUserData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with user profiles.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
PyBMF.datasets.MovieLensUserData module¶
- class PyBMF.datasets.MovieLensUserData.MovieLensUserData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with user profiles
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
PyBMF.datasets.NetflixData module¶
- class PyBMF.datasets.NetflixData.NetflixData(path=None, size='small')[source]¶
Bases:
BaseDataLoad Netflix dataset.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
PyBMF.datasets.NetflixGenreCastData module¶
- class PyBMF.datasets.NetflixGenreCastData.NetflixGenreCastData(path=None, size='small', source='imdb')[source]¶
Bases:
NetflixDataLoad Netflix dataset with genre and cast information.
Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.
PyBMF.datasets.NoSplit module¶
PyBMF.datasets.RatioSplit module¶
- class PyBMF.datasets.RatioSplit.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]¶
Bases:
BaseSplitRatio split, usually used in prediction tasks.
- Parameters:
X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.seed (int) – Random seed.
- static get_indices(data_idx, train_size, test_size)[source]¶
Get indices for train, val and test set.
Used in
RatioSplitandRatioSplit.negative_sampling.- Parameters:
data_idx (ndarray) – The indices of dataset.
train_size (int) – The size of training set.
test_size (int) – The size of test set.
- static get_size(val_size, test_size, n_ratings, train_size=None)[source]¶
Get size of train, val and test set.
Used in both
RatioSplitandRatioSplit.negative_sample.- Parameters:
val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.n_ratings (int) – Number of ratings.
train_size (int, float or None) – If
None, use the rest of data. If0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.
- negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]¶
Select and append negative samples onto train, val and test set.
- Parameters:
train_size (int or float) – If it is int,
train_sizeis the integer size of dataset. If it is float,train_sizeis the fraction size of dataset.test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.seed (int) – Random seed.
type (str in {'uniform', 'popularity'}) – Type of negative sampling.
Module contents¶
- class PyBMF.datasets.CrossValidation(X, test_size=None, n_folds=None, seed=None)[source]¶
Bases:
BaseSplitK-fold cross-validation, used in prediction tasks
- Parameters:
X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.n_folds (int) – Number of folds.
current_fold (int) – Index of the current fold.
- get_fold(current_fold)[source]¶
Get current fold.
- Parameters:
current_fold (int) – Index of the current fold.
- static get_indices(data_idx, partition, current_fold)[source]¶
Get indices for current fold.
- Parameters:
data_idx (ndarray) – The indices of dataset.
partition (ndarray) – An array of starting indices of each fold and the test set.
current_fold (int) – The index of current fold.
- Returns:
train_idx (ndarray) – The indices of training data.
val_idx (ndarray) – The indices of validation data.
test_idx (ndarray) – The indices of test data.
- static get_partition(n_folds, test_size, n_ratings, train_val_size=None)[source]¶
Get partition for cross-validation.
Used in
CrossValidationandCrossValidation.cv_negative_sample.- Parameters:
n_folds (int) – Number of folds.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.train_val_size (int, float or None) – If it is
None, use the remaining data outsidetest_size. If it is int,train_val_sizeis the integer size of dataset. If it is float,train_val_sizeis the fraction size of dataset. Note that0.0is not valid.
- Returns:
partition (ndarray) – An array of starting indices of each fold and the test set.
test_size (int) – The size of test set.
- negative_sample(test_size, train_val_size, seed=None, type='uniform')[source]¶
Negative sampling for cross-validation.
- Parameters:
test_size (int) – Number of test samples.
train_val_size (int) – Number of train and validation samples.
seed (int) – Random seed.
type (str) – Type of negative sampling.
- class PyBMF.datasets.MovieLensData(path=None, size='1m')[source]¶
Bases:
BaseDataLoad MovieLens dataset.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
- sort_factor(X, dim, factor_info)[source]¶
Sort the matrix and factor_info by factor order.
- Parameters:
X (csr_matrix) – The matrix to be sorted.
dim (int) – If
dimis 0, sort rows. Ifdimis 1, sort columns.factor_info (list of 2 tuples) – The list of factor info. For example, [
u_order,u_idmap,u_alias].
- Returns:
X (csr_matrix) – The sorted matrix.
factor_info (list of 2 tuples) – The list of factor info. For example, [
u_order,u_idmap,u_alias].
- class PyBMF.datasets.MovieLensGenreCastData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with IMDB genre and cast information.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
- class PyBMF.datasets.MovieLensGenreCastUserData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with user profiles.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
- class PyBMF.datasets.MovieLensUserData(path=None, size='1m')[source]¶
Bases:
MovieLensDataLoad MovieLens dataset with user profiles
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'100k', '1m'}) – MovieLens dataset size.
- class PyBMF.datasets.NetflixData(path=None, size='small')[source]¶
Bases:
BaseDataLoad Netflix dataset.
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
- class PyBMF.datasets.NetflixGenreCastData(path=None, size='small', source='imdb')[source]¶
Bases:
NetflixDataLoad Netflix dataset with genre and cast information.
Genre and cast information comes from Netflix-Prize-IMDB-TMDB-Joint-Dataset on GitHub: https://github.com/felixnie/Netflix-Prize-IMDB-TMDB-Joint-Dataset
- Parameters:
path (str) – Path to the cached dataset.
size (str in {'small', 'full'}) – Netflix data ‘small’ version, size 15MB, users ~10k, items 4945, ratings ~608k. Netflix data ‘full’ version, size 2.43GB, users ~480k, items 17770, ratings ~100M.
source (str in {'imdb', 'tmdb'}) – Source should be ‘imdb’ or ‘tmdb’.
- class PyBMF.datasets.NoSplit(X, seed=None)[source]¶
Bases:
BaseSplitNo split, usually used in reconstruction tasks.
Designed for reconstruction tasks, where training, validation and testing use the same full set of samples. NoSplit supports negative sampling.
- class PyBMF.datasets.RatioSplit(X, test_size=None, val_size=None, seed=None)[source]¶
Bases:
BaseSplitRatio split, usually used in prediction tasks.
- Parameters:
X (ndarray, spmatrix) – The data matrix.
test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.seed (int) – Random seed.
- static get_indices(data_idx, train_size, test_size)[source]¶
Get indices for train, val and test set.
Used in
RatioSplitandRatioSplit.negative_sampling.- Parameters:
data_idx (ndarray) – The indices of dataset.
train_size (int) – The size of training set.
test_size (int) – The size of test set.
- static get_size(val_size, test_size, n_ratings, train_size=None)[source]¶
Get size of train, val and test set.
Used in both
RatioSplitandRatioSplit.negative_sample.- Parameters:
val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.n_ratings (int) – Number of ratings.
train_size (int, float or None) – If
None, use the rest of data. If0.0, empty training set. Used in negative sampling if there’s no need to append negative samples to the training set.
- negative_sample(train_size=None, test_size=None, val_size=None, seed=None, type='uniform')[source]¶
Select and append negative samples onto train, val and test set.
- Parameters:
train_size (int or float) – If it is int,
train_sizeis the integer size of dataset. If it is float,train_sizeis the fraction size of dataset.test_size (int or float) – If it is int,
test_sizeis the integer size of dataset. If it is float,test_sizeis the fraction size of dataset.val_size (int or float) – If it is int,
val_sizeis the integer size of dataset. If it is float,val_sizeis the fraction size of dataset.seed (int) – Random seed.
type (str in {'uniform', 'popularity'}) – Type of negative sampling.