

oml.utils.dataframe_format.check_retrieval_dataframe_format(df: Union[Path, str, DataFrame], dataset_root: Optional[Path] = None, sep: str = ',', verbose: bool = True) None[source]

Function checks if the dataset is in the correct format.

  • df – Path to .csv file or pandas DataFrame

  • dataset_root – Path to the dataset root, set None if you used absolute paths in your dataframe

  • sep – Separator used in .csv

  • verbose – Set True if you want to see warnings


oml.utils.download_mock_dataset.download_mock_dataset(dataset_root: Union[str, Path], check_md5: bool = True, df_name: str = 'df.csv') Tuple[DataFrame, DataFrame][source]

Function to download mock dataset which is already prepared in the required format.

  • dataset_root – Path to save the dataset

  • check_md5 – Set True to check md5sum

  • df_name – Name of csv file for which output DataFrames will be returned

Returns: Dataframes for the training and validation stages


class oml.utils.misc_torch.PCA(embeddings: Tensor)[source]

Bases: object

Principal component analysis (PCA).

Estimate principal axes, and perform vectors transformation.


The code is almost the same as one from sklearn, but we had two reasons to have our own implementation. First, we need to work with Torch tensors instead of NumPy arrays. Second, we wanted to avoid one more external dependency.


Matrix of shape [embeddings_dim, embeddings_dim]. Principal axes in embeddings space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by explained_variance.




Array of size embeddings_dim The amount of variance explained by each of the selected components. The variance estimation uses n_embeddings - 1 degrees of freedom. Equal to eigenvalues of the covariance matrix of embeddings.




Array of size embeddings_dim. Percentage of variance explained by each of the components.




Array of size embeddings_dim. The singular values corresponding to each of the selected components.




Array of size embeddings_dim. Per-feature empirical mean, estimated from the training set. Equal to embeddings.mean(dim=0).



For an embeddings matrix \(X\) of shape \(n\times d\) the principal axes could be found by performing Singular Value Decomposition

\[X = U\Sigma V^T\]

where \(U\) is an \(n\times n\) orthogonal matrix, \(\Sigma\) is an \(n\times d\) rectangular diagonal matrix with non-negative real numbers on the diagonal, \(V\) is an \(d\times d\) orthogonal matrix.

Rows of the \(V\) form an orthonormal basis, and could be used to project embeddings to a new space, possible of lower dimension:

\[X' = X\cdot V^T\]

The inverse transform is done by

\[X = X'\cdot V\]



>>> embeddings = torch.rand(100, 5)
>>> pca = PCA(embeddings)
>>> embeddings_transformed = pca.transform(embeddings)
>>> embeddings_recovered = pca.inverse_transform(embeddings_transformed)
>>> torch.all(torch.isclose(embeddings, embeddings_recovered, atol=1.e-6))
__init__(embeddings: Tensor)[source]

embeddings – Embeddings matrix with the shape of [n_embeddings, embeddings_dim].

transform(embeddings: Tensor, n_components: Optional[int] = None) Tensor[source]

Apply fitted PCA to transform embeddings.

  • embeddings – Matrix of shape [n_embeddings, embeddings_dim].

  • n_components – The desired dimension of the output.


Transformed embeddings.

inverse_transform(embeddings: Tensor) Tensor[source]

Apply inverse transform to embeddings.


embeddings – Matrix of shape [n_embeddings, N] where N <= embeddings_dim is the dimension of embeddings.


Embeddings projected into original embeddings space.

calc_principal_axes_number(pcf_variance: Tuple[float, ...]) Tensor[source]

Function estimates the number of principal axes that are required to explain the explained_variance_ths variance.


pcf_variance – Values in range [0, 1]. Find the number of components such that the amount of variance that needs to be explained is greater than the fraction specified by pcf_variance.


List of amount of principal axes.

Let \(X\) be a set of \(d\) dimensional embeddings. Let \(\lambda_1, \ldots, \lambda_d\in\mathbb{R}\) be a set of eigenvalues of the covariance matrix of \(X\) sorted in descending order. Then for a given value of desired explained variance \(r\), the number of principal components that explaines \(r\cdot 100\%%\) variance is the largest integer \(n\) such that

\[\frac{\sum\limits_{i = 1}^{n - 1}\lambda_i}{\sum\limits_{i = 1}^{d}\lambda_i} \leq r\]


In the example bellow there are 4 vectors of length 10, and only first 4 dimensions have non-zero values. Its covariance matrix will have only 4 eigenvalues, that are greater than 0, i.e. there are only 4 principal axes. So, in order to keep at least 50% of the information from the set, we need to keep 2 principal axes, and in order to keep all the information we need to keep 5 principal axes (one additional axis appears because the number of principal axes is superior to the desired explained variance threshold).

>>> embeddings = torch.eye(4, 10, dtype=torch.float)
>>> pca = PCA(embeddings)
>>> pca.calc_principal_axes_number(pcf_variance=(0.5, 1))
tensor([2, 5])


oml.utils.misc_torch.take_2d(x: Tensor, indices: Tensor) Tensor[source]
  • x – Tensor with the shape of [N, M]

  • indices – Tensor of integers with the shape of [N, P] Note, rows in indices may contain duplicated values. It means that we can take the same element from x several times.


Tensor of the items picked from x with the shape of [N, P]


oml.utils.misc_torch.assign_2d(x: Tensor, indices: Tensor, new_values: Tensor) Tensor[source]
  • x – Tensor with the shape of [N, M]

  • indices – Tensor of integers with the shape of [N, P], where P <= M

  • new_values – Tensor with the shape of [N, P]


x[i, indices[i, j]] = new_values[i, j]

Return type

Tensor with the shape of [N, M] constructed by the following rule