Utils

check_retrieval_dataframe_format

oml.utils.dataframe_format.check_retrieval_dataframe_format(df: Union[Path, str, DataFrame], dataset_root: Optional[Path] = None, sep: str = ',', verbose: bool = True) None[source]

Function checks if the dataset is in the correct format.

Parameters
  • df – Path to .csv file or pandas DataFrame

  • dataset_root – Path to the dataset root, set None if you used absolute paths in your dataframe

  • sep – Separator used in .csv

  • verbose – Set True if you want to see warnings

download_mock_dataset

oml.utils.download_mock_dataset.download_mock_dataset(dataset_root: Union[str, Path], check_md5: bool = True, df_name: str = 'df.csv') Tuple[DataFrame, DataFrame][source]

Function to download mock dataset which is already prepared in the required format.

Parameters
  • dataset_root – Path to save the dataset

  • check_md5 – Set True to check md5sum

  • df_name – Name of csv file for which output DataFrames will be returned

Returns: Dataframes for the training and validation stages

PCA

class oml.utils.misc_torch.PCA(embeddings: Tensor)[source]

Bases: object

Principal component analysis (PCA).

Estimate principal axes, and perform vectors transformation.

Note

The code is almost the same as one from sklearn, but we had two reasons to have our own implementation. First, we need to work with Torch tensors instead of NumPy arrays. Second, we wanted to avoid one more external dependency.

components

Matrix of shape [embeddings_dim, embeddings_dim]. Principal axes in embeddings space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by explained_variance.

Type

torch.Tensor

explained_variance

Array of size embeddings_dim The amount of variance explained by each of the selected components. The variance estimation uses n_embeddings - 1 degrees of freedom. Equal to eigenvalues of the covariance matrix of embeddings.

Type

torch.Tensor

explained_variance_ratio

Array of size embeddings_dim. Percentage of variance explained by each of the components.

Type

torch.Tensor

singular_values

Array of size embeddings_dim. The singular values corresponding to each of the selected components.

Type

torch.Tensor

mean

Array of size embeddings_dim. Per-feature empirical mean, estimated from the training set. Equal to embeddings.mean(dim=0).

Type

torch.Tensor

For an embeddings matrix \(X\) of shape \(n\times d\) the principal axes could be found by performing Singular Value Decomposition

\[X = U\Sigma V^T\]

where \(U\) is an \(n\times n\) orthogonal matrix, \(\Sigma\) is an \(n\times d\) rectangular diagonal matrix with non-negative real numbers on the diagonal, \(V\) is an \(d\times d\) orthogonal matrix.

Rows of the \(V\) form an orthonormal basis, and could be used to project embeddings to a new space, possible of lower dimension:

\[X' = X\cdot V^T\]

The inverse transform is done by

\[X = X'\cdot V\]

See:

Example

>>> embeddings = torch.rand(100, 5)
>>> pca = PCA(embeddings)
>>> embeddings_transformed = pca.transform(embeddings)
>>> embeddings_recovered = pca.inverse_transform(embeddings_transformed)
>>> torch.all(torch.isclose(embeddings, embeddings_recovered, atol=1.e-6))
tensor(True)
__init__(embeddings: Tensor)[source]
Parameters

embeddings – Embeddings matrix with the shape of [n_embeddings, embeddings_dim].

transform(embeddings: Tensor, n_components: Optional[int] = None) Tensor[source]

Apply fitted PCA to transform embeddings.

Parameters
  • embeddings – Matrix of shape [n_embeddings, embeddings_dim].

  • n_components – The desired dimension of the output.

Returns

Transformed embeddings.

inverse_transform(embeddings: Tensor) Tensor[source]

Apply inverse transform to embeddings.

Parameters

embeddings – Matrix of shape [n_embeddings, N] where N <= embeddings_dim is the dimension of embeddings.

Returns

Embeddings projected into original embeddings space.

calc_principal_axes_number(pcf_variance: Tuple[float, ...]) Tensor[source]

Function estimates the number of principal axes that are required to explain the explained_variance_ths variance.

Parameters

pcf_variance – Values in range [0, 1]. Find the number of components such that the amount of variance that needs to be explained is greater than the fraction specified by pcf_variance.

Returns

List of amount of principal axes.

Let \(X\) be a set of \(d\) dimensional embeddings. Let \(\lambda_1, \ldots, \lambda_d\in\mathbb{R}\) be a set of eigenvalues of the covariance matrix of \(X\) sorted in descending order. Then for a given value of desired explained variance \(r\), the number of principal components that explaines \(r\cdot 100\%%\) variance is the largest integer \(n\) such that

\[\frac{\sum\limits_{i = 1}^{n - 1}\lambda_i}{\sum\limits_{i = 1}^{d}\lambda_i} \leq r\]

Example

In the example bellow there are 4 vectors of length 10, and only first 4 dimensions have non-zero values. Its covariance matrix will have only 4 eigenvalues, that are greater than 0, i.e. there are only 4 principal axes. So, in order to keep at least 50% of the information from the set, we need to keep 2 principal axes, and in order to keep all the information we need to keep 5 principal axes (one additional axis appears because the number of principal axes is superior to the desired explained variance threshold).

>>> embeddings = torch.eye(4, 10, dtype=torch.float)
>>> pca = PCA(embeddings)
>>> pca.calc_principal_axes_number(pcf_variance=(0.5, 1))
tensor([2, 5])

take_2d

oml.utils.misc_torch.take_2d(x: Tensor, indices: Tensor) Tensor[source]
Parameters
  • x – Tensor with the shape of [N, M]

  • indices – Tensor of integers with the shape of [N, P] Note, rows in indices may contain duplicated values. It means that we can take the same element from x several times.

Returns

Tensor of the items picked from x with the shape of [N, P]

assign_2d

oml.utils.misc_torch.assign_2d(x: Tensor, indices: Tensor, new_values: Tensor) Tensor[source]
Parameters
  • x – Tensor with the shape of [N, M]

  • indices – Tensor of integers with the shape of [N, P], where P <= M

  • new_values – Tensor with the shape of [N, P]

Returns

x[i, indices[i, j]] = new_values[i, j]

Return type

Tensor with the shape of [N, M] constructed by the following rule