Utils

check_retrieval_dataframe_format
download_mock_dataset
PCA
take_2d
assign_2d

check_retrieval_dataframe_format 

oml.utils.dataframe_format.check_retrieval_dataframe_format(df: Union[Path, str, DataFrame], dataset_root: Optional[Path] = None, sep: str = ',', verbose: bool = True) → None[source]

Function checks if the dataset is in the correct format.

Parameters

df – Path to .csv file or pandas DataFrame
dataset_root – Path to the dataset root, set None if you used absolute paths in your dataframe
sep – Separator used in .csv
verbose – Set True if you want to see warnings

download_mock_dataset 

oml.utils.download_mock_dataset.download_mock_dataset(dataset_root: Union[str, Path], check_md5: bool = True, df_name: str = 'df.csv') → Tuple[DataFrame, DataFrame][source]

Function to download mock dataset which is already prepared in the required format.

Parameters

dataset_root – Path to save the dataset
check_md5 – Set True to check md5sum
df_name – Name of csv file for which output DataFrames will be returned

Returns: Dataframes for the training and validation stages

PCA 

class oml.utils.misc_torch.PCA(embeddings: Tensor)[source]

Bases: object

Principal component analysis (PCA).

Estimate principal axes, and perform vectors transformation.

Note

The code is almost the same as one from sklearn, but we had two reasons to have our own implementation. First, we need to work with Torch tensors instead of NumPy arrays. Second, we wanted to avoid one more external dependency.

components

Matrix of shape [embeddings_dim, embeddings_dim]. Principal axes in embeddings space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by explained_variance.

Type: torch.Tensor

explained_variance

Array of size embeddings_dim The amount of variance explained by each of the selected components. The variance estimation uses n_embeddings - 1 degrees of freedom. Equal to eigenvalues of the covariance matrix of embeddings.

Type: torch.Tensor

explained_variance_ratio

Array of size embeddings_dim. Percentage of variance explained by each of the components.

Type: torch.Tensor

singular_values

Array of size embeddings_dim. The singular values corresponding to each of the selected components.

Type: torch.Tensor

mean

Array of size embeddings_dim. Per-feature empirical mean, estimated from the training set. Equal to embeddings.mean(dim=0).

Type: torch.Tensor

For an embeddings matrix \(X\) of shape \(n\times d\) the principal axes could be found by performing Singular Value Decomposition

\[X = U\Sigma V^T\]

where \(U\) is an \(n\times n\) orthogonal matrix, \(\Sigma\) is an \(n\times d\) rectangular diagonal matrix with non-negative real numbers on the diagonal, \(V\) is an \(d\times d\) orthogonal matrix.

Rows of the \(V\) form an orthonormal basis, and could be used to project embeddings to a new space, possible of lower dimension:

\[X' = X\cdot V^T\]

The inverse transform is done by

\[X = X'\cdot V\]

See:

Principal Components Analysis

Example

>>> embeddings = torch.rand(100, 5)
>>> pca = PCA(embeddings)
>>> embeddings_transformed = pca.transform(embeddings)
>>> embeddings_recovered = pca.inverse_transform(embeddings_transformed)
>>> torch.all(torch.isclose(embeddings, embeddings_recovered, atol=1.e-6))
tensor(True)

__init__(embeddings: Tensor)[source]

Parameters: embeddings – Embeddings matrix with the shape of [n_embeddings, embeddings_dim].

transform(embeddings: Tensor, n_components: Optional[int] = None) → Tensor[source]

Apply fitted PCA to transform embeddings.

Parameters

embeddings – Matrix of shape [n_embeddings, embeddings_dim].
n_components – The desired dimension of the output.

Returns

Transformed embeddings.

inverse_transform(embeddings: Tensor) → Tensor[source]

Apply inverse transform to embeddings.

Parameters: embeddings – Matrix of shape [n_embeddings, N] where N <= embeddings_dim is the dimension of embeddings.
Returns: Embeddings projected into original embeddings space.

calc_principal_axes_number(pcf_variance: Tuple[float, ...]) → Tensor[source]

Function estimates the number of principal axes that are required to explain the explained_variance_ths variance.

Parameters: pcf_variance – Values in range [0, 1]. Find the number of components such that the amount of variance that needs to be explained is greater than the fraction specified by pcf_variance.
Returns: List of amount of principal axes.

Let \(X\) be a set of \(d\) dimensional embeddings. Let \(\lambda_1, \ldots, \lambda_d\in\mathbb{R}\) be a set of eigenvalues of the covariance matrix of \(X\) sorted in descending order. Then for a given value of desired explained variance \(r\), the number of principal components that explaines \(r\cdot 100\%%\) variance is the largest integer \(n\) such that

\[\frac{\sum\limits_{i = 1}^{n - 1}\lambda_i}{\sum\limits_{i = 1}^{d}\lambda_i} \leq r\]

Example

In the example bellow there are 4 vectors of length 10, and only first 4 dimensions have non-zero values. Its covariance matrix will have only 4 eigenvalues, that are greater than 0, i.e. there are only 4 principal axes. So, in order to keep at least 50% of the information from the set, we need to keep 2 principal axes, and in order to keep all the information we need to keep 5 principal axes (one additional axis appears because the number of principal axes is superior to the desired explained variance threshold).

>>> embeddings = torch.eye(4, 10, dtype=torch.float)
>>> pca = PCA(embeddings)
>>> pca.calc_principal_axes_number(pcf_variance=(0.5, 1))
tensor([2, 5])

take_2d 

oml.utils.misc_torch.take_2d(x: Tensor, indices: Tensor) → Tensor[source]

Parameters

x – Tensor with the shape of [N, M]
indices – Tensor of integers with the shape of [N, P] Note, rows in indices may contain duplicated values. It means that we can take the same element from x several times.

Returns

Tensor of the items picked from x with the shape of [N, P]

assign_2d 

oml.utils.misc_torch.assign_2d(x: Tensor, indices: Tensor, new_values: Tensor) → Tensor[source]

Parameters

x – Tensor with the shape of [N, M]
indices – Tensor of integers with the shape of [N, P], where P <= M
new_values – Tensor with the shape of [N, P]

Returns

x[i, indices[i, j]] = new_values[i, j]

Return type

Tensor with the shape of [N, M] constructed by the following rule

Utils

check_retrieval_dataframe_format

download_mock_dataset

PCA

take_2d

assign_2d

check_retrieval_dataframe_format 

download_mock_dataset 

PCA 

take_2d 

assign_2d 