Utils
check_retrieval_dataframe_format
- oml.utils.dataframe_format.check_retrieval_dataframe_format(df: Union[Path, str, DataFrame], dataset_root: Optional[Path] = None, sep: str = ',', verbose: bool = True) None[source]
Function checks if the dataset is in the correct format.
- Parameters
df – Path to
.csvfile or pandas DataFramedataset_root – Path to the dataset root, set
Noneif you used absolute paths in your dataframesep – Separator used in
.csvverbose – Set
Trueif you want to see warnings
PCA
- class oml.utils.misc_torch.PCA(embeddings: Tensor)[source]
Bases:
objectPrincipal component analysis (PCA).
Estimate principal axes, and perform vectors transformation.
Note
The code is almost the same as one from sklearn, but we had two reasons to have our own implementation. First, we need to work with Torch tensors instead of NumPy arrays. Second, we wanted to avoid one more external dependency.
- components
Matrix of shape
[embeddings_dim, embeddings_dim]. Principal axes in embeddings space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted byexplained_variance.- Type
torch.Tensor
- explained_variance
Array of size
embeddings_dimThe amount of variance explained by each of the selected components. The variance estimation usesn_embeddings - 1degrees of freedom. Equal to eigenvalues of the covariance matrix ofembeddings.- Type
torch.Tensor
- explained_variance_ratio
Array of size
embeddings_dim. Percentage of variance explained by each of the components.- Type
torch.Tensor
- singular_values
Array of size
embeddings_dim. The singular values corresponding to each of the selected components.- Type
torch.Tensor
- mean
Array of size
embeddings_dim. Per-feature empirical mean, estimated from the training set. Equal toembeddings.mean(dim=0).- Type
torch.Tensor
For an embeddings matrix \(X\) of shape \(n\times d\) the principal axes could be found by performing Singular Value Decomposition
\[X = U\Sigma V^T\]where \(U\) is an \(n\times n\) orthogonal matrix, \(\Sigma\) is an \(n\times d\) rectangular diagonal matrix with non-negative real numbers on the diagonal, \(V\) is an \(d\times d\) orthogonal matrix.
Rows of the \(V\) form an orthonormal basis, and could be used to project embeddings to a new space, possible of lower dimension:
\[X' = X\cdot V^T\]The inverse transform is done by
\[X = X'\cdot V\]See:
Example
>>> embeddings = torch.rand(100, 5) >>> pca = PCA(embeddings) >>> embeddings_transformed = pca.transform(embeddings) >>> embeddings_recovered = pca.inverse_transform(embeddings_transformed) >>> torch.all(torch.isclose(embeddings, embeddings_recovered, atol=1.e-6)) tensor(True)
- __init__(embeddings: Tensor)[source]
- Parameters
embeddings – Embeddings matrix with the shape of
[n_embeddings, embeddings_dim].
- transform(embeddings: Tensor, n_components: Optional[int] = None) Tensor[source]
Apply fitted PCA to transform embeddings.
- Parameters
embeddings – Matrix of shape
[n_embeddings, embeddings_dim].n_components – The desired dimension of the output.
- Returns
Transformed embeddings.
- inverse_transform(embeddings: Tensor) Tensor[source]
Apply inverse transform to embeddings.
- Parameters
embeddings – Matrix of shape
[n_embeddings, N]whereN <= embeddings_dimis the dimension of embeddings.- Returns
Embeddings projected into original embeddings space.
- calc_principal_axes_number(pcf_variance: Tuple[float, ...]) Tensor[source]
Function estimates the number of principal axes that are required to explain the explained_variance_ths variance.
- Parameters
pcf_variance – Values in range [0, 1]. Find the number of components such that the amount of variance that needs to be explained is greater than the fraction specified by
pcf_variance.- Returns
List of amount of principal axes.
Let \(X\) be a set of \(d\) dimensional embeddings. Let \(\lambda_1, \ldots, \lambda_d\in\mathbb{R}\) be a set of eigenvalues of the covariance matrix of \(X\) sorted in descending order. Then for a given value of desired explained variance \(r\), the number of principal components that explains \(r\cdot 100\%%\) variance is the largest integer \(n\) such that
\[\frac{\sum\limits_{i = 1}^{n - 1}\lambda_i}{\sum\limits_{i = 1}^{d}\lambda_i} \leq r\]Example
In the example bellow there are 4 vectors of length 10, and only first 4 dimensions have non-zero values. Its covariance matrix will have only 4 eigenvalues, that are greater than 0, i.e. there are only 4 principal axes. So, in order to keep at least 50% of the information from the set, we need to keep 2 principal axes, and in order to keep all the information we need to keep 5 principal axes (one additional axis appears because the number of principal axes is superior to the desired explained variance threshold).
>>> embeddings = torch.eye(4, 10, dtype=torch.float) >>> pca = PCA(embeddings) >>> pca.calc_principal_axes_number(pcf_variance=(0.5, 1)) tensor([2, 5])
take_2d
- oml.utils.misc_torch.take_2d(x: Tensor, indices: Tensor) Tensor[source]
- Parameters
x – Tensor with the shape of
[N, M]indices – Tensor of integers with the shape of
[N, P]Note, rows inindicesmay contain duplicated values. It means that we can take the same element fromxseveral times.
- Returns
Tensor of the items picked from
xwith the shape of[N, P]
assign_2d
- oml.utils.misc_torch.assign_2d(x: Tensor, indices: Tensor, new_values: Tensor) Tensor[source]
- Parameters
x – Tensor with the shape of
[N, M]indices – Tensor of integers with the shape of
[N, P], whereP <= Mnew_values – Tensor with the shape of
[N, P]
- Returns
x[i, indices[i, j]] = new_values[i, j]- Return type
Tensor with the shape of
[N, M]constructed by the following rule