Utils
check_retrieval_dataframe_format
- oml.utils.dataframe_format.check_retrieval_dataframe_format(df: Union[Path, str, DataFrame], dataset_root: Optional[Path] = None, sep: str = ',', verbose: bool = True) None [source]
Function checks if the dataset is in the correct format.
- Parameters
df – Path to
.csv
file or pandas DataFramedataset_root – Path to the dataset root, set
None
if you used absolute paths in your dataframesep – Separator used in
.csv
verbose – Set
True
if you want to see warnings
PCA
- class oml.utils.misc_torch.PCA(embeddings: Tensor)[source]
Bases:
object
Principal component analysis (PCA).
Estimate principal axes, and perform vectors transformation.
Note
The code is almost the same as one from sklearn, but we had two reasons to have our own implementation. First, we need to work with Torch tensors instead of NumPy arrays. Second, we wanted to avoid one more external dependency.
- components
Matrix of shape
[embeddings_dim, embeddings_dim]
. Principal axes in embeddings space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted byexplained_variance
.- Type
torch.Tensor
- explained_variance
Array of size
embeddings_dim
The amount of variance explained by each of the selected components. The variance estimation usesn_embeddings - 1
degrees of freedom. Equal to eigenvalues of the covariance matrix ofembeddings
.- Type
torch.Tensor
- explained_variance_ratio
Array of size
embeddings_dim
. Percentage of variance explained by each of the components.- Type
torch.Tensor
- singular_values
Array of size
embeddings_dim
. The singular values corresponding to each of the selected components.- Type
torch.Tensor
- mean
Array of size
embeddings_dim
. Per-feature empirical mean, estimated from the training set. Equal toembeddings.mean(dim=0)
.- Type
torch.Tensor
For an embeddings matrix \(X\) of shape \(n\times d\) the principal axes could be found by performing Singular Value Decomposition
\[X = U\Sigma V^T\]where \(U\) is an \(n\times n\) orthogonal matrix, \(\Sigma\) is an \(n\times d\) rectangular diagonal matrix with non-negative real numbers on the diagonal, \(V\) is an \(d\times d\) orthogonal matrix.
Rows of the \(V\) form an orthonormal basis, and could be used to project embeddings to a new space, possible of lower dimension:
\[X' = X\cdot V^T\]The inverse transform is done by
\[X = X'\cdot V\]See:
Example
>>> embeddings = torch.rand(100, 5) >>> pca = PCA(embeddings) >>> embeddings_transformed = pca.transform(embeddings) >>> embeddings_recovered = pca.inverse_transform(embeddings_transformed) >>> torch.all(torch.isclose(embeddings, embeddings_recovered, atol=1.e-6)) tensor(True)
- __init__(embeddings: Tensor)[source]
- Parameters
embeddings – Embeddings matrix with the shape of
[n_embeddings, embeddings_dim]
.
- transform(embeddings: Tensor, n_components: Optional[int] = None) Tensor [source]
Apply fitted PCA to transform embeddings.
- Parameters
embeddings – Matrix of shape
[n_embeddings, embeddings_dim]
.n_components – The desired dimension of the output.
- Returns
Transformed embeddings.
- inverse_transform(embeddings: Tensor) Tensor [source]
Apply inverse transform to embeddings.
- Parameters
embeddings – Matrix of shape
[n_embeddings, N]
whereN <= embeddings_dim
is the dimension of embeddings.- Returns
Embeddings projected into original embeddings space.
- calc_principal_axes_number(pcf_variance: Tuple[float, ...]) Tensor [source]
Function estimates the number of principal axes that are required to explain the explained_variance_ths variance.
- Parameters
pcf_variance – Values in range [0, 1]. Find the number of components such that the amount of variance that needs to be explained is greater than the fraction specified by
pcf_variance
.- Returns
List of amount of principal axes.
Let \(X\) be a set of \(d\) dimensional embeddings. Let \(\lambda_1, \ldots, \lambda_d\in\mathbb{R}\) be a set of eigenvalues of the covariance matrix of \(X\) sorted in descending order. Then for a given value of desired explained variance \(r\), the number of principal components that explains \(r\cdot 100\%%\) variance is the largest integer \(n\) such that
\[\frac{\sum\limits_{i = 1}^{n - 1}\lambda_i}{\sum\limits_{i = 1}^{d}\lambda_i} \leq r\]Example
In the example bellow there are 4 vectors of length 10, and only first 4 dimensions have non-zero values. Its covariance matrix will have only 4 eigenvalues, that are greater than 0, i.e. there are only 4 principal axes. So, in order to keep at least 50% of the information from the set, we need to keep 2 principal axes, and in order to keep all the information we need to keep 5 principal axes (one additional axis appears because the number of principal axes is superior to the desired explained variance threshold).
>>> embeddings = torch.eye(4, 10, dtype=torch.float) >>> pca = PCA(embeddings) >>> pca.calc_principal_axes_number(pcf_variance=(0.5, 1)) tensor([2, 5])
take_2d
- oml.utils.misc_torch.take_2d(x: Tensor, indices: Tensor) Tensor [source]
- Parameters
x – Tensor with the shape of
[N, M]
indices – Tensor of integers with the shape of
[N, P]
Note, rows inindices
may contain duplicated values. It means that we can take the same element fromx
several times.
- Returns
Tensor of the items picked from
x
with the shape of[N, P]
assign_2d
- oml.utils.misc_torch.assign_2d(x: Tensor, indices: Tensor, new_values: Tensor) Tensor [source]
- Parameters
x – Tensor with the shape of
[N, M]
indices – Tensor of integers with the shape of
[N, P]
, whereP <= M
new_values – Tensor with the shape of
[N, P]
- Returns
x[i, indices[i, j]] = new_values[i, j]
- Return type
Tensor with the shape of
[N, M]
constructed by the following rule