Datasets
Check dataframe format for the datasets below.ImageBaseDataset
- class oml.datasets.images.ImageBaseDataset(paths: List[Path], dataset_root: Optional[Union[Path, str]] = None, bboxes: Optional[Sequence[Optional[Tuple[int, int, int, int]]]] = None, extra_data: Optional[Dict[str, Any]] = None, transform: Optional[Union[Compose, Compose]] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
Bases:
IBaseDataset
,IVisualizableDataset
The base class that handles image specific logic.
- __init__(paths: List[Path], dataset_root: Optional[Union[Path, str]] = None, bboxes: Optional[Sequence[Optional[Tuple[int, int, int, int]]]] = None, extra_data: Optional[Dict[str, Any]] = None, transform: Optional[Union[Compose, Compose]] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
- Parameters
paths – Paths to images. Will be concatenated with
dataset_root
if provided.dataset_root – Path to the images’ dir, set
None
if you provided the absolute paths in your dataframebboxes – Bounding boxes of images. Some of the images may not have bounding bboxes.
extra_data – Dictionary containing records of some additional information.
transform – Augmentations for the images, set
None
to perform only normalisation and casting to tensorf_imread – Function to read the images, pass
None
to pick it automatically based on provided transformscache_size – Size of the dataset’s cache
input_tensors_key – Key to put tensors into the batches
index_key – Key to put samples’ ids into the batches
ImageLabeledDataset
- class oml.datasets.images.ImageLabeledDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
Bases:
DFLabeledDataset
,IVisualizableDataset
The dataset of images having their ground truth labels.
- __init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_labels() ndarray
ImageQueryGalleryLabeledDataset
- class oml.datasets.images.ImageQueryGalleryLabeledDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels')[source]
Bases:
DFQueryGalleryLabeledDataset
,IVisualizableDataset
The annotated dataset of images having query/gallery split.
Note, that some datasets used as benchmarks in Metric Learning explicitly provide the splitting information (for example,
DeepFashion InShop
dataset), but some of them don’t (for example,CARS196
orCUB200
). The validation idea for the latter is to perform 1 vs rest validation, where every query is evaluated versus the whole validation dataset (except for this exact query).So, if you want an item participate in validation as both: query and gallery, you should mark this item as
is_query == True
andis_gallery == True
, as it’s done in the CARS196 or CUB200 dataset.- __init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
- get_labels() ndarray
ImageQueryGalleryDataset
- class oml.datasets.images.ImageQueryGalleryDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors')[source]
Bases:
DFQueryGalleryDataset
,IVisualizableDataset
The NOT annotated dataset of images having query/gallery split.
- __init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Returns
self.input_tensors_key
self.index_key: int = item
- Return type
Dictionary including the following keys
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
TextBaseDataset
- class oml.datasets.texts.TextBaseDataset(texts: List[str], tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
Bases:
IBaseDataset
,IVisualizableDataset
The base class that handles text specific logic.
- __init__(texts: List[str], tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
TextLabeledDataset
- class oml.datasets.texts.TextLabeledDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
Bases:
DFLabeledDataset
,IVisualizableDataset
The dataset of texts having their ground truth labels.
- __init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_labels() ndarray
TextQueryGalleryLabeledDataset
- class oml.datasets.texts.TextQueryGalleryLabeledDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, labels_key: str = 'labels', input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
Bases:
DFQueryGalleryLabeledDataset
,IVisualizableDataset
The annotated dataset of texts having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as
is_query == True
andis_gallery == True
.- __init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, labels_key: str = 'labels', input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
- get_labels() ndarray
TextQueryGalleryDataset
- class oml.datasets.texts.TextQueryGalleryDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
Bases:
DFQueryGalleryDataset
,IVisualizableDataset
The NOT annotated dataset of texts having query/gallery split.
- __init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Returns
self.input_tensors_key
self.index_key: int = item
- Return type
Dictionary including the following keys
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
AudioBaseDataset
- class oml.datasets.audios.AudioBaseDataset(paths: ~typing.List[~typing.Union[str, ~pathlib.Path]], dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, start_times: ~typing.Optional[~typing.List[float]] = None, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Bases:
IBaseDataset
,IVisualizableDataset
,IHTMLVisualizableDataset
The base class that handles audio specific logic.
- __init__(paths: ~typing.List[~typing.Union[str, ~pathlib.Path]], dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, start_times: ~typing.Optional[~typing.List[float]] = None, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
- Parameters
paths – List of audio file paths.
dataset_root – Base path for audio files.
extra_data – Extra data to include in dataset items.
input_tensors_key – Key under which audio tensors are stored.
index_key – Key for indexing dataset items.
sample_rate – Sampling rate of audio files.
max_num_seconds – Duration to use for each audio file.
convert_to_mono – Whether to downmix audio to one channel or leave the same.
start_times – List of start time offsets in
seconds
for each audio.spec_repr_func – Spectral representation extraction function used for visualization.
- __getitem__(item: int) Dict[str, Union[FloatTensor, int]] [source]
- Parameters
item – Idx of the sample
- Returns
self.input_tensors_key
self.index_key: int = item
- Return type
Dictionary including the following keys
- visualize(item: int, color: Tuple[int, int, int] = (0, 0, 0)) ndarray [source]
Visualize an audio file.
- Parameters
item – Dataset item index.
color – Color of the plot.
- Returns
Array representing the image of the plot.
- visualize_as_html(item: int, title: str, color: Tuple[int, int, int] = (0, 0, 0)) str [source]
Visualize an audio file in HTML markup.
- Parameters
item – Dataset item index.
color – Color of the plot.
title – The title of html block.
- Returns
HTML markup with spectral representation image and audio player.
AudioLabeledDataset
- class oml.datasets.audios.AudioLabeledDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Bases:
DFLabeledDataset
,IVisualizableDataset
,IHTMLVisualizableDataset
The dataset of audios having their ground truth labels.
- __init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
- Parameters
df – DataFrame with input data.
dataset_root – Base path for audio files.
extra_data – Extra data to include in dataset items.
input_tensors_key – Key under which audio tensors are stored.
index_key – Key for indexing dataset items.
labels_key – Key under which labels are stored.
sample_rate – Sampling rate of audio files.
max_num_seconds – Duration to use from each audio file.
convert_to_mono – Whether to downmix audio to one channel or leave the same.
spec_repr_func – Spectral representation extraction function used for visualization.
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_labels() ndarray
AudioQueryGalleryDataset
- class oml.datasets.audios.AudioQueryGalleryDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Bases:
DFQueryGalleryDataset
,IVisualizableDataset
,IHTMLVisualizableDataset
The non-annotated dataset of audios having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as
is_query == True
andis_gallery == True
.- __init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
- Parameters
df – DataFrame with input data.
dataset_root – Base path for audio files.
extra_data – Extra data to include in dataset items.
input_tensors_key – Key under which audio tensors are stored.
index_key – Key for indexing dataset items.
sample_rate – Sampling rate of audio files.
max_num_seconds – Duration to use from each audio file.
convert_to_mono – Whether to downmix audio to one channel or leave the same.
spec_repr_func – Spectral representation extraction function used for visualization.
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Returns
self.input_tensors_key
self.index_key: int = item
- Return type
Dictionary including the following keys
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
AudioQueryGalleryLabeledDataset
- class oml.datasets.audios.AudioQueryGalleryLabeledDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Bases:
DFQueryGalleryLabeledDataset
,IVisualizableDataset
,IHTMLVisualizableDataset
The annotated dataset of audios having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as
is_query == True
andis_gallery == True
.- __init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
- Parameters
df – DataFrame with input data.
dataset_root – Base path for audio files.
extra_data – Extra data to include in dataset items.
input_tensors_key – Key under which audio tensors are stored.
index_key – Key for indexing dataset items.
labels_key – Key under which labels are stored.
sample_rate – Sampling rate of audio files.
max_num_seconds – Duration to use from each audio file.
convert_to_mono – Whether to downmix audio to one channel or leave the same.
spec_repr_func – Spectral representation extraction function used for visualization.
- __getitem__(item: int) Dict[str, Any]
- Parameters
item – Idx of the sample
- Return type
Dictionary including the following keys
self.input_tensors_key
self.index_key: int = item
self.labels_key
- get_query_ids() LongTensor
- get_gallery_ids() LongTensor
- get_labels() ndarray
PairDataset
- class oml.datasets.pairs.PairDataset(base_dataset: IBaseDataset, pair_ids: List[Tuple[int, int]], input_tensors_key_1: str = 'input_tensors_1', input_tensors_key_2: str = 'input_tensors_2', index_key: str = 'idx')[source]
Bases:
IPairDataset
Dataset to iterate over pairs of items of any modality.
get_mock_images_dataset
- oml.utils.download_mock_dataset.get_mock_images_dataset(dataset_root: Union[str, Path] = PosixPath('/home/docs/.cache/oml/mock_dataset'), df_name: str = 'df.csv', check_md5: bool = True, global_paths: bool = False) Tuple[DataFrame, DataFrame] [source]
Function to download mock images dataset which is already prepared in the required format.
- Parameters
dataset_root – The directory where the dataset will be saved.
df_name – The name of the CSV file from which the output DataFrames will be generated.
check_md5 – If
True
, validates the dataset using an MD5 checksum.global_paths – If
True
, concatenates the paths in the dataset with the dataset_local_folder.
- Returns
The first DataFrame is for the training stage.
The second DataFrame is for the validation stage.
- Return type
A tuple containing two DataFrames
get_mock_texts_dataset
get_mock_audios_dataset
- oml.utils.download_mock_dataset.get_mock_audios_dataset(dataset_root: Union[str, Path] = PosixPath('/home/docs/.cache/oml/mock_audio_dataset'), df_name: str = 'df.csv', check_md5: bool = True, global_paths: bool = False) Tuple[DataFrame, DataFrame] [source]
Function to download mock audios dataset which is already prepared in the required format.
- Parameters
dataset_root – The directory where the dataset will be saved.
df_name – The name of the CSV file from which the output DataFrames will be generated.
check_md5 – If
True
, validates the dataset using an MD5 checksum.global_paths – If
True
, concatenates the paths in the dataset with the dataset_local_folder.
- Returns
The first DataFrame is for the training stage.
The second DataFrame is for the validation stage.
- Return type
A tuple containing two DataFrames