Datasets

Check dataframe format for the datasets below.

ImageBaseDataset

class oml.datasets.images.ImageBaseDataset(paths: List[Path], dataset_root: Optional[Union[Path, str]] = None, bboxes: Optional[Sequence[Optional[Tuple[int, int, int, int]]]] = None, extra_data: Optional[Dict[str, Any]] = None, transform: Optional[Union[Compose, Compose]] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]

Bases: IBaseDataset, IVisualizableDataset

The base class that handles image specific logic.

__init__(paths: List[Path], dataset_root: Optional[Union[Path, str]] = None, bboxes: Optional[Sequence[Optional[Tuple[int, int, int, int]]]] = None, extra_data: Optional[Dict[str, Any]] = None, transform: Optional[Union[Compose, Compose]] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
Parameters
  • paths – Paths to images. Will be concatenated with dataset_root if provided.

  • dataset_root – Path to the images’ dir, set None if you provided the absolute paths in your dataframe

  • bboxes – Bounding boxes of images. Some of the images may not have bounding bboxes.

  • extra_data – Dictionary containing records of some additional information.

  • transform – Augmentations for the images, set None to perform only normalisation and casting to tensor

  • f_imread – Function to read the images, pass None to pick it automatically based on provided transforms

  • cache_size – Size of the dataset’s cache

  • input_tensors_key – Key to put tensors into the batches

  • index_key – Key to put samples’ ids into the batches

__getitem__(item: int) Dict[str, Union[FloatTensor, int]][source]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

visualize(item: int, color: Tuple[int, int, int] = (0, 0, 0)) ndarray[source]

ImageLabeledDataset

class oml.datasets.images.ImageLabeledDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]

Bases: DFLabeledDataset, IVisualizableDataset

The dataset of images having their ground truth labels.

__init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

ImageQueryGalleryLabeledDataset

class oml.datasets.images.ImageQueryGalleryLabeledDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels')[source]

Bases: DFQueryGalleryLabeledDataset, IVisualizableDataset

The annotated dataset of images having query/gallery split.

Note, that some datasets used as benchmarks in Metric Learning explicitly provide the splitting information (for example, DeepFashion InShop dataset), but some of them don’t (for example, CARS196 or CUB200). The validation idea for the latter is to perform 1 vs rest validation, where every query is evaluated versus the whole validation dataset (except for this exact query).

So, if you want an item participate in validation as both: query and gallery, you should mark this item as is_query == True and is_gallery == True, as it’s done in the CARS196 or CUB200 dataset.

__init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_query_ids() LongTensor
get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

ImageQueryGalleryDataset

class oml.datasets.images.ImageQueryGalleryDataset(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors')[source]

Bases: DFQueryGalleryDataset, IVisualizableDataset

The NOT annotated dataset of images having query/gallery split.

__init__(df: DataFrame, extra_data: Optional[Dict[str, Any]] = None, dataset_root: Optional[Union[Path, str]] = None, transform: Optional[Compose] = None, f_imread: Optional[Callable[[Union[Path, str, bytes]], Union[Image, ndarray]]] = None, cache_size: Optional[int] = 0, input_tensors_key: str = 'input_tensors')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

get_query_ids() LongTensor
visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

TextBaseDataset

class oml.datasets.texts.TextBaseDataset(texts: List[str], tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]

Bases: IBaseDataset, IVisualizableDataset

The base class that handles text specific logic.

__init__(texts: List[str], tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Any][source]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

TextLabeledDataset

class oml.datasets.texts.TextLabeledDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]

Bases: DFLabeledDataset, IVisualizableDataset

The dataset of texts having their ground truth labels.

__init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', labels_key: str = 'labels', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

TextQueryGalleryLabeledDataset

class oml.datasets.texts.TextQueryGalleryLabeledDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, labels_key: str = 'labels', input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]

Bases: DFQueryGalleryLabeledDataset, IVisualizableDataset

The annotated dataset of texts having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as is_query == True and is_gallery == True.

__init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, labels_key: str = 'labels', input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_query_ids() LongTensor
get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

TextQueryGalleryDataset

class oml.datasets.texts.TextQueryGalleryDataset(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]

Bases: DFQueryGalleryDataset, IVisualizableDataset

The NOT annotated dataset of texts having query/gallery split.

__init__(df: DataFrame, tokenizer: Any, max_length: int = 128, extra_data: Optional[Dict[str, Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

get_query_ids() LongTensor
visualize(item: int, color: Tuple[int, int, int]) ndarray[source]

AudioBaseDataset

class oml.datasets.audios.AudioBaseDataset(paths: ~typing.List[~typing.Union[str, ~pathlib.Path]], dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, start_times: ~typing.Optional[~typing.List[float]] = None, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]

Bases: IBaseDataset, IVisualizableDataset, IHTMLVisualizableDataset

The base class that handles audio specific logic.

__init__(paths: ~typing.List[~typing.Union[str, ~pathlib.Path]], dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, start_times: ~typing.Optional[~typing.List[float]] = None, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Parameters
  • paths – List of audio file paths.

  • dataset_root – Base path for audio files.

  • extra_data – Extra data to include in dataset items.

  • input_tensors_key – Key under which audio tensors are stored.

  • index_key – Key for indexing dataset items.

  • sample_rate – Sampling rate of audio files.

  • max_num_seconds – Duration to use for each audio file.

  • convert_to_mono – Whether to downmix audio to one channel or leave the same.

  • start_times – List of start time offsets in seconds for each audio.

  • spec_repr_func – Spectral representation extraction function used for visualization.

__getitem__(item: int) Dict[str, Union[FloatTensor, int]][source]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

visualize(item: int, color: Tuple[int, int, int] = (0, 0, 0)) ndarray[source]

Visualize an audio file.

Parameters
  • item – Dataset item index.

  • color – Color of the plot.

Returns

Array representing the image of the plot.

visualize_as_html(item: int, title: str, color: Tuple[int, int, int] = (0, 0, 0)) str[source]

Visualize an audio file in HTML markup.

Parameters
  • item – Dataset item index.

  • color – Color of the plot.

  • title – The title of html block.

Returns

HTML markup with spectral representation image and audio player.

AudioLabeledDataset

class oml.datasets.audios.AudioLabeledDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]

Bases: DFLabeledDataset, IVisualizableDataset, IHTMLVisualizableDataset

The dataset of audios having their ground truth labels.

__init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Parameters
  • df – DataFrame with input data.

  • dataset_root – Base path for audio files.

  • extra_data – Extra data to include in dataset items.

  • input_tensors_key – Key under which audio tensors are stored.

  • index_key – Key for indexing dataset items.

  • labels_key – Key under which labels are stored.

  • sample_rate – Sampling rate of audio files.

  • max_num_seconds – Duration to use from each audio file.

  • convert_to_mono – Whether to downmix audio to one channel or leave the same.

  • spec_repr_func – Spectral representation extraction function used for visualization.

__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]
visualize_as_html(item: int, title: str, color: Tuple[int, int, int]) str[source]

AudioQueryGalleryDataset

class oml.datasets.audios.AudioQueryGalleryDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]

Bases: DFQueryGalleryDataset, IVisualizableDataset, IHTMLVisualizableDataset

The non-annotated dataset of audios having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as is_query == True and is_gallery == True.

__init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Parameters
  • df – DataFrame with input data.

  • dataset_root – Base path for audio files.

  • extra_data – Extra data to include in dataset items.

  • input_tensors_key – Key under which audio tensors are stored.

  • index_key – Key for indexing dataset items.

  • sample_rate – Sampling rate of audio files.

  • max_num_seconds – Duration to use from each audio file.

  • convert_to_mono – Whether to downmix audio to one channel or leave the same.

  • spec_repr_func – Spectral representation extraction function used for visualization.

__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Returns

self.input_tensors_key self.index_key: int = item

Return type

Dictionary including the following keys

get_query_ids() LongTensor
visualize(item: int, color: Tuple[int, int, int]) ndarray[source]
visualize_as_html(item: int, title: str, color: Tuple[int, int, int]) str[source]

AudioQueryGalleryLabeledDataset

class oml.datasets.audios.AudioQueryGalleryLabeledDataset(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]

Bases: DFQueryGalleryLabeledDataset, IVisualizableDataset, IHTMLVisualizableDataset

The annotated dataset of audios having query/gallery split. To perform 1 vs rest validation, where a query is evaluated versus the whole validation dataset (except for this exact query), you should mark the item as is_query == True and is_gallery == True.

__init__(df: ~pandas.core.frame.DataFrame, dataset_root: ~typing.Optional[~typing.Union[~pathlib.Path, str]] = None, extra_data: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, input_tensors_key: str = 'input_tensors', index_key: str = 'idx', labels_key: str = 'labels', sample_rate: int = 16000, max_num_seconds: ~typing.Optional[float] = 3.0, convert_to_mono: bool = True, spec_repr_func: ~typing.Callable[[~torch.FloatTensor], ~torch.FloatTensor] = <function default_spec_repr_func>)[source]
Parameters
  • df – DataFrame with input data.

  • dataset_root – Base path for audio files.

  • extra_data – Extra data to include in dataset items.

  • input_tensors_key – Key under which audio tensors are stored.

  • index_key – Key for indexing dataset items.

  • labels_key – Key under which labels are stored.

  • sample_rate – Sampling rate of audio files.

  • max_num_seconds – Duration to use from each audio file.

  • convert_to_mono – Whether to downmix audio to one channel or leave the same.

  • spec_repr_func – Spectral representation extraction function used for visualization.

__getitem__(item: int) Dict[str, Any]
Parameters

item – Idx of the sample

Return type

Dictionary including the following keys

self.input_tensors_key self.index_key: int = item self.labels_key

get_query_ids() LongTensor
get_labels() ndarray
get_label2category() Optional[Dict[int, Union[str, int]]]
Returns

Mapping from label to category if known.

visualize(item: int, color: Tuple[int, int, int]) ndarray[source]
visualize_as_html(item: int, title: str, color: Tuple[int, int, int]) str[source]

PairDataset

class oml.datasets.pairs.PairDataset(base_dataset: IBaseDataset, pair_ids: List[Tuple[int, int]], input_tensors_key_1: str = 'input_tensors_1', input_tensors_key_2: str = 'input_tensors_2', index_key: str = 'idx')[source]

Bases: IPairDataset

Dataset to iterate over pairs of items of any modality.

__init__(base_dataset: IBaseDataset, pair_ids: List[Tuple[int, int]], input_tensors_key_1: str = 'input_tensors_1', input_tensors_key_2: str = 'input_tensors_2', index_key: str = 'idx')[source]
__getitem__(item: int) Dict[str, Union[Tensor, int]][source]
Parameters

item – Idx of the sample

Return type

Dictionary with the following keys

self.input_tensors_key_1 self.input_tensors_key_2 self.index_key

get_mock_images_dataset

oml.utils.download_mock_dataset.get_mock_images_dataset(dataset_root: Union[str, Path] = PosixPath('/home/docs/.cache/oml/mock_dataset'), df_name: str = 'df.csv', check_md5: bool = True, global_paths: bool = False) Tuple[DataFrame, DataFrame][source]

Function to download mock images dataset which is already prepared in the required format.

Parameters
  • dataset_root – The directory where the dataset will be saved.

  • df_name – The name of the CSV file from which the output DataFrames will be generated.

  • check_md5 – If True, validates the dataset using an MD5 checksum.

  • global_paths – If True, concatenates the paths in the dataset with the dataset_local_folder.

Returns

  • The first DataFrame is for the training stage.

  • The second DataFrame is for the validation stage.

Return type

A tuple containing two DataFrames

get_mock_texts_dataset

oml.utils.download_mock_dataset.get_mock_texts_dataset() Tuple[DataFrame, DataFrame][source]

Mock texts dataset useful for prototyping pipelines and understanding dataset structure.

get_mock_audios_dataset

oml.utils.download_mock_dataset.get_mock_audios_dataset(dataset_root: Union[str, Path] = PosixPath('/home/docs/.cache/oml/mock_audio_dataset'), df_name: str = 'df.csv', check_md5: bool = True, global_paths: bool = False) Tuple[DataFrame, DataFrame][source]

Function to download mock audios dataset which is already prepared in the required format.

Parameters
  • dataset_root – The directory where the dataset will be saved.

  • df_name – The name of the CSV file from which the output DataFrames will be generated.

  • check_md5 – If True, validates the dataset using an MD5 checksum.

  • global_paths – If True, concatenates the paths in the dataset with the dataset_local_folder.

Returns

  • The first DataFrame is for the training stage.

  • The second DataFrame is for the validation stage.

Return type

A tuple containing two DataFrames