Dataset format
To reuse as much from OML as possible, you need to prepare a .csv file in the required format. It’s not obligatory, especially if you implement your own Dataset, but the format is required in case of usage built-in datasets or Pipelines. You can check out the mock images, texts, or audio datasets.
Required columns:
label
- integer value indicates the label of item.split
- must be one of 2 values:train
orvalidation
.is_query
,is_gallery
- have to beNone
wheresplit == train
andTrue
(or1
) orFalse
(or0
) wheresplit == validation
. Note, that both values can beTrue
at the same time. Then we will validate every item in the validation set using the “1 vs rest” approach (datasets of this kind areSOP
,CARS196
orCUB
).[Images, Audios]
path
- path to image/audio. It may be global or relative path (in these case you need to passdataset_root
to build-in Datasets.)[Texts]
text
- text describing an item.
Optional columns:
category
- category which groups sets of similar labels (likedresses
, orfurniture
).sequence
- ids of sequences of photos that may be useful to handle in Re-id tasks. Must be strings or integers. Take a look at the detailed example.[Images]
x_1
,x_2
,y_1
,y_2
- integers, the format isleft
,right
,top
,bot
(x_1
andy_1
must be less thanx_2
andy_2
). If only part of your images has bounding boxes, just fill the corresponding row with empty values.[Audios]
start_time
- a float representing the time offset from which the audio should start being read.
Check out the examples of dataframes. You can also use helper to check if your dataset is in the right format:
from oml.utils import (
get_mock_audios_dataset,
get_mock_images_dataset,
get_mock_texts_dataset,
)
from oml.utils.dataframe_format import check_retrieval_dataframe_format
# IMAGES
df_train, df_val = get_mock_images_dataset(global_paths=True)
check_retrieval_dataframe_format(df=df_train)
check_retrieval_dataframe_format(df=df_val)
# TEXTS
df_train, df_val = get_mock_texts_dataset()
check_retrieval_dataframe_format(df=df_train)
check_retrieval_dataframe_format(df=df_val)
# AUDIO
df_train, df_val = get_mock_audios_dataset(global_paths=True)
check_retrieval_dataframe_format(df=df_train)
check_retrieval_dataframe_format(df=df_val)