Dataset format
To reuse as much from OML as possible, you need to prepare a .csv file in the required format. It’s not obligatory, especially if you implement your own Dataset, but the format is required in case of usage built-in datasets or Pipelines. You can check out the tiny dataset as an example.
Required columns:
label
- integer value indicates the label of item.path
- path to image. It may be global or relative path (in these case you need to passdataset_root
to build-in Datasets.)split
- must be one of 2 values:train
orvalidation
.is_query
,is_gallery
- have to beNone
wheresplit == train
andTrue
(or1
) orFalse
(or0
) wheresplit == validation
. Note, that both values can beTrue
at the same time. Then we will validate every item in the validation set using the “1 vs rest” approach (datasets of this kind areSOP
,CARS196
orCUB
).
Optional columns:
category
- category which groups sets of similar labels (likedresses
, orfurniture
).x_1
,x_2
,y_1
,y_2
- integers, the format isleft
,right
,top
,bot
(x_1
andy_1
must be less thanx_2
andy_2
). If only part of your images has bounding boxes, just fill the corresponding row with empty values.sequence
- ids of sequences of photos that may be useful to handle in Re-id tasks. Must be strings or integers. Take a look at the detailed example.
Check out the examples of dataframes. You can also use helper to check if your dataset is in the right format:
import pandas as pd
from oml.utils.dataframe_format import check_retrieval_dataframe_format
check_retrieval_dataframe_format(df=pd.read_csv("/path/to/table.csv"), dataset_root="/path/to/dataset/root/")