util.data package¶
Submodules¶
util.data.dataset_analytics module¶
This script perform some analysis on the dataset provided. In particular computes std and mean (to be used to center your dataset).
Structure of the dataset expected:
Split folders¶
‘args.dataset-folder’ has to point to the parent of the train folder. Example:
~/../../data/svhn
where the dataset_folder contains the train sub-folder as follow:
args.dataset_folder/train
Classes folders¶
The train split should have different classes in a separate folder with the class name. The file name can be arbitrary (e.g does not have to be 0-* for classes 0 of MNIST). Example:
train/dog/whatever.png train/dog/you.png train/dog/like.png
train/cat/123.png train/cat/nsdf3.png train/cat/asd932_.png
-
util.data.dataset_analytics.
cms_inmem
(file_names)[source]¶ Computes mean and image_classification deviation in an offline fashion. This is possible only when the dataset can be allocated in memory.
- Parameters
file_names (List of String) – List of file names of the dataset
- Returns
mean (double)
std (double)
-
util.data.dataset_analytics.
cms_online
(file_names, workers)[source]¶ Computes mean and image_classification deviation in an online fashion. This is useful when the dataset is too big to be allocated in memory.
- Parameters
file_names (List of String) – List of file names of the dataset
workers (int) – Number of workers to use for the mean/std computation
- Returns
mean (double)
std (double)
-
util.data.dataset_analytics.
compute_mean_std
(dataset_folder, inmem, workers)[source]¶ Computes mean and std of a dataset. Saves the results as CSV file in the dataset folder.
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
inmem (Boolean) – Specifies whether is should be computed i nan online of offline fashion.
workers (int) – Number of workers to use for the mean/std computation
- Returns
- Return type
-
util.data.dataset_analytics.
compute_mean_std_segmentation
(dataset_folder, inmem, workers, filter_boundaries)[source]¶ Computes mean and std of a dataset for semantic segmentation. Saves the results as CSV file in the dataset folder.
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
inmem (Boolean) – Specifies whether is should be computed i nan online of offline fashion.
workers (int) – Number of workers to use for the mean/std computation
filter_boundaries (bool) – specifies whether thr boundary pixels should be removed or not
- Returns
- Return type
util.data.dataset_bidimensional module¶
This script allows for creation of a bidimensional(2D) dataset.
-
util.data.dataset_bidimensional.
circle
(size)[source]¶ Samples are generated in a grid fashion (np.linspace) and then draw a circle on x*x + y*y > 0.5 2 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
diagonal
(size)[source]¶ Generates a dataset where points on a diagonal line are one class, and points surrounding it are a different class.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
donut
(size)[source]¶ Samples are generated in a grid fashion (np.linspace) and then draw a donut. 2 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
flag
(size)[source]¶ XOR problem but with multi class, each corner a different class 4 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
spiral
(size)[source]¶ Samples are generated in a two spiral fashion, starting from the center. 2 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
spiral_multi
(size)[source]¶ Samples are generated in a two spiral fashion, starting from the center. 4 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
-
util.data.dataset_bidimensional.
stripes
(size)[source]¶ Samples are generated in a stripe fashion, like a TV color screen (vertical stripes). Each bin is a different class. 5 classes.
- Parameters
size (int) – The total number of points in the dataset.
- Returns
train, val, test – The three splits. Each row is (x,y,label)
- Return type
ndarray[float] of size (n,3)
util.data.dataset_integrity module¶
This script generate the integrity footprint on the dataset provided. Such a footprint can be used to verify that the data has no been modified, altered or manipulated. The integrity of the dataset can be verified in two ways: quick and deep. The former is very fast and uses a high level type of verification such as recently modified files and file counts. The latter basically re-compute the footprint and verifies if it matches the existing one. This is slow and should be used only when the integrity of the dataset is a critical matter.
Structure of the dataset expected can be found at: https://diva-dia.github.io/DeepDIVAweb/articles/prepare-dataset/
-
util.data.dataset_integrity.
dict_compare
(d1, d2)[source]¶ - Parameters
d1 (Dictionary)
d2 (Dictionary) – Dictionaries to compare
- Returns
Sets with the element which has been respectively added, removed, modified or stayed the same
- Return type
added, removed, modified, same
-
util.data.dataset_integrity.
generate_integrity_footprint
(dataset_folder)[source]¶ This function generates the integrity footprint on the dataset provided. Such a footprint can be used to verify that the data has no been modified, altered or manipulated.
The footprint file will contain the following information in a JSON format:
- {
path : <string> // Path to this folder where the last step is the name of the folder last_modified : <date> // This correspond to the most recent ‘last modified’ in the dataset files : { // For each file
- {
file_name : <string> // The filename as string file_hash : <hash> // This is the hash of the content
} folders : { // For each folder, recursion
// Recursion but NO last_modified (not needed anymore)
]
}
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
- Returns
- Return type
A dictionary of the format explained in generate_integrity_footprint() above.
-
util.data.dataset_integrity.
get_last_modified
(dataset_folder)[source]¶ Elaborates the most recent ‘last_modified’ tag by scanning all files in the root folder and sub-folders.
This routine excludes the ‘footprint.json’ file which, if taken into account, would prevent the verification process to succeed (as it modifies the last modified of the root itself).
- Parameters
dataset_folder (String (path)) – Path to the dataset folder
- Returns
last_modified – A string representing the last modified of the entire folder
- Return type
String
-
util.data.dataset_integrity.
save_footprint
(dataset_folder, filename, data)[source]¶ Save the footprint on file system
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
filename (String) – Name of the file where the data will be saved
data (dictionary) – The actual data in JSON compliant format
- Returns
- Return type
-
util.data.dataset_integrity.
verify_integrity_deep
(dataset_folder)[source]¶ This function basically re-compute the footprint and verifies if it matches the existing one. This is slow and should be used only when the integrity of the dataset is a critical matter.
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
- Returns
Is the dataset footprint still matching the data?
- Return type
Boolean
-
util.data.dataset_integrity.
verify_integrity_quick
(dataset_folder)[source]¶ This function verifies that the ‘last_modified’ field still corresponds to the one contained in the footprint. This check is verify fast, but it comes at a price. The OS updates this number when files are added or removed to the folder, but NOT if a file is modified. Because of this, it is not 100% safe and especially does NOT protect you against malicious attacks! To have a safe check whether the data is the same you should rely on the slower verify_integrity_deep() function.
- Parameters
dataset_folder (String (path)) – Path to the dataset folder (see above for details)
- Returns
Is the ‘last_modified’ field still actual?
- Return type
Boolean
util.data.dataset_splitter module¶
This script allows for creation of a validation set from the training set.
-
util.data.dataset_splitter.
split_dataset
(dataset_folder, split, symbolic, debug=False)[source]¶ Partition a dataset into train/val splits on the filesystem.
- Parameters
dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
debug (bool) – Prints additional debug statements
- Returns
- Return type
-
util.data.dataset_splitter.
split_dataset_segmentation
(dataset_folder, split, symbolic, test=False)[source]¶ Partition a dataset into train/val(/test) splits on the filesystem for segmentation datasets organized as dataset/data with the images and dataset/gt for the ground truth. The corresponding images need to have the same name.
- Parameters
dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
test (bool) – If true, the validation set is split again (1:1) into a val and test set. Default false.
- Returns
- Return type
-
util.data.dataset_splitter.
split_dataset_writerIdentification
(dataset_folder, split)[source]¶ Partition a dataset into train/val splits on the filesystem.
- Parameters
dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
- Returns
- Return type
util.data.get_a_dataset module¶
-
util.data.get_a_dataset.
cifar10
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the CIFAR dataset to the location specified on the file system
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
diva_hisdb
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the DIVA HisDB-all dataset for semantic segmentation to the location specified on the file system
See also: https://diuf.unifr.ch/main/hisdoc/diva-hisdb
- Output folder structure: ../HisDB/CB55/train
../HisDB/CB55/val ../HisDB/CB55/test
../HisDB/CB55/test/data -> images ../HisDB/CB55/test/gt -> pixel-wise annotated ground truth
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
fashion_mnist
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the Fashion-MNIST dataset to the location specified on the file system
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
glas
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the tubule dataset (from the GlaS challenge) for semantic segmentation to the location specified on the file system
See also: https://github.com/choosehappy/public/tree/master/DL%20tutorial%20Code/3-tubule
- Output folder structure: ../HisDB/GlaS/train
../HisDB/GlaS/val ../HisDB/GlaS/test
../HisDB/GlaS/test/data -> images ../HisDB/GlaS/test/gt -> pixel-wise annotated ground truth
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
kmnist
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the K-MNIST dataset to the location specified on the file system
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
miml
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the Multi-Instance Multi-Label Image Dataset on the file system. Dataset available at: http://lamda.nju.edu.cn/data_MIMLimage.ashx
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
mnist
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the MNIST dataset to the location specified on the file system
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
-
util.data.get_a_dataset.
svhn
(args)[source]¶ Fetches and prepares (in a DeepDIVA friendly format) the SVHN dataset to the location specified on the file system
- Parameters
args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
- Returns
- Return type
util.data.remove_whitespace module¶
util.data.shuffle_labels module¶
This script allows creates a symlink directory with all labels shuffled.
-
util.data.shuffle_labels.
split_dataset
(dataset_folder, output_folder, symbolic)[source]¶ Partition a dataset into train/val splits on the filesystem.
- Parameters
dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
output_folder (str) – Path to the output folder (see datasets.image_folder_dataset.load_dataset for details).
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
- Returns
- Return type