util.data package¶

Submodules¶

util.data.dataset_analytics module¶

This script perform some analysis on the dataset provided. In particular computes std and mean (to be used to center your dataset).

Structure of the dataset expected:

Split folders¶

‘args.dataset-folder’ has to point to the parent of the train folder. Example:

~/../../data/svhn

where the dataset_folder contains the train sub-folder as follow:

args.dataset_folder/train

Classes folders¶

The train split should have different classes in a separate folder with the class name. The file name can be arbitrary (e.g does not have to be 0-* for classes 0 of MNIST). Example:

train/dog/whatever.png train/dog/you.png train/dog/like.png

train/cat/123.png train/cat/nsdf3.png train/cat/asd932_.png

util.data.dataset_analytics.cms_inmem(file_names)[source]¶

Computes mean and image_classification deviation in an offline fashion. This is possible only when the dataset can be allocated in memory.

Parameters

file_names (List of String) – List of file names of the dataset

Returns

mean (double)
std (double)

util.data.dataset_analytics.cms_online(file_names, workers)[source]¶

Computes mean and image_classification deviation in an online fashion. This is useful when the dataset is too big to be allocated in memory.

Parameters

file_names (List of String) – List of file names of the dataset
workers (int) – Number of workers to use for the mean/std computation

Returns

mean (double)
std (double)

util.data.dataset_analytics.compute_mean_std(dataset_folder, inmem, workers)[source]¶

Computes mean and std of a dataset. Saves the results as CSV file in the dataset folder.

Parameters

dataset_folder (String (path)) – Path to the dataset folder (see above for details)
inmem (Boolean) – Specifies whether is should be computed i nan online of offline fashion.
workers (int) – Number of workers to use for the mean/std computation

Returns

Return type

None

util.data.dataset_analytics.compute_mean_std_segmentation(dataset_folder, inmem, workers, filter_boundaries)[source]¶

Computes mean and std of a dataset for semantic segmentation. Saves the results as CSV file in the dataset folder.

Parameters

dataset_folder (String (path)) – Path to the dataset folder (see above for details)
inmem (Boolean) – Specifies whether is should be computed i nan online of offline fashion.
workers (int) – Number of workers to use for the mean/std computation
filter_boundaries (bool) – specifies whether thr boundary pixels should be removed or not

Returns

Return type

None

util.data.dataset_bidimensional module¶

This script allows for creation of a bidimensional(2D) dataset.

util.data.dataset_bidimensional.circle(size)[source]¶

Samples are generated in a grid fashion (np.linspace) and then draw a circle on x*x + y*y > 0.5 2 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.diagonal(size)[source]¶

Generates a dataset where points on a diagonal line are one class, and points surrounding it are a different class.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.donut(size)[source]¶

Samples are generated in a grid fashion (np.linspace) and then draw a donut. 2 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.flag(size)[source]¶

XOR problem but with multi class, each corner a different class 4 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.spiral(size)[source]¶

Samples are generated in a two spiral fashion, starting from the center. 2 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.spiral_multi(size)[source]¶

Samples are generated in a two spiral fashion, starting from the center. 4 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.stripes(size)[source]¶

Samples are generated in a stripe fashion, like a TV color screen (vertical stripes). Each bin is a different class. 5 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_bidimensional.xor(size)[source]¶

XOR problem 2 classes.

Parameters: size (int) – The total number of points in the dataset.
Returns: train, val, test – The three splits. Each row is (x,y,label)
Return type: ndarray[float] of size (n,3)

util.data.dataset_integrity module¶

This script generate the integrity footprint on the dataset provided. Such a footprint can be used to verify that the data has no been modified, altered or manipulated. The integrity of the dataset can be verified in two ways: quick and deep. The former is very fast and uses a high level type of verification such as recently modified files and file counts. The latter basically re-compute the footprint and verifies if it matches the existing one. This is slow and should be used only when the integrity of the dataset is a critical matter.

Structure of the dataset expected can be found at: https://diva-dia.github.io/DeepDIVAweb/articles/prepare-dataset/

util.data.dataset_integrity.dict_compare(d1, d2)[source]¶

Parameters

d1 (Dictionary)
d2 (Dictionary) – Dictionaries to compare

Returns

Sets with the element which has been respectively added, removed, modified or stayed the same

Return type

added, removed, modified, same

util.data.dataset_integrity.generate_integrity_footprint(dataset_folder)[source]¶

This function generates the integrity footprint on the dataset provided. Such a footprint can be used to verify that the data has no been modified, altered or manipulated.

The footprint file will contain the following information in a JSON format:

{

path : <string> // Path to this folder where the last step is the name of the folder last_modified : <date> // This correspond to the most recent ‘last modified’ in the dataset files : { // For each file

{
file_name : <string> // The filename as string file_hash : <hash> // This is the hash of the content

} folders : { // For each folder, recursion

// Recursion but NO last_modified (not needed anymore)

]

}

Parameters: dataset_folder (String (path)) – Path to the dataset folder (see above for details)
Returns
Return type: A dictionary of the format explained in generate_integrity_footprint() above.

util.data.dataset_integrity.get_last_modified(dataset_folder)[source]¶

Elaborates the most recent ‘last_modified’ tag by scanning all files in the root folder and sub-folders.

This routine excludes the ‘footprint.json’ file which, if taken into account, would prevent the verification process to succeed (as it modifies the last modified of the root itself).

Parameters: dataset_folder (String (path)) – Path to the dataset folder
Returns: last_modified – A string representing the last modified of the entire folder
Return type: String

util.data.dataset_integrity.save_footprint(dataset_folder, filename, data)[source]¶

Save the footprint on file system

Parameters

dataset_folder (String (path)) – Path to the dataset folder (see above for details)
filename (String) – Name of the file where the data will be saved
data (dictionary) – The actual data in JSON compliant format

Returns

Return type

None

util.data.dataset_integrity.verify_integrity_deep(dataset_folder)[source]¶

This function basically re-compute the footprint and verifies if it matches the existing one. This is slow and should be used only when the integrity of the dataset is a critical matter.

Parameters: dataset_folder (String (path)) – Path to the dataset folder (see above for details)
Returns: Is the dataset footprint still matching the data?
Return type: Boolean

util.data.dataset_integrity.verify_integrity_quick(dataset_folder)[source]¶

This function verifies that the ‘last_modified’ field still corresponds to the one contained in the footprint. This check is verify fast, but it comes at a price. The OS updates this number when files are added or removed to the folder, but NOT if a file is modified. Because of this, it is not 100% safe and especially does NOT protect you against malicious attacks! To have a safe check whether the data is the same you should rely on the slower verify_integrity_deep() function.

Parameters: dataset_folder (String (path)) – Path to the dataset folder (see above for details)
Returns: Is the ‘last_modified’ field still actual?
Return type: Boolean

util.data.dataset_splitter module¶

This script allows for creation of a validation set from the training set.

util.data.dataset_splitter.split_dataset(dataset_folder, split, symbolic, debug=False)[source]¶

Partition a dataset into train/val splits on the filesystem.

Parameters

dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
debug (bool) – Prints additional debug statements

Returns

Return type

None

util.data.dataset_splitter.split_dataset_segmentation(dataset_folder, split, symbolic, test=False)[source]¶

Partition a dataset into train/val(/test) splits on the filesystem for segmentation datasets organized as dataset/data with the images and dataset/gt for the ground truth. The corresponding images need to have the same name.

Parameters

dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data
test (bool) – If true, the validation set is split again (1:1) into a val and test set. Default false.

Returns

Return type

None

util.data.dataset_splitter.split_dataset_writerIdentification(dataset_folder, split)[source]¶

Partition a dataset into train/val splits on the filesystem.

Parameters

dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
split (float) – Specifies how much of the training set should be converted into the validation set.
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data

Returns

Return type

None

util.data.get_a_dataset module¶

util.data.get_a_dataset.cifar10(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the CIFAR dataset to the location specified on the file system

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.diva_hisdb(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the DIVA HisDB-all dataset for semantic segmentation to the location specified on the file system

Output folder structure: ../HisDB/CB55/train

../HisDB/CB55/val ../HisDB/CB55/test

../HisDB/CB55/test/data -> images ../HisDB/CB55/test/gt -> pixel-wise annotated ground truth

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.fashion_mnist(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the Fashion-MNIST dataset to the location specified on the file system

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.glas(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the tubule dataset (from the GlaS challenge) for semantic segmentation to the location specified on the file system

Output folder structure: ../HisDB/GlaS/train

../HisDB/GlaS/val ../HisDB/GlaS/test

../HisDB/GlaS/test/data -> images ../HisDB/GlaS/test/gt -> pixel-wise annotated ground truth

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.historical_wi(args)[source]¶

util.data.get_a_dataset.icdar2017_clamm(args)[source]¶

util.data.get_a_dataset.kmnist(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the K-MNIST dataset to the location specified on the file system

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.miml(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the Multi-Instance Multi-Label Image Dataset on the file system. Dataset available at: http://lamda.nju.edu.cn/data_MIMLimage.ashx

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.mnist(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the MNIST dataset to the location specified on the file system

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.get_a_dataset.svhn(args)[source]¶

Fetches and prepares (in a DeepDIVA friendly format) the SVHN dataset to the location specified on the file system

Parameters: args (dict) – List of arguments necessary to run this routine. In particular its necessary to provide output_folder as String containing the path where the dataset will be downloaded
Returns
Return type: None

util.data.remove_whitespace module¶

util.data.remove_whitespace.get_list_images(dir)[source]¶

util.data.remove_whitespace.main(args)[source]¶

util.data.remove_whitespace.open_crop_save(path)[source]¶

util.data.remove_whitespace.remove_empty(img)[source]¶

util.data.shuffle_labels module¶

This script allows creates a symlink directory with all labels shuffled.

util.data.shuffle_labels.split_dataset(dataset_folder, output_folder, symbolic)[source]¶

Partition a dataset into train/val splits on the filesystem.

Parameters

dataset_folder (str) – Path to the dataset folder (see datasets.image_folder_dataset.load_dataset for details).
output_folder (str) – Path to the output folder (see datasets.image_folder_dataset.load_dataset for details).
symbolic (bool) – Does not make a copy of the data, but only symbolic links to the original data

Returns

Return type

None

util.data package¶

Submodules¶

util.data.dataset_analytics module¶

Split folders¶

Classes folders¶

util.data.dataset_bidimensional module¶

util.data.dataset_integrity module¶

util.data.dataset_splitter module¶

util.data.get_a_dataset module¶

util.data.remove_whitespace module¶

util.data.shuffle_labels module¶

Module contents¶