Finding duplicates in an image collection is a recurring task for many image-related machine learning use cases. For example, presence of duplicates can create extreme biases in the evaluation of your ML model. imagededup is a python package that simplifies the task of finding exact and near duplicates.
This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.
The talk takes the user through the package functionality.