|59769059c9||6 months ago|
|.github||11 months ago|
|.mk||2 years ago|
|benchmarks||2 years ago|
|cli||2 years ago|
|db||2 years ago|
|docker||11 months ago|
|docs||4 years ago|
|models||4 years ago|
|notebooks||3 years ago|
|references||4 years ago|
|repo_admin||2 years ago|
|reports||4 years ago|
|scripts||2 years ago|
|server||11 months ago|
|task_queue||1 year ago|
|template_support||2 years ago|
|tests||2 years ago|
|web||2 years ago|
|winnow||8 months ago|
|winnow.egg-info||4 years ago|
|.dockerignore||3 years ago|
|.gitignore||3 years ago|
|LICENSE||4 years ago|
|Makefile||3 years ago|
|README.md||11 months ago|
|audio_processing.py||2 years ago|
|conftest.py||3 years ago|
|default.config.yaml||2 years ago|
|docker-compose.yml||6 months ago|
|docker_tensorflow.sh||4 years ago|
|environment-gpu.yaml||2 years ago|
|environment.yaml||2 years ago|
|extract_exif.py||2 years ago|
|extract_features.py||2 years ago|
|generate_matches.py||2 years ago|
|generate_matches_remote.py||2 years ago|
|generate_remote_matches.py||2 years ago|
|ingest_jobs.py||3 years ago|
|install_nvidia_docker.sh||3 years ago|
|network_vis.py||3 years ago|
|process_video_url.py||2 years ago|
|requirements-winnow-unit-tests.txt||3 years ago|
|requirements.txt||4 years ago|
|run_docker_container.sh||4 years ago|
|serve_jupyter.sh||3 years ago|
|servers.json||3 years ago|
|setup.py||3 years ago|
|template_matching.py||2 years ago|
|test_environment.py||3 years ago|
|tox.ini||4 years ago|
Benetech Video Deduplication Project
Near Duplicate, object, and metadata detection for video files.
To find out more about the project, installation, and running the tool you may review our e-Learning module: https://benetech.github.io/VideoDeduplication/
Installation (Ubuntu with Docker)
Install and configure Docker
The easiest, most consistent method for installing Docker on Ubuntu can be found at: https://get.docker.com/
curl -fsSL https://get.docker.com -o get-docker.sh
To allow docker to by used by non-root users:
Create the docker group.
sudo groupadd docker
Add your user to the docker group.
sudo usermod -aG docker $USER
Log out and log back in so that your group membership is re-evaluated.
If testing on a virtual machine, it may be necessary to restart the virtual machine for changes to take effect.
On a desktop Linux environment such as X Windows, log out of your session completely and then log back in.
On Linux, you can also run the following command to activate the changes to groups:
Once the above has been completed. Open a command prompt window and type the ‘docker’ command to confirm that the Docker service is available and returning the help guide.
Enable GPU support for Docker
Assuming docker has been installed run the following command and install the NVIDIA Docker runtime using the script in the main project folder [GPU LINUX ONLY]:
sudo curl -L "https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
then modify permissions:
sudo chmod +x /usr/local/bin/docker-compose
git clone https://github.com/benetech/VideoDeduplication.git
Building and Running Application
The default approach to build and run the application is to use docker-compose utility.
Shortcut commands to run the application are:
make run- build and run application
make stop- stop application
make run will ask you the following questions:
- Location of your source video files
- Availability of Nvidia GPU support for Docker (see Enable GPU support for Docker)
- Whether you want to use pre-built images
The docker-compose.yml configuration relies on various environment variables. The only required variable is
BENETECH_DATA_LOCATION- path to the root folder containing your video files.
You can set environment variables in the .env file at the repository root folder.
By default docker-compose will build all required containers and assume Nvidia GPU support is available. You can
also use various predefined configuration extensions
placed in the
./docker-compose directory (see ./docker-compose/README.md)
make run shortcut is a tiny wrapper around the
docker-compose command which chooses appropriate configuration
extensions. If you specified the
BENETECH_DATA_LOCATION environment variable (either in your shell
.env file) you can simply execute
sudo docker-compose up -d to run the default configuration.
The command above might throw an error if you already have postgres server running.
If that's the case run
systemctl stop postgresql (Linux) before using docker-compose or choose
alternative postgres-port by setting the
BENETECH_PG_PORT environment variable.
Once the docker-compose is running, you will be able to access the following:
- User interface on http://localhost:5000
- projects notebooks on http://localhost:8888
- pgAdmin on http://localhost:16543
You can check your running instances using this command:
sudo docker ps
Take note of the following names:
- Deduplication App ->
- User Interface ->
- Postgres Server ->
- PgAdmin ->
In order to use pgAdmin, follow these instructions:
- go to http://localhost:1643 and use the credentials as defined on the
- Click create new server
- Choose a reference name for the server
- Go the connection tab and set the host name to
postgres, maintenance database to "videodeduplicationdb" and user / password as
In order to run the main scripts, simply enter the app's docker container by running the following command:
docker exec -it videodeduplication_dedup-app_1 /bin/bash
Once within the container, run one of the main scripts as described on the "running" section of this documentation.
If you don't want to build Docker images locally you can use prebuilt-images hosted on Docker Hub
If you use
make run command you can set
BENETECH_PREBUILT=YES in the
If you use
docker-compose explicitly you can run:
sudo docker-compose -f docker-compose.yml -f docker-compose/prebuilt.yml up -d
To pull images run:
docker pull johnhbenetech/videodeduplication:gpu
Build Images Manually
You can build and run containers manually:
sudo docker build -f docker/Dockerfile.dedup-gpu -t benetech-dedup:gpu . sudo docker build -f docker/Dockerfile.server -t benetech-server .
This repo contains three main scripts that perform the following tasks:
1. extract_features.py : Signature extraction Pipeline 2. generate_matches.py : Signature to Matches (saved as CSV) 3. template_matching.py: Uses source templates to query the extracted embeddings and generates a report containing potential matches 4. audio_processing.py: Audio processing pipeline developed in collaboration with Microsoft as described on our [wiki](https://github.com/benetech/VideoDeduplication/wiki/Audio-Processing)
Important notebooks include (located inside the notebooks folder):
1. Visualization and Annotation Tool.ipynb: Allows the output of the generate_matches script to be reviewed and annotated. 2. Template Matching Demo.ipynb: Allows the output of the extract_features script to be queried against known videos / images [as defined in custom templates built by the user]
These scripts use the 'config.yaml' file to define where to collect data from, hyperparameters (...)
video_source_folder: Directory where the source video files are located
destination_folder: Destination of the output files generated from the scripts
root_folder_intermediate: Folder name used for the intermediate representations (Make sure it's compatible with the next paremeter)
match_distance: Distance threshold that determines whether two videos are a match [FLOAT - 0.0 to 1.0]
video_list_filename: Name of the file that contains the list of processed video files (to be saved by the extraction script)
filter_dark_videos: [true / false] Whether to remove dark videos from final output files.
filter_dark_videos_thr:[1-10 int range] Ideally a number between 1 and 10. Higher numbers means we will be less strict when filtering out dark videos.
*min_video_duration_seconds: Minimum video duration in secondds
detect_scenes: [true / false] Whether to run scene detection or not.
minimum_scene_duration: [1-5 int range] Ideally a number between 1 and 10. Higher numbers means we will be append smaller scenes into larger oners.
use_pretrained_model_local_path: [true / false] Whether to use the pretrained model from your local file system
pretrained_model_local_path:: Absolute path to pretrained model in case the user doesn't want to download it from S3
use_db: : [true / false] true conninfo: Connection string (eg. postgres://[USER]:[PASSWORD]@[URL]:[PORT]/[DBNAME]). When using it using our Docker workflow, URL should default to "videodeduplication_postgres_1" instead of localhost
keep_fileoutput: [true / false]. Whether to keep regular output even with results being saved in DB
templates_source_path: Directory where templates of interest are located (should be the path to a directory where each folder contains images related to the template - eg: if set for the path datadrive/templates/, this folder could contain sub-folders like plane, smoke or bomb with its respective images on each folder)
Within the docker command line
Extract video signatures
'--config', '-cp' : Path to the project config file [default:'config.yml'] '--list-of-files', '-lof' : path to txt with a list of files for processing - overrides source folder from the config file [default:''] '--frame-sampling', '-fs': 'Sets the sampling strategy (values from 1 to 10 - eg sample one frame every X seconds) - overrides frame sampling from the config file' [default:1] --save-frames', '-sf': 'Whether to save the frames sampled from the videos - overrides save_frames on the config file'[default:False]
'--config', '-cp' : Path to the project config file [default:'config.yml'] '--list-of-files', '-lof' : path to txt with a list of files for processing and generating matches / scene detection / metadata extraction - overrides loading all signatures available from the file system
'--config', '-cp' : Path to the project config file [default:'config.yml'] '--list-of-files', '-lof' : path to txt with a list of files for processing and generating matches / scene detection / metadata extraction - overrides loading all signatures available from the file system '--cores', '-' : Number of cores to be used on parallel processing routines [default:5] "--model", "-m" : Path to the audio processing model", [default:'data/audio_model.h5']
Template Object Matching
'--override', '-ovr' : Overrides the previous template matches saved on the DB [default:False] '--template-dir', '-td' : path to a directory containing templates - overrides source folder from the config file' [default:'']
We have created a few benchmarking scripts to allow performance testing for a few features of the project.
In order to evalute video deduplication, please run the script below:
python benchmarks/evaluate.py --benchmark augmented_dataset
This script will download our testing dataset and run our pipeline on it. Results are stress tested using random sampling to create random query/answers pairs at different levels of positive/negative examples (eg. what's the performance of our model when 10% of the content is duplicated? what about at 15%). The results of the benchmarking script are saved at the root of the data folder.
For more details about our evaluation metric please refer to our wiki
In order to evaluate template matching, please run the script below:
python benchmarks/evaluate.py --benchmark landmarks
This script will download our subset of the google landmark dataset. Our script uses samples of landmarks to create query templates and runs those templates against random subsets landmarks.
The results of the benchmarking script are saved at the root of the data folder.
In order to evaluate scene detection, please run the script below:
python benchmarks/evaluate.py --benchmark scene_detection
This script will download our subset of the Planet Earth.
The results of the benchmarking script are saved at the root of the data folder.