AI Roto from Image Engine at SIGGRAPH

There’s been a lot of AI/ML stuff in Siggraph this year, but one thing that was quite interesting is the Image Engine paper about generating alpha mattes for persons/objects using Machine learning.

It’s called “Automated Video Segmentation Machine Learning Pipeline” by Johannes Merz and Lucien Fostier.

The artist just has to load the plate and type a prompt like “Person” “Car” and give positive/negative clicks and a reusable WIP mattes for comps and downstream work.

The interesting this is this matte extraction tool has been deployed in the studio and successfully executed in 12 different shows total of 1241 shots until this date

In their SIGGRAPH paper, they mention they can pull a rough matte for a person in 100 frames in about 10 minutes on their GPU servers.

The hardware is NVIDIA A4000 GPUs, and to fit memory limits, they scale frames so the longest side is 1024 pixels, keeping the aspect ratio.

So how does this work ?

They built a 3-stage automatic system promptable detection, per-frame segmentation, and mask-based video tracking.

1) Stage 1 — Object detection (per frame)

They use GroundingDINO, an open-set detector that you prompt with natural language (e.g., “person” or “person, car”). That gives bounding boxes for whatever you asked for on every frame.

2) Stage 2 — Image segmentation & cleanup (per frame)

For each detection box, they ask SAM2 (Segment Anything v2) to produce a proper roto mask for the object inside the box.

SAM2 replaced their earlier SAM v1 experiments because SAM2 adds benefits for image + video work

It tracks in short chunks (for example, ~20 frames at a time) and then refreshes every few frames (say every 5 frames) this keeps the mask accurate even as poses or occlusions change.

3) Stage 3 — Video tracking (temporal consistency)

they track masks with SAM2’s video tracking mode (mask prompts are more stable than boxes)

After the forward pass they run the process backwards so people who appear later still get assigned consistent layers.

If automatic results miss small or tricky parts (background people, fine dress details), artists can open a browser tool and give positive/negative clicks (points) to SAM2 or draw masks on a frame.

But there are limitations though:

These mattes are meant to give artists fast, reusable WIP mattes for comps and downstream work.
The paper didn’t mention how it can handle defocused, motion blur high camera movement shots
to limit memory, shots are resized so the maximum dimension ≤ 1024 px, which may impact very high- use cases.
Pipeline runs on dedicated GPU servers (NVIDIA A4000 with 16GB). Memory can be a bottleneck, specially SAM2 video tracking.
Thresholds and heuristics (Sim/IoU/ε) are set to balance false positives vs. losing faint layers; this can sometimes remove usable layers unless manually corrected.

What’s next (their roadmap)

Spline tracking for roto: Move beyond per-pixel masks to predict/track roto splines (knot positions, curve properties) to speed up final-quality roto workflows

check out the paper here : https://dl.acm.org/doi/full/10.1145/3744199.3744635