GuideMe: Multi-Domain Task Guidance and Intervention in Streaming Video

Fang Liu1,*,   Jinpeng Chen2,*,   Ke Xu3,1,   Yuhao Liu1,   Huankang Guan2,   Xudong Lu4,  
Bo Yang2,   Gerhard Hancke1,   Rui Liu2,†,   Rynson W.H. Lau1,5,†

1City University of Hong Kong    2Huawei Research   
3University of Science and Technology of China   
4Chinese University of Hong Kong    5City University of Hong Kong (Dongguan)

ECCV 2026

*Joint first authors    Corresponding authors

While multimodal Large Language Models (MLLMs) excel at offline video understanding, an interesting question of how far they are from serving as a real-time procedural coach remains unknown. Such a role typically requires an MLLM to continuously monitor the execution, detect mistakes, and provide corrective guidance in a closed-loop interaction. In this paper, we construct GuideMe, the first multi-domain benchmark for streaming video that supports training and evaluation of MLLMs for closed-loop interactive task guidance. It comprises 2,458 videos spanning 223.7 hours across diverse domains, including cooking, object manipulation, daily-life guidance, and fitness, with 47,775 interaction samples covering next-step instructions, completion feedback, error detection, and corrective guidance. To evaluate existing models on GuideMe, we design a three-facet assessment framework to measure representative MLLMs, consisting of temporal-semantic bipartite matching for sequence-level alignment, behavioral classification for intervention timing, and LLM-as-a-Judge for content quality. Extensive experiments highlight a critical performance asymmetry: despite excelling at providing instructions, existing MLLMs consistently fail to identify execution errors and respond with corrective feedback.

GuideMe Dataset

The final dataset contains 2,458 video instances with a combined duration of 223.7 hours. Video lengths range from 0.5 to 41.2 minutes (average 5.5 min, median 3.6 min), yielding 47,775 streaming interaction samples across cooking, daily tasks, fitness, and embodied assistance, of which 9,876 form the held-out test set used for all evaluations. GuideMe is split into GuideMe-Train (1,985 videos, 177.0 hours) and GuideMe-Test (473 videos, 46.7 hours); all reported evaluations use GuideMe-Test, while GuideMe-Train supports task-specific adaptation and is used only for the fine-tuned Qwen3-VL-8B result in the main results table.

2,458 Videos
223.7 Hours
47,775 Interaction Samples
4 Task Domains
Dataset Domain #Videos Hours Step-level Instruction Timed Feedback Error Detection Action-specific Correction Closed-loop Interaction
Epic-Kitchen-100 Cooking 700 100 × × × ×
COIN Diverse 11,827 476 × × × ×
HoloAssist Object Manip. 2,221 166 × × ×
CaptainCook4D Cooking 384 94.5 × × × ×
QEVD-Fit-Coach Fitness 223 13.5 ×
Assembly101 Assembly 4,321 513 × × ×
EgoPER Cooking 386 28 × × ×
GuideMe Diverse 2,458 223.7

Comparison of procedural video datasets. GuideMe supports closed-loop interactive task guidance across all five evaluation dimensions.

Domain Distribution

GuideMe bridges everyday routines and more complex assistance scenarios. Home Life and Cooking form the largest portion of the benchmark, accounting for 38.4% and 23.4% of the videos, while technical domains such as Field Tech and IT Support contribute 38.2%.

Duration Distribution

The videos range from 0.5 to 41.2 minutes, with an average duration of 5.5 minutes and a median of 3.6 minutes. This distribution covers concise step-level tasks while retaining long-form sequences for evaluating long-range temporal reasoning.

GuideMe video distribution

Evaluation Framework

Temporal-Semantic Matching

Measures sequence-level alignment by matching predicted guidance with ground-truth interventions in time and meaning.

Behavioral Classification

Evaluates intervention timing and response behavior, including correct silence, false alarms, missing responses, and partly correct interventions.

LLM-as-a-Judge

Assesses the content quality of open-ended guidance responses, focusing on semantic correctness and actionable feedback.

Experimental Results

Model Param. Temporal Alignment Response Behavior
sPrecision ↑ sRecall ↑ sF1 ↑ Scorem CS ↑ NR ↓ FA ↓ PC ↑ Score ↑
Proprietary Multimodal Models
Doubao-Seed-1.8 - 30.6 47.5 36.8 63.1 8.8 6.1 34.8 44.2 38.8
GPT-5.2 - 30.2 36.8 32.5 64.4 13.4 16.8 30.6 39.2 38.9
Gemini 3 Pro - 39.3 36.9 36.5 66.6 17.5 22.0 19.0 41.5 44.8
Gemini 3.1 Pro - 29.3 46.9 35.7 61.5 7.4 6.3 36.4 49.9 37.8
Open-source Multimodal Models
Qwen2.5-7B 7B 30.1 30.9 29.3 57.5 19.7 20.4 25.4 34.5 37.5
Qwen3-VL-8B 8B 29.7 46.8 35.7 62.0 5.2 6.7 37.7 50.5 31.9
Qwen3-VL-30B-A3B 30B 29.5 46.9 35.6 61.6 6.7 6.0 36.5 50.9 33.3
Qwen3.5-397B-A17B 397B 38.4 20.2 21.7 64.6 35.2 37.9 7.9 19.0 45.7
Open-source Multimodal Streaming Models
VideoLLM-online 8B 22.3 41.1 28.7 40.8 0.0 0.0 43.6 56.4 22.6
Dispider 7B 1.0 0.1 0.1 46.7 43.6 56.3 0.0 0.1 43.6
LiveStar 8B 20.7 19.0 19.1 14.4 21.2 24.4 22.2 32.3 25.0
MMDuet2 3B 3.3 0.2 0.4 43.3 41.7 57.9 0.2 0.2 41.8
Fine-tuned on GuideMe Train Set
Qwen3-VL-8B 8B 40.3 43.7 39.7 58.1 19.5 20.6 24.5 35.3 42.1

Main evaluation results on GuideMe. sPrecision is the sum of matched semantic similarities divided by active non-silent predictions, measuring intervention relevance; sRecall is divided by non-silent ground-truth responses, measuring required-intervention coverage; sF1 is their harmonic mean.
Scorem is the LLM-as-a-Judge quality score averaged over successfully matched prediction-reference intervention pairs, isolating response quality after temporal-semantic matching.
CS: Correct Silent, NR: No Response, FA: False Alarm, PC: Partly Correct. Score is the average instance-level response-behavior score: CS receives 100, FA and NR receive 0, and PC receives an LLM-as-a-Judge quality score scaled to 0-100. denotes fine-tuning on the GuideMe training split.

Aggressive vs. Conservative Behaviors

No baseline provides reliable streaming coaching. Aggressive models such as Gemini 3.1 Pro and Qwen3-VL-8B/30B-A3B respond frequently but trigger many false alarms, while conservative models such as Qwen3.5 speak much less often and miss many required interventions.

Knowing When to Speak Remains Hard

Scaling and streaming-oriented pretraining do not solve the task. Larger models do not consistently improve temporal alignment, and streaming models also struggle, with some over-responding and others collapsing toward near-silence.

Fine-tuning Helps but Is Not Sufficient

Task-specific fine-tuning improves temporal alignment and response quality, but it also shifts the model toward more conservative behavior. Robust closed-loop coaching still requires jointly improving timing, intervention decisions, and content quality.

Qualitative Results

User Query: I'm going to make some oatmeal. Can you walk me through the steps and let me know if I make any mistakes along the way?
0s 150s
Ground Truth
Silent
Doubao-Seed-1.8
Silent
GPT-5.2
Silent
Gemini 3.1 Pro
Silent
Qwen2.5-7B
Silent
MMDuet2
Silent
Qwen3-VL-8B (SFT)
Silent

BibTeX

Coming soon...

If you have any questions, please contact me at fawnliu2333@gmail.com.