Poltergeist
Poltergeist is a hackathon concept for testing vision-language models in high-stakes document workflows. Instead of synthetic noise, it focuses on production-like issues such as stains, creases, blur, compression, and occlusion.
The product story is intentionally simple for cross-functional teams: set up an experiment, run realistic perturbations, inspect failure patterns, and feed those cases back into model improvement. The screens below walk through that end-to-end loop.
Open the live demo (best viewed on desktop). I designed and built this concept in a focused 10-hour sprint.
Landing screen
The landing page sets context quickly: why vision models break in the
real world and why realistic adversarial testing matters. The call-to-action
and hero preview help non-ML stakeholders understand the product in one
glance.
Projects list
This is the workspace overview where teams launch new runs and revisit
previous experiments. Status tags make progression visible from baseline
benchmarking to fine-tuning, so everyone can track where each model iteration
stands.
New project setup
Setup captures only the inputs needed to define a useful run: project
identity, model, dataset, task type, and duration. The structure keeps
setup fast while still making experimental scope explicit.
Test suite selection
Scenario categories are grouped by failure mode, including environment
changes, sensor degradation, semantic manipulations, and document-specific
damage. This organization encourages deliberate test design rather than
one-click black-box evaluation.
Run in progress
During execution, the UI keeps the team oriented with progress context
and estimated timing. The state is intentionally calm and informative,
so users understand that the system is actively generating and evaluating
scenarios.
Run summary
Summary surfaces key outcomes first: baseline vs attacked performance,
degradation signals, and ranked attack categories. The failed-scenario
gallery then turns those metrics into concrete examples the team can use
for triage and retraining decisions.
Scenario preview
The detail view supports human-in-the-loop review for each case: attack
type, prompt, expected response, and model output are shown together.
This makes validation actionable and helps teams separate meaningful vulnerabilities
from noisy or invalid samples.
Sparsh Paliwal · 2026