Trio Retina¶
Turn any perception model's output into one standard, queryable world-state โ symbolic events, with a latent-vector channel built in. The model-agnostic state layer for world models.
Retina turns raw signals โ video, RTSP, files โ into a queryable world-state: readable events (zone.enter, dwell, line.cross) plus a standardized latent vec channel on the same records, on one small model-agnostic standard. The latent channel is a real, serializable interface today (attach your own embedding โ see examples/latent_vec.py); the automatic producers (V-JEPA scene + per-object ReID) are on the roadmap. Bring any model (YOLO, V-JEPA, DINO, a VLM, or none); Retina assembles its output into state a dynamics model, rule engine, or LLM can consume.
Think OpenTelemetry for perception โ it doesn't build the sensors, it normalizes any of them into one state.
Install¶
pip install trio-retina # core: numpy only
pip install 'trio-retina[yolo]' # + Ultralytics YOLO adapter
pip install 'trio-retina[video]' # + OpenCV frame source (files / RTSP / webcam)
pip install 'trio-retina[all]' # everything
Quickstart¶
Runs on a bare pip install trio-retina (numpy only) โ no model, no GPU, no video file. A stand-in detector walks one "person" across a dock zone; Retina emits the real retina.event stream:
import numpy as np
from retina import CountRule, IoUTracker, Retina, Zone, ZoneRule
from retina.detect import Detection
class ScriptedDetector:
"""A stand-in model: one 'person' walking across a dock zone."""
def __init__(self):
self._xs = list(range(0, 102, 6))
def __call__(self, frame):
x = self._xs.pop(0) if self._xs else 100
return [Detection(label="person", bbox=(x - 10, 40, x + 10, 60), confidence=0.9)]
dock = Zone("dock", [(40, 0), (60, 0), (60, 100), (40, 100)])
cam = Retina(
source_id="cam_01",
detector=ScriptedDetector(),
tracker=IoUTracker(min_hits=2),
rules=[
ZoneRule(dock, classes={"person"}, dwell_s=2.0),
CountRule(1, classes={"person"}),
],
)
frames = [(np.zeros((100, 100, 3), dtype=np.uint8), float(i)) for i in range(18)]
for event in cam.run(frames):
print(event.to_json())
# {"type":"count.threshold","t":1.0,"src":"cam_01","n":1,"frame":1,...}
# {"type":"zone.enter","t":7.0,"src":"cam_01","id":1,"label":"person",...}
# {"type":"zone.dwell","t":7.0,...,"zone":"dock","dur":2.0,...}
# {"type":"zone.exit","t":7.0,...,"zone":"dock","dur":3.0,...}
With a real model + video¶
pip install 'trio-retina[yolo]' (add [video] for the frame source), then point it at your clip:
from retina import Retina, Zone, ZoneRule, YoloDetector
from retina.sources import video_frames
dock = Zone("dock", [(0.3, 0.2), (0.7, 0.2), (0.7, 0.9), (0.3, 0.9)], normalized=True)
cam = Retina(
source_id="cam_01",
detector=YoloDetector("yolo11n.pt", classes={"person"}),
rules=[ZoneRule(dock, classes={"person"}, dwell_s=30)],
)
for event in cam.run(video_frames("your.mp4")):
print(event.to_json())
Run it in your browser โ no install¶
| notebook | what it shows |
|---|---|
| quickstart | detector โ zone / line / count / dwell events + validate() |
| camera โ webhook | a restricted-zone alert pushed to your endpoint |
| from Supervision | pipe your existing sv.Detections straight in |
More no-model examples ship with the source (not the wheel) โ git clone the repo and run python examples/quickstart.py. See also rtsp_to_webhook.py, from_supervision.py, and latent_vec.py.
๐ The world-model stack¶
Retina is the encoder (s = Enc(x)) of a world model. The whole front-to-back
seam is demonstrable end to end โ on a synthetic scene, as a small, honest proof of
concept (examples/world_model/):

Any perception model on top, any dynamics model underneath, meeting on one standard WorldState โ Retina is the constant in the middle.
1 ยท Swap the encoder, the state is constant. The same pipeline run three ways โ
symbolic only, + DinoV2Embedder (per-object entity.vec), + VJepa2Embedder
(scene-level ws.scene) โ yields the identical WorldState schema; only which model
filled the latent changes.
2 ยท A dynamics model imagines the future off that state. A small transformer
trained offline on recorded WorldState sequences predicts where each entity is headed.
The honest ablation โ does Retina's appearance latent actually help? โ on held-out
data with real DINOv2 vecs (mean 7-step position error, px, lower is better):
| dynamics input | 7-step error |
|---|---|
| constant-velocity baseline | 7.68 px |
| learned, pos-only | 1.45 px |
| learned, pos + appearance latent | 1.33 px |
The latent channel measurably improves prediction โ +83% over constant-velocity,
+8% over pos-only, widening with the horizon. Full grid in
BENCHMARK.md.

Raw video โ one standardized Retina WorldState โ predicted player runs. Left is a real broadcast clip (Roboflow's MIT-licensed sports sample, originally DFL Bundesliga); it runs through a real YOLO detector + tracker and a frozen DINOv2-small appearance encoder, out as one model-agnostic WorldState, rendered right as a top-down tactical radar. The dynamics transformer draws each player's predicted next run ahead in indigo (gray = past). Teams are coloured by clustering the DINOv2 appearance vectors into two groups โ the latent knows who's who. Honest by design โ player motion is stochastic, so at this short horizon the learned model roughly ties a constant-velocity baseline on held-out error; the appearance latent's measurable win lives in the cleaner synthetic ablation above. Real pipeline end to end.
3 ยท Front + back compose through one standard โ any encoder in front, any dynamics
behind, meeting on one serializable state. See
end_to_end.py.
Where to go next¶
- Concepts โ the mental model in one read:
FrameโDetectionโTrackโEvent, the dual symbolic + latent state, and where Retina sits in the stack. - Cookbook โ runnable task recipes (zone intrusion โ webhook, counting / line-crossing, Supervision interop, latent
vec, validation, the CLI). - CLI โ
retina demo/run/validate/bench. - Extend โ add your own detector / tracker / rule / sink behind the tiny
Protocols. - FAQ โ which extra to install, "no events?", RTSP reconnect, CPU vs GPU, where examples live.
- Event spec โ the tiny, JWT-style
retina.eventinterchange format. - Design notes โ why Retina is the encoder of a world model, and what it deliberately does not do.
- API reference โ the public Python API, generated from docstrings.
No footage to test the video path? retina.sample_video() returns a tiny generated clip; retina.sample_events() returns a bundled retina.event sample for validate and the CLI โ both work offline.
For the full landing page (demos, supported models, comparisons, roadmap), see the README on GitHub.