Skip to content

API reference

Generated from the docstrings of the public retina package.

retina

Retina — turn camera streams into event streams.

A small, model-agnostic, hardware-neutral library for the Signal -> Event layer: one level above object detection (Supervision gives you boxes; Retina gives you "person entered the dock and dwelled 31s"), and one level below domain judgment.

Quickstart (3 lines, any model):

from retina import Retina, Zone, ZoneRule, YoloDetector
from retina.sources import video_frames

dock = Zone("dock", [(0.1, 0.1), (0.9, 0.1), (0.9, 0.9), (0.1, 0.9)], normalized=True)
cam = Retina(
    source_id="cam_01",
    detector=YoloDetector("yolo11n.pt", classes={"person"}),
    rules=[ZoneRule(dock, classes={"person"}, dwell_s=30)],
)
for event in cam.run(video_frames("dock.mp4")):
    print(event.to_json())

Compose models like n8n / LCEL (no GUI):

pipe = YoloDetector("yolo11n.pt") | IoUTracker() | ZoneRule(dock) | JsonlSink("e.jsonl")

CallableDetector

Bases: Pipeable

Wrap a plain function as a Detector, optionally filtering classes / confidence. Lets you plug any model in one line.

CountRule

Bases: _RuleBase

count.threshold when the number of tracked objects (optionally inside a zone / of given classes) crosses threshold. Edge-triggered: fires once when the predicate flips true, re-arms when it goes false.

Detection dataclass

One object found in one frame.

from_supervision(detections, class_names=None) classmethod

Ingest a Roboflow Supervision sv.Detectionslist[Detection].

Supervision is the de-facto interop format ~20+ CV libraries convert into, so anyone already using it pipes straight into Retina's event layer. We never import supervision — the object is read by duck-typing: .xyxy (Nx4 [x1,y1,x2,y2]), .confidence (N or None), .class_id (N or None), .data (dict, may hold "class_name").

Label resolution, per row i: prefer data["class_name"][i]; else map class_id[i] through class_names (dict or list); else str(class_id); else "". Missing confidence falls back to the Detection default.

Detector

Bases: Protocol

Any object/callable that turns a frame into detections.

A frame is an HxWx3 uint8 numpy array (or whatever your detector accepts — Retina just passes it through).

DetectorNode

Bases: Node

Run a detector on the frame image; fill frame.detections.

DinoV2Embedder

Bases: Pipeable

Frozen DINOv2 per-object embedder — the first real vec producer.

Callable enricher: for each track it crops frame.image[y1:y2, x1:x2], runs DINOv2 over all crops in one batched forward pass, and attaches the L2-normalized embedding as track.user["vec"] = Vec(...).to_dict(). From there WorldState.from_frame lifts it onto entity.vec.

size picks the backbone: small (dim 384, default), base (768), large (1024). device="auto" selects mps → cuda → cpu. Set bgr=True for OpenCV frames (cv2 is BGR); synthetic / RGB frames keep the default False. Empty or out-of-bounds crops are skipped (clamped to image bounds).

EnricherNode

Bases: Node

Run a function on the frame and merge its result into frame.user.

The seam for a VLM describe, a classifier, or a V-JEPA novelty score. fn takes the Frame and returns a dict (merged into frame.user) or any value (stored under key).

Entity dataclass

One thing present in the scene: a symbolic core (+ optional latent vec).

Event dataclass

One thing that happened. Serializes to the minimal JWT-style form.

to_dict()

Flat dict, null/empty fields omitted, custom ext merged in.

EventType

The closed primitive vocabulary for 0.1 (see SPEC.md).

Frame dataclass

Append-only enrichment unit flowing through the pipeline.

Stages attach to it: the detector fills detections, the tracker fills tracks, the rules fill events. user is an open extension slot.

GateNode

Bases: Node

Drop the frame (skip everything downstream) when the gate says don't look.

GroundingDinoDetector

Bases: Pipeable

Open-vocabulary detection from a text prompt via Grounding DINO (HF transformers). pip install 'trio-retina[grounding]'. Detects any classes you name — no training. Heavy (torch); not imported unless instantiated.

IoUTracker

Bases: Pipeable

Greedy IoU association — small, deterministic, zero extra deps.

A detection matches the highest-IoU live track of the same class above iou_threshold. Tracks survive max_missed frames of occlusion and become confirmed after min_hits hits (so transient noise never fires events).

JsonlSink

Bases: _SinkPipeable

Append events to a JSONL file as they arrive (streaming).

Line dataclass

A directed tripwire a->b. Crossing direction is reported relative to it.

scaled(size)

The (a, b) endpoints in pixel coords; scale once per frame and reuse.

LineRule

Bases: _RuleBase

line.cross when a track's centroid crosses the tripwire. dir is a_to_b or b_to_a by which side it moved toward.

Requires tracked input (each track carries an id and prev_centroid), per the standard — line.cross is meaningless without object identity.

min_frames (default 1) is a jitter debounce, like Supervision's LineZone.minimum_crossing_threshold. With min_frames=1 the rule is stateless and emits the instant the prev→curr centroid segment intersects the line (the original behavior). With min_frames > 1, a crossing is pending once the segment intersects, and is confirmed and emitted only after the track has stayed continuously on the new side for min_frames frames (including the crossing frame). If the track returns to the original side before then, the crossing is discarded as jitter and nothing is emitted. The event fires on the frame the crossing is confirmed, carrying the direction of the original crossing (and that frame's t / box).

MotionGate

Look only when the frame changed from the previous one (mean abs diff).

Node

Bases: Pipeable

A pipeline step: Frame -> Frame (or None to drop the frame).

NorfairTracker

Bases: Pipeable

Norfair adapter — pure-Python Kalman tracking with re-association, better ID stability through occlusion than IoUTracker. pip install 'trio-retina[norfair]'.

Surfaces only tracks detected this frame (coasting/occluded ones are kept internally for re-association but not returned, so occupancy/dwell stay honest).

Pipeable

Mixin giving | composition. Subclasses implement to_node().

Pipeline

A linear chain of nodes. Each frame flows through every node in order; a node returning None drops the frame (the rest of the chain is skipped).

process(image, t, *, frame_num=None)

Run one (image, timestamp) through the chain; return the enriched Frame.

frame_num defaults to an internal monotonic counter; pass the true source frame index if you have it (a Pipeline is single-stream/stateful).

run(frames)

Stream events from an iterable of (image, timestamp) pairs.

run_states(frames)

Stream a WorldState snapshot per frame — the assembled-state channel (entities + relations + scene), alongside run()'s event stream.

Relation dataclass

A typed, directed relation between two entities (subj -predicate-> obj).

family is an optional coarse grouping (spatial / social / functional …) above the specific predicate.

Retina

Sugar over Pipeline for the common detector -> tracker -> rules case.

RuleNode

Bases: Node

Run an event rule over the tracks; append to frame.events.

SinkNode

Bases: Node

Emit each event on the frame to a sink (jsonl/webhook/kafka/...).

Track dataclass

A detected object followed across frames.

bbox is the tracker's current box; det_bbox preserves the raw detector box (they differ once a Kalman/DCF tracker predicts). user is an open extension slot for downstream code.

TrackerNode

Bases: Node

Give detections identity over time; fill frame.tracks.

VJepa2Embedder

Bases: Pipeable

Frozen V-JEPA 2 scene-level embedder — the first real scene producer.

V-JEPA 2 is a self-supervised video encoder, so this is not a per-frame op: it keeps a rolling buffer of the last clip_len frame images and, once full, runs V-JEPA 2 over the whole clip, mean-pools the patch/temporal tokens to a single vector, and attaches it as frame.user["scene"] = Vec(...).to_dict(). WorldState.from_frame then lifts it onto ws.scene — symmetric with how DinoV2Embedder fills entity.vec. Before the buffer fills, the frame passes through untouched (no scene yet). The buffer slides by one frame thereafter, so every frame from clip_len on carries a fresh scene latent.

clip_len is the number of frames per clip (default 16). device="auto" selects mps → cuda → cpu. normalize=True L2-normalizes the pooled vector. Set bgr=True for OpenCV frames (cv2 is BGR); synthetic / RGB frames keep the default False.

Needs the extra (pulls torch + transformers + pillow, downloads V-JEPA 2 weights): pip install 'trio-retina[vjepa]'.

Vec dataclass

A model-tagged latent. Small vectors ride values inline; large or re-embeddable ones ride ref by reference. Always tagged {model, dim}.

VlmDetector

Bases: Pipeable

Use ANY vision-language model as a detector.

You pass a client(image, prompt) -> iterable of dicts, where each dict has label, box = [x1, y1, x2, y2] (pixels), and optional score. VlmDetector just maps that into Detections — so Qwen-VL, Gemini, GPT-4o, Claude, or a local VLM all plug in behind the same seam. The client is yours (an OpenAI-compatible call, an HTTP request, etc.); keep grounding/JSON parsing there. A VLM can also be used as an EnricherNode/event source — see docs.

WebhookSink

Bases: _SinkPipeable

POST each event as JSON to a URL. Uses urllib (stdlib) — no requests dep.

Only http/https URLs are accepted. The URL may come from a workflow.json (a trusted operator input — see SECURITY.md), but we still reject other schemes (file://, ftp://, …) so a stray config can't make urllib read a local file or hit an unexpected protocol.

WorldState dataclass

The assembled snapshot: entities present, their relations, scene latent.

from_frame(frame) classmethod

Assemble a WorldState from a Frame: each track becomes an entity.

Maps the symbolic core (id/type/bbox/conf) straight off the track; if a per-object latent was attached upstream (in track.user["vec"] as a dict), it rides along as the entity's vec. A scene-level latent (e.g. a frozen V-JEPA scene encoder) attaches symmetrically: if frame.user["scene"] is a dict, it lifts onto ws.scene. Relations default empty — filled by a higher stage (a relation extractor).

to_dict()

Minimal dict, null/empty fields omitted — the smallest is {src, t}.

WorldStateNode

Bases: Node

Assemble a WorldState snapshot from the frame's tracks and store it on frame.user[key], so the state channel flows through the same composable pipeline as events. Read it off frame.user or via Pipeline.run_states().

YoloDetector

Bases: Pipeable

Optional Ultralytics YOLO adapter. pip install trio-retina[yolo].

Loads any Ultralytics weights — YOLOv5/8/9/10/11/12, YOLO-World, RT-DETR — so swapping models is just a different weights string. Not imported unless you instantiate it, so the base package stays light.

Zone dataclass

A polygonal region of interest.

scaled(size)

The polygon in pixel coords. For a normalized zone, multiply by the frame size; compute this once per frame and reuse across objects.

ZoneRule

Bases: _RuleBase

zone.enter on entry, zone.exit on departure, zone.dwell once a track has stayed dwell_s seconds inside (fires once per visit).

exit_grace_s keeps a track logically inside until it has been out-of-zone or absent for that long (rides out detection blips / id flicker without a spurious exit; the exit dur is measured to the last frame seen inside). anchor picks the body-point tested against the polygon: center (default, the centroid), feet (bottom-center of bbox), or head (top-center).

event_f1(pred, ref, **kw)

Precision / recall / F1 between predicted and reference events.

load_schema()

The formal JSON Schema (draft 2020-12) for retina.event.

match_events(pred, ref, *, time_tol=2.0, keys=('type', 'zone', 'dir'))

Greedy nearest-in-time matching. Returns (tp, fp, fn).

A predicted event matches an unused reference event with identical keys and the smallest |t_pred - t_ref| within time_tol.

register_node(type_name, builder)

Register a custom node type for declarative from_json workflows.

The registry is a module global, so a registration is process-wide.

sample_events()

Return a filesystem path to the bundled sample retina.event JSONL.

The file ships inside the package (retina/_assets/sample_events.jsonl), so this works offline the instant the wheel is installed — no network, no licensing risk. It is the five-event synthetic dock scene from retina demo (count threshold → zone enter → dwell → line cross → exit), handy for trying the event format, validate(), or the CLI::

retina validate "$(python -c 'import retina; print(retina.sample_events())')"

Returns the path as a str. The path is stable for the life of the process; treat the file as read-only (it lives inside the install).

sample_video(*, force=False)

Return a path to a small sample video clip, cached per-user.

What this is. A synthetic clip — deterministic moving shapes (a couple of coloured rectangles drifting across a dark background) — generated once with OpenCV and cached under ~/.cache/trio-retina/. It exists to exercise the video-source plumbing end to end with zero network and zero third-party-footage licensing risk: video_frames(retina.sample_video()), retina run workflow.json "$(... sample_video ...)", frame striding, EOF handling, and so on.

What this is NOT. It is not real-world footage, so a real object detector (YoloDetector) will find no people/vehicles in it — there are none. For the YOLO-on-real-footage path, point Retina at your own clip (video_frames("your.mp4")); the synthetic clip only verifies the wiring.

Writing the clip needs OpenCV (the [video] extra). The first call writes and caches it; later calls return the cached path immediately. Pass force=True to regenerate.

Raises RuntimeError with a clear [video] hint if OpenCV is missing and the clip is not already cached.

to_jsonl(events, path)

Write events to a JSONL file. Returns the count written.

to_node(x)

Coerce a step into a Node: pass Nodes through, auto-wrap pipeable domain objects (detector/tracker/rule/sink) via their to_node().

validate(event)

Return a list of problems (empty = valid). Accepts an Event or a dict.