AI agent for home security camera monitoring: two different shapes of that problem.
Most pages on this query describe the same product category: a self-hosted NVR that ingests RTSP and runs a detector on a TPU or a GPU. That is the right answer for one shape of the problem, the one where you own the cameras. There is a separate, smaller shape: cloud cameras whose vendor will not hand you a stream, watched on a Mac, that need a soft scheduled second pair of eyes. Different problem, different tool.
Direct answer (verified 2026-05-03)
For continuous frame-level detection on RTSP feeds from cameras you own, run a self-hosted NVR with built-in AI: Frigate, Scrypted with detect plugins, or BlueIris with CodeProject.AI. For a soft, scheduled second pair of eyes on a vendor cloud camera dashboard you already watch on your Mac (Ring, Nest, Wyze, Eufy, Arlo), a general-purpose computer-use agent with a recurring routine fits the job. It does not replace an NVR. The two cover different shapes of the same word "monitor" and the choice between them is a property of your cameras, not of the software.
NVR side reference: Frigate documentation. Mac-side mechanism described below at file paths inside the open-source Fazm repository.
The split that every other guide misses
Read the top results for this topic and they all answer one shape of the question. The shape is: I own one or more cameras that speak RTSP or ONVIF, I want to point a piece of software at them that detects events on every frame, and I want it on my LAN. That is real, it is solved, and Frigate plus a Coral TPU is the community-default kit for it.
The shape that those guides do not address is the one that, in practice, more households actually have. Ring, Nest, Wyze, Eufy and Arlo all encrypt the stream to the vendor cloud. The customer does not have a usable RTSP URL. There are bridges (go2rtc with a Wyze plug-in, scrypted-nest, wyze-bridge) but they break on firmware updates and they generally violate the vendor terms of service, which the typical small-business or family customer is not going to take on. The realistic loop those people have today is: open a browser tab on a Mac, look at the dashboard, glance at it again later. They do not have a frame to feed into a detector. They have a window to read.
That second shape is what this page is about. Not because it is better than an NVR, but because it is a different problem and the tool that fits it is different.
Two shapes, two tools
The choice is determined by what your cameras let you have. If you have RTSP, the left column is the right answer. If you have a vendor dashboard, the right column is the closer fit.
| Feature | Self-hosted NVR with built-in AI (Frigate, Scrypted, BlueIris) | Computer-use agent on Mac (e.g. Fazm) with a routine |
|---|---|---|
| Input | RTSP / ONVIF stream from a camera you own | A vendor web dashboard rendered in a browser tab |
| Detection cadence | Every frame, 5-15 fps, sub-second event latency | Every routine fire, typical interval 5-30 minutes |
| Compute | Coral TPU, small NVIDIA GPU, or Apple Neural Engine on a separate box | Your Mac CPU plus a hosted LLM call per fire |
| Cost shape | One-time hardware, then near-zero marginal cost | Per-fire LLM tokens, runs while your Mac is awake |
| What it can detect | Person, vehicle, package, animal, license plate, with bounding boxes | Whatever the dashboard already shows or implies in text plus optional vision |
| Setup | A Linux box, a Coral or GPU, RTSP URLs from each camera, container config | Open the dashboard tab, ask the agent in plain language |
| Survives Mac sleep | Yes, runs on a separate always-on device | No, stops while the Mac is asleep or logged out |
| Privacy line | Local by default, cloud only if you opt in | Whatever the dashboard already sends to the cloud, plus the LLM call |
The two are not competing for the same job. A household with one Ring doorbell and a Nest cam wants the right column. A household with five Reolink RLC cameras on a PoE switch wants the left column. A household with both runs both.
What "watch on a schedule" looks like in practice
Most existing guides on AI agents for cameras describe a continuous ingestion pipeline. The schedule-based shape is structurally different: there is no stream, there is a periodic glance, and the output is a written summary, not a bounding box. Toggle below.
The two pipelines, side by side
Camera publishes an RTSP stream. NVR pulls the stream continuously, decodes it, samples a frame at 5-15 fps, runs a YOLO-class detector on a Coral TPU or a small GPU, gets bounding boxes for person / vehicle / package, writes events to local disk, fires a webhook (Home Assistant, MQTT, ntfy) when an event matches a rule. Latency from real-world event to your phone vibrating is under a second.
- Requires RTSP / ONVIF (vendor cloud cameras don't give you this)
- Runs on a separate always-on box, not your Mac
- Bounding boxes, not natural-language summaries
The mechanism on the Mac side, with file paths
Calling something "AI agent monitoring" is meaningless without the mechanism. Here is what actually happens, in the open-source Fazm codebase, when a routine fires. None of the file paths below are invented; they are in github.com/m13v/fazm.
One fire of a routine
launchd polls
com.m13v.fazm-routines, every 60s
Find due rows
cron_jobs WHERE next_run_at <= now
Spawn ACP runner
acp-bridge/src/cron-runner.mjs
Read the screen
AX tree, optional vision
Write back
cron_runs row + chat_messages
The schedule string format is parsed in acp-bridge/src/schedule.mjs at the function computeNextRun (line 14). It accepts three shapes: cron:0 9 * * 1-5 for 5-field cron, every:1800 for an interval in seconds, and at:2026-04-30T18:00:00Z for a one-shot ISO timestamp. Storage is the per-user SQLite database at ~/Library/Application Support/Fazm/users/<UUID>/fazm.db in the cron_jobs and cron_runs tables (added in the fazmV7 migration).
The job that runs each fire is acp-bridge/src/cron-runner.mjs. It spins up the same ACP bridge that the floating bar uses, sends the routine prompt, captures the model response and the cost, writes the result into cron_runs and into a normal chat_messages row keyed by taskId="routine-<id>" so the conversation appears in chat history. There is no separate event log to read; the chat thread is the log.
What the user actually writes
The agent translates plain language to a row. The right column is what the user types. The left column is the row that gets written. Toggle between them.
Routine creation
-- cron_jobs row written by routines_create
INSERT INTO cron_jobs (
id, name, schedule, prompt, enabled,
last_status, run_count, next_run_at
)
VALUES (
'r_8f3a2',
'ring-front-door-watcher',
'every:600',
'Look at the Ring tab in my browser. If there has been a motion event on Front Door in the last 30 minutes, summarize it. If you can see a person on the live view right now, send me a chat message that says so. Otherwise, stay quiet.',
1,
null,
0,
strftime('%s', 'now')
);When this fits, and when it does not
The routine-on-Mac shape fits a narrow band of households. Outside it, an NVR is the right tool, and the honest answer is to send the reader to Frigate and not pretend otherwise.
This is the right tool when…
- Your cameras are vendor cloud cameras (Ring, Nest, Wyze, Eufy, Arlo) and you do not have a usable RTSP URL.
- Your Mac is your primary computer and is plugged in for the day, not a laptop you close at 6pm.
- You want a written summary of activity, not bounding-box-level event data.
- Polling at 5-30 minute intervals is fast enough; you accept it is not real time.
- You already pay for the vendor cloud and you do not want to add another always-on device.
- You want to express the alert rule in plain language instead of a config file.
This is the wrong tool when…
- You own the cameras and they expose RTSP or ONVIF. Run Frigate, Scrypted or BlueIris instead.
- You need sub-second event latency. Polling cadences add minutes.
- You need 24/7 coverage and your Mac sleeps. A Pi or NUC running an NVR does not sleep.
- You need bounding-box accurate object detection across thousands of frames per hour.
- You are running cameras that you specifically chose because the vendor was local-only and you do not want any cloud touch on the feed.
- You are deploying for a small business with multiple locations; an NVR per site is the right architecture.
The honest limits
Some of these are fixable on the Mac side, some of them are structural and you should know going in.
- Latency floor. The poller wakes once a minute, so an
every:30schedule fires at 60 seconds in practice. The minimum useful cadence is around two to five minutes. - Sleep is fatal. A laptop that closes at 6pm gives you no overnight coverage. If overnight matters, you want an NVR on a separate always-on box; this is not the trade you want to make on a MacBook.
- Vendor dashboard fragility. If Ring redesigns the dashboard layout, the agent is reading a different tree. That is a real maintenance tax compared to RTSP, which has been the same wire format since 1998.
- Per-fire model cost. Each fire is one model turn. At
every:600that is 144 turns a day. With a vision fallback on every fire, that adds up. Without one, the agent often only needs the AX tree, which is cheap. - No bounding boxes. The output of a turn is prose. If you need "person at coordinates 312, 188 with confidence 0.94" you need a real detector. This pattern gives you "looks like there is a delivery driver at the front door right now."
- One privacy line you do add. Whatever is on the dashboard tab gets sent to the model provider. For cloud cameras that already send video to the vendor, this is a small marginal line. For anything more sensitive, run an NVR locally and skip this pattern entirely.
Want help wiring this for your specific cameras?
If you are on cloud cameras and want a working scheduled watcher set up against your Mac, book a 30-minute call and we will walk through it on yours.
Questions people ask after reading the above
Frequently asked questions
If I just type 'AI agent for home security camera monitoring' into Google, what is the actual answer?
There are two answers and they are for two different problems. If you have your own cameras (Reolink, Hikvision, Amcrest, Dahua, an old Axis) and you want continuous, frame-by-frame detection of people, packages, vehicles, animals, the answer is a self-hosted NVR that ingests RTSP and runs a detector on a Coral TPU or a small GPU: Frigate, Scrypted with detect plugins, BlueIris with the CodeProject.AI plugin. If you have a cloud camera (Ring, Nest, Wyze, Eufy, Arlo) whose vendor will not give you raw RTSP, you cannot run an NVR against it; you watch a vendor dashboard. For the second shape, a general-purpose computer-use agent on the Mac you already use to watch the dashboard, scheduled to peek every few minutes and summarize, is the closer fit. It is a different product category from an NVR.
So is Fazm a security camera AI?
No. Fazm is a desktop AI agent for macOS. It controls the apps and windows on your Mac, including any browser tab, via the macOS accessibility APIs. The thing that makes it relevant to this query is that it has a recurring-task primitive (it calls them routines) and a way to read what is on screen at the moment the task fires. So if a vendor's web dashboard happens to be the only way to see your cameras, you can ask Fazm to look at it on a schedule. Frame-level detection on raw video is not its job and it would be the wrong tool for that.
What does the routine system actually do, technically?
There is a launchd agent called com.m13v.fazm-routines that polls a SQLite table named cron_jobs every 60 seconds. Rows whose next_run_at is in the past get fired. Each fire spawns acp-bridge/src/cron-runner.mjs, which spins up the same ACP bridge the floating bar uses, sends the routine prompt as a query, captures the result and the cost, and writes it back to cron_jobs and cron_runs plus a chat_messages row keyed under taskId='routine-<job-id>'. Schedule strings come in three shapes: cron:0 9 * * 1-5 for cron syntax, every:1800 for an interval in seconds, at:2026-04-30T18:00:00Z for a one-shot. The parser is computeNextRun in acp-bridge/src/schedule.mjs at line 14. The DB lives in your per-user folder at ~/Library/Application Support/Fazm/users/<UUID>/fazm.db.
Why every:1800 instead of cron syntax?
Cron is exact (this minute, this hour) and interval is relative (every N seconds from the last fire). For a watcher, the interval form is what you usually want, because the question is 'how stale am I willing for my last look to be?' not 'fire at exactly 9:03'. The shortest interval that the design tolerates is constrained by the 60-second poll: an every:30 routine is treated as every:60 in practice because the poller only wakes up once a minute. You also do not want to run the watcher tighter than the model can finish a turn, which on Claude Sonnet for a one-shot 'describe this dashboard and tell me if anything looks off' is usually 5 to 15 seconds.
What can the agent actually see when it looks at, say, the Ring web dashboard?
It reads the macOS accessibility tree of the browser window first. For Ring, Nest, Wyze and similar dashboards, the AX tree gives you the page chrome (camera labels, motion event timestamps, button names) but not the raw video frames. The agent gets a flattened text dump of the window, with one element per line, scoped to the active window. If a dashboard renders activity as text (e.g. 'Front Door: motion 2 minutes ago, person detected') the agent can read that directly. If the dashboard only shows a thumbnail with no text annotations, the agent has to fall back to a vision pass, which on macOS uses CGWindowListCreateImage to grab the window pixels and sends them to the model. Vision is slower and costs more per turn, but it works.
Why is this not just 'set up Frigate'?
Because not every camera speaks RTSP. Ring, Nest, and Wyze all encrypt their streams to the vendor cloud and do not expose a documented RTSP endpoint to the LAN. There are projects that try to bridge this (go2rtc with a Wyze plugin, wyze-bridge, scrypted-nest) and they work for some users some of the time, but they break on firmware updates, they violate vendor TOS, and they require running a Linux box. For the user who already has those cameras, already pays the vendor cloud, and is on a Mac, the calculus is different: do not bridge an RTSP feed that the vendor does not want to give you, just look at the same dashboard you would look at yourself, on a schedule. Different shape, different tool.
Is this in any sense a replacement for an NVR with object detection?
No, and you should not pretend it is. NVRs running on a Coral TPU or a small NVIDIA GPU process every frame at 5-15 fps, get sub-second detection latency, store events on local disk, and run for years without pinging a cloud. A computer-use agent watching a vendor dashboard on a schedule has minutes of latency, costs API tokens per fire, depends on the vendor's web UI not changing, and stops the moment your Mac sleeps. The two coexist. People with their own RTSP cameras self-host an NVR; people on cloud cameras use a scheduled watcher; people with both run both.
Doesn't the Mac sleeping kill any of this?
Yes. By default macOS sleeps the user session after the lid closes or the inactivity timeout fires, and a launchd agent in the user domain stops being scheduled while the user is logged out. The pattern that works on a Mac you treat as 'always on' is to run it as a kiosk: power adapter connected, lid open, a profile installed via System Settings > Users & Groups > Login Options that disables 'Put hard disks to sleep' and 'Turn display off after', and login keychain set to unlock at boot. If your Mac sleeps every night this is not the right tool; an NVR running on a Raspberry Pi or NUC is.
What about privacy? The vendor dashboards are already in the cloud.
Right, that's the trade. If the camera is a Ring or a Nest, the vendor already has the video. The agent reading the dashboard does not change that. Where it does add a privacy line is the LLM call: every routine fire sends whatever the agent saw on screen (text plus an optional vision pass) to the model provider you configured. Fazm supports a custom API endpoint, so if you have an Anthropic-compatible local gateway or a corporate proxy you can route through it; if you point at api.anthropic.com directly, the screen content goes to Anthropic. For a security camera dashboard that is already cloud-hosted this is usually not the line that matters. For a feed you specifically chose because the vendor was local-only, this is the line that matters and you should self-host an NVR instead.
What is the simplest first routine to try?
Open the dashboard you already use in a browser tab on your Mac. Pin the tab so it survives a restart. Then ask Fazm in plain language: 'every 10 minutes, look at my Ring dashboard tab, tell me if there has been a motion event in the last 30 minutes, and if there is a person on the front door camera right now, send me a chat message about it'. The agent translates that to a routine row with schedule='every:600' and stores the prompt verbatim. You see the result in your chat history under taskId='routine-<id>'. If the result is wrong you tell the agent and it edits the routine. There is no DSL to learn; the storage format is just a row.
How do I see what each fire actually did?
Two places. The chat thread under taskId='routine-<job-id>' shows the model's response per fire, in normal conversation history. The cron_runs table in fazm.db stores per-fire status, started_at, duration_ms, cost_usd, output_text, and error_message; you can SELECT against it. There is also a per-fire log file at ~/fazm/inbox/skill/logs/routine-run-<short-id>-<timestamp>.log when the routines pipeline is installed in dev mode, which captures the full ACP bridge stderr and the runner stderr, useful when a routine starts failing silently after a vendor changes their dashboard layout.
The mechanism behind the routine pattern, in more detail.
Adjacent reading
Accessibility tree limits beyond the browser
Why the AX tree on a desktop window is not the same shape as the one Playwright gives you, and the four boundaries you cross at once leaving Chrome.
Accessibility API vs screenshot agents
Where each shape wins on the desktop, and the apps where AX is thin enough that screenshots are the only honest fallback.
MacOS desktop agent autonomy
How much of the loop a desktop agent can run on its own before a human needs to confirm, and where the line should sit for tasks that touch sensitive screens.