Tags: Local LLM · Python · Apple Calendar · Agent · Privacy
Hardware: M1 MacBook Pro 32GB
Model: qwen3.5:9b via Ollama
I spent the last few days building SkyAgenda — a local automation agent that reads my Apple Calendar, fetches weather forecasts for upcoming event locations, and uses a locally-running LLM to generate a structured Daily Briefing, delivered to Slack every morning.
The defining constraint: all sensitive data stays on-device. No calendar events, no addresses, no personal schedule patterns ever leave the machine.
The obvious alternative would have been to wire up an external inference API. They're easier to integrate, have higher capability ceilings, and require zero infrastructure to maintain. So why not?
Because this specific use case has a hard requirement on data boundaries — and cloud APIs can't satisfy it.
Calendar data is among the most sensitive personal data you generate. It encodes not just where you're going, but when, how often, and with whom. Raw event data, location strings, and address fields don't belong outside the machine running this agent — regardless of how trustworthy a given cloud provider might be.
By running inference locally, the entire pipeline closes: calendar data local, address normalization local, text generation local. No third-party data retention policy, no audit trail you don't control, no surprise terms-of-service update changes what leaves your machine.
The challenge for automation projects isn't running once — it's running stably for weeks. Depending on an external inference API introduces a class of failures that are difficult to predict or control: network blips, rate limits, model version silently upgraded on the provider's end, quota exhausted mid-run.
Moving inference local makes the dependency graph legible: calendar depends on Apple CalDAV, weather depends on WeatherAPI, text generation depends on local Ollama. Every failure mode is observable and bounded.
For high-frequency automation with short inputs and outputs, API token costs create a persistent low-level anxiety that subtly distorts engineering decisions. Running local means no per-token billing, which means you can write verbose debug prompts, log everything, and iterate freely — without calculating whether an extra test run is "worth it."
| Local LLM | Cloud API | |
|---|---|---|
| Data leaves device | No | Yes |
| Token cost | None | Per request |
| Latency (this task) | 4–12s | ~1–3s |
| Availability dependency | Local only | Network + provider |
| Model version control | Full | Provider-controlled |
The system is six serial steps. Each step's output is the next step's input; errors can be caught and logged at any point.
1. Load config (.env + config/*.yaml)
2. Prefetch weather (current location + watched locations, 3-day forecast)
3. Fetch calendar (nearest 3 future events via CalDAV)
4. Normalize locations (LLM: raw address → WeatherAPI-queryable string)
5. Enrich events (attach weather forecast to each event)
6. Generate briefing (LLM: structured JSON → Slack message)
The core orchestration in main.py:
# src/skyagenda/main.py
upcoming_events = await apple_calendar_client.get_nearest_future_events(limit=3)
for event in upcoming_events:
enriched_event = await enrich_event(event) # location analysis + weather
output_payload["calendar"]["upcoming_3_days_enriched"].append(enriched_event)
briefing_generator = DailyBriefingGenerator(debug_log_path=briefing_debug_log)
briefing_text = await briefing_generator.generate(
output_payload["calendar"]["upcoming_3_days_enriched"],
followed_locations=output_payload["weather"]["followed_locations_forecast_3days"],
)
The calendar layer uses aiocaldav, but CalDAV implementations vary significantly across clients and servers — function signatures for search queries aren't standardized. To handle this, the client attempts multiple signatures in sequence until one succeeds:
# src/skyagenda/apple_calendar_client.py
attempts = [
((), {"start": start_date, "end": end_date, "event": True, "expand": True}),
((), {"start": start_date, "end": end_date, "event": True}),
((), {"start": start_date, "end": end_date}),
((), {"begin": start_date, "end": end_date}),
((), {"date_start": start_date, "date_end": end_date}),
((start_date, end_date), {}),
]
Event objects are also often incomplete. When fields are missing, the client falls back to parsing the raw ICS text directly, extracting SUMMARY, DTSTART, LOCATION, and DESCRIPTION manually.
One design decision worth noting: the query target is "nearest 3 future events", not "events in the next 3 days." This matters — calendars can be sparse. A 3-day window sometimes returns nothing. The system automatically widens the search window from 90 days → 1 year → 10 years until it finds 3 events.
async def get_nearest_future_events(self, limit: int = 3, search_days: int = 90):
# expands window automatically: 90d → 1yr → 10yr

Calendar location fields are written for humans, not machines. Real examples from my own calendar include multi-line addresses mixing Chinese, Japanese neighborhood names, postal codes, and building names — none of which the WeatherAPI can parse directly.
EventLocationAnalyzer handles this in two passes.
Pass 1: Send the raw event to the LLM, ask for structured JSON:
{
"normalized_query": "Yokohama, Japan",
"confidence": 0.92,
"resolution_level": "city"
}
Pass 2: If the normalized query still contains CJK characters, trigger a second "English rewrite" prompt:
# src/skyagenda/event_location_analyzer.py
parsed = await self._ollama_generate_json(self._prompt(event))
if self._contains_cjk(merged["normalized_query"]):
rewritten = await self._ollama_generate_json(
self._english_rewrite_prompt(event, merged)
)
Results are cached by input text, so repeated runs with the same calendar data skip all LLM calls for this step.
The weather layer is intentionally thin. WeatherClient wraps api.weatherapi.com, primarily calling the forecast endpoint:
# src/skyagenda/weather_client.py
async def get_forecast_weather(self, location, days=3):
return await self._get_weather_data("forecast", {"q": location, "days": days})
Raw API responses are normalized to a minimal structure before being passed downstream:
{
"condition": "Partly cloudy",
"avg_temp_c": 12.4,
"min_temp_c": 8.1,
"max_temp_c": 16.7,
}
Keeping this normalization layer decoupled from the LLM prompt templates means upstream API changes don't cascade into prompt rewrites.
DailyBriefingGenerator turns enriched event data into a readable daily summary. Two design decisions shaped how this works.

Early versions made two LLM calls per day — one for Highlight, one for Advice. The refactored version makes a single call that returns both in one JSON object:
payload = {
"model": self.model,
"prompt": self._day_brief_json_prompt(label, target_date, rows),
"format": "json",
"think": False,
"stream": False,
}
This halves the number of LLM requests and, more importantly, generates Highlight and Advice within the same context window — so they're semantically consistent with each other.
JSON parsing can fail. When it does, the system doesn't raise — it falls back to regex extraction, looking for Highlight: and Advice: markers in plain-text output. Either way, every LLM request is written to daily_briefing_llm_debug.jsonl:
{
"timestamp": "2026-03-07T18:01:05",
"raw_response": "...",
"done_reason": "stop",
"eval_count": 48,
"fallback_triggered": false
}
This logging is what makes model upgrades safe. When output format drifts after an Ollama update, you can diagnose from logs rather than reverse-engineering from Slack messages.
Choosing qwen3.5:9b was a concrete tradeoff, not a vague "this model seems good."
The Ollama startup log tells the story:
offloaded 33/33 layers to GPU
total memory size="7.9 GiB"
llama runner started in 4.41 seconds
FlashAttention: Enabled
All 33 layers offloaded to Metal GPU means inference runs entirely on GPU — no CPU fallback dragging things down. 7.9 GiB of unified memory usage leaves plenty of headroom on a 32GB machine. Cold start under 5 seconds.
Actual request latency from production logs:
200 | 12.09s | POST /api/generate
200 | 5.04s | POST /api/generate
200 | 4.34s | POST /api/generate
200 | 4.94s | POST /api/generate
For a Daily Briefing task, 4–12 seconds is fine. The inference parameters are tuned for stability over creativity:
# src/skyagenda/daily_briefing.py
temperature = 0.5
top_p = 0.92
num_ctx = 1024
num_predict = 64
think = False
Why 9B specifically? 7B models are noticeably less consistent on structured JSON output and multi-language address handling — exactly the two tasks this system depends on most. Larger models (14B+) would push memory usage into uncomfortable territory on this machine and slow things down beyond the acceptable range for a background agent. 9B sits at the right inflection point for this hardware and this task profile.
Lesson learned: A 9B model is not plug-and-play. Stable production behavior requires tuning all four levers together: model, prompt, parser, and timeout strategy. Get any one of them wrong and you'll see intermittent failures that are hard to attribute.

Early on I hit a case where Ollama returned 200 but the response field was an empty string, silently triggering fallback. The fix was disabling the thinking channel (think: false) and logging done_reason and eval_count to trace protocol-level behavior.
Without detailed logging, a fallback-path response and a normal-path response can look identical in Slack. You'll attribute quality degradation to the model when the real cause is a parse failure. The solution is to always log fallback_triggered explicitly — make the two paths distinguishable in the data.
Two small changes had an outsized effect on briefing readability: showing abbreviated location names (Yokohama instead of the full multi-line address block), and suppressing the weather row entirely for events with no location rather than showing N/A. Neither change touches core logic — both meaningfully improve the daily reading experience.
SkyAgenda's value isn't in any single piece — not the LLM, not the calendar integration, not the weather API. It's in the reliability of the full chain: structured data in, structured data out, observable at every step.
If you're building something similar, my suggestion is to invest in logging and parsing stability first, before optimizing prompts or exploring model alternatives. A system that runs correctly every day for two weeks is more useful than one that occasionally produces impressive output but fails unpredictably.
The architecture is intentionally unsexy. That's the point.