Tags: Local LLM · Python · RSS · Automation · Agent
Hardware: M1 MacBook Pro 32GB
Models: qwen2.5:7b-instruct (Ollama) · gpt-5-nano (OpenAI API)
The first post covered SkyAgenda — a morning briefing agent that wires together Apple Calendar, weather forecasts, and a locally-running LLM. This one is about something different: SkyRss, a pipeline that fetches RSS feeds daily, filters them with a local model, generates a final report via online API, and pushes it to Slack automatically.
The two projects start from different constraints. SkyAgenda's hard requirement was data sovereignty — nothing sensitive leaves the machine. SkyRss is more about cost control and pipeline stability. But after building both, they converged on the same core engineering judgment: put high-frequency, fault-tolerant steps local; save the expensive online call for the final expression layer.
The most common mistake in RSS automation is dumping all raw feed content directly into a cloud LLM and asking it to "summarize today's news." The problems are predictable:
SkyRss takes a different approach: use the cheapest possible inference for the bulk of the data, and spend the expensive call only on the last step.
The pipeline runs in five stages:
qwen2.5:7b-instruct selects top items per sourcechat.postMessagescheduled_tasks.toml triggers runs at 10:00 and 20:00RSS sources aren't hardcoded — they live in config/sources.yaml. Each entry has an id, url, worker_type, and enabled flag. At runtime, the master worker reads the config, dispatches to typed workers, and fetches concurrently via aiohttp.
Raw output is written to data/raw/<yyyy-mm-dd>/rss_<source_id>.json. Storing by date is a small decision with a big practical upside: briefings can be replayed against the same data without re-fetching. When iterating on prompts or swapping models, you just re-run — no network calls needed.
No LLM involved at this stage, intentionally. RSS fetching is an I/O problem, not a reasoning problem. Getting "stable ingestion, consistent storage, raw data preserved" right first matters more than wiring AI in early.

Raw feeds don't go to the online API directly. They go through three compression passes first:
The local model is qwen2.5:7b-instruct running via Ollama, with conservative inference settings (temperature: 0.1, num_ctx: 16384). The task is highly structured: given a numbered list of headlines, pick the most important ones and return their indices.
Why 7B specifically? Because this task isn't open-ended generation — it's constrained selection. A 7B model handles it well, runs fast on local hardware, costs nothing per call, and can fall back to rule-based logic if output format drifts. It belongs at the front of a daily pipeline, not a larger model that takes longer to load and costs more to keep warm.

Once every source has a briefing list, SkyRss makes a single online API call to produce the final report. Two separate commands handle language variants: english_digest.py and chinese_digest.py. Both read the compressed briefing file and call the API with a clean, minimal payload.
The token count reaching the API at this point is small. The online model isn't processing raw feeds — it's acting as a last-mile editor, organizing already-filtered content into a readable Slack report.
With this compression strategy, daily API cost runs well under $0.01. The online model goes from being a full-text processor to being the final polish step.

The generated report goes out via chat.postMessage as plain mrkdwn. No Block Kit, no interactive components — just text.
The reasoning is practical: plain text is easier to debug, easier for the model to generate directly, and failure modes show up immediately in logs. SkyRss optimizes for reliable delivery, not impressive UI.
SkyRss uses a unified task scheduler that reads scheduled_tasks.toml and triggers Python modules by time:
[[tasks]]
name = "skyrss_run_all"
cwd = "apps/skyrss"
module = "skyrss.run_all"
times = ["10:00", "20:00"]
enabled = true
The dispatch logic is just subprocess.Popen on the configured module. No framework dependencies. Every task is a standard Python module that can be reproduced directly from the terminal when something goes wrong — which, for a personal project, matters more than scheduling features you'll never use.
SkyRss's model split converged on the same answer as SkyAgenda:
| Stage | Model | Reason |
|---|---|---|
| RSS filtering | Local 7B (Ollama) | High-frequency, fault-tolerant, no generation quality needed |
| Final report | Online API | Low-frequency, output goes directly to users |
Could the final report be generated locally with a 20B or 70B model? Yes. But running a large model for a daily pipeline creates real overhead: higher memory pressure, longer cold starts, and a machine that's noticeably less available for other work. The current split is the better tradeoff.
The principle underneath: don't put the most expensive model at every step — figure out what each step actually needs, and use just enough to do it well.
Raw data retention is more useful than it sounds. Per-day storage gives the pipeline replayability. Swap a model, rewrite a prompt, re-run — no re-fetching needed. This makes iteration loops much shorter.
Design the failure path before you need it. A 7B model will occasionally produce malformed output. Rule-based fallback needs to exist before the first production run, not as a hotfix after the first late-night pipeline failure.
Simple schedulers stay simple. TOML config + subprocess is more maintainable than any framework for a project at this scale. When something breaks, one terminal command reproduces it.
The next project in this series is going to be more complex than either SkyAgenda or SkyRss. The plan is a financial analysis agent — one that processes real financial statements (balance sheets, income statements, cash flow reports) using a combination of local LLM inference and online API calls to produce structured, auditable financial analysis.
The challenge is different from news summarization. Financial documents require precise numerical reasoning, cross-referencing across multiple statements, and output that can be traced back to source data. The same hybrid inference principle applies — local models for document parsing and initial extraction, online API for synthesis and narrative — but the accuracy bar is higher and the failure cost is steeper.
More on that in the next post.