WANG's Cyber Realm

Tags: Local LLM · Python · RSS · Automation · Agent
Hardware: M1 MacBook Pro 32GB
Models: qwen2.5:7b-instruct (Ollama) · gpt-5.4-nano (OpenAI API)

Sky Series #3 — After SkyAgenda automated my daily briefings and SkyRss built a cost-efficient news pipeline, the natural next step was financial monitoring. Higher stakes. Same principles.

I hold positions across two markets — US equities (NVDA, CRWD, PLTR, COIN) and Tokyo-listed stocks (7011.T Mitsubishi Heavy, 6758.T Sony, 6367.T Daikin). Every morning I was doing the same routine manually: check prices, scan headlines, remind myself why I own what I own. Repetitive, error-prone, and boring.

The tools available are either too generic ("your portfolio is down 2%") or too expensive to run with cloud AI against 20 tickers every single day. A frontier model call per ticker per run adds up fast on a personal project with no revenue.

So I built SkyFinance: an 6-step pipeline that runs twice every trading day and once on Sundays — automatically. Two Slack messages per run, roughly $0.50/month in cloud API costs.

The architecture is intentionally unsexy. That's the point.

The Core Constraint

Every design decision flows from one constraint: local inference is free, cloud inference is not.

The question is always where to draw the line. News filtering and per-ticker classification are tasks a 7B model handles reliably. Cross-holding portfolio synthesis and polished report generation are tasks where frontier models earn their cost. The mistake is using a cloud API for everything, or trying to push local models beyond their capability floor.

I've been burned by both directions. SkyFinance is built around that lesson.

Pipeline Overview

┌─────────────────────────────────────────────────────────────┐
│                 SkyFinance — Daily Pipeline                 │
│                                                             │
│  STAGE 1: DATA ACQUISITION          (free)                  │
│  ┌─────────────────────────────────────────────┐            │
│  │  01  Prices & fundamentals   yfinance       │            │
│  │  02  News headlines          Google RSS     │            │
│  └─────────────────────────────────────────────┘            │
│                         │                                   │
│  STAGE 2: LOCAL ANALYSIS            (zero cost)             │
│  ┌─────────────────────────────────────────────┐            │
│  │  03  Filter + brief          Ollama 7B      │            │
│  │  04  Send brief → Slack      #daily         │            │
│  └─────────────────────────────────────────────┘            │
│                         │                                   │
│  STAGE 3: CLOUD SYNTHESIS           (~$0.01/run)            │
│  ┌─────────────────────────────────────────────┐            │
│  │  05  Full report (EN)        gpt-5.4-nano   │            │
│  │  06  Send report → Slack     #daily-cn      │            │
│  └─────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────┘

Steps 1–3 are critical — the pipeline aborts if they fail. Steps 4–6 are non-critical — a Slack outage or API hiccup logs a warning and continues.

Every step reads and writes structured JSON to dated directories. This means every run is fully auditable after the fact.

data/
  base/YYYYMMDD/{TICKER}.json       ← price, PE, targets, PnL
  news/YYYYMMDD/{TICKER}.json       ← headlines per query term
  analysis/YYYYMMDD/
    {TICKER}.json                   ← per-ticker LLM verdict
    _daily_brief.json               ← aggregated signal summary
  report/YYYYMMDD/
    full_report.json / .md          ← full cross-holding report (English)
    full_report_en.json / .md       ← same, second channel

Stage 1 — Data Acquisition

Prices via yfinance

Nothing novel here. For each of the 20 tickers I store a structured snapshot: current price, daily change, P/E, market cap, 52-week range position, analyst target prices (T1/T2/T3), and P&L against my cost basis.

{
  "ticker": "NVDA",
  "current_price": 875.4,
  "change_pct": -1.23,
  "pe_ratio": 42.1,
  "market_cap_fmt": "2.15T",
  "52w_high": 974.0, "52w_low": 461.0,
  "pnl_str": "+34.2% vs cost",
  "target_prices": [950.0, 1050.0, 1200.0]
}

This structured snapshot feeds directly into the final AI report prompt alongside news signals. The model sees price context, not just headlines.

News via Google RSS — Three-Tier Keywords

Rather than paying for a news API, I use Google News RSS feeds with an async fetcher — aiohttp with asyncio.Semaphore(4) for concurrency. Same approach as SkyRss, applied per-holding.

async def fetch_ticker_news(session, ticker, keywords, semaphore):
    async with semaphore:
        articles = []
        for kw in keywords:          # L1 + L2 queries combined
            url = f"https://news.google.com/rss/search?q={quote(kw)}&hl=en-US"
            async with session.get(url, timeout=10) as resp:
                articles.extend(parse_rss(await resp.text()))
        return ticker, deduplicate(articles)

async def fetch_all(holdings):
    semaphore = asyncio.Semaphore(4)  # max 4 concurrent RSS requests
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_ticker_news(session, h.ticker, h.L1 + h.L2, semaphore)
                 for h in holdings]
        return dict(await asyncio.gather(*tasks))

The non-obvious part is the keyword architecture. Each holding carries three tiers:

┌──────────────────────────────────────────────────────────┐
│            Keyword Tier System (per holding)             │
│                                                          │
│  L1 — Direct identifiers        ← queried, always        │
│       "NVIDIA", "NVDA", "Jensen Huang"                   │
│                                                          │
│  L2 — Thematic keywords         ← queried, always        │
│       "AI chip", "CUDA", "data center GPU", "Blackwell"  │
│                                                          │
│  L3 — Macro context             ← NOT queried            │
│       "semiconductor supply chain", "US export controls" │
│       (passed to LLM as framing only)                    │
│                                                          │
│  signal_chain                   ← causal thesis string   │
│       "AI capex → GPU demand → NVDA revenue beat"        │
└──────────────────────────────────────────────────────────┘

L3 keywords are deliberately not queried. Searching "semiconductor supply chain" across 20 tickers would flood results with undifferentiated macro noise. Instead, L3 terms are embedded in the LLM prompt as contextual framing — the model knows why certain L2 signals matter without drowning in macro articles.

The signal_chain field is the key idea: a one-line causal logic from macro event to stock impact. For NVDA it's AI capex → GPU demand → NVDA revenue beat. For 7011.T Mitsubishi Heavy it's defense budget increase → backlog growth → order intake. This chain gets passed to both the local LLM and the cloud API so analysis is always relative to the specific investment thesis, not generic sentiment.

Stage 2 — Local LLM Analysis

This is the load-bearing cost reduction in the system. The model choice here matters more than it looks.

Why Local

Filtering 20 tickers' worth of headlines is a classification task, not a reasoning task. For each ticker I need: relevance score, sentiment direction, 2–3 sentence brief. Running this locally means zero API cost for a stage that fires twice a day, every trading day.

Model Selection — 4B vs 7B vs 9B on M1 Pro 32GB

Hardware: MacBook M1 Pro, 32GB unified memory. All models run via Ollama on Metal. I benchmarked three sizes from the Qwen2.5-instruct family against the same task — filter and brief generation on real news articles:

┌──────────────────────────┬──────────┬──────────────┬──────────────────────────────────┐
│ Model                    │ Load     │ Per-request  │ Quality on short JP/EN news      │
├──────────────────────────┼──────────┼──────────────┼──────────────────────────────────┤
│ qwen2.5:3b-instruct      │ 1.6s     │ 1.4 – 2.5s   │ ⚠️ Thin extraction — misses      │
│ KV cache: 1,152 MB       │          │ avg ~2.0s    │    context in short JP articles, │
│                          │          │              │    JSON occasionally incomplete  │
├──────────────────────────┼──────────┼──────────────┼──────────────────────────────────┤
│ qwen2.5:7b-instruct      │ 3.1s     │ 2.6 – 4.1s   │ ✅ Solid signal extraction,       │
│ KV cache: 1,792 MB       │          │ avg ~3.4s    │    handles JP/EN mix reliably,   │
│                          │          │              │    consistent structured JSON    │
├──────────────────────────┼──────────┼──────────────┼──────────────────────────────────┤
│ qwen2.5:14b-instruct     │ 6.2s     │ 6 – 13s      │ ⚠️ Marginally deeper reasoning,  │
│ KV cache: 6,144 MB       │          │ avg ~9s      │    but the gap over 7B is small  │
│                          │          │              │    for this classification task  │
└──────────────────────────┴──────────┴──────────────┴──────────────────────────────────┘

Full batch (20 tickers × 2 stages = 40 calls):
  3B  →  ~1.5 min   speed is fine, but signal quality too shallow
  7B  →  ~2–3 min   sweet spot ✅
  14B →  ~7–8 min   too slow, quality gain doesn't justify it

The speed gap between 3B and 7B is actually small — about 1.5 seconds per request. Where 7B earns its keep is comprehension of short Japanese news text. Financial headlines are compact and context-heavy: a 40-character TSE headline with no subject can refer to a policy change, an earnings revision, or a supply chain update. The 3B model regularly produces generic or incomplete signals on these; 7B handles them correctly the large majority of the time.

The 14B model is noticeably more capable in terms of raw reasoning, but for a filtering and classification task the quality ceiling doesn't change meaningfully compared to 7B. The cost is a 3× slowdown and 6GB of KV cache — memory that competes with everything else running on the machine. The numbers make it an easy cut.

qwen2.5:7b-instruct with temperature: 0.1 is what stayed in production. The Ollama call is a plain HTTP POST — no SDK, no streaming:

def ollama_chat(messages: list[dict]) -> str:
    payload = {
        "model": "qwen2.5:7b-instruct",
        "messages": messages,
        "stream": False,                               # wait for full response
        "options": {"temperature": 0.1, "num_predict": 512},
    }
    resp = requests.post("http://localhost:11434/api/chat",
                         json=payload, timeout=90)
    return resp.json()["message"]["content"]

The Batching Problem

The critical failure mode I hit before getting this right:

❌ Naive — interleaved stages:

  Ticker A → Stage 1 (filter) → Stage 2 (brief)    ← model loaded
  Ticker B → Stage 1 (filter)                       ← 90s cold start (model evicted)
  Ticker B → Stage 2 (brief)                        ← fine
  Ticker C → Stage 1 (filter)                       ← 90s cold start again
  ...

Ollama evicts a model from VRAM after ~5 minutes of inactivity. With 20 tickers and two stages, interleaving caused model eviction between every pair. Each eviction costs a full cold-start penalty on top of the actual inference time — turning a 2-minute batch into something that takes 10× longer.

✅ Correct — batched by stage:

  ALL 20 tickers → Stage 1 (filter)    ← model loads once, stays loaded
  ALL 20 tickers → Stage 2 (brief)     ← model stays loaded throughout

The model loads exactly once per stage across all 20 tickers. Total local inference time: ~2–3 minutes for the full 40-call batch. The fix in code is just a loop restructure:

# ❌ Interleaved — triggers model eviction between tickers
for ticker in tickers:
    filter_results[ticker] = ollama_chat(filter_prompt(ticker))
    brief_results[ticker]  = ollama_chat(brief_prompt(ticker))   # model may be evicted here

# ✅ Batched by stage — model loads once per stage
for ticker in tickers:
    filter_results[ticker] = ollama_chat(filter_prompt(ticker))

for ticker in tickers:
    brief_results[ticker]  = ollama_chat(brief_prompt(ticker))

┌────────────────────────────────────────────────────────────┐
│                  Two-Stage Local Pipeline                  │
│                                                            │
│  STAGE 1 — Filter (all tickers)                            │
│  ┌────────┐  ┌────────┐  ┌────────┐         ┌────────┐     │
│  │NVDA    │  │CRWD    │  │PLTR    │  · · ·  │7011.T  │     │
│  │filter  │  │filter  │  │filter  │         │filter  │     │
│  └────────┘  └────────┘  └────────┘         └────────┘     │
│       ↓ model stays loaded in VRAM ↓                       │
│  STAGE 2 — Brief (all tickers)                             │
│  ┌────────┐  ┌────────┐  ┌────────┐         ┌────────┐     │
│  │NVDA    │  │CRWD    │  │PLTR    │  · · ·  │7011.T  │     │
│  │brief   │  │brief   │  │brief   │         │brief   │     │
│  └────────┘  └────────┘  └────────┘         └────────┘     │
└────────────────────────────────────────────────────────────┘

Output and Language

Each ticker produces a structured JSON verdict:

{
  "ticker": "NVDA",
  "signal_strength": 4,
  "sentiment": "bullish",
  "recommended_action": "buy_watch",
  "key_catalysts": ["Blackwell shipment beat", "data centre demand upgrade"],
  "summary": "Jensen Huang's confirmation of accelerated Blackwell delivery aligns directly
              with the AI capex thesis. Analyst upgrades add near-term momentum."
}

The brief output language is whatever the source news is — Japanese for Tokyo-listed stocks, English for US equities. No translation is applied at this stage, and that's intentional.

Asking a local 7B model to translate while also analysing is a context mixing problem. The model ends up hedging between languages, producing output that's partially translated and partially not, with degraded analysis quality in both. The local model's job is signal extraction from the source material — keeping it in the original language means it can do that one task well.

This pre-filtered JSON — not raw headlines, not translated text — is what gets passed to the cloud API. That's what keeps Stage 3 token costs manageable.

Stage 3 — Cloud API: Cross-Holding Synthesis

Why Not Local Here

After Stage 2, I have 20 per-ticker verdicts. What I need next is something a 7B model genuinely cannot do well:

Cross-holding correlation: Is the NVDA bullish signal consistent with MSFT/AMZN/PLTR? Or isolated noise?
Portfolio-level risk synthesis: If export controls tighten, which positions are correlated?
Market regime classification: Label the current macro regime and map my holdings to it
Unified language output: The brief arrives as a mix of Japanese and English signals. The final report needs to synthesise all of it into coherent English prose — which means reading across both languages fluently

These require broad world knowledge, sustained multi-context reasoning across 20 positions, and cross-lingual synthesis. A frontier API model handles this in one call. A 7B model hallucinates or loses coherence by the tenth ticker.

Prompt Architecture

┌─────────────────────────────────────────────────────────────┐
│                 Cloud API Prompt Structure                  │
│                                                             │
│  ┌─────────────────────────────────────┐                    │
│  │  PRICE SNAPSHOT TABLE               │  ← from Step 01    │
│  │  NVDA  | 875.4 USD | cap 2.15T      │                    │
│  │  CRWD  | 310.2 USD | PE 89.4        │                    │
│  │  7011.T| 2847 JPY  | PnL +89.4%     │                    │
│  │  ...   | ...       | ...            │                    │
│  └─────────────────────────────────────┘                    │
│                   +                                         │
│  ┌─────────────────────────────────────┐                    │
│  │  LOCAL LLM SIGNALS (pre-filtered)   │  ← from Step 03    │
│  │  [★★★★☆] NVDA BUY_WATCH            │                     │
│  │    chain: AI capex → GPU → revenue  │                    │
│  │    brief: Blackwell beat confirmed  │                    │
│  │  [★☆☆☆☆] MSFT HOLD                │                      │
│  │    brief: No material news.         │                    │
│  │  ...                               │                     │
│  └─────────────────────────────────────┘                    │
│                   ↓                                         │
│  ONE API CALL → full_report.json                            │
└─────────────────────────────────────────────────────────────┘

The output is a structured JSON report — portfolio sentiment, market regime, top 5 actions, risk alerts, sector summary, and a one-liner for every holding. JSON in, JSON out. No free-form parsing required.

Language: Enforce English at the API Layer

The brief coming out of Stage 2 is a mix — Japanese summaries for Tokyo stocks, English for US equities. The cloud API prompt explicitly instructs the model to output all free-text fields in English.

At this stage, there's no cost reason to avoid it. The model is already receiving the full context in a single call; adding a language constraint costs zero extra tokens compared to what the reasoning requires anyway. The API can read mixed Japanese/English input and produce clean English output reliably — this is exactly where frontier models earn their keep over local inference.

Step 05/06 → prompt: "All free-text fields in English"
           → full_report.json  (cross-holding synthesis, English throughout)

The result is a report that's consistent in language regardless of which market a signal originated from. A Tokyo defense stock signal and a US semiconductor signal end up in the same readable English paragraph.

A Technical Note on Newer OpenAI Models

Newer models use max_completion_tokens, not max_tokens. This produces a 400 error that isn't immediately obvious:

400: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.

response = client.chat.completions.create(
    model="gpt-5.4-nano",
    messages=[{"role": "system", "content": SYSTEM_PROMPT},
              {"role": "user",   "content": prompt}],
    max_completion_tokens=4096,   # ← not max_tokens
    temperature=0.2,
)
report = json.loads(response.choices[0].message.content)

One call, ~5,000 tokens in, structured JSON out. The second issue to watch: always test token budget with the full holding count. 2,048 tokens truncated the JSON mid-object on 20 tickers; 4,096 is enough with room to spare.

Slack Delivery

Three messages per pipeline run:

┌─────────────────────────────────────────────────────────────┐
│  Slack Message Flow                                         │
│                                                             │
│  Step 04                                                    │
│  ┌────────────────────────────────────────────────────┐     │
│  │  🔶 Local Brief  (fast, free, JP+EN as-is)         │     │
│  │  Per-ticker signals from Ollama                    │     │
│  │  Language mirrors the source news — no translation │     │
│  │  Sent ~10 min after pipeline starts                │     │
│  └────────────────────────────┬───────────────────────┘     │
│                               ↓                             │
│                          #daily-brief                       │
│                                                             │
│  Step 06                                                    │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  🏦 Full Report (API)                                 │  │
│  │  English (API)                                        │  │
│  │  Top actions                                          │  │
│  │  Risk alerts                                          │  │
│  │  Sector summary                                       │  │
│  │  Holdings grid                                        │  │
│  └────────────────────────────┬──────────────────────────┘  │
│                               ↓                             │
│                          #report-daily                      │
└─────────────────────────────────────────────────────────────┘

All three senders share a single _post() function in slack_sender.py, with an optional channel override so the two full reports can target different channels from the same codebase. Messages use Slack Block Kit — headers, sections, and two-column fields arrays for the holdings grid.

The Scheduler

scheduler.py is a persistent Python process built on APScheduler — three jobs, no crontab required.

┌─────────────────────────────────────────────────────────────┐
│               Scheduled Runs (Tokyo timezone)               │
│                                                             │
│  Mon–Fri  16:30 JST    After Tokyo close  (TSE closes 15:30)│
│  Mon–Fri  17:00 ET     After NYSE close  (NYSE closes 16:00)│
│                        = 06:00 JST (summer EDT)             │
│                          07:00 JST (winter EST)             │
│  Sunday   16:00 JST    Weekly portfolio review              │
└─────────────────────────────────────────────────────────────┘

The NYSE job is registered in America/New_York timezone. APScheduler resolves EDT/EST automatically — no manual DST logic.

Between fire times the main thread is suspended via threading.Event.wait(timeout=seconds_until_next_job). It's not a polling loop. CPU usage during standby is 0%.

# Preview next scheduled run times without starting
$ python scheduler.py --dry-run

Sky Finance Scheduler — next runs from 2026-04-07 09:14 JST
────────────────────────────────────────────────────────────
Tokyo-close (TSE+1h)          →  2026-04-07 16:30 JST   (04/07 03:30 EDT)
NYSE-close  (NYSE+1h)         →  2026-04-07 17:00 EDT    (04/08 06:00 JST)
Sunday summary                →  2026-04-12 16:00 JST    (04/12 03:00 EDT)

Cost Profile

Component	Cost
yfinance — price data	Free
Google RSS — news	Free
Ollama — local LLM inference	Free (local GPU/CPU)
OpenAI API — 2 calls/day × ~5,000 tokens	~$0.01–0.02/day
Slack Bot API	Free tier
Monthly total	~$0.30–0.60

The local LLM pre-filtering is what makes that number possible. Without it, you'd pass raw headlines for 20 tickers directly to the cloud API — roughly 10× more tokens per run, plus the noise of unfiltered news mixed with actual signals.

The local LLM's job isn't to be smart. It's to be cheap and good enough at filtering so the cloud API can be smart about what matters.

What Broke Along the Way

Ollama timeout on first call. Model cold-start on Apple Silicon takes longer than the default 120-second request timeout. Fix: add a warmup_model() call at startup with a 300-second timeout. Don't discover this on the first real run.

JSON truncated mid-object. max_completion_tokens: 2048 wasn't enough for a 20-ticker portfolio report. The model would cut off mid-JSON — valid syntax up to the truncation point, broken after. Fix: increase to 4096. Always test with your full holding count, not a subset.

Slack channel_not_found. The Bot Token needs to be explicitly invited to each channel with /invite @BotName. The API returns channel_not_found even if the channel exists and the bot has the right OAuth scopes — it just hasn't been invited. Took longer to debug than it should have.

Summary

SkyFinance Phase 1 delivers automated portfolio monitoring for 20 holdings across US and Japanese markets, running twice per trading day and once on Sundays, for roughly $0.50/month.

Three things that matter:

Batch local LLM calls by stage, not by ticker. Interleaving stages across tickers causes model eviction on every switch. Stage batching means the model loads once.
Local LLM as a pre-filter, not a final answer. The 7B model's job is to compress 20 tickers' worth of headlines into structured signals. The frontier model's job is to synthesise those signals into a portfolio view. Don't ask either to do the other's job.
Every stage writes structured output. JSON in, JSON out, dated directories. When something breaks — and it will — you want to replay from any stage without re-running everything upstream.

What's Next — Phase 2

Phase 1 handles price data and news signals. What it doesn't have:

Financial statement analysis. Earnings reports, revenue trends, margin trajectories, balance sheet health. Parsing SEC filings and IR PDFs is a different problem from parsing RSS feeds — it requires structured extraction from unstructured documents, cross-quarter comparison, and reasoning about forward guidance language. A 7B model isn't reliable here.

Macroeconomic overlay. Fed policy, yield curve dynamics, JPY/USD exposure (material for my Tokyo positions), geopolitical risk. These operate on longer timescales than daily news and need separate ingestion and synthesis before being merged into the portfolio view.

Phase 2 will add earnings report ingestion, a macro regime model, and a top-tier model for the final synthesis layer — because cross-statement analysis with macro overlay is a genuinely hard reasoning task that earns a higher inference cost.

The architecture stays the same. The intelligence layer gets upgraded.

Sky Series: SkyAgenda → SkyRss → SkyFinance (this post) → Sky FinancePhase 2 (coming)

Imgur

SkyFinance: I Built a Personal Stock Alert System with Local LLMs and Cloud AI