5G AI Lab — System Architecture

Model roles, responsibilities & example prompts

🧠

gemma4:e4b

Main brain · orchestrator · ~4.5B params

Primary reasoning engine. Receives all specialist results and synthesizes a final coherent answer. Handles complex multi-step questions, interprets 5G lab context, and decides when more tool calls are needed.

"Why is UE registration failing based on these logs?"

"Explain what these Prometheus metrics mean for our setup"

click for full model report →

ollama pull gemma4:e4b

Already installed ✓

~4GB · ~5GB VRAM

Role: orchestrator
Temp: 0.7 · Ctx: 8192

⚡

phi3:mini

Intelligent router · ~2B params · ~200ms

Reads every incoming question and classifies it in milliseconds — is this a code task, a visual, a web search, a log parse, or a complex reasoning question? Routes to the right specialist instantly, making the system feel fast and responsive.

"this is a code question → codellama"

"this has an image → llava"

click for full routing report →

ollama pull phi3:mini

~2GB · ~2.5GB VRAM

Role: router
Latency target: <200ms

🔍

mistral:7b

Web search specialist · 7B params

Activated when Gemma needs current information from the web. Sends a precise search query, fetches results, and returns a clean 3-5 sentence factual summary. Used for hardware specs, 3GPP standards, software changelogs, and known issues.

"What 3GPP release does Open5GS 2.7.6 support?"

"Ericsson BB 6630 maximum UE capacity"

click for full search engine report →

ollama pull mistral:7b

~4GB · ~6GB VRAM

Role: web search
Temp: 0.3 · concise output

💻

codellama:13b

Code & config specialist · 13B params

Handles all code generation, script writing, config generation, and log parsing with regex. Generates Ericsson router CLI commands, Python scripts, Ansible playbooks, and PromQL queries. Far more reliable than general models for exact CLI syntax.

"Write CLI to configure VLAN 2140 on Ericsson router"

"Parse AMF logs and extract all S1AP errors as JSON"

click for full model report →

ollama pull codellama:13b

~8GB · ~10GB VRAM

Role: code & CLI specialist
Needs GPU for best speed

👁️

llava:13b

Vision specialist · multimodal · 13B params

Reads and analyzes images. Point it at Grafana dashboard screenshots, Ericsson Element Manager UI screenshots, alarm lists, or network diagram photos. Identifies anomalies, reads values, and describes what it sees in technical language.

"Here is a Grafana screenshot — what looks wrong?"

"Analyze this baseband alarm screenshot"

click for full model report →

ollama pull llava:13b

~8GB · ~12GB VRAM

Role: vision & image analysis
Needs 12GB+ VRAM

📋

mistral:7b-instruct

Structured output specialist · 7B params

Instruct-tuned for strict JSON and table output. Used specifically for log analysis, alarm report generation, and any task where the output must be machine-readable or precisely formatted. Ideal for feeding results into dashboards or other tools.

"Parse today's MME logs → JSON list of failed registrations"

"Summarize Prometheus alerts as structured report"

click for full model report →

ollama pull mistral:7b-instruct

~4GB · ~6GB VRAM

Role: structured JSON output
Log parsing · alarm reports

📚

nomic-embed-text

Embeddings · local knowledge base · 270MB

Converts text to vectors for semantic search. Feeds the local ChromaDB vector database. Index your lab documentation, Ericsson manuals, Open5GS configs, router configs, and historical logs. Then search them by meaning, not just keywords.

"What does our router config say about VLAN 2120?"

"Find all handover failures from last week's logs"

click for full knowledge base report →

ollama pull nomic-embed-text

~270MB · ~0.5GB VRAM

Role: embeddings · ChromaDB
Tiny size · huge value

Recommended hardware for the AI server

Every query runs 3 models in sequence — peak VRAM is their sum

phi3:mini

2.5 GB

router · always first

specialist

6–12 GB

varies by query type

gemma4:e4b

5 GB

synthesis · always last

peak demand

13.5–19.5 GB

19.5 GB with llava:13b

      Worst-case path: phi3:mini (2.5) + llava:13b (12) + gemma4:e4b (5) = 19.5 GB  · 
      Typical path (search/code): phi3:mini (2.5) + specialist (6–10) + gemma4:e4b (5) = 13.5–17.5 GB
    

Component	Minimum	Recommended	Why	Your risk without it
GPU	RTX 3080 Ti (12GB)	RTX 4090 (24GB) or A10 (24GB)	Pipeline runs 3 models simultaneously. llava:13b alone needs 12GB. codellama:13b needs 10GB. Without GPU, all 7 models fall back to CPU.	60–180s response times on 13B models. llava:13b is completely unusable on CPU. Vision queries time out.
VRAM	16GB	24GB+	The pipeline runs phi3:mini + specialist + gemma4:e4b in sequence — all three overlap in VRAM during synthesis. Peak demand is 19.5GB (llava path). 24GB covers all 7 specialists with no offloading.	16GB handles most queries but llava:13b will partially offload to RAM (~3–4s extra per vision query). 10GB is too small — codellama alone won't fit cleanly.
RAM (system)	32GB	64GB	Models overflow VRAM to RAM. SearXNG + Playwright headless Chromium + ChromaDB + WSL2 overhead together consume ~8–12GB just for the support stack. Remaining RAM absorbs VRAM spillover.	System thrashes when multiple specialists are called in sequence. Playwright browser fetch crashes. ChromaDB indexing slows to a crawl.
CPU	8 core	16 core (Ryzen 9 / i9 / Xeon)	phi3:mini routing runs on CPU even with a GPU present. Playwright headless Chromium, SearXNG Flask server, ChromaDB embeddings, and SSH tool calls all run concurrently on CPU alongside model inference.	phi3:mini routing adds 400–800ms instead of <200ms. Parallel search fetch competes with inference. Tool calls feel sluggish.
Storage	512GB NVMe	1TB+ NVMe (PCIe 4.0)	Current 7 models total ~31GB. Planned upgrade to qwen2.5-coder:32b adds ~20GB. ChromaDB grows as you index lab docs. SearXNG, Playwright browser (~130MB), logs, and WSL2 image add up fast.	Running out of space when pulling the code model upgrade. Slow model load from a SATA SSD adds 5–15s cold-start latency per model.
Network	100Mbps Ethernet	1Gbps Ethernet to lab network	SSH to core (172.29.10.26), Prometheus queries (:9099), NMS health (:8888), AMF/MME log streaming — all concurrent when multiple tools are invoked. SearXNG also fetches from upstream search engines.	Prometheus queries and log fetches add 1–3s latency. SSH timeout risks during heavy concurrent tool use.
OS	WSL2 Ubuntu 22.04 ✓ (current)	Native Ubuntu 22.04 (dual boot or dedicated server)	Native Linux removes WSL2 memory overhead, fixes GPU passthrough latency, eliminates port-forwarding complexity, and lets Playwright and SearXNG bind ports directly without Windows NAT.	WSL2 works but adds ~2GB RAM overhead, occasional GPU driver friction, and Playwright requires --no-sandbox workaround (already applied).

Current model stack — 7 models · ~31 GB total on disk

Model	Role	Disk	VRAM	Active when	Priority
phi3:mini	Router	~2 GB	~2.5 GB	Every single query — always first	★★★ pull first
gemma4:e4b	Brain · synthesis	~4 GB	~5 GB	Every single query — always last	★★★ pull second
nomic-embed-text	Embeddings · KB	~270 MB	~0.5 GB	KNOWLEDGE queries · knowledge base ingest	★★★ tiny · huge value
mistral:7b	Web search	~4 GB	~6 GB	SEARCH queries · SearXNG result synthesis	★★★ pull third
mistral:7b-instruct-v0.3-q8_0	Log parser	~4 GB	~6 GB	LOGS queries · structured JSON output	★★☆ pull fourth
codellama:13b	Code · CLI	~8 GB	~10 GB	CODE queries · scripts · configs · PromQL	★★☆ pull fifth
llava:13b	Vision · images	~8 GB	~12 GB	VISION queries only · needs image path from user	★☆☆ needs 24GB VRAM
Total — all 7 models		~30.3 GB	peak 19.5 GB	not all active simultaneously — Ollama loads on demand

Planned model upgrade — requires RTX 4090 (24GB VRAM)

Current model	Planned replacement	Disk	VRAM	Quality gain	Needs
codellama:13b	qwen2.5-coder:32b	~20 GB	~20 GB	HumanEval: 38% → 80% · better CLI syntax · better PromQL	24GB VRAM minimum
After upgrade: total disk ~42 GB · peak VRAM 27 GB (phi3 + qwen2.5-coder:32b + gemma4) → requires 24GB GPU with partial RAM offload for gemma4 layers. Prompt documented in CODING_IMPROVEMENTS.md PROMPT 6.

Support stack — RAM & CPU overhead (no VRAM)

Component	RAM usage	CPU threads	Notes
SearXNG (Flask server)	~150 MB	1–2	Must be started manually after WSL reboot · logs to ai-lab/logs/searxng.log
Playwright headless Chromium	~300–500 MB	2–4	Launched per JS-fetch request · auto-closes after · requires --no-sandbox on WSL2
ChromaDB (vector store)	~100 MB + index size	1	Grows as you ingest lab docs · stored at knowledge/chroma_db/
Ollama server	~200 MB + model	2–8	Keeps last-used model warm in VRAM · auto-evicts after idle timeout
WSL2 + Ubuntu overhead	~2–3 GB	2	Fixed OS overhead · more on Windows host with other apps running
Support stack total	~3–4 GB	~8–16	on top of whatever model VRAM is in use

VRAM management — how Ollama handles the 7-model stack

Ollama loads models on demand and keeps them warm until idle timeout (~5 min). It does not pre-load all 7 models.
In practice, only 2–3 models are in VRAM at any moment: phi3:mini (router) and gemma4:e4b (brain) stay warm between queries since they're used every time. The specialist evicts after its query completes.

Priority order to keep warm manually (if you want to pre-load):
phi3:mini (2.5GB · always needed) → gemma4:e4b (5GB · always needed) → nomic-embed-text (0.5GB · tiny, keep warm) → mistral:7b (6GB · most common specialist) → load others on demand

With 16GB VRAM: phi3 + gemma4 + mistral:7b fits (14GB). llava:13b will offload ~3.5GB to RAM — adds ~4s to vision queries.
With 24GB VRAM: all specialists including llava:13b run fully in VRAM. Zero RAM offloading at current model sizes.

Live Demo — real models, real answers ● checking server…

          Type a question below or click an example.

          Watch phi3:mini route it → specialist → gemma4 synthesis.

TRY AN EXAMPLE

Routing Pipeline

📥

Input received

waiting for question…

⚡

phi3:mini — Router

few-shot classify · temp 0 · 5 tokens

🎯

Route decision

label will appear here

🔧

Specialist model

processes the request

🧠

gemma4:e4b — Synthesis

final answer generation

✅

Answer delivered

session log updated

Queries

—

Last route

—

Route ms

🔍

mistral:7b — Search Engine Report

search_agent.py · full architecture & roadmap

At a glance

search backends

fetch layers

SearXNG engines

external API keys

Search stack — priority order

SearXNG — self-hosted metasearch

Runs locally on http://localhost:8080 · no rate limits · no API key
Aggregates: Google · DuckDuckGo · Wikipedia · GitHub in one query
Engines: only English-language results · Bing disabled (noise)
Config: /home/ericsson/searxng/searxng-lab.yml
Start: bash /home/ericsson/ai-lab/start_searxng.sh

LIVE

OpenClaw → DuckDuckGo

Subprocess call to openclaw CLI · triggers only if SearXNG is down
Provider: duckduckgo · limit 6 results · 30s timeout
Configured in ~/.openclaw/openclaw.json

FALLBACK 1

ddgs — DuckDuckGo Python library

Direct Python call · no subprocess · last resort if openclaw also fails
pip package: ddgs · same results as DuckDuckGo

FALLBACK 2

Page fetch stack — enriches thin snippets (<120 chars)

Layer 1 — requests + trafilatura (static)

Fast plain HTTP fetch · ~1–2s · works for any static HTML page
trafilatura extracts main content, strips nav/ads/boilerplate
If extracted text ≥ 150 chars → accepted, no Layer 2 needed

LIVE

Layer 2 — Playwright headless Chromium (JS)

Full browser render · ~4–6s · fires only when Layer 1 returns <150 chars
Handles React/Vue/Angular SPAs, lazy-loaded content, dynamic docs
Waits 1.5s after domcontentloaded for JS to settle
Confirmed working: releasealert.dev, deepwiki-style pages
Browser binary: ~/.cache/ms-playwright/chromium-1223/

LIVE

ALWAYS SKIPPED (bot-protected / unfetchable)

ericsson.com scribd.com youtube.com twitter.com / x.com linkedin.com facebook.com springer.com jstor.org ieee.org .pdf · .doc · .ppt · .zip

Synthesis — after results are collected

mistral:7b · temperature 0.1 · max 800 tokens

            System prompt: research assistant · use only provided search results · cite source URL for every key fact

            Input: question + up to 6 search results (title · URL · body/extracted content)

            Output: cited factual answer in plain prose · streamed token by token to the terminal

Future additions — roadmap

Semantic reranking HIGH

Use nomic-embed-text (already running for ChromaDB) to score each search result by semantic similarity to the query. Reorder results so mistral:7b always sees the most relevant content first, not just what SearXNG ranked first.

PDF extraction HIGH

Install pymupdf (pip install pymupdf) to extract text from PDF URLs — unlocks Ericsson datasheets, 3GPP specs, and archive.org technical documents currently skipped by the fetch stack.

Parallel URL fetch MED

Use concurrent.futures to fetch top-N URLs simultaneously instead of sequentially. Cuts fetch time from N×8s to ~8s regardless of how many URLs are enriched.

Query expansion MED

Use phi3:mini (already running for routing) to rephrase the query before searching. Turns "BB 6630 GPS sync" into "Ericsson Baseband 6630 GPS synchronization IEEE 1588 PTP requirements" — better query, better results.

More SearXNG engines MED

Stack Overflow (technical Q&A) and arXiv (5G research papers) are built into SearXNG — one-line additions to searxng-lab.yml. Also consider enabling a local Brave Search engine for broader coverage.

Better mistral prompt MED

Add lab-specific instructions: prefer open5gs.org and 3gpp.org over generic blogs, always cite version numbers, format commands as code blocks, downrank sources older than 2 years. Direct improvement to answer quality.

📋

mistral:7b-instruct — Log Parser

tools/log_agent.py · mistral:7b-instruct-v0.3-q8_0 · triggered by LOGS label

At a glance

params · Q8_0 quant

~4GB

disk · ~6GB VRAM

temperature (deterministic)

1024

max tokens

Strict JSON output schema

Always returns valid JSON · parseable by json.loads() with zero post-processing

            {

              "summary":  "one sentence — what happened overall",

              "errors":   ["error message 1", "error message 2", ...],

              "warnings": ["warning 1", "warning 2", ...],

              "events":   ["notable event 1", "event 2", ...]

            }

            System prompt enforces: no markdown · no prose · no code fences · raw JSON only

            Auto-strips accidental ``` fences if the model adds them · graceful error JSON on failure

Input sources accepted

📄

Raw log text — paste directly into CLI

Paste journalctl output · syslog lines · Open5GS log lines directly at the CLI prompt

📁

File path — agent reads the file automatically

Enter a path like /var/log/open5gs/amf.log — agent reads and parses the full file

What it parses best

AMF / MME logs

Registration failures · NAS rejections · S1AP errors · UE attach/detach events · NGAP issues

UPF logs

Packet routing errors · PFCP session failures · data plane drops · GTP tunnel problems

PTP4L / chrony logs

Sync loss events · clock offset violations · master clock changes · GPS lock/unlock events

systemd journal

Any journalctl output · service crashes · kernel messages · OOM kills · segfaults

Terminal display

            Renders a colored box: ┌── Log Analysis ──────────────────────────────────────────┐

            Shows: summary · up to 5 errors (✗) · up to 5 warnings (⚠) · up to 5 events (·)

            JSON result also passed to gemma4:e4b for final natural-language synthesis

📚

nomic-embed-text — Knowledge Base

knowledge/search.py · knowledge/ingest.py · ChromaDB at knowledge/chroma_db/

At a glance

270

MB model size

~0.5

GB VRAM

Top 3

chunks returned

100%

offline · no API

Ingest & search pipeline

Ingest: ingest <path>

Reads any file or directory · splits documents into overlapping text chunks for better recall

Embed: each chunk → float vector via Ollama

POST /api/embed with nomic-embed-text · returns a 768-dimensional float vector per chunk

Store: ChromaDB at knowledge/chroma_db/

Persistent local vector database · survives restarts · stored on disk in the project folder

Search: query → embed → cosine similarity → top 3

Query embedded with same model · ChromaDB finds closest vectors · top 3 chunks returned with relevance scores shown as ████░░ bars

Synthesize: top chunks → gemma4:e4b → answer

The 3 matching chunks are passed to gemma4:e4b, which answers citing the source document and chunk number

What you should index

Equipment manuals

Ericsson BB 6630 admin guide · Router 6000 CLI reference · Element Manager documentation

Open5GS configs

Your YAML configs for AMF/MME/UPF/SMF/NRF · subscriber exports · PLMN and slice settings

Lab runbooks

Step-by-step procedures · maintenance guides · troubleshooting checklists · incident reports

Router CLI exports

Router 6000 running configs · VLAN assignments · context configurations · route tables

CLI commands

            ingest /path/to/docs/          # index a whole folder

            ingest router_config.txt       # index a single file

            ingest /mnt/e/ericsson_manuals/ # index from a Windows drive

            Then ask naturally: "What does our router config say about VLAN 2120?"

            Semantic search finds meaning — no exact keyword match needed