Network Topology
AI Subsystem
Model Roles
Hardware Requirements
Demo
AI server (WSL2) · Open5GS core · Ericsson Router 6000 · BB 6630
supptel Ericsson Team
Question routing & model delegation flow
Model roles, responsibilities & example prompts
Recommended hardware for the AI server
| Component | Minimum | Recommended | Why | Your risk without it |
|---|---|---|---|---|
| GPU | RTX 3080 Ti (12GB) | RTX 4090 (24GB) or A10 (24GB) | Pipeline runs 3 models simultaneously. llava:13b alone needs 12GB. codellama:13b needs 10GB. Without GPU, all 7 models fall back to CPU. | 60–180s response times on 13B models. llava:13b is completely unusable on CPU. Vision queries time out. |
| VRAM | 16GB | 24GB+ | The pipeline runs phi3:mini + specialist + gemma4:e4b in sequence — all three overlap in VRAM during synthesis. Peak demand is 19.5GB (llava path). 24GB covers all 7 specialists with no offloading. | 16GB handles most queries but llava:13b will partially offload to RAM (~3–4s extra per vision query). 10GB is too small — codellama alone won't fit cleanly. |
| RAM (system) | 32GB | 64GB | Models overflow VRAM to RAM. SearXNG + Playwright headless Chromium + ChromaDB + WSL2 overhead together consume ~8–12GB just for the support stack. Remaining RAM absorbs VRAM spillover. | System thrashes when multiple specialists are called in sequence. Playwright browser fetch crashes. ChromaDB indexing slows to a crawl. |
| CPU | 8 core | 16 core (Ryzen 9 / i9 / Xeon) | phi3:mini routing runs on CPU even with a GPU present. Playwright headless Chromium, SearXNG Flask server, ChromaDB embeddings, and SSH tool calls all run concurrently on CPU alongside model inference. | phi3:mini routing adds 400–800ms instead of <200ms. Parallel search fetch competes with inference. Tool calls feel sluggish. |
| Storage | 512GB NVMe | 1TB+ NVMe (PCIe 4.0) | Current 7 models total ~31GB. Planned upgrade to qwen2.5-coder:32b adds ~20GB. ChromaDB grows as you index lab docs. SearXNG, Playwright browser (~130MB), logs, and WSL2 image add up fast. | Running out of space when pulling the code model upgrade. Slow model load from a SATA SSD adds 5–15s cold-start latency per model. |
| Network | 100Mbps Ethernet | 1Gbps Ethernet to lab network | SSH to core (172.29.10.26), Prometheus queries (:9099), NMS health (:8888), AMF/MME log streaming — all concurrent when multiple tools are invoked. SearXNG also fetches from upstream search engines. | Prometheus queries and log fetches add 1–3s latency. SSH timeout risks during heavy concurrent tool use. |
| OS | WSL2 Ubuntu 22.04 ✓ (current) | Native Ubuntu 22.04 (dual boot or dedicated server) | Native Linux removes WSL2 memory overhead, fixes GPU passthrough latency, eliminates port-forwarding complexity, and lets Playwright and SearXNG bind ports directly without Windows NAT. | WSL2 works but adds ~2GB RAM overhead, occasional GPU driver friction, and Playwright requires --no-sandbox workaround (already applied). |
Current model stack — 7 models · ~31 GB total on disk
| Model | Role | Disk | VRAM | Active when | Priority |
|---|---|---|---|---|---|
| phi3:mini | Router | ~2 GB | ~2.5 GB | Every single query — always first | ★★★ pull first |
| gemma4:e4b | Brain · synthesis | ~4 GB | ~5 GB | Every single query — always last | ★★★ pull second |
| nomic-embed-text | Embeddings · KB | ~270 MB | ~0.5 GB | KNOWLEDGE queries · knowledge base ingest | ★★★ tiny · huge value |
| mistral:7b | Web search | ~4 GB | ~6 GB | SEARCH queries · SearXNG result synthesis | ★★★ pull third |
| mistral:7b-instruct-v0.3-q8_0 | Log parser | ~4 GB | ~6 GB | LOGS queries · structured JSON output | ★★☆ pull fourth |
| codellama:13b | Code · CLI | ~8 GB | ~10 GB | CODE queries · scripts · configs · PromQL | ★★☆ pull fifth |
| llava:13b | Vision · images | ~8 GB | ~12 GB | VISION queries only · needs image path from user | ★☆☆ needs 24GB VRAM |
| Total — all 7 models | ~30.3 GB | peak 19.5 GB | not all active simultaneously — Ollama loads on demand | ||
Planned model upgrade — requires RTX 4090 (24GB VRAM)
| Current model | Planned replacement | Disk | VRAM | Quality gain | Needs |
|---|---|---|---|---|---|
| codellama:13b | qwen2.5-coder:32b | ~20 GB | ~20 GB | HumanEval: 38% → 80% · better CLI syntax · better PromQL | 24GB VRAM minimum |
| After upgrade: total disk ~42 GB · peak VRAM 27 GB (phi3 + qwen2.5-coder:32b + gemma4) → requires 24GB GPU with partial RAM offload for gemma4 layers. Prompt documented in CODING_IMPROVEMENTS.md PROMPT 6. | |||||
Support stack — RAM & CPU overhead (no VRAM)
| Component | RAM usage | CPU threads | Notes |
|---|---|---|---|
| SearXNG (Flask server) | ~150 MB | 1–2 | Must be started manually after WSL reboot · logs to ai-lab/logs/searxng.log |
| Playwright headless Chromium | ~300–500 MB | 2–4 | Launched per JS-fetch request · auto-closes after · requires --no-sandbox on WSL2 |
| ChromaDB (vector store) | ~100 MB + index size | 1 | Grows as you ingest lab docs · stored at knowledge/chroma_db/ |
| Ollama server | ~200 MB + model | 2–8 | Keeps last-used model warm in VRAM · auto-evicts after idle timeout |
| WSL2 + Ubuntu overhead | ~2–3 GB | 2 | Fixed OS overhead · more on Windows host with other apps running |
| Support stack total | ~3–4 GB | ~8–16 | on top of whatever model VRAM is in use |