Attribution: This article was based on content by @whamp on GitHub.
Original: https://github.com/Whamp/chirp

Local, private dictation: why it matters

Speech-to-text (STT) is transforming how we interact with computers — from hands-free note taking to accessibility and real-time productivity boosts. But mainstream dictation solutions often mean sending audio to cloud services (privacy risk), require GPUs (costly), or demand installers and admin rights that locked-down corporate machines won’t allow. For many users — journalists, clinicians, developers working on secure endpoints — the sweet spot is a local, low-friction STT stack that runs on CPU or modest GPUs and integrates with the OS without dropping new .exe installers.

Projects like Chirp aim to fill that gap: a local Windows dictation tool that uses modern models (NVIDIA ParakeetV3 in this case), runs via Python-only deployment (no extra executables), and is designed to be usable in locked-down environments. This guide expands that idea into a practical taxonomy of local STT approaches, representative tools, integration patterns, evaluation criteria, and a hands-on getting-started path.

Key Takeaways

  • Local STT is feasible and practical: modern models + runtime stacks (ONNX Runtime, optimized runtimes) let you run high-quality transcription on CPU or modest GPU hardware.
  • Choose between turnkey local apps (fast to deploy) and modular stacks (more flexible) based on control, privacy, and integration needs.
  • ONNX Runtime, edge-optimized models (ParakeetV3/Whisper variants), and lightweight service layers (FastAPI/Uvicorn) form a robust, portable architecture for self-hosted dictation.
  • Evaluate on accuracy, latency, offline capability, model size/footprint, and deployment constraints (no-exe, admin-free, Windows compatibility).

Background: STT = speech-to-text, i.e., converting spoken audio into written text in real time or batch.

A taxonomy for privacy-first STT tooling

I group local/edge STT tools into four categories that reflect the typical trade-offs teams face.

  1. Local turnkey dictation apps (end-user oriented)
  2. Lightweight open-source STT toolkits (offline-first)
  3. Model runtimes and accelerators (inference engines)
  4. Integration / orchestration components (APIs, containers, service runners)

Each category targets a particular problem: user-focused single-machine dictation, building blocks for embedded/offline systems, raw inference performance, and ways to integrate STT into workflows.

1) Local turnkey dictation apps

These are desktop-focused projects that package capture, inference, and OS integration for individual users. They prioritize minimal setup and immediate usability.

  • Chirp
    Chirp is explicitly designed for locked-down Windows environments where installing .exe files is impossible. It runs with Python only, uses NVIDIA’s ParakeetV3 model to provide accurate local dictation, and aims to expose a simple user interface and hotkey-driven capture. Trade-offs: very convenient for Windows users but depends on available model weights and runtime configuration; best for single-machine personal use where you can run Python.

  • Windows Speech Recognition (built-in)
    The built-in Windows speech stack requires no installs and integrates directly with the OS. Pros: zero setup and deep OS integration. Cons: often lower accuracy than modern neural models and may rely on cloud services in certain editions or be disabled in locked environments.

When to choose: pick Chirp (or similar Python-first apps) when you need high-quality, privacy-preserving dictation on a locked-down Windows machine and can run Python. Use built-in OS features when simplicity and zero-dependency are paramount.

2) Lightweight open-source STT toolkits (offline-first)

These projects provide models and inference pipelines designed to run locally without cloud dependency.

  • Vosk
    Vosk is an offline speech recognition toolkit with bindings for many languages and platforms (including Windows). It includes compact models that run on CPU with low latency and supports streaming. Trade-offs: models are smaller (fast, lower compute) but can lag behind latest large neural transcription models in raw accuracy. Great for embedded devices and real-time applications.

  • Kaldi
    Kaldi is a mature and widely used speech toolkit used in research and production. It’s highly configurable and powerful for building custom pipelines. Trade-offs: steeper learning curve and more infrastructure overhead than turnkey toolkits. Choose Kaldi for highly customized production pipelines or research tasks.

When to choose: use Vosk for quick local deployments on desktop or low-power devices. Choose Kaldi when you need advanced customization, model training pipelines, or integration with ASR research.

3) Model runtimes and accelerators

These are the engines that execute neural models efficiently, often bringing GPU/CPU optimizations and cross-platform portability.

  • ONNX Runtime
    ONNX (Open Neural Network Exchange) Runtime is an inference engine that runs models converted to the ONNX format. It supports CPU, GPU, and hardware accelerators and is widely used to run optimized models on edge and server environments. Trade-offs: requires model conversion to ONNX in some cases and careful operator compatibility checks. Choose ONNX Runtime when portability and performance across hardware types matter.

Background: ONNX is an open format for neural models to enable portability across frameworks and runtimes.

  • [NVIDIA NeMo / TensorRT] (https://github.com/NVIDIA/NeMo) and https://developer.nvidia.com/tensorrt
    NVIDIA provides model toolkits (NeMo) and runtime acceleration (TensorRT) to squeeze maximum latency and throughput on NVIDIA GPUs. Trade-offs: superior performance on NVIDIA hardware but vendor-specific; also requires GPU access. Choose these when low-latency, GPU-accelerated inference is available and necessary.

When to choose: prefer ONNX Runtime for broad compatibility (CPU-first or GPU), and NVIDIA stacks when you control the GPU environment and demand top-end performance.

4) Integration / orchestration components

These components make STT available to applications: web APIs, containers, and simple orchestration.

  • FastAPI + Uvicorn + https://www.uvicorn.org/
    FastAPI is a modern Python web framework ideal for creating a local STT service; Uvicorn is the ASGI server. Trade-offs: requires running a small service but enables programmatic access, hotkeys, browser clients, and multi-client access. Best when you want a reusable API rather than a single desktop app.

  • Docker / Docker Compose (https://www.docker.com/)
    Containerizing STT runtimes makes them portable and reproducible, simplifies dependency management, and allows you to run the same stack across machines. Trade-offs: on Windows in locked-down environments you may not be able to run Docker; containers add overhead.

When to choose: use FastAPI/Uvicorn when you need an HTTP API for integration; use containers when managing reproducible deployments across machines or servers.

Example Stacks

Here are three realistic stack patterns that combine components above.

  1. Single-user Windows dictation (no .exe install)

    • Chirp (frontend hotkeys + small UI)
    • ONNX Runtime (CPU or GPU)
    • ParakeetV3 ONNX model weights (local file)
      Pattern: audio capture -> local preproc -> ONNX inference -> text insertion into focused window.
  2. Self-hosted team STT server (privacy-first)

    • FastAPI service + Uvicorn
    • ONNX Runtime (server-side GPU) or NeMo/TensorRT for acceleration
    • ParakeetV3 or optimized Whisper model in ONNX format
    • Auth proxy + storage for transcripts
      Pattern: client(s) send audio chunks -> API preproc -> model inference -> textual response; transcripts stored and available via authenticated API.
  3. Embedded/edge transcription pipeline

    • Vosk on-device for streaming low-latency needs
    • Kafka or local message bus to pipe transcripts to downstream analytics
    • Lightweight dashboard or sink to storage
      Pattern: device captures audio -> Vosk transcribes -> events forwarded for analytics.

Integration Architecture (ASCII diagram)

Example: Chirp-style local pipeline (single-machine)

Client (hotkey/UI) | v Audio Capture (microphone driver / WASAPI) | v Preprocessor (resample, normalize) —> Optional VAD (voice activity detection) | v Inference Runtime (ONNX Runtime or TensorRT) -> Model weights (ParakeetV3 or Whisper) | v Postprocess (punctuation, timestamping) | v OS Integration (paste into focused window / clipboard / local API)

Integration points and data flow:

  • Audio capture hands off PCM frames to the preprocessor.
  • Preprocessor buffers and optionally performs VAD; then batches frames to the inference runtime.
  • The runtime runs the model, produces raw tokens or logits, which postprocess converts to readable text.
  • Final text is returned to the UI or an HTTP endpoint; no audio leaves the host.

Practical evaluation criteria

When choosing tools/stacks, compare along these axes:

  • Accuracy: word error rate (WER) on your target language/accents.
  • Latency: end-to-end time from spoken word to text (important for dictation).
  • Resource use: CPU/threads, RAM, and GPU needs; model size on disk.
  • Offline capability & privacy: whether models run fully locally without network traffic.
  • Ease of deployment: can you run with Python only (no admin/exe) on Windows?
  • Integration points: does it provide an API or easy OS integration (hotkeys, clipboard hooks)?
  • Maintenance / support and model updates: availability of pre-trained weights and community.

Getting started — local Windows dictation with a model runtime

Below is a practical starting path you can adapt to Chirp-style workflows.

  1. Prereqs
  • Python 3.10+ installed (user-level install is usually allowed).
  • Git (optional) to clone projects.
  • For GPU acceleration, appropriate drivers + CUDA (if using CUDA-enabled runtimes).
  1. Clone and set up (conceptual steps — follow project README for exact commands)
  • git clone https://github.com/Whamp/chirp
  • python -m venv .venv
  • source .venv/Scripts/activate
  • pip install -r requirements.txt
  • Install an inference runtime:
    • CPU: pip install onnxruntime
    • GPU: pip install onnxruntime-gpu (requires CUDA)
  1. Obtain ParakeetV3 (or another model) and place ONNX file in a models/ directory. Many projects publish ONNX exports on Hugging Face or vendor portals; follow their license and distribution rules.

  2. Run locally (conceptual):

  • If the project provides a small CLI runner (some projects use a lightweight wrapper named uv), run that to start the dictation server or UI.
  • Alternatively start a Python service directly:
    • uvicorn chirp.app:app –host 127.0.0.1 –port 8000
  1. Docker Compose example (containerized inference API)
  • docker-compose.yml
1
2
3
4
5
6
7
8
version: "3.8"
services:
  stt:
    build: ./stt-api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
  • Dockerfile (./stt-api)
1
2
3
4
5
6
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Inside app.py you would load ONNX Runtime, load the model from /app/models, and expose an endpoint that accepts audio and returns text.

Real-world usage patterns

  • Journalist on a locked-down corporate laptop: use a Python-based dictation app (Chirp) + CPU ONNX Runtime and a small, optimized ParakeetV3 model; this gives privacy and avoids .exe installs.
  • Research team building a custom pipeline: use Kaldi or NeMo for model training and deploy inference via ONNX Runtime or TensorRT in containers behind FastAPI.
  • IoT / edge device: select a compact Vosk model to run continuously on-device, stream transcripts to local analytics.

Further Resources

This guide was inspired by Show HN: Chirp – Local Windows dictation with ParakeetV3 no executable required curated by @whamp

Encouragement: check the original project page for exact install steps, model download links, and project-specific configuration options.

References