About the Role
We are hiring an AI Engineer specializing in Speech & Voice Systems to own the design and optimization of speech recognition, text-to-speech, and wake-word detection pipelines for a next-generation consumer AI product. The role focuses on delivering natural, low-latency voice interactions that work reliably on constrained hardware (edge devices) and scale seamlessly with cloud infrastructure.
Responsibilities
- Develop and optimize speech-to-text (STT) systems (e.g., Whisper, Vosk) for low-latency recognition.
- Implement and enhance text-to-speech (TTS) systems (Coqui, Piper, VITS), including multi-voice support and style variations.
- Prototype and refine wake-word detection and integrate noise suppression, VAD, AGC, and audio normalization for robust performance.
- Apply model optimization techniques (quantization, ONNX, CTranslate2, GGUF/ggml) for offline/CPU-first inference.
- Design caching, streaming, and batching strategies to meet real-time performance targets (<2s response).
- Collaborate with backend engineers to expose APIs (/stt, /tts, /wakeword) for integration into apps.
- Monitor performance via observability dashboards (Prometheus/Grafana, OpenTelemetry).
- Ensure privacy-first design: local-first processing with optional cloud fallback.
Qualifications
- 5+ years of professional experience in applied AI, with at least 3 years focused on speech technologies.
- Proven experience building production-grade STT/TTS systems.
- Strong knowledge of audio/DSP fundamentals: resampling, denoising, VAD, loudness normalization.
- Proficiency in Python (PyTorch); experience with FastAPI/Docker for model serving.
- Familiarity with wake-word frameworks (Porcupine, Snowboy) and streaming audio integration.
- Track record of delivering low-latency speech systems optimized for edge devices.
Nice to Have
- Experience with multilingual voice systems.
- Familiarity with real-time streaming architectures (WebRTC, gRPC).
- Exposure to IoT/edge deployment.