This a Full Remote job, the offer is available from: Ukraine
About Us
We are a stealth-mode startup building next-generation infrastructure for the AI industry. Our team has decades of experience in software, systems, and deep tech. We are working on a new kind of AI runtime that pushes the boundaries of performance and flexibility making advanced models portable, efficient, and customizable for real-world deployment.
If you want to be part of a small, fast-moving team shaping the future of applied AI systems, this is your opportunity.
Role
We are looking for a C++ Engineer, based in Ukraine, with strong systems and GPU programming background to help extend and optimize an open-source AI inference runtime. You will work on low-level internals of large language model serving, focusing on:
- Dynamic adapter integration (e.g., LoRA/QLoRA)
- Incremental model update mechanisms
- Multi-session inference caching and scheduling
- GPU performance improvements (Tensor Cores, CUDA/ROCm)
This is a hands-on role: you will be designing, coding, profiling, and iterating on high-performance inference code that runs directly on CPUs and GPUs.
Responsibilities
- Implement support for runtime adapter loading (LoRA), enabling models to be customized on the fly without retraining or model merges.
- Design and implement mechanisms for incremental model deltas, allowing models to be extended and updated efficiently.
- Extend runtime to handle multi-session execution, with isolation and caching strategies for concurrent users.
- Optimize core math kernels and memory layouts to improve inference performance on CPU and GPU backends.
- Collaborate with backend and infrastructure engineers to integrate your work into APIs and orchestration layers.
- Write benchmarks, unit tests, and profiling tools to ensure correctness and measure performance gains.
- Contribute to system architecture discussions and help define the roadmap for future runtime features.
Requirements
- Strong proficiency in modern C++ (C++14/17/20) and systems programming.
- Solid understanding of low-level performance optimization: memory management, multithreading, SIMD, cache efficiency.
- Experience with CUDA and/or ROCm/HIP GPU programming.
- Familiarity with linear algebra kernels (matrix multiply, attention) and how they map to hardware acceleration (Tensor Cores, BLAS libraries, etc.).
- Exposure to machine learning inference frameworks (e.g., llama.cpp, TensorRT, ONNX Runtime, TVM, PyTorch internals) is a plus.
- Comfortable working in a Unix/Linux environment; experience with build systems (CMake, Bazel) and CI pipelines.
- Strong problem-solving and debugging skills; ability to dive deep into both code and performance traces.
- Self-motivated and able to thrive in a fast-moving startup environment.
Nice to Have
- Experience implementing LoRA or adapter-based fine-tuning in inference runtimes.
- Knowledge of quantization methods and deploying quantized models efficiently.
- Background in distributed systems or multi-GPU orchestration.
- Contributions to open-source ML/AI systems.
Why Join
- Build core IP at the intersection of AI and systems engineering.
- Work with a highly technical founding team on problems that are both intellectually challenging and commercially impactful.
- Opportunity to shape the direction of a new AI platform from the ground up.
- Competitive compensation (contract or full-time), equity potential, and flexible remote work.
Please Use this link to apply to this job: https://www.baasi.com/career/apply/3136319
This offer from “baasi” has been enriched by Jobgether.com and got a 0% flex score.