Senior AI Inference Engineer (llama.cpp specialist): Tether

This is a Systems Engineering role disguised as an AI role. You will not be building chatbots in Python or fine-tuning models in PyTorch for cloud deployment.

Instead, you will be working "close to the metal" in C++. Your mission is to take massive Large Language Models (LLMs) and optimize them so they run efficiently on Edge Devices (consumer laptops, phones, and desktops) rather than powerful cloud servers. You are the engineer who makes AI "private" by ensuring it processes data locally on the user's hardware.

Key Responsibilities
Engine Optimization: You will port, maintain, and enhance the llama.cpp library. This involves writing low-level C++ code to make matrix multiplications faster on consumer CPUs and GPUs.

Runtime Performance: Your KPIs (Key Performance Indicators) are Inference Speed (tokens per second) and Memory Footprint (RAM usage). You must ensure models load instantly and don't crash the user's device.

Hardware Agnosticism: You must ensure the AI runs smoothly regardless of whether the user has an NVIDIA GPU, an Apple Silicon chip, or just a standard Intel CPU.

Research-to-Production: You act as the bridge for the Research team. When they discover a new model architecture, you write the C++ implementation to make it runnable for the end user.

Technical Stack
Primary Language: C++ (Must be excellent/expert level).

Core Libraries: llama.cpp, ggml (The fundamental tensor library behind llama.cpp), ONNX.

Bonus Skills: JavaScript (likely for interfacing with the upper application layer).

Concepts: Model Quantization (4-bit, 8-bit), SIMD instructions (AVX, NEON), Memory Management.

Strategic Context: Tether & Mexico
The Mission: Tether is building Keet, a Peer-to-Peer (P2P) chat and video application. To add AI to a P2P app without breaking privacy, the AI must run on the user's device. You are building the engine for that feature.

Location Strategy: While Tether is globally distributed, hiring this role in Mexico suggests a desire for time-zone alignment with North American engineering pods while tapping into Latin America's growing systems engineering talent pool.

Remote Culture: Tether is "hardcore remote." They expect autonomy and high output without a physical office environment.

Candidate Profile
You are likely the type of engineer who:

Reads the llama.cpp GitHub issues for fun.

Understands why mmap is crucial for loading large models.

Knows the difference between Q4_K_M and Q5_K_S quantization formats.

Prefers debugging a segfault in C++ over writing a REST API in Python.

Senior AI Inference Engineer (llama.cpp specialist): Tether

🧠 Related Jobs