Skills

Python Go Rust Incident Response CI/CD Kubernetes Monitoring Research Training AWS Cost Management GCP Kafka Terraform

Job Specifications

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

At Nscale, our Software engineers form the backbone of our product offering. We build state of the art AI products allowing our clients to move quickly in an increasingly competitive digital landscape.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future

About The Role (Job Purpose)

Nscale is looking for a Senior Software Engineer to build and scale the control/data plane systems and application services that power our GenAI cloud. You'll work alongside domain experts and experienced engineers across our infrastructure, platform, and product teams to build the foundational systems that enable thousands of AI workloads to run reliably at scale. This is a high-impact role where you'll have significant ownership and the opportunity to shape how we build and operate critical platform services.

What You'll be Doing

Build and own the control plane and data plane services that power our cloud platform. You'll contribute to APIs and SDKs for platform consumption, implement reliable distributed state management and storage systems, and create services that coordinate workload scheduling and orchestration across multiple regions.
Engineer the infrastructure for processing and managing high-throughput workloads and distributed data flows. You'll solve complex challenges around data capture, storage, and accessibility for AI/ML training and inference.
Drive technical decisions for your systems and champion engineering best practices across the team. You will uphold high standards for reliability, testing, monitoring, and CI/CD in a fast-paced, research-driven environment, and provide technical mentorship to engineers on your team.
Own the operational health of your systems in production. You'll implement observability, respond to incidents, optimise performance, and continuously improve reliability based on production feedback and metrics.
You will have the opportunity to develop entirely new platform services and methods, leveraging cloud-native technologies and AI to create novel platform and product capabilities

About You (Skills / Qualifications)

You have extensive hands-on experience designing, building, and operating scalable production systems on or for a major cloud provider (e.g., AWS, GCP), including data-intensive distributed workflows, backend services, and APIs.
You use AI tools like Claude, Cursor, or similar as a core part of your development workflow - not as a novelty, but as a fundamental multiplier of what you can build. Whether you're already using AI to rapidly prototype complex distributed systems, explore unfamiliar codebases, and architect solutions across new domains, or you're excited to push your AI-assisted development skills to that level, you understand the potential and are committed to mastering how to effectively collaborate with AI while maintaining high code quality and architectural coherence.
You believe in using the right tool for the job and have strong proficiency with typed languages. Our primary stack is built with Go, with some services in Rust and Python. You're comfortable working across different languages and applying various technical approaches to find the best solution.
You have delivered multi-service distributed systems from ambiguous requirements to high-adoption operational systems in production, with hands-on experience in day 2 operations including monitoring, alerting, incident response, and performance optimisation at scale.
You thrive in an ambiguous, fast-paced environment where you are given high levels of agency and ownership. You are a pragmatic problem-solver who is biased towards action and impact.

Nice to Have:

Experience designing developer-friendly APIs, SDKs, or platform services that customers and other teams depend on
Experience with Kubernetes, infrastructure-as-code (Terraform, Pulumi), event-driven architectures, and message queues (NATS, Kafka, RabbitMQ)
Experience with GPU orchestration and workload scheduling for AI/ML inference and training workloads
Contributions to open-source projects in the cloud-native ecosystem
Comfortable contributing throughout the stack including frontend to accelerate delivery when needed

What We Can Offer You

At Nscale, you'll find a

About the Company

Nscale is the Hyperscaler engineered for AI, offering high-performance compute optimised for training, fine-tuning, and intensive workloads. From our data centres to software stack, we are vertically integrated in Europe to provide unparalleled performance, efficiency and sustainability. Know more