Job Specifications
Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.
Your Role:
You will be central to building our high-performance storage platform that enables research teams to work with massive datasets for AI training, large-scale simulations, and data-intensive computational workloads. Working closely with product, your platform team members, and infrastructure specialists, you'll design and implement storage systems that deliver high-throughput data access, manage multi-TB to PB-scale datasets, and provide the data protection and lifecycle management required by enterprise and research organizations.
Job Responsibilities
Storage Platform Architecture: Design and build scalable storage orchestration systems supporting block, object, and file storage optimized for AI training datasets, model checkpoints, simulation data, and large-scale data processing pipelines.
Research Cluster Storage: Design and implement storage systems for research computing environments including Kubernetes and SLURM clusters, enabling shared datasets, persistent storage for distributed training, model checkpoints, and high-throughput data access for batch processing workloads.
High-Performance Data Access: Implement storage solutions that deliver consistent high-throughput and low-latency performance for demanding research workloads, including distributed training, large-scale simulations, and real-time data processing.
Data Pipeline Engineering: Build robust data ingestion, processing, and movement systems that handle massive datasets, support efficient data loading for training pipelines, and enable seamless data access across compute infrastructure.
Multi-Tiered Storage Orchestration: Build systems that coordinate across NVMe, SSD, and high-capacity storage tiers, placing and moving data based on access patterns and workload requirements.
Enterprise Data Protection: Implement comprehensive backup, snapshot, replication, and disaster recovery systems that meet enterprise data protection requirements and support zero-data loss policies.
Storage APIs & Integration: Create storage management APIs and SDKs that integrate seamlessly with compute platforms, research workloads, and financial data feeds.
Observability and Performance: Build monitoring and optimization systems that ensure consistent storage performance, track capacity utilization, and provide visibility into data access patterns.
Requirements
Experience: 5+ years in software engineering with proven experience building storage platforms, distributed storage systems, or data infrastructure for production environments.
Kubernetes Storage & Container Orchestration: Strong familiarity with Kubernetes storage architecture, persistent volumes, storage classes, and CSI drivers. Understanding of pods, deployments, stateful sets, and how Kubernetes manages storage resources.
Storage Systems Expertise: Deep understanding of storage architectures including block storage, object storage, file systems, distributed storage, and performance optimization techniques.
Programming Skills: Expert-level Python proficiency. Experience with C/C++, Rust, or Go for performance critical storage components is highly valued.
Linux & Systems Programming: Strong experience with Linux in production environments, including file systems, storage subsystems, and kernel-level storage interfaces
Data Systems Engineering: Strong background in building data pipelines, large-scale data processing systems, and managing data lifecycle at scale.
Experience: 5+ years in software engineering with proven experience building compute platforms, container orchestration, or distributed systems for performance-critical applications.
Storage Systems Expertise: Deep understanding of storage architectures including NVMe, distributed storage, caching strategies, and performance optimization for latency-sensitive workloads.
Programming Skills: Expert-level Python proficiency. Experience with C/C++,, Rust, Go for performance-critical components is highly valued.
Data Systems: Strong background in building data pipelines, ETL processes, and large-scale data processing systems.
Platform & API Design: Proven experience building storage platforms with multi-tenancy, data isolation, and enterprise-grade reliability features.
Problem Solving & Architecture: Demonstrated ability to solve complex performance and scalability challenges while balancing pragmatic shipping with good long-term architect