- Company Name
- Entrust
- Job Title
- Senior MLOps Engineer
- Job Description
-
**Job Title**
Senior MLOps Engineer
**Role Summary**
Build and operate end‑to‑end machine learning production infrastructure, enabling scientists and engineers to deploy and scale models across multi‑tenant workloads on Kubernetes/EKS. Provide platform abstractions, observability, and developer experience so that ML research can be rapidly moved into production.
**Expectations**
- Lead the design, implementation, and maintenance of a cloud‑native MLOps platform.
- Serve as the technical point of contact for data scientists and ML engineers, continuously improving user experience.
**Key Responsibilities**
- Run and evolve the ML compute layer on Kubernetes/EKS (CPU/GPU), ensuring portability across regions.
- Operate Argo Workflows, Dask Gateway, and other scheduling tools for data preparation, training, evaluation, and batch compute.
- Develop GitOps‑native delivery pipelines (GitLab CI, Helm, FluxCD) for ML jobs and platform components, supporting fast rollouts and safe rollbacks.
- Design and maintain a lakehouse data platform (LakeFS, Apache Iceberg, Snowflake) for experiment reproducibility and data lineage tracking.
- Create clear APIs/CLIs, templates, and documentation to enhance developer experience.
- Implement observability with Prometheus/Grafana, logging (Loki/Promtail, Datadog, Sentry), and alerting.
- Ensure compliance with networking, security, and Linux administration best practices.
**Required Skills**
- Production experience with AWS (EKS, S3, EC2, IAM, Lambda, RDS) and container orchestration.
- Strong Python proficiency (FastAPI/Django, Pydantic, boto3).
- Familiarity with CI/CD and GitOps tools (GitLab CI, Helm, FluxCD, Terraform).
- Experience building and operating data pipelines with idempotency, retries, and reproducibility.
- Knowledge of observability stacks: Prometheus, Grafana, Loki, Datadog, or similar.
- Understanding of networking, security concepts, and Linux system administration.
**Nice to Have**
- Distributed compute frameworks (Dask, Spark, Ray).
- Inference servers (NVIDIA Triton).
- FinOps cost attribution and multi‑region deployment strategies.
**Required Education & Certifications**
- Bachelor’s degree or higher in Computer Science, Engineering, or related field.
- Relevant certifications (e.g., AWS Certified Solutions Architect, Kubernetes Administrator) preferred.