- Company Name
- bigspark
- Job Title
- Data Engineer
- Job Description
-
Job title: Data Engineer
Role Summary: Build and maintain enterprise‑grade data platforms and pipelines that support analytics, AI and business decisions, focusing on high availability, scalability, and performance across hybrid/multi‑cloud environments.
Expactations: Design and implement robust batch and streaming data solutions; ensure data quality, security, and governance; collaborate with cross‑functional teams; continuously optimize and automate data workflows; maintain documentation and lineage.
Key Responsibilities:
- Develop scalable ETL/ELT pipelines using Apache Spark, Flink, Beam and orchestrators (Airflow, Dagster, Prefect, or equivalents).
- Integrate large, heterogeneous datasets in hybrid and multi‑cloud setups (AWS, Azure, GCP).
- Manage data lakes/lakehouses (Delta Lake, Iceberg, Hudi) and storage formats (Parquet, ORC, Avro).
- Build and maintain streaming architectures with Kafka (Schema Registry, Streams), Pulsar, Kinesis, or Event Hubs.
- Apply dimensional, Data Vault, and semantic modeling; utilize virtualization tools such as Denodo or Starburst/Trino.
- Implement data quality, observability and lineage using OpenLineage, Marquez, Great Expectations, Monte Carlo, or Soda.
- Deliver CI/CD pipelines (GitHub/GitLab, Jenkins, Terraform, Docker, Kubernetes).
- Enforce security, GDPR compliance and IAM policies; manage encryption and tokenization.
- Perform Linux administration and infra‑as‑code across AWS Glue, EMR, Athena, S3, Lambda, Step Functions, and analogous Azure/GCP services.
Required Skills:
- 3+ years data engineering experience in commercial settings.
- Strong programming in Python, Scala, or Java with clean coding/testing habits.
- Proficiency with Spark (core, SQL, streaming), Databricks, Snowflake, Flink, Beam.
- Expertise in Delta Lake, Iceberg, Hudi; file formats Parquet, ORC, Avro.
- Streaming & messaging – Kafka (including Schema Registry & Streams), Pulsar, Kinesis, Event Hubs.
- Data modelling (dimensional, Data Vault, semantic); virtualization tools Denodo, Starburst/Trino.
- Cloud platforms – AWS (Glue, EMR, Athena, S3, Lambda, Step Functions), plus Azure Synapse and GCP BigQuery awareness.
- SQL and NoSQL databases: PostgreSQL, MySQL, DynamoDB, MongoDB, Cassandra.
- Orchestration: Autosys/CA7/Control‑M, Airflow, Dagster, Prefect, or managed services.
- Observability & lineage tools: OpenLineage, Marquez, Great Expectations, Monte Carlo, Soda.
- DevOps: Git, Jenkins, Terraform, Docker, Kubernetes (EKS/AKS/GKE, OpenShift).
- Security & governance: encryption, tokenization, IAM, GDPR compliance, and Linux administration.
Required Education & Certifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- Relevant certifications (e.g., AWS Certified Data Analytics, Databricks Certified Professional Data Engineer) preferred but not mandatory.