- Company Name
- Mistral AI
- Job Title
- Software Engineer, Data Acquisition
- Job Description
-
**Job title:** Software Engineer, Data Acquisition
**Role Summary:**
Design, develop, and maintain scalable web crawling and data acquisition pipelines. Leverage Python-based scraping frameworks, headless browsers, and distributed queue systems to extract high‑quality data from diverse web sources and APIs. Collaborate with cross‑functional teams to ensure data volume, velocity, and accuracy meet business objectives.
**Expectations:**
- Build robust Crawlers & Indexers using BeautifulSoup, Scrapy, Selenium/Playwright, and regular expression/XPath techniques.
- Manage distributed job queues (Redis, Kubernetes) and store crawled data in relational/NoSQL databases (PostgreSQL, MongoDB).
- Monitor data quality and implement automated validation workflows.
- Optimize crawling performance for efficiency, scalability, and resilience.
- Continuously improve infrastructure: refactor old pipelines, adopt new libraries, and incorporate ML for smarter crawling where applicable.
**Key Responsibilities:**
1. Develop and maintain Python crawlers for target websites.
2. Implement headless browsing with Chrome DevTools, Selenium, or Playwright.
3. Design data extraction patterns using regex, XPath, CSS selectors.
4. Build and manage distributed job queues (Redis, Kubernetes).
5. Store and index data in PostgreSQL, MongoDB, or similar solutions.
6. Implement monitoring, logging, and alerting for data pipeline health.
7. Collaborate with API teams to integrate and consume external data sources.
8. Analyze and visualize crawled data using Pandas, NumPy, Matplotlib.
9. Document architecture, code, and best practices for the engineering team.
**Required Skills:**
- **Programming:** Python (core) → advanced; Java or C++ optional.
- **Web Scraping:** BeautifulSoup, Scrapy, Selenium, Playwright, or equivalent.
- **Web Technologies:** HTML, CSS, JavaScript, HTTP/HTTPS protocols.
- **Data Structures & Algorithms:** Handles queues, stacks, hash maps, and optimization for large‑scale crawling.
- **Databases:** PostgreSQL, MongoDB, or similar; experience with SQL/NoSQL schema design.
- **Distributed Systems:** Experience with Redis, Kubernetes, and scalable job queue patterns.
- **Data Engineering:** Proficiency with Pandas, NumPy, Matplotlib for analysis and visualization.
- **Optional/Bonus:** ML for crawling efficiency, cloud platforms (AWS/GCP), Docker, Hadoop/Spark.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Software Engineering, Data Engineering, or related field.
- Proven work experience (3+ years) building web scraping and data ingestion pipelines.
- No mandatory certifications required, but knowledge of data‑engineering or cloud certifications is advantageous.