Job Specifications
We are building an internal platform for collecting feedback from users and turning it into structured, rich signals. These signals will directly support evaluation, analysis, and preference-learning workflows.
This role is for engineers who move fast, like owning real systems end-to-end, and are excited to turn a strong prototype into something reliable, scalable, and widely used.
We expect this engineer to independently own major system components, but not to set org-wide technical direction.
What you’ll do
You’ll take a working prototype and push it into a durable internal product by:
Owning major components end-to-end. Design, build, and ship core parts of the system (session state, rubric tooling, inference-backed features, exports) with minimal hand-holding.
Shipping quickly and iterating in tight loops. Deliver improvements weekly, respond to real user feedback (researchers, raters), and refine UX and APIs without over-engineering.
Building inference-backed product features. Integrate server-side LLM calls for rubric suggestions, guided generation, and scoring—with proper versioning, logging, and safety boundaries.
Making data trustworthy by default. Implement event logging, immutable snapshots, and stable export schemas so downstream users can rely on the data without guesswork.
Keeping the system operational. Add basic observability, guardrails, retries, and sane failure modes so the platform works reliably under real use.
What you’ll build
A multi-stage elicitation workflow (context ? A/B preference ? rubric ? validation ? calibration)
Rubric tooling (edit, weight, version, lint)
Session persistence (autosave, resume, auditability)
Inference-powered features (rubric suggestions, guided generation, scoring)
Export pipelines that produce clean, versioned “signal bundles” for eval and training consumers
What success looks like (first ~90 days)
Authentication and security plans are validated and implemented.
Sessions autosave and resume reliably; no data loss on refresh.
Core inference calls run server-side with logging, versioning, and access control.
Completed sessions consistently produce valid, usable exports.
The system ships meaningful UX or infra improvements every 1–2 weeks.
Researchers and raters actively use the tool and give actionable feedback.
Required Qualifications
Strong software engineering fundamentals: you can design, build, debug, and operate real systems.
Experience owning non-trivial product or platform components end-to-end.
Comfort moving quickly in ambiguous problem spaces.
Hands-on experience with:
a modern frontend framework (React/TypeScript or similar),
backend APIs (Python/FastAPI, Go, Node, etc.),
relational databases and basic schema design.
Good engineering judgment around tradeoffs, failure modes, and scope. Nice to have (not required)
Familiarity with human-computer interaction principles, experiment design, bias mitigation, or data quality tooling.
Experience with evaluation systems, annotation tooling, or preference data.
Experience integrating LLMs into production systems.
Experience building data exports consumed by downstream pipelines.
Why this role matters
Preference elicitation is a leverage point: better signals make evaluation better, training more reliable, and alignment work more concrete.
If you like moving fast, owning systems, and turning prototypes into products that matter, this role could be for you.
Pay Rate: $130-$140/hour