Role Summary
Data Engineer II within the Onyx Research Data Platform at GSK, focused on designing, delivering, and maintaining automated end-to-end data services and pipelines. Work across structured, unstructured, and scientific data domains to deliver reliable, scalable, and governed data products, with opportunities to contribute to GenAI-enabled data capabilities.
Responsibilities
- Build modular code / libraries / services using modern data engineering tools (Python/Spark, Kafka, Storm) and orchestration tools (e.g., Google Workflow, Airflow Composer)
- Produce well-engineered software, including automated test suites and technical documentation
- Develop, measure, and monitor key metrics for all tools and services, and iterate to improve them
- Ensure consistent application of platform abstractions for logging and data lineage
- Participate in code reviews and uphold coding best practices, improving the team's standards
- Adhere to QMS framework and CI/CD best practices
- Provide L3 support to existing tools / pipelines / services
Qualifications
- Required: Bachelorβs degree in Data Engineering, Computer Science, Software Engineering, or a related discipline
- Required: 4+ years of data engineering experience
- Required: Software engineering experience
- Required: Familiarity with orchestrating tooling
- Required: Cloud experience (GCP, Azure or AWS)
- Required: Experience in automated testing and design
- Preferred: New PhD or a Masters degree with 2+ years of experience
- Preferred: Experience overcoming high-volume, high-compute challenges
- Preferred: Knowledge and use of programming languages such as Python, Scala, Java, including toolchains for documentation, testing, and operations/observability
- Preferred: Strong experience with modern software development tools and practices (Git/GitHub, DevOps tools, metrics/monitoring)
- Preferred: Cloud experience (AWS, Google Cloud, Azure, Kubernetes)
- Preferred: Experience with CI/CD implementations using git and common stacks (e.g., Jenkins, CircleCI, GitLab, Azure DevOps)
- Preferred: Experience with agile software development environments using Jira and Confluence
- Preferred: Demonstrated experience with data engineering tools (e.g., Spark, Kafka, Storm)
- Preferred: Knowledge of data modeling, database concepts and SQL
- Preferred: Exposure to GenAI or ML data workflows (vector stores, embeddings, feature pipelines)
Skills
- Proficiency with Spark, Kafka, Storm and related data processing frameworks
- Experience with data orchestration and workflow tools (Airflow, Google Workflow, etc.)
- Strong software engineering practices: testing, documentation, version control, and observability
- Knowledge of data governance, logging, and data lineage
- Familiarity with GenAI-enabled data concepts and vectorized data flows