Role Summary
The Principal Data Engineer will architect and deliver cloud-based data pipelines using Python, Spark, and Airflow to automate ETL/ELT processes across data lakes and warehouses. They will design and implement AI/ML and GenAI-driven solutions using supervised/unsupervised learning, statistical modeling, and NLP to enhance data quality, automate workflows, detect similarities, and support evidence-based clinical decision-making. The role involves developing data integration workflows for structured and unstructured data, creating interactive dashboards and real-time visualization platforms to deliver actionable insights, and mentoring junior engineers while guiding enterprise-wide adoption of scalable, AI-powered data engineering solutions. Location: Cambridge, MA; 100% remote work allowed anywhere in the U.S.
Responsibilities
- Engineer cloud-based data pipelines using Python, Spark, and Airflow to automate ETL/ELT processes, enabling efficient data ingestion, transformation, and storage across data lakes and warehouses.
- Design and implement AI/ML and GenAI-driven solutions using supervised/unsupervised learning, statistical modeling, and NLP to enhance data quality, automate workflows, detect similarities, and support evidence-based clinical decision-making.
- Develop robust data integration workflows for structured and unstructured data, ensuring adherence to Good Clinical Practices (GCP), FDA regulations, and SOPs through SQL-based data validation frameworks.
- Create interactive dashboards and real-time visualization platforms to deliver actionable insights from clinical and operational data, enabling stakeholders to monitor performance and drive data-informed strategies.
- Develop custom automation tools using Python, R, and APIs to streamline data entry, reduce manual processing, and enhance operational efficiency across clinical research systems.
- Drive strategic alignment by partnering with cross-functional teams, mentoring junior engineers, and advising leadership on AI/ML adoption, automation strategies, and emerging data technologies.
- Influence industry practices by presenting technical innovations at leading conferences and guiding enterprise-wide adoption of scalable, AI-powered data engineering solutions.
Qualifications
- 30 months of related experience; design, develop, test, and deploy software applications and features based on client and project requirements.
- Experience implementing automated testing and regression testing using Selenium and Python to improve test coverage, reduce manual effort, and ensure application stability.
- Collaborate with cross-functional teams, including developers, business analysts, and QA leads; participate in Agile/Scrum ceremonies to plan, deliver, and communicate software progress iteratively.
- Perform data wrangling, transformation, and management to create structured datasets stored in databases, supporting data analyses.
Education
- Masterβs degree in Computer Science, Data Science, Engineering, or related field.
Skills
- Python
- Spark
- Airflow
- SQL-based data validation frameworks
- AI/ML, GenAI, NLP
- Data integration for structured and unstructured data
- Dashboard and real-time visualization
- APIs and automation using Python and R
- Collaboration, mentoring, and leadership communication