Role Summary
The Regeneron Genetics Center uses genetics and health data on millions of people to advance our understanding of human disease and guide Regeneronβs therapeutic programs. You will help us to organize, analyze and interpret health information aggregated from electronic health records, surveys, digital devices, and laboratory assays from multiple collaborators. This will require developing and implementing standards for health data as well as code repositories, pipelines and analysis tools that facilitate interacting with data. You will need expertise in organizing and structuring large and complex datasets as well as top-notch software engineering and orchestration skills. You will help design data structures that store health data. You will implement the code and processes that populate, validate, and analyze these data. You will participate in the downstream analysis using machine learning, genomics, and epidemiology.
Responsibilities
- Developing and maintaining a toolkit with key functions, APIs, and summaries that help users understand, interpret, and interact with phenotype data. In addition to Python, knowledge of R, SQL and/or C++ is a definite plus.
- Developing the tools and code that will transform electronic health records, surveys, laboratory assays, or digital device data into a harmonized tall and narrow format compatible with RGC analytical tools, applications and processes. You will probably be writing and updating code in Python and using associated data science libraries, such as pandas, Polars, NumPy, scikit, and others.
- Reviewing the structure, content, and quality of phenotype data extracted from electronic health records, surveys, digital devices, or laboratory assays. Each of these datasets may include data on 100,000s of people and require coordination and input from multiple stakeholders with varied expertise.
- Discussing the challenges and opportunities of using health data from electronic health records, surveys, digital devices, and laboratory assays to characterize human health and disease. Expertise in cutting-edge statistical methods for epidemiology is a definite plus.
- Working with modern cloud environments and platforms. A knowledge of AWS and related toolkits will be useful for your day-to-day work. You will be using your computational skills to execute analysis and data processing at scale and to facilitate automation and repeatability of all key processes.
- Presenting results and summaries of these datasets and data processing plans to a variety of technical audiences, ranging from experts in statistics, epidemiology, genetics, and computation to experts in biology, drug design, and medicine. You will need outstanding communication skills and an ability to summarize and present to a variety of technical audiences.
- Working in a highly interactive environment with a team of colleagues. We highly value the ability to interact, learn, and teach so that you and other skilled individuals consistently achieve high levels of motivation, enthusiasm, and performance.
Qualifications
- Required: PhD (or MS with additional years in lieu) in Computer Science, Health Informatics, Clinical Informatics, Biostatistics, or a related field with at least 3 years of relevant experience organizing large datasets in a research setting.
- Required: Demonstrated knowledge of Python and key data science libraries.
- Preferred: Knowledge of R, SQL and/or C/C++.
- Preferred: Experience mapping structured and unstructured data to ontologies such as ICD-10, RxNORM and LOINC.
- Preferred: Experience applying best practices for data quality control, summarization and visualization.
Skills
- Data modeling and data engineering for large health data cohorts
- Experience with cloud platforms (AWS) and scalable data processing
- Proficiency with Python and data science libraries (pandas, Polars, NumPy, scikit-learn, etc.)
- Strong communication and ability to present technical concepts to diverse audiences
- Collaborative, team-oriented mindset with ability to work across disciplines
Education
- PhD in Computer Science, Health Informatics, Clinical Informatics, Biostatistics, or related field; or MS with additional years of relevant experience in lieu