Description
**Role Summary**
+ Build and operate large-scale healthcare data pipelines across batch workflows, metadata-driven ingestion, and data service publishing.
+ Own end-to-end engineering from source ingestion to conformed data products, with strong focus on reliability, data quality, and operational observability.
+ Partner with analytics, business, and platform teams to deliver trusted datasets for sales, claims, activity, patient, and rare disease use cases.
**Key Responsibilities**
+ Design and maintain PySpark/SQL pipelines in Databricks for landing, unified, unstitched, and published data layers.
+ Build and support Airflow DAGs for scheduling, dependencies, retries, and production operations.
+ Implement metadata/config-driven frameworks for ingestion, transformation, and rule-based processing.
+ Develop robust data quality controls, DQ summaries, failure handling, and alerting workflows.
+ Manage batch/process audit logs, run status tracking, release flags, and operational reporting.
+ Integrate multi-source data (files, APIs, cloud storage, and relational systems) into governed Delta/Spark tables.
+ Optimize pipeline performance using partitioning, parallelization, and query tuning.
+ Collaborate on schema evolution, business-rule onboarding, and production support.
**Required Skills**
+ Bachelor’s degree in Computer Science, Information Technology, or a related field with 2-6 years of experience
+ Advanced Python, PySpark, and SQL (window functions, complex joins, MERGE patterns, optimization).
+ Hands-on Databricks and Airflow experience in enterprise environments.
+ Experience with cloud data platforms (AWS), object storage, and secure secret handling.
+ Strong data quality engineering, monitoring, and troubleshooting in regulated data contexts.
+ Solid understanding of ETL orchestration, dependency management, and SLA-driven delivery.





