Data Engineer II
Bungee Tech.com
Office
Tamil Nadu, Chennai, India
Full Time
Job Summary:
Building On The Foundation Of The Sde-I Role, The De- Ii Position Takes On A Greater Level Of Responsibility And Leadership. You'Ll Play A Crucial Role In Driving The Evolution And Efficiency Of Our Data Collection And Analytics Platform, Capable Of Handling Terabyte-Scale Data And Billions Of Data Points.
Key Responsibilities
- Lead the design, development, and optimization of large-scale data pipelines and infrastructures using technologies like Apache Airflow, Spark, Kafka, and more.
- Architect and implement distributed data processing solutions to handle terabyte-scale datasets and billions of records efficiently across multi-region cloud infrastructure (AWS, GCP, DO).
- Develop and maintain real-time data processing solutions for high-volume data collection operations using technologies like Spark Streaming and Kafka.
- Optimize data storage strategies using technologies such as Amazon S3, HDFS, and Parquet/Avro file formats for efficient querying and cost management.
- Build and maintain high-quality ETL pipelines, ensuring robust data collection and transformation processes with a focus on scalability and fault tolerance.
- Collaborate with data analysts, researchers, and cross-functional teams to define and maintain data quality metrics, implement robust data validation, and enforce security best practices.
- Mentor junior engineers (SDE-I) and foster a collaborative, growth-oriented environment.
- Participate in technical discussions, contributing to architectural decisions, and proactively identifying improvements for scalability, performance, and cost-efficiency.
- Ensure application performance monitoring (APM) is in place, utilizing tools like Datadog, New Relic, or similar to proactively monitor and optimize system performance, detect bottlenecks, and ensure system health.
- Implement effective data partitioning strategies and indexing for performance optimization in distributed databases such as DynamoDB, Cassandra, or HBase.
- Stay current with advancements in data engineering, orchestration tools, and emerging cloud technologies, continually enhancing the platform’s capabilities
Qualifications & Experience:
- 4-5+ years of hands-on experience with Apache Airflow and other orchestration tools for managing large-scale workflows and data pipelines.
- Expertise in AWS technologies, Athena, AWS Glue, DynamoDB, Apache Spark, PySpark, SQL, and NoSQL databases.
- Experience in designing and managing distributed data processing systems that scale to terabyte and billion-scale datasets using cloud platforms like AWS, GCP, or Digital Ocean.
- Proficiency in web crawling frameworks, including Node.js, HTTP protocols, Puppeteer, Playwright, and Chromium for large-scale data extraction.
- Experience with monitoring and observability tools such as Grafana, Prometheus, Elasticsearch, and familiarity with monitoring and optimizing resource utilization in distributed systems.
- Strong understanding of infrastructure as code using Terraform, automated CI/CD pipelines with Jenkins, and event-driven architecture with Kafka.
- Experience with data lake architectures and optimizing storage using formats such as Parquet, Avro, or ORC.
- Strong background in optimizing query performance and data processing frameworks (Spark, Flink, or Hadoop) for efficient data processing at scale.
- Knowledge of containerization (Docker, Kubernetes) and orchestration for distributed system deployments.
- Deep experience in designing resilient data systems with a focus on fault tolerance, data replication, and disaster recovery strategies in distributed environments.
- Strong data engineering skills, including ETL pipeline development, stream processing, and distributed systems.
- Excellent problem-solving abilities, with a collaborative mindset and strong communication skills.
