Lead Assistant Manager

EXL.com

Office

India

Full Time

Role : Data Engineer -Pyspark Basic Qualifications 2+ years of experience in data engineering or software development with a focus on PySpark and distributed data processing Strong proficiency in PySpark for big data processing, transformations, and data analysis Solid experience with Apache Spark architecture, performance tuning, and optimization techniques for large-scale data processing Expertise in Python programming for building robust, scalable data pipelines using PySpark Strong understanding of Hadoop and other big data technologies (e.g., Hive , HBase , Kafka ) in a cloud or on-premise environment Hands-on experience with SQL , including querying and optimizing large datasets in distributed environments Familiarity with data lake and data warehouse architectures and working with structured and unstructured data Proven ability to design, implement, and automate ETL processes for large volumes of data Experience working in an Agile environment, collaborating closely with cross-functional teams to deliver high-quality solutions Strong problem-solving and debugging skills, with a focus on performance optimization and cost-efficient data processing Preferred Qualifications Degree in Computer Science , Engineering , Data Science , or a related field, or equivalent work experience Extensive experience with PySpark on Apache Spark clusters , both on-premise and in cloud environments such as AWS , GCP , or Azure Hands-on experience with data orchestration tools like Apache Airflow or Cloud Composer for managing data workflows Knowledge of Hadoop ecosystem tools, including Hive , HBase , and Kafka , and experience integrating them with PySpark Experience working with NoSQL databases (e.g., Cassandra , MongoDB ) and data stores commonly used in big data environments Proficiency in distributed computing concepts, such as shuffling, partitioning, and fault tolerance in Spark clusters Familiarity with data serialization formats (e.g., Parquet , Avro , ORC ) for efficient data storage and processing Experience with data quality , data governance , and data security best practices for big data systems Knowledge of containerization tools (e.g., Docker ) and orchestration platforms (e.g., Kubernetes ) for deploying PySpark workloads Strong experience with cloud storage solutions (e.g., S3 , Google Cloud Storage , Azure Blob Storage ) for big data processing Ability to optimize data pipelines for performance and scalability in distributed systems Experience with DevOps practices for continuous integration, testing, and deployment of data pipelines Strong communication skills , with the ability to clearly convey complex technical concepts to both technical and non-technical audiences Ability to work effectively in a fast-paced, collaborative, and dynamic team environment