1. Work with Apache Spark, HDFS, AWS EMR, Spark Streaming, GraphX, MlLib, Cassandra, Elasticsearch, Yarn, Hadoop, Hive, AWS Cloud services, SQL.
2. Be working with Machine learning / Deep learning libraries ( MlLib, Tensorflow, PyTorch) to implement solutions that solves or automates real world tasks like .
3. Be building smart models that can be used in edge devices like IoT devices to perform edge computing and provide smart predictions locally.
4. Design, implement and automate deployment of distributed system for collecting and processing large data-sources.
5. Write ETL and ELT jobs and Spark / Hadoop jobs to perform computation on large scale datasets.
6. Design streaming applications using Apache Spark, Apache Kafka for real time computations.
7. Design complex data models and schemas for structured and semi structured datasets in SQL and NoSQL environments.
8. Deploy and test solutions on cloud platforms like Amazon EMR, Google Dataproc, Google Cloud Dataflow etc.
9. Explore and analyze data using various visualization tools like Tableau, Qlik etc.
Candidate Profile :
Experience : Minimum 1 - 2 years of experience in Apache Spark.
Technical Skills :
1. Experience in working with Streaming Environments (Spark Streaming / Flink)
2. Experience in Hadoop ecosystem ( Hadoop MR,HDFS, Pig, SQOOP, Impala, Hive, Presto)
3. Good experience of using Spark and Hadoop frameworks on Amazon EMR.
4. Strong knowledge of data modelling and design principles in SQL and NoSQL environments.
5. Strong experience in working and building ELT and ETL pipelines and their components.
6. Experience or familiarity with visualisation tools like Tableau, Qlik or Grafana.
7. Strong experience in developing REST API and consuming data from external web API’s.
8. Comfortable with source control system (Github) and linux environments.