94693 Big Data Engineering
Warning: The information on this page is indicative. The subject outline for a
particular session, location and mode of offering is the authoritative source
of all information about the subject for that offering. Required texts, recommended texts and references in particular are likely to change. Students will be provided with a subject outline once they enrol in the subject.
Subject handbook information prior to 2022 is available in the Archives.
Credit points: 8 cp
Result type: Grade, no marks
Requisite(s): 36106 Machine Learning Algorithms and Applications AND 94692 Data Science Practice
The data scientist of the modern era can no longer be equipped with the skills to undertake Data Science work on only local devices. The true power of a data scientist comes through the application of his skill sets at scale on Big Data. That is, data which typically is of such a Volume, Variety and Velocity that traditional data analytics processes will not work. With the rise of innovations such as the Big Data distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and computation at scale, it is now possible for the modern data scientist to convert Big Data into value at scale. A data scientist’s capability to convert data into value is largely correlated with the stage of his company’s data infrastructure, this means that a data scientist who knows about Big Data Engineering is able to prove and apply his Data Science skills while being aligned with the stage and need of the company.
This elective provides MDSI students with a strong edge over other data scientists who have no exposure to big data engineering and will be best positioned to meet this increasing demand of modern data scientists with Big Data Engineering skills.
Subject learning objectives (SLOs)
Upon successful completion of this subject students should be able to:
|1.||Participate in the development of data engineering projects using popular tools (PostgreSQL, Spark, Airflow, Kafka etc..) and languages including Python and SQL.|
|2.||Build simple data pipelines or ETLs in order to load, transform and prepare raw data into modern data warehousing solutions.|
|3.||Interact with and query data from modern data warehousing and data lake technologies.|
|4.||Understand basic data engineering concepts including data modelling, big data, distributed data processing and data streaming.|
Contribution to the development of graduate attributes
Your experiences as a student in this subject support you to develop the following graduate attributes (GA):
GA 1 Sociotechnical systems thinking
GA 2 Creative, analytical and rigorous sense making
GA 3 Create value in problem solving and inquiry
GA 4 Persuasive and robust communication
GA 5 Ethical citizenship
Teaching and learning strategies
This subject is conducted twice a week on face-to-face sessions with exercises assigned between classes.
Classes will be conducted in the format of workshop with a mix of lecture components and practical exercises.
Each session runs for 3 hours on Monday evenings and Thursday evenings as decided in the timetable.
The practical components involve two types of activities:
- ‘code together’ sessions in which the instructor and students build understanding through collaboratively coding solutions to problems or implementing theoretical concepts.
- Practical coding tasks for students to complete themselves or in small groups.
Assignments are a mix of practical coding exercises, solution design and implementation tasks. Through these students get an exposure to historical and current industry trends and challenges, while developing tangible skills to implement these technologies in a work context.
Due to the rapidly advancing nature of this field it is important for students to develop skills in quickly absorbing, dissecting and understanding new technologies and their value to business problems. These assignments in the subject are designed to help students develop these critical new skills.
1. Database Fundamentals:
- Data Modelling
- Database Properties
2. Data Warehouse Concepts:
- Data Warehouse Architecture
- SQL Database
- NoSQL Database
3. Big Data Concepts:
- Big Data Definitions
- Data Lake
4. Data Pipelines:
- ETL & ELT
- Data Pipelines
5. Data Streaming:
- Data Streaming Definitions
- Stream Processing
- Spark Streaming
6. Cloud Computing Concepts:
- Cloud Storage
- Cloud Computing
Assessment task 1: Assessment task 1: Data processing on a big dataset with Spark
Be able to load, explore and process a dataset whose size is larger than your available memory by using Spark.
|Groupwork:||Group, individually assessed|
Assessment task 2: Assessment task 2: Building data pipelines with Airflow
Be able to build data pipelines with Airflow capable of collecting raw data and loading cleaned data into a datamart
Assessment task 3: Assessment task 3: Real Time Analytics with Spark Streaming
Be able to build a data streaming service capable of analysing real time data
|Groupwork:||Group, group assessed|