University of Technology Sydney

94693 Big Data Engineering

Warning: The information on this page is indicative. The subject outline for a particular session, location and mode of offering is the authoritative source of all information about the subject for that offering. Required texts, recommended texts and references in particular are likely to change. Students will be provided with a subject outline once they enrol in the subject.

Subject handbook information prior to 2024 is available in the Archives.

UTS: Transdisciplinary Innovation
Credit points: 8 cp
Result type: Grade, no marks

Requisite(s): 36106 Machine Learning Algorithms and Applications

Description

The data scientist of the modern era can no longer be equipped with the skills to undertake Data Science work on only local devices. The true power of a data scientist comes through the application of his skill sets at scale on Big Data. That is, data which typically is of such a Volume, Variety and Velocity that traditional data analytics processes will not work. With the rise of innovations such as the Big Data distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and computation at scale, it is now possible for the modern data scientist to convert Big Data into value at scale. A data scientist’s capability to convert data into value is largely correlated with the stage of his company’s data infrastructure, this means that a data scientist who knows about Big Data Engineering is able to prove and apply his Data Science skills while being aligned with the stage and need of the company.

This elective provides MDSI students with a strong edge over other data scientists who have no exposure to big data engineering and will be best positioned to meet this increasing demand of modern data scientists with Big Data Engineering skills.

Subject learning objectives (SLOs)

Upon successful completion of this subject students should be able to:

1. Participate in the development of data engineering projects using popular tools (Snowflake, Databricks, PostgreSQL, Spark, Airflow, Kafka etc..) and languages including Python and SQL.
2. Build simple data pipelines or ETLs in order to load, transform and prepare raw data into modern data warehousing solutions.
3. Interact with and query data from modern data warehousing and data lake technologies.
4. Understand basic data engineering concepts including data modelling, big data, distributed data processing and data streaming.

Contribution to the development of graduate attributes

Your experiences as a student in this subject support you to develop the following graduate attributes (GA):

GA 1 Sociotechnical systems thinking
GA 2 Creative, analytical and rigorous sense making
GA 3 Create value in problem solving and inquiry
GA 4 Persuasive and robust communication
GA 5 Ethical citizenship

Teaching and learning strategies

This subject is conducted over 8 sessions with exercises assigned between classes.
Classes will be conducted in the format of workshop with a mix of lecture components and practical exercises.
Each session runs for 3 hours on Tuesday evenings as decided in the timetable.


The practical components involve two types of activities:

  • ‘code together’ sessions in which the instructor and students build understanding through collaboratively coding solutions to problems or implementing theoretical concepts.
  • Practical coding tasks for students to complete themselves or in small groups.

Assignments are a mix of practical coding exercises, solution design and implementation tasks. Through these students get an exposure to historical and current industry trends and challenges, while developing tangible skills to implement these technologies in a work context.


Due to the rapidly advancing nature of this field it is important for students to develop skills in quickly absorbing, dissecting and understanding new technologies and their value to business problems. These assignments in the subject are designed to help students develop these critical new skills.

Content (topics)

1. Database Fundamentals:

  • Data Modelling
  • Database Properties
  • SQL


2. Data Storage:

  • Data Warehouse
  • SQL & NoSQL Databases
  • Data Lake
  • Snowflake


3. Data Processing:

  • Big Data Definitions
  • Hadoop
  • Spark


4. Data Pipelines:

  • ETL & ELT
  • Data Pipelines
  • Airflow


5. Data Streaming:

  • Data Streaming Definitions
  • Kafka
  • Stream Processing
  • Spark Streaming

Assessment

Assessment task 1: Assessment task 1: Data lake with Snowflake

Intent:

Be able to store, load, explore and process a dataset on a Datalake by using Snowflake

Type: Report
Groupwork: Individual
Weight: 30%
Criteria:
  1. Quality of code
  2. Justification of any data processing (transformation, formats, storage, etc.)
  3. Accuracy of results with evidence supporting claims
  4. Quality of findings and recommendations for business questions
  5. Clarity and quality of written report

Assessment task 2: Assessment task 2: Data processing with Spark

Intent:

Be able to load, explore and process a dataset using Spark.

Type: Report
Groupwork: Individual
Weight: 35%
Criteria:
  1. Quality of code
  2. Justification of any data processing (transformation, formats, storage, etc.)
  3. Accuracy of results with evidence supporting claims
  4. Quality of findings and recommendations for business questions
  5. Clarity and quality of written report

Assessment task 3: Assessment task 3: Data pipelines with Airflow

Intent:

Be able to build data pipelines with Airflow to collect raw data and processed it for an end user.

Type: Report
Groupwork: Individual
Weight: 35%
Criteria:
  1. Quality of code
  2. Justification of any data processing (transformation, formats, storage, etc.) and DAGs structures
  3. Accuracy of results with evidence supporting claims
  4. Quality of findings and recommendations for business questions
  5. Clarity and quality of written report