University of Technology Sydney

36103 Statistical Thinking for Data Science

Warning: The information on this page is indicative. The subject outline for a particular session, location and mode of offering is the authoritative source of all information about the subject for that offering. Required texts, recommended texts and references in particular are likely to change. Students will be provided with a subject outline once they enrol in the subject.

Subject handbook information prior to 2023 is available in the Archives.

UTS: Analytics and Data Science: TD School
Credit points: 8 cp

Subject level:

Postgraduate

Result type: Grade, no marks

There are course requisites for this subject. See access conditions.

Description

This subject teaches students the key skills and concepts to apply statistical thinking within an applied data science setting. Students start by being introduced to basic statistical concepts, develp programming skills in R, and start working with real world data. This is followed by learning a family of linear regression models, and then applying what they have learned to go through a full data science research cycle. Working in teams, students learn how to formulate research questions, answer them using formal statistics and real-world datasets, and communicate their findings both verbally and in report format. Students are then given the oppourtinuty to extend their team projects as individuals, using advanced methods to formulate and answer new research questions and submitting their findings in a technical scientific report.

The progression of this subject starts with more teaching-intensive methods such as workshops and lectures to give students the technical and conceptual know-how to work as practicing data scientists. However, as the term progresses, students increasingly move towards an individually-driven learning mode, providing first teams, and then individuals, the freedom to develop their statistical thinking and skills further.

By the end of the term students have a strong technical, conceptual, and practical foundation to continue their development as Data Scientists.

Subject learning objectives (SLOs)

Upon successful completion of this subject students should be able to:

1. Manage the complexity of real data science projects and their inevitable compromises
2. Formulate authentic data science questions precise enough to be answered by valid statistical techniques
3. Justify the use of different statistical concepts and tools to audiences from a wide range of backgrounds
4. Find, clean, and merge datasets from a range of sources to answer real world data science problems
5. Apply statistical methods that are appropriate to a dataset and stakeholder requirements
6. Interpret the results of a statistical analysis correctly, visualizing and reporting upon them in ways that create value for, and are sensitive to the needs of, a wide range of stakeholders
7. Collaborate with and contribute to the professional community of data scientists, both local and global

Course intended learning outcomes (CILOs)

This subject also contributes specifically to the development of the following course outcomes:

  • Exploring and testing models and describing behaviours of complex systems
    Explore and test models and generalisations for describing the behaviour of sociotechnical systems and selecting data sources, taking into account the needs and values of different contexts and stakeholders (1.2)
  • Making the invisible visible
    Use transdisciplinary approaches to seeing and doing to uncover underrepresented, or misrepresented, elements of a system (1.4)
  • Exploring, interpreting and visualising data
    Explore, analyse, manipulate, interpret and visualise data using data science techniques, software and technologies to make sense of data rich environments (2.2)
  • Designing and managing data investigations
    Apply and assess data science concepts, theories, practices and tools for designing and managing data discovery investigations in professional environments that draw upon diverse data sources, including efforts to shed light on underrepresented components (2.4)
  • Developing strategies for innovation
    Explore, interrogate, generate, apply, test and evaluate problem-solving strategies to extract economic, business, social, strategic or other value from data (3.1)
  • Working together
    Develop a collaborative and team-oriented mindset to harness value for stakeholders to produce innovative solutions to challenges (3.3)
  • Engaging audiences
    Explore and craft interpretative narratives that engage key audiences with data analytics and potential significance for action, at a societal, industrial, organisational, group or individual levels (4.2)
  • Informing decision making
    Develop, test, justify and deliver data project propositions, methodologies, analytics outcomes and recommendations for informing decision-making, both to specialist and non-specialist audiences (4.3)

Contribution to the development of graduate attributes

Your experiences as a student in this subject support you to develop the following graduate attributes (GA):

GA 1 Sociotechnical systems thinking
GA 2 Creative, analytical and rigorous sense making
GA 3 Create value in problem solving and inquiry
GA 4 Persuasive and robust communication

Teaching and learning strategies

Authentic problem based learning: This subject relies heavily upon the principle that students learn best by doing. It offers a range of authentic data science problems to solve that help to develop students’ statistical thinking about complex problems. Students learn how to find and understand the many different forms of R documentation, and to write useful documentation for others. They work on real world data analysis problems using datasets that they create using modern data harvesting techniques. These are used to answer realistic data science questions in broad areas of topical interest. This exposes them to the true ambiguities, constraints, and complexities of working as a data scientist for a variety of different stakeholders.

Blend of online and face to face activities: This subject is offered through a series of block sessions blending online with face-to-face learning. Students interact face-to-face with each other and the teaching team in three intensive modules that require the completion of both preparation and after class activities. They concurrently use a range of complementary online resources to develop their statistical thinking according to identified weaknesses in their background knowledge. They are expected to engage in online discussion and to actively participate in other blended activities.

Collaborative work: We place a strong emphasis on group activities and collaboration in diverse teams. As a data science professional you need to approach professional projects and challenges by working with people from different backgrounds, expectations, and expertise. This course simulates that environment by requiring students to work with a team of peers who come from many different backgrounds. Group assessments help students to develop effective strategies for working as a part of a data science team, as well as an appreciation that there are diverse perspectives on many different topics in data science and innovation.

Self paced evaluation and improvement: This subject takes students from an exceptionally wide range of backgrounds, some of who are better versed in statistical methods, and R, than others. We help all students to self-diagnose their weaknesses and strengths, and to work to improve in areas that they identify as a priority for the professional niche that they would like to occupy as a practicing data scientist. Students choose their own path through a wide variety of curated resources as needed. A series of advanced topics help them to develop further expertise in areas of interest and to prepare for iLabs.

Content (topics)

Module 0: Preparing for Statistical Thinking in Data Science

  • Introduction to key concepts and terminology
  • Using the R programming language and R Studio
  • Working with data using modern Data Science libraries.
  • Obtaining data from different sources

Module 1: First Principles

  • Introduction to statistics
  • Exploratory data analysis
  • Data visualisation
  • Linear regression
  • Model assumptions and diagnotics
  • Writing documentation using R Markdown

Module 2: Regression Models

  • Generalized linear models
  • Logistic regression
  • Model selection
  • Collaboration and version control with Git
  • Pitching, executing, and communicating a data science research project

Module 3: Advanced Topics

  • Time series analysis
  • Marginal/Hierarchical models
  • Bayesian methods
  • Introduction to multinomial models and fitting nonlinear relationships.

Assessment

Assessment task 1: Exploration of data skills and issues

Objective(s):

3 and 7

Type: Report
Groupwork: Individual
Weight: 10%
Length:

500-700 words R "vignette" (not including code samples)

Assessment task 2: Data analysis project - Part A

Objective(s):

1, 2, 3, 4, 5, 6 and 7

Type: Project
Groupwork: Group, group and individually assessed
Weight: 50%
Length:

Part A: 750-1000 words

Part B: 5000 words (report) and 6-8 minutes (group presentation)

Part C: 700-1000 words

Assessment task 3: Individual project exploration

Objective(s):

2, 3 and 6

Type: Project
Groupwork: Individual
Weight: 40%
Length:

2000 word Canvas Submission.

Minimum requirements

Students must participate in all online and face to face requirements, as well as complete assessment tasks.

Recommended texts

There are four core resources used by this subject. Each one covers a different aspect of what we teach. Depending on your background and what you are planning to learn you will find at least one useful. You are not expected to read all of these resources cover-to-cover. Use them to help you solve specific problems.

  • To learn R and the tidyverse: Hadley Wickham and Garrett Grolemund (2017) R for Data Science, O’Reilly
    Free, online version at here: http://r4ds.had.co.nz/. We will refer to it as R4DS in this subject.
  • To learn statistical concepts: Bruce, P., & Bruce, A. (2017). Practical Statistics for Data Scientists: 50 Essential Concepts. O'Reilly Media, Inc. You can get it here. We will refer to it as PSDS in this subject.
  • To learn linear regression modelling: Brian Caffo, Regression models for Data Science in R, Lean pubs. You can get a free copy here: leanpub.com/regmods/read. It is written as a companion book to the Coursera Regression Models class, and also has a series of YouTube videos accompanying it. We will refer to it as RM throughout this subject.
  • To run a good Data Science project: Godsey, B. (2017). Think Like a Data Scientist: Tackle the data science process step-by-step. Manning Publications Co.. You can get it here.

Other resources

Additional general and module-specific resources will be available on Canvas