36103 Statistical Thinking for Data Science
Warning: The information on this page is indicative. The subject outline for a
particular session, location and mode of offering is the authoritative source
of all information about the subject for that offering. Required texts, recommended texts and references in particular are likely to change. Students will be provided with a subject outline once they enrol in the subject.
Subject handbook information prior to 2021 is available in the Archives.
Credit points: 8 cp
PostgraduateResult type: Grade, no marks
There are course requisites for this subject. See access conditions.
Knowledge of concepts taught in 35513 Statistical Methods is assumed.
This subject helps students to advance their thinking about statistics and how it can be used, or abused, in data science. Starting from the assumed knowledge of basic statistics that students bring into the subject: including concepts like probability, distributions, hypothesis testing, significance, power and confidence; students quickly develop their ability to create modern statistical models in real-world data science contexts. Learning to use the powerful language R, students work their way through the entire data science cycle: from data collection, cleaning and merging datasets, exploratory analysis, modelling and reporting. This process provides rapid exposure to the wide range of modern day packages (for example, the tidyverse) that facilitate rapid statistical analyses for data science questions. Students also learn to make the invisible trends in datasets visible, to make predictions from complex datasets and to reproducibly document their statistical procedures for different audiences. Working with a team of data science professionals from a variety of different backgrounds, students learn how to appropriately communicate their newfound statistical insights and engage a variety of different audiences and stakeholders in order to inform decision making. A selection of advanced topics helps each student to concurrently follow their own personalised learning journey: evaluating and bolstering any gaps in their knowledge and skills, and prepare for future electives and iLab projects.
Subject learning objectives (SLOs)
Upon successful completion of this subject students should be able to:
|1.||Manage the complexity of real data science projects and their inevitable compromises|
|2.||Formulate authentic data science questions precise enough to be answered by valid statistical techniques|
|3.||Justify the use of different statistical concepts and tools to audiences from a wide range of backgrounds|
|4.||Find, clean, and merge datasets from a range of sources to answer real world data science problems|
|5.||Apply statistical methods that are appropriate to a dataset and stakeholder requirements|
|6.||Interpret the results of a statistical analysis correctly, visualizing and reporting upon them in ways that create value for, and are sensitive to the needs of, a wide range of stakeholders|
|7.||Collaborate with and contribute to the professional community of data scientists, both local and global|
Course intended learning outcomes (CILOs)
This subject also contributes specifically to the development of the following course outcomes:
- Exploring and testing models and describing behaviours of complex systems
Explore and test models and generalisations for describing the behaviour of sociotechnical systems and selecting data sources, taking into account the needs and values of different contexts and stakeholders (1.2)
- Making the invisible visible
Use transdisciplinary approaches to seeing and doing to uncover underrepresented, or misrepresented, elements of a system (1.4)
- Exploring, interpreting and visualising data
Explore, analyse, manipulate, interpret and visualise data using data science techniques, software and technologies to make sense of data rich environments (2.2)
- Designing and managing data investigations
Apply and assess data science concepts, theories, practices and tools for designing and managing data discovery investigations in professional environments that draw upon diverse data sources, including efforts to shed light on underrepresented components (2.4)
- Developing strategies for innovation
Explore, interrogate, generate, apply, test and evaluate problem-solving strategies to extract economic, business, social, strategic or other value from data (3.1)
- Working together
Develop a collaborative and team-oriented mindset to harness value for stakeholders to produce innovative solutions to challenges (3.3)
- Engaging audiences
Explore and craft interpretative narratives that engage key audiences with data analytics and potential significance for action, at a societal, industrial, organisational, group or individual levels (4.2)
- Informing decision making
Develop, test, justify and deliver data project propositions, methodologies, analytics outcomes and recommendations for informing decision-making, both to specialist and non-specialist audiences (4.3)
Contribution to the development of graduate attributes
Your experiences as a student in this subject support you to develop the following graduate attributes (GA):
GA 1 Sociotechnical systems thinking
GA 2 Creative, analytical and rigorous sense making
GA 3 Create value in problem solving and inquiry
GA 4 Persuasive and robust communication
Teaching and learning strategies
Authentic problem based learning: This subject relies heavily upon the principle that students learn best by doing. It offers a range of authentic data science problems to solve that help to develop students’ statistical thinking about complex problems. Students learn how to find and understand the many different forms of R documentation, and to write useful documentation for others. They work on real world data analysis problems using datasets that they create using modern data harvesting techniques. These are used to answer realistic data science questions in broad areas of topical interest. This exposes them to the true ambiguities, constraints, and complexities of working as a data scientist for a variety of different stakeholders.
Blend of online and face to face activities: This subject is offered through a series of block sessions blending online with face-to-face learning. Students interact face-to-face with each other and the teaching team in three intensive modules that require the completion of both preparation and after class activities. They concurrently use a range of complementary online resources to develop their statistical thinking according to identified weaknesses in their background knowledge. They are expected to engage in online discussion and to actively participate in other blended activities.
Collaborative work: We place a strong emphasis on group activities and collaboration in diverse teams. As a data science professional you need to approach professional projects and challenges by working with people from different backgrounds, expectations, and expertise. This course simulates that environment by requiring students to work with a team of peers who come from many different backgrounds. Group assessments help students to develop effective strategies for working as a part of a data science team, as well as an appreciation that there are diverse perspectives on many different topics in data science and innovation.
Self paced evaluation and improvement: This subject takes students from an exceptionally wide range of backgrounds, some of who are better versed in statistical methods, and R, than others. We help all students to self-diagnose their weaknesses and strengths, and to work to improve in areas that they identify as a priority for the professional niche that they would like to occupy as a practicing data scientist. Students choose their own path through a wide variety of curated resources as needed. A series of advanced topics help them to develop further expertise in areas of interest and to prepare for iLabs. NOTE: There is still an expectation that each of you will have completed a basic course in statistics (see pre-requisite information).
It is assumed that participants will have some familiarity with foundational statistical concepts such as randomness and properties of random variables, the normal and binomial distributions, and will be able to interpret descriptive statistics presented in table or graphical form. Please talk to the subject coordinator if you have any concerns about your foundational knowledge and skills.
Students will learn to apply the R programming language to statistical problems in data science throughout the subject.
Module 1: First Principles
- Using R in the Data Science Cycle
- Getting Data for Data Science
- Linear Regression
- Factor Analysis
Module 2: Regression Models
- Generalized Linear Models
- Logistic Regression
- Multinomial models
- Marginal/Hierarchical Models
- Model selection
Module 3: Advanced Topics
- Principle Components Regression
- Time Series Analysis
- Geo-Spatial Models
- Structural Equation Modelling
- other methods as appropriate
Assessment task 1: Exploration of data skills and issues
3 and 7
500-700 words R "vignette" (not including code samples)
Assessment task 2: Data analysis project - Part A
1, 2, 3, 4, 5, 6 and 7
|Groupwork:||Group, group and individually assessed|
Part A: 750-1000 words
Part B: 5000 words (report) and 10-15 minutes (presentation)
Part C: 700-1000 words
Assessment task 3: Individual project exploration
2, 3 and 6
2000 word blog post.
Students must participate in all online and face to face requirements, as well as complete assessment tasks.
Principle Texts Recommended for this subject
Each one covers a different aspect of what we teach. Depending on your background and what you are planning to learn you will find at least one useful. NB: We do not expect you to read all three from cover to cover! Use them to help you solve specific problems!
- To Learn R: Hadley Wickham and Garrett Grolemund (2017) R for Data Science, O’Reilly
Free, online version at here: http://r4ds.had.co.nz/. We will refer to it as R4DS in this subject.
- To Learn Statistical Concepts: Bruce, P., & Bruce, A. (2017). Practical Statistics for Data Scientists: 50 Essential Concepts. O'Reilly Media, Inc. You can get it here: https://search.lib.uts.edu.au/permalink/61UTS_INST/19joism/alma991006773665405671. We will refer to it as PSDS in this subject.
- To run a good Data Science project: Godsey, B. (2017). Think Like a Data Scientist: Tackle the data science process step-by-step. Manning Publications Co.. You can get it here: https://search.lib.uts.edu.au/permalink/61UTS_INST/19joism/alma991004346589705671
Other potentially useful texts and resources
- Brian Caffo, Regression models for Data Science in R, Lean pubs. You can get a free copy here: https://leanpub.com/regmods. We will refer to it as RM throughout this subject. This book provides an accrssible introduction to linear and generalized linear regression models. It is written as a companion book to the Coursera Regression Models class, and also has a series of YouTube videos accompanying it. A great set of resources!
- Caffo, B. Statistical inference for data science. Lean pubs. You can get a free copy here: https://leanpub.com/LittleInferenceBook
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer.
Free online version here: http://www-bcf.usc.edu/%7Egareth/ISL/.
- The EdX Course on Statistical Thinking for Data Science: https://courses.edx.org/courses/course-v1:ColumbiaX+DS101X+1T2016/course/
- The Penn state online statistics courses are an invaluable resource - they cover everything we will, and more besides! Access them here: https://onlinecourses.science.psu.edu/statprogram/programs
- Harrell, F. (2015). Regression Modeling Strategies. Springer. Copies are available in the library (including one on reserve) and some chapters are available on UTS Online.
- Baumer, Kaplan, & Horton (2017). Modern Data Science with R, CRC Press. (Available at the library: http://find.lib.uts.edu.au/?R=OPAC_b3246164 or some sample chapters can be found at this site: http://mdsr-book.github.io/)
Massive Open Online courses (MOOCs)
- Data Camp has a number of courses that you can access for free (or you can pay for full access)
(A number of those links go to specialisations - there are courses on statistics, R, machine learning and many other things there!)