Measuring the Teaching and Learning Journey

High-stakes assessment has been an important rite of passage throughout much of human history. Many ancient cultures and tribal societies required their young to undertake risky and painful quests to make the transition to adulthood. We continue to deploy this rite of passage in the form of national school leaver examinations today.
Jun 20, 2019

This is an excerpt from Education Cargo Cults Must Die by John Hattie and Arran Hamilton. To download the full white paper that is part of the Corwin Australia Educator Series, click here.

High-stakes assessment has been an important rite of passage throughout much of human history. Many ancient cultures and tribal societies required their young to undertake risky and painful quests to make the transition to adulthood. We continue to deploy this rite of passage in the form of national school leaver examinations today. Modern educational assessments are high stakes but without the physical risk of the tribal tests (although they can invoke high levels of stress). Different times, different measures. SATs, A-Levels, the International Baccalaureate, and other assessments signal to employers and training providers that school leavers have acquired the required skills for the next stage of their journey.

These assessments can tell us, often with relatively high levels of accuracy, a student’s level of competence in mathematics, literacy, foreign languages, science and about the depth and breadth of knowledge they have acquired across a range of curriculum areas. From this, we can also make inferences about a student’s readiness for university studies and life beyond school, albeit with less precision.

Navigating by the Light of the Stars 
The outcomes of high-stakes summative assessments are also often used to make inferences about the quality of schools [e.g., school league tables], school systems [e.g. PISA; TIMSS; PIRLS], individual teachers, and about whether certain education products and programs are more effective than others. In other words, they are often used in the quest to distinguish educational gold from education cargo cults and to validate the former over the latter. 

In this context, high-stakes assessments are blunt instruments — akin to piloting your boat by the stars on a cloudy night, rather than GPS. We can infer something about which schools are higher and lower performers, but need to carefully tease out background variables like the starting points and circumstances of the learners, the multiple other important outcomes, so that we can measure distance traveled, rather than the absolute end point in one set of competencies. Indeed, all too often, we find that the greatest variability in learning outcomes is not between different schools but between different teachers within the same school (McGaw 2008). The key unit of analysis should be the teacher, rather than the school—and many high-stakes assessments may not be attributable to a particular school. 

In the context of individual teachers (provided there is a direct link between the teacher and the particular content assessed), the outcomes of high-stakes assessments can tell us quite a lot about which teachers are more or less effective – particularly where the pattern of performance holds over several years. 

Again, care is needed. It is not only the outcomes of the assessments, but the growth from the beginning to end of the course that should be considered. Otherwise, those teachers who start with students already knowing much, but growing little, look great and those who start with students who know less at the beginning, but grow remarkably, look poor when it should be the other way around. 

But unless the outcomes of high-stakes assessments are reported back to schools at the item level (i.e., how well students did and grew on each component of the assessment, rather than just the overall grade), teachers are left in the dark about which elements of their practice (or third-party products and programs) are more/less effective. They just know that, overall, by the light of the stars, they are navigating in the right or wrong direction. And even where they are navigating in the wrong direction, there are likely some elements of their tradecraft or product kitbag that are truly outstanding, but are missed.

Even where teachers are able to access item-level data from high-stakes assessments, the inferential jump that they must make to systematically map this back to specific elements of their tradecraft or the impact of specific training programs or pieces of educational technology is too great to do with any meaningful fidelity. 

Navigating With a GPS System
The only way we can use student achievement data with any sense of rigor to tease out the education gold is by collecting it (formatively) at the beginning, middle, and (summatively) end of the journey to systematically measure distance travelled by students and by experimentally varying very narrow elements of teacher practice to see whether this results in an upward or downward spike in student performance. 

It is as important to know about the efficiency and effectiveness of the journey as it is to reach your destination. This is one of the benefits of GPS systems.

Within the context of the individual teacher in the individual classroom, we know that formative evaluation is educational gold in and of itself (Hattie & Timperley 2007). The most effective approach to formative evaluation contains three components: 

  • Feed-up: Where am I going? 
  • Feed-back: How am I doing? 
  • Feed-forward: What is my next step?

What is important is not the testing itself, but the way that it is incorporated into the cycle of challenging goals to support learners in unlocking the skill, will, and thrill to learn.

The challenge, of course, is that “everything seems to work somewhere and nothing everywhere” (Wiliam 2014). So, even where this analysis is conducted systematically, we cannot be completely certain that the educational approach, training program, or technology intervention that resulted in education gold in one context will not end up looking like a cargo cult in quite another.

We need repeated evaluation projects that investigate the same approaches across many different contexts to give us much greater confidence in the fidelity of our findings. And once we have this data, we face the challenge of vacuuming it up from disparate sources and in drawing the common threads to build a compelling narrative about what’s a “cargo cult” and what’s gold. We can then ask not only about overall effects, but under what conditions and for which students programs work best.

Written by John Hattie and Arran Hamilton
Dr John Hattie has been Professor of Education and Director of the Melbourne Education Research Institute at the University of Melbourne, Australia, since March 2011. He was previously Professor of Education at the University of Auckland. His research interests are based on applying measurement models to education problems. He is president of the International Test Commission, served as advisor to various Ministers, chaired the NZ performance based research fund, and in the last Queens Birthday awards was made “Order of Merit for New Zealand” for services to education. He is a cricket umpire and coach, enjoys being a Dad to his young men, besotted with his dogs, and moved with his wife as she attained a promotion to Melbourne. Learn more about his research at www.corwin.com/visiblelearning, and view his Corwin titles here.

Dr Arran Hamilton is Group Director of Strategy at Cognition Education. His early career included teaching and research at Warwick University and a stint in adult and community education. Arran transitioned into educational consultancy more than 15 years ago and has held senior positions at Cambridge Assessment, Nord Anglia Education, Education Development Trust (formerly CfBT) and the British Council. Much of this work was international and focused on supporting Ministries of Education and corporate funders to improve learner outcomes.