文章基本信息

标题：Randomness, sample size, imagination and metacognition: making judgments about differences in data sets.
作者：Stack, Sue ; Watson, Jane
期刊名称：Australian Mathematics Teacher
印刷版ISSN：0045-0685
出版年度：2013
期号：December
语种：English
出版社：The Australian Association of Mathematics Teachers, Inc.
摘要：An experimental treatment for mildly to moderately depressed patients is tested by taking two groups of 15 to the Caribbean. One group swims 4 hours per day for 4 weeks (the control group) and the other group swims for 4 hours per day for 4 weeks with dolphins (the treatment). Ten out of 15 of the dolphin group improve, whereas 3 out of 15 of the swimming only group improve (Rossman, 2008).
关键词：Action research;Combinatorial probabilities;Geometric probabilities;Mathematical statistics;Mathematics;Mathematics education;Mathematics/Data Analysis and;Probabilities;Probability theory;Statistics;Statistics (Data);Statistics (Mathematics)

Randomness, sample size, imagination and metacognition: making judgments about differences in data sets.

Stack, Sue ; Watson, Jane

Consider this:

An experimental treatment for mildly to moderately depressed patients is tested by taking two groups of 15 to the Caribbean. One group swims 4 hours per day for 4 weeks (the control group) and the other group swims for 4 hours per day for 4 weeks with dolphins (the treatment). Ten out of 15 of the dolphin group improve, whereas 3 out of 15 of the swimming only group improve (Rossman, 2008).

What could you claim from this experiment and for what population? How certain would you be? Why? What types of things are you thinking about to inform your decision? What would you like to know more about? What are the big statistical ideas in this problem?

This was one of the problems that Grade 10 mathematics students were challenged to consider as a part of an action research project, where two teachers trialled a statistics and probability unit specially developed by the second author. A key intent of the five week unit was to help students develop conceptual and technical tools to be able to make informal inferences (statistical judgments) about whether claims based on one sample could be generalised to larger populations, and about whether differences in data sets were significant or whether the differences might have occurred by chance (a random event). Through using the TinkerPlots program (Konold & Miller, 2011) students could collect samples and create simulations that would enable them to develop visual and intuitive understandings alongside statistical interpretations.

The lead teacher was keen to expose her students to technical tools that might expand their notion of what mathematical inquiry is and liked the way that the unit combined ideas of probability and statistics, concerned that the normal way of teaching probability was around gaming contexts.

A key challenge of the project was building an understanding of the concept of randomness within the activities in such a way that students could draw meaningful conclusions about populations from small samples. The conversations that the two teachers and the two researchers had with students problematised the existing understandings of what randomness was--it highlighted the range of different meanings and contexts, and how and when they are used. As a result they became active inquirers into concepts around randomness, realising that in building student capacity for inference it was crucial explicitly to develop and distinguish different ideas around randomness.

Based on the activities used with the students, one aim of this article is to tease out some of the aspects of randomness that emerged as a result of this project that might be useful for other teachers. A metacognitive frame for thinking about informal inference is introduced, which takes into account the thinking that students were actually engaged in during the unit of work. It was realised how important the role of imaginative thinking is in helping students to engage with the context, imagine alternative hypotheses, imagine the probability space, imagine their data in new ways and imagine whole populations.

Swimming with dolphins

Take a moment to imagine the issues in the dolphin problem. Does it help to imagine being there in the Caribbean swimming with dolphins? What might be the reasons that swimming with dolphins is a positive experience (special experience, bonding, dolphins are friendly) and why might it be negative (scary, not in control)? Does it help to imagine how the scientists may have designed the experiment? Did they select the 30 people randomly from an entire population of depressed people, or did they invite people to participate? Would a randomly selected sample be representative? How would we know it was representative? Would inviting people to participate tend to select people who like dolphins and therefore the experiment would be difficult to generalise to a larger population? How was the control group of 15 people selected out of the 30 people? Was it done randomly and why might this give a fairer test? What constitutes an improvement? How long was the improvement sustained after the tests?

Chance devices                Measurement contexts

Single random outcome         Random sample size of 1
                              from a population

Expect an unpredictable       Expect an unpredictable
outcome:                      outcome.

* Toss of a coin. Could       Unknown likelihood to be
  be either head or tail.     representative of the
                              population.
* Roll of dice. Could be
  1, 2, 3, 4, 5, or 6.

Small random sample of        Small random sample
outcomes

Expect variation in           Expect that small samples
  outcomes from the             are likely to vary from
  probability model.            the population. Comparing
  Repeating these samples       different small samples
  will show high variation      of the same sample size,
  between the samples.          each randomly selected
                                from a population, will
Toss a coin 10 times.           show variation among
  Might have 7 heads and 3      them.
  tails. Repeating this
  five times might have the
  following results: 6/10H,
  4/10H, 7/10H, 5/10H,
  4/10H.

Large random sample of        Large sample size
outcomes                      randomly selected

Expect a pattern to           Expect close agreement
  appear reflecting the         with the population.
  underlying probability        Comparing different large
  model.                        random samples will show
                                little variation among
* 1000 coin tosses.             them.
  Expect percentage of
  heads to be close to
  50% (theoretical
  probability).

* 1000 dice throws.
  Expect relative frequency
  of each number to be
  close to 1/6 (theoretical
  probability).

Now look at the data: Sixty-seven per cent of people improved when swimming with dolphins and 20% of people improved just swimming. This seems to indicat that swimming with dolphins does make a difference. But could this result hav happened by chance? Perhaps it was an extreme random event. The sample i very small--only 15 people in each group. In small samples would we expect mor extreme events? How could we test whether this is the case for these data?

How many different ways have we considered randomness in this problem What student experience and understanding of randomness are we assuming?

It is useful to consider randomness in two key mathematical contexts: random ness as used in chance devices (where it is theoretically possible to generate infinite populations from single trials), and randomness in measurement context (where there is a finite population of fixed yet undetermined values that sampling aims to discern). How does randomness behave in these different contexts an can understanding randomness in one context help to understand it in the other

Randomness and the long term

A key idea for chance devices is that in the long term (for large samples) we will see an emerging pattern of outcomes that reflects the underlying theoretical probability model. This is a model that can be deduced through imagining all of the possibilities in the probability space. Research suggests that as a result of playing games few students are likely to believe that in the long term random generation ensures fairness (Watson & Moritz, 2003). In the short term (or for small samples) there is an unpredictability and it is possible that extreme values occur. The result may not appear fair or representative at all. Compare 30 rolls of a die versus 3000 based on random generation of data in TinkerPlots in Figure 1.

[FIGURE 1 OMITTED]

The unusual outcome on the left of Figure 1 might lead us to believe it was caused as a result of a specific intervention (i.e., using a loaded die), when in fact it might just be a "usual" characteristic of randomness--the generation of an extreme event. Research suggests that students have difficulty in discerning between caused and random extreme events generated by chance devices (Chiesi & Primi, 2009). Young students for example, seeing 5 heads in a row might be likely to predict another Head, thinking it is a caused pattern, whereas older students might predict a Tail because they expect the results to even out, not recognising that each toss is an independent event. A typical assumption by students is that a large sample of random events will be replicated in a small sample (Watson & Moritz, 2000). It is difficult for many to imagine how the sample size might affect the number of extreme events that we see.

Comparison of sample sizes in chance devices

Using simulation we can compare a range of sample sizes to generate random data. We toss a coin 10 times to represent a sample size of 10, and compare that with 30, 50 and 100 tosses, to represent larger sample sizes (see Figure 2). For each one we do this for 1000 trials to get a sense of what the pattern will look like in the long term. The plots show the frequency of outcomes for heads. The students were interested in seeing how often heads occurred more than 60% of the time in each set of samples--these outcomes represented an 'extreme' event. As the sample size got larger the percentage of extreme events reduced.

[FIGURE 2 OMITTED]

Comparison of sample size for fixed populations

The previous simulation is a very visual demonstration about how sample size in the context of chance devices affects the generation of extreme events. We would expect a similar trend comparing a range of sample sizes of large finite populations--the smaller the sample size the greater the chance of extreme events.

In the following situation we have a known population of 677 data points. Each data point represents a person who came on the First Fleet and the value of the crime for which he or she was transported (see Figure 3; data from Watson et al., 2011). We can put these data into a random sampler and then select samples of 10 data values and compare their variation and medians with the actual population measures. For a small sample size taken over several trials we see much variation in the medians of the individual samples (Figure 4). Considered together, 100 of the sample medians have their own median of 26.2 (Figure 5). As the sample size increases to 30 data values, the cluster of medians has less variation (Figure 6) and the median of the medians (29.5) is closer to the population median of 29 (Figure 3).

Discussion questions: To what extent might notions of random sampling being fair and representative hold for samples of 10 versus 30? On what are you basing your judgments? What are implications for sampling unknown populations?

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

Random re-allocation of data

Can we use the above ideas to help us think about the dolphin data taken from a small sample of an unknown population? We want to be able to discern if the results mean that the improvement in depression is a caused by swimming with dolphins for our small sample and then be able to generalise to a larger population.

To help us do that we need to imagine that the reverse is true. Despite the logical contextual reasons that swimming with dolphins causes improvement and our expectation that it might be true, we are going to imagine that the dolphin experimental results are merely due to a random extreme event.

Consider each data value as representing one person with information about what group he or she is in (dolphin or control) and his or her result (improved or not improved). This can be represented by the two-way table in Figure 7. We have 13 people who improved (10 from the dolphin group) and 17 people who did not (5 people from the dolphin group). Now what if the treatment had no effect? We would expect the results of improved/not improved to be randomly spread across the two groups. We can estimate this possibility using a TinkerPlots simulation where we put our results into a random generator and then re-assign the results to the two groups (dolphin or control). Details for carrying out the re-allocation with TinkerPlots are found in Watson (2013). (1)

We might find that when randomly re-assigned there are only 6 people who 'improve' by swimming with dolphins, then 4 for the next trial, then 8. When we do this for 100 trials we can look at the randomly generated frequency distribution of people who 'improved' by swimming with dolphins (Figure 8).

[FIGURE 7 OMITTED]

[FIGURE 8 OMITTED]

The random simulation created a spread in values for improvement after swimming with dolphins, and only 1% showed the result that 10 or more people 'improved' when they swam with dolphins. We therefore get a numerical value that suggests the likelihood that the result of the actual experiment was random. We can say, "If we randomly re-allocate the data again and again only about 1% of the time is the randomly generated result the same or more extreme than our experimental result." This gives us high confidence in the actual experimental results for this sample. From this we can take the next step to generalise to a larger population.

In the unit, students also used this process when comparing two different frequency distributions, e.g., the number of meaningful words memorised versus the number of meaningless words memorised--two data sets that they had experimentally generated themselves. This activity was based on suggestions of Shaughnessy, Chance and Kranendonk (2009).

Generalisation

The aim of the random re-allocation process is to help students have more confidence and discernment in generalising from a sample to a larger population due to having a numerical likelihood of it being a random occurrence for a given number of trials. In this step, however, students also need to draw in aspects about the experimental design that they have considered earlier: how certain is it that experimental bias has been mitigated, how was the small sample selected from the population, what population might this be reasonably generalised to? When these factors are considered then students should be able to provide an overall level of confidence in being able to generalise the outcome to a larger population.

Metacognition, imagination and making explicit the thinking processes

The action research project highlighted how important it is to help students create concept maps for the thinking they are doing. It would help if the class developed a glossary and map of the terminology that they were using, particularly notions about randomness, samples and populations. It would be helpful during the course of the unit to spell out the different slang meanings versus statistical meanings, the problems with the meanings as they come up in new conceptual contexts, and questions around these.

It is important to make explicit the sort of thinking that is involved in making effective inferences and discerning judgments. When teachers are drawing on rich contexts in mathematics they may find themselves participating in more general discussions where understanding the context is as important as understanding the mathematics and for the students the lines between the two contexts become blurred. Initially some students drew on other forms of thinking such as scientific thinking in critiquing the experimental design and general reasoning in looking for why an experiment might work, with more focus on the context than on a statistical evaluation of the data. They often felt one dimension of evaluation was enough and did not bring the different types of evaluation together.

In reflecting on the student responses for the dolphin problem it was clear that many of them drew on valuable forms of thinking that need to be made explicit as part of an overall process of making informal inferences. Figure 9 attempts to capture the type of thinking different students were engaged in during the unit and within this context provide a framework for making informal inferences. At every stage of the inference process imagination is required in thinking about the questions posed. Without the questions and imagination to think about alternatives, the process becomes a textbook exercise.

It is important to have metacognitive conversations with students to encourage them to reflect on their thinking and what they are drawing on in coming up with judgments. As they describe the "sources" of their thinking these can be captured, valued, discussed, refined and mapped in a similar way to what is shown in Figure 9 in creating a metacognitive frame for the unit. This then creates visible thinking tools that students can use along with specific statistical thinking such as using the concepts of variation, distribution, expectation, randomness, and informal inference. A key part is helping students overcome limitations in thinking about mathematics being only associated with formulae and calculations, rather, that it requires imagination and flexibility in thinking.

Conclusion

There is considerable research on the difficulties students have in conceptualising individual concepts of probability and statistics (see for example, Bryant & Nunes, 2012; Jones, 2005). The unit of work developed for this action research project was specifically designed to address some of these in order to help students create visual and intuitive understandings of the issues of sampling, randomness and populations. The interweaving of concepts, combined with the technical skills, was challenging for the teachers, students and researchers.

To create meaningful and deep learning experiences for students requires considerable teacher knowledge and skill, being able to weave together concepts with context and metacognitive thinking. In particular, it takes skill in helping students to draw out from diverse activities the central and connecting themes. At the end of the program the lead teacher said she had previously never thought of designing a unit around a key idea such as randomness or inference. Being able to use such big conceptual ideas as a theme for a unit provided a very valuable and interesting experience, certainly deepening her own experience of the nuances within the ideas and how to build more connective conceptual experiences for the students.

Figure 9. Metacognitive frame for making informal inferences
for the dolphin context.

Making statistical judgements
about differences in data sets

CONTEXT

1. Hypothesis

Prediction of the generalisation for a wider population based
on  an experiment to discern different treatments.

2. What are logical reasons that the hypothesis may or
may not be true?

3. Experimental validity

* What are the likely biases that could distort the results?

* What strategies can mitigate bias?
(e.g., random selection, control group)

* How representative is this sample of the more general
population?

* How would you describe this sample size and its adequacy
for  generalisation?

STATISTICAL INFERENCE

4. Description of the evidence

* Distribution, central tendancy, differences.

* Is there an indication that the difference is significant
for these  data?

5. Could this have happened by chance?

* How frequently would a random generation of these numbers
create this experimental result if repeated 100 times in a
simulation using random re-allocation of data?

* Based on this, how certain are you that this is caused
result, rather than a random one?

6. Generalisability of the experiment

* What is your claim?

* What is the population that you could generalise this
claim to?

* What is the certainty of your claim based on the
experimental validity and the random re-allocation of
data simulation?

7. Implications and recommendations

* Is this result significant enough to recommend any
action?

* How might these data be used?

* What questions are still remaining?

For students who go on to study formal statistics, the hope is that the experiences with informal inference in memorable contexts such as the dolphin problem will provide a foundation for appreciating and understanding the formal statistics associated with t-tests and p-values. For those who do not go on to study formal statistics, it is hoped that they have gained an appreciation of randomness and its usefulness in decision-making, moving beyond seeing random only as haphazard.

Informal inference

* based on evidence

* can be generalised to a larger population

* gives an estimate of certainty.

Random phenomenon

A phenomenon is called random if the outcome of a single repetition is uncertain but there is nonetheless a regular distribution of relative frequencies in a large number of repetitions.

Simple random sample

A simple random sample of size nconsists of n units from the population chosen in such a way that every set of nunits has an equal chance to be the sample actually selected.

(Definitions from Moore and McCabe, Introduction to the Practice of Statistics, 1993)

In the action research project students initially used 'random' as a slang word meaning 'unexpected'. 'It is random' meant it could be anything.

Students expected to see the frequency distribution of the die to even out in the long term, but were surprised by how many rolls of the die it took.

Some of the students had problems in comparing different samples with the population and making a judgment on how similar or different they were. Part of the reason for this is that outliers changed the X axis for each plot so students had to manipulate the graphs to get comparative scales.

Most students initially found it difficult to imagine that the results of the dolphin treatment could be due to chance, rather than caused by the treatment. They were captured by the conviction that the treatment was logically effective and therefore true.

At the beginning many had difficulty in understanding what random reallocation was, how it worked and why they were doing it. Some would get the idea, only to lose it later, then regain it. It was conceptually difficult and needed considerable teacher support. With several examples, however, most students developed a visual/intuitive sense of the likelihood that a result could have been a caused versus a random event. It was important to give some examples where the difference was less obvious.

References

Bryant, P. & Nunes, T. (2012). Children's understanding of probability: A literature review. London: Nuffield Foundation. Retrieved from wwww.nuffieldfoundation.org

Chiesi, F. & Primi, C. (2009). Recency effects in primary-age children and college students. International Electronic Journal of Mathematics Education, 4(3), 259-274.

Cobb, G. (2007). The introductory statistics course: A Ptolemaic curriculum? Technology Innovations in Statistics Education, 1(1). Retrieved from http://escholarship.org/uc/item/6hb3k0nz

Jones, G.A. (Ed.). (2005). Exploring probability in school: Challenges for teaching and learning. New York: Springer.

Konold, C. & Miller, C.D. (2011). TinkerPlots: Dynamic data exploration [computer software, Version 2.0]. Emeryville, CA: Key Curriculum Press.

Moore, D. S. & McCabe, G. P. (1993). Introduction to the practice of statistics (2nd ed.). New York: W. H. Freeman.

Shaughnessy, J. M., Chance, B. & Kranendonk, H. (2009). Focus in high school mathematics reasoning and sense making: Statistics and probability. Reston, VA: NCTM.

Watson, J. M. (2013). Resampling with TinkerPlots. Teaching Statistics, 35(1), 32-36.

Watson, J., Beswick, K., Brown, N., Callingham, R., Muir, T. & Wright, S. (2011). Digging into Australian data with TinkerPlots. Melbourne: Objective Learning Materials.

Watson, J. M. & Moritz, J. B. (2000). Developing concepts of sampling. Journal for Research in Mathematics Education, 31, 44-70.

Watson, J. M. & Moritz, J. B. (2003). Fairness of dice: A longitudinal study of students' beliefs and strategies for making judgments. Journal for Research in Mathematics Education, 34(4), 270-304.

Sue Stack

University of Tasmania

<[email protected]>

Jane Watson

University of Tasmania

<[email protected]>

(1) This process of random re-allocation is a visually intuitive yet numerically based alternative to a t-test approach that some students will meet in later years. It is the technique recommended by Cobb (2007).