You are on page 1of 5

Laura Tully-Gustafson

Term Project part 6


For this research project, each member of the class counted the different colors found in a bag of
skittles and recorded them. We compiled the results and computed lots of different statistics about
them. It is interesting to consider what the overall proportions of colors are compared to what the
proportions are in each individual bag.
The first thing we did was look at the class totals of different colors and how they compared.
Here is a table of frequencies.

The graphs do reflect what I expected to see. The class data shows almost equal amounts of
each color of candy, so on the pie chart all of the pieces of pie are about the same, and in the
Pareto chart, all of the bars are about the same size. We have a fairly large sample size with
the entire class, and I would guess the the Skittles factory produces about the same number
of each color of Skittles, so it makes sense that we have a relatively uniform distribution of
colors once the data is compiled.

The distribution of colors in the total class does not match my own data. The colors of skittles
in my individual bag were not distributed uniformly. This makes sense because one bag of
skittles is a small sample, and it is likely to not be an accurate representation of the entire
population.
Group Portion:
Since we each had different proportions, we had different predictions. One person had the highest
amount of orange candies, so she predicted that the class would have the highest percentage of
orange. One person expected there to be equal proportions of all of the colors. That would be 20%
of each red, orange, yellow, green, and purple.
The actual percentages of the class candies are (rounded to the nearest hundreth of a percent)
19.87% red, 19.87% orange, 20.64% yellow, 19.94% green, and 19.69% purple. This is close to
20% of each color. Orange was not the highest, but it was about equal to the other colors.

Total number of candies of each color in class data set:

Within
our specific group we had a member who had bought their skittle bags out of state- so our
overall population for the skittles has to be 2.17 Skittle ounce bags within the United States. As a
result
of this, the sample method that we used is not entirely random. Our sample has a lot of Utah 2.17
ounce skittles based bags and a few anomaly bags from out of the state- therefore I would say it is
not
random as we certainly did not get a representation of all of the states which represent the
population.
In order to best represent the population size, we would recommend that a cluster sample method
be
taken. By cluster sampling we could ensure that there is a sample from at least every state, so that
the
sample size is not so biased (since most entries came from Utah). In order to accomplish this, we
would
need to expand the sample size to at least 50 bags- so that there would be at least one bag from
each
state.
We calculated some more metrics about our data set:

(a) mean number of candies per bag


59.1
(b) standard deviation of the number of candies per bag
6.4
(c) 5-number summary for the number of candies per bag
34, 58, 60, 62, 71

The shape of the distribution is skewed left; there are more bars left of the median than
right. The graph shows more variation than I would expect in bags of skittles. There
were 48 bags in the class, and the range was 37. There are 8 outliers in the class.
Three values in the distribution lie above the upper fence (66), and five values lie below
the lower fence (54). I had 59 candies, which was very close to the median of 60.

For categorical data, you can use bar graphs or pie charts. For both of these, the different
categories do not have to be numerical, they can just be compared using the sizes of the bars
or the sizes of the slices of pie (if the values add up to 1).
For quantitative data, you can use a stem and leaf plot, a histogram, a scatter plot or a
boxplot. These depend on having numerical data that can be compared. You could also use a
bar graph or pie chart, but it would not illustrate that one category was bigger than another
numerically. For quantitative data, you can calculate almost anything. Mean, median, range,
standard
deviation, quartiles, upper fence, lower fence, variance.
For categorical data, you can calculate a median and a percentage for each category. It
doesnt make sense to calculate most of the other numbers because they we dont have a
numerical value for how far apart two categories are. Therefore it wouldnt make sense to try
to calculate the dispersion of categorical answers.
The next step of the project was calculating confidence intervals. A confidence interval is an
estimation of where a value will fall in a population, based
on where it falls in a sample. It is a powerful tool because we can put a number, or a level of
confidence, on how sure we are that the population parameter is within this range. If we have
a sample, and determine a mean, or a standard deviation, or any other statistic for that
sample, that does not give us much information about the total population. By using multiple
samples and creating confidence intervals, we can use our statistics to gather information
about the total population.
The 99% confidence interval for the proportion of yellow candies is (.1864, .2256). This means that
we are 99% confident that the proportion of yellow candies in all skittles is between 18.64% and
22.56%.
The 95% confidence interval for the mean number of candies per bag is (57.239, 60.961). This

means that we are 95% confident that the mean number of candies in the total skittle bag population
falls within that interval.
The 98% confidence interval for the population standard deviation of the number of candies per bag
is (5.01, 8.02). This means that we are 98% confident that the population standard deviation for the
number of candies per bag is between 5 and 8 skittles.
In the statistics term project, I learned how to apply the statistical
concepts as we were learning them in class. I think it is helpful to do this
because in the homework assignments, there is a definite structure to each
problem and you don't always have to think about the big picture of what the
statistics is accomplishing. Statistics has a lot of practical applications in social
sciences and there is not always an obvious correct method of dealing with the
data. By doing a project where we have one large set of data and have to
determine the results of several different methods, we have a way to approach
a set of data about anything. For example, if we have numerical values, we can
find out the mean and standard deviation right away. Graphing data is a good
way to visualize the information if there might be an obvious pattern in the
values.
I think that this project changed the way I think about real-world statistics
applications because before this project, I did not think about them very much.
It's easy to see statistics in advertisements, magazine articles, or news and not
think about where they came from. By experimenting with our data, I saw that
you can get different results, or a different perspective on numbers depending
on what methods you use. Now when I see a statistic or a parameter like 80%
of Americans prefer X, I wish I had access to the entire data set, or at least
have the answers to questions about it. How big is the sample? Are there any
possible biases? Was the data collected over a long period of time or was it all
surveyed at once? I am skeptical that the statistic might not be what it seems,
and maybe if the data was processed differently, we would find obvious biases.

You might also like