Chapter 3 – Part 1


Chapter 3 – Part 1

TOPICS
 
3.1 Visual Description

Visual representations of statistical data can provide insight into the data without using mathematics.

Table 3.1 shows the Characteristics of Univariate Data.

CharacteristicInterpretation
MeasurementWhat are the units of measurement? Continuous or integer (discrete)? Missing observations? Accuracy concerns?
Central TendencyWhere are the data values concentrated?
DispersionHow are the data spread out? Are there outliers?
ShapeIs the distribution symmetrical? Skewed? Bimodal?

 
Measurement
Before drawing any graphic representation, it's best to look at the data and try to visualize how it was collected.  In the example of S&P P/E ratio data (Table 3.2), the data was obtained from publicly accessible records, so accuracy of data is not in question. The data as reported in this table is cross-sectional (many units shown at the same point in time).  It's already called a "ratio", but the format of the data shows that a true zero is possible and meaningful, defining it as ratio data.
 
Sorting
It is always helpful to sort the data (numerically) to work with it.  With a small sample, a sort of the data will make it easy to spot things like mode (a central tendency measurement), range (a dispersion measurement) and even shape (outliers and to which direction).
 


3.2 Dot Plots

A dot plot is a simple graphic display of n individual values of numerical data.   Basic steps to creating a dot plot are:

  1. Sort the data values to obtain the range;
  2. Mark the X axis (straight horizontal line) with value demarcations (e.g. vertical hash marks and label them (e.g. labels in multiples of 5);
  3. Plot each data value as a dot over the corresponding demarcation;
  4. If more than one value is to be placed at a demarcation line, they should be stacked vertically to show the increased frequency.

See Figure 3.1 for an example of a Dot Plot of the P/E Ratio values in Table 3.2.
 
3.3 Frequency Distributions and Histograms
A frequency distribution is a table formed by classifying n data values into k classes called bins. (Terminology is adopted from Excel.)  The bin limits define the values to be included in each bin.  Frequencies can be expressed as absolute frequencies (raw number of occurrences of that one value) or as relative frequencies (percentage of all values in the data collection).  The bin limits cannot overlap.

Constructing a Frequency Distribution:
 
  1. Sort the data and identify the smallest and largest data values in the data collection. In the example of the P/E Ratio values, we find these are xmin = 8 and xmax = 29. (Where "x" represents the individual data value.)


  1. Choose the number of bins:  The number of bins you choose is important for legibility of your graph.  Too many and you'll wind up with empty bins, giving a multimodal (see definitions above) appearance that isn't meaningful. Use too few and you won't be able to determine what the bunched up data actually means. Fortunately, there is a rule to help you determine how many bins to use based on the size of the data collection.  


Sturges' Rule: k = 1 + 3.3(log)n

That scary looking formula is a good opportunity for you to test out your ability to handle a scientific calculator. Get it out and follow along.

In the case of the P/E ratio data in Table 3.2, n = 57.  So, enter the calculation like this:  1 + 3.3 log (57).  (For those unfamiliar, "log" should be a key on your calculator. In the formula display box, it will be followed by an open parentheses into which you should insert the value of n, which is 57 in this example.)  

Did you get 6.79?  Then you input it right and you're likely ready for the calculations we'll have to do for the Chapter 4 part of the test.  It means that for a data collection of 57 observations, we should use either 6 or 7 bins. Table 3.4 also lays this out for random data set sizes.


  1. Set Bin Limits: Setting bin limits will also require some judgment. Bin width can be estimated by dividing the data range by the number of bins:  Bin Width = xmax - xmin/k  Round bin width up to an appropriate value. (If dealing with discrete data, then round it up  to a whole number. If dealing with continuous data, round it up to a fractional value that suits. Then set the lower limit as a multiple of the bin width. (e.g. If we were dealing with discrete data, and the bin width were 3.5, then you would i) round the width up to 4 and ii) set the lower limit as a multiple of 4.)


  1. Put Data Values in Appropriate Bins: In general, the lower limit is included in the bin while the upper limit is excluded. Excel's histogram option includes the upper limit while excluding the lower limit.  (This applies when using continuous data, because they will be fractional values that can be divided.)  There are benefits to each method, but what is important is that the bins do not overlap. Data values must be counted in only one bin.


  1. Create Table:  You can choose to show only absolute frequencies (raw number of observations) or show them as relative frequencies as well (percentage of data collection).   You can also include cumulative frequencies (running total of absolute frequencies) and cumulative relative frequencies (running total of relative frequencies).   Look to Table 3.5 for an explanation (using the P/E ratio data from Table 3.2 ):


Bin LimitsFrequencyRelative FrequencyCumulative FrequencyCumulative Relative Frequency
8 < 1210.175410.1754
12 < 1614.245624.4210
16 < 2017.298241.7193
20 < 248.140449.8596
24 < 286.105355.9649
28 < 322.0351571.0000


Cumulative and Relative Frequencies

Look to the highlighted row in the table above.   The values that lie between 12 and 15.99 are 14 in raw number out of the whole set of 57 (the absolute frequency), but this represents a .2456 relative frequency.  14 is approximately 24.56% of 57.

Yet, when running a total of values, the values in this bin have been added to the next bin lower (from 8 to 11.99).  So, if there are 10 values between 8 and 11.99 and 14 values from 12 to 15.99, that means that there are 24 values less than 16. The point of this is to show the raw number of values that are less than 16 out of the whole set of 57.  This is called the "cumulative frequency".  When seeking the percentage of this running total, it would also be out of the whole set of 57.  24 is approximately 42.10% of 57.  So, it is said that the cumulative relative frequency is .4210.


Histograms
A histogram is a graphic representation of a frequency distribution. It's a bar chart whose Y-axis (vertical border) shows the number of data values (or percentage of them) within each bin of a frequency distribution.  The X-axis (horizontal border) contains ticks to show the end points of each bin. There should be no gaps between bars of a histogram.

Figure 3.5 will show that there is no graphic difference between a representation of the absolute frequencies or the relative frequencies. This is because the representation is only a matter of scaling.

A change in the number of bins, however, will alter the appearance of the histogram.  A large number of bins will result in many thin bars that won't describe much. A small number of bins will result in a few bars that will crowd the data together. See Figure 3.6 for an example comparing the same data shown in a different number of bins.

Making an Excel Histogram
[I'm skipping this for now only because I see no way he can possibly test this in a multiple choice fashion.  I'll revisit this tomorrow if I have time left over.]

Shape
A histogram suggests the shape of the population we are sampling, but unless the sample is large, we have to be cautious about making inferences. The following terminology is helpful in understanding the implications of shape.

Modal class: a histogram bar that is higher than those on either side.  A unimodal histogram has only one such taller bar. A bimodal histogram has two such taller bars. And one with more than two taller bars is a multimodal histogram.

Skew:  A histogram's skew is indicated by the direction of its longer tail. If neither tail is longer, the histogram is symmetric. A right-skewed (also called positively skewed) histogram has a longer right tail, which means that most of the data values are clustered on the left side of the graph.  A left-skewed (or negatively skewed) histogram has a longer left tail, which means that most of the data values are clustered on the right side of the graph.

Figure 3.8 (p. 71) shows several histogram templates depicting skew and modal classes.

An outlier is an extreme value that is far enough from the majority of the data that it probably arose from a different cause or is due to measurement error.

Frequency Polygon and Ogive:
 
A frequency polygon is a line graph that connects the midpoints of the histogram intervals, plus extra intervals at the beginning and end so that the line will touch the X-axis. It serves the same purpose as a histogram, but is attractive when you need to compare two data sets.

An ogive (pronounced "oh-jive") is a  line graph of the cumulative frequencies. It is useful for finding percentiles or in comparing the shape of the sample within a known benchmark such as the normal distribution.



Popular Posts