Statistics Chapter 2




Chapter 2 – Data Collection

Symbols:
 
= sum
 
Population - All of the items we are interested in. (Can be finite, such as passengers on a plane or infinite, such as all Cokes bottled in an ongoing process.)  Population symbols are commonly Greek letters. (But not always)
 

N = size
µ = mean
σ2 = variance
σ = standard deviation

Sample - A subset of the population that we will actually be analyzing. Sample symbols are commonly Roman letters. (But not always)

n = size
x̄  = mean
s2 = variance
s = standard deviation

Definitions:

Subject - single member of a data collection we want to study, such as persons,
     firms or regions
 
Variable - a characteristic of a the subject, such as an employee's income or an
      invoice amount

Versions of Data:
Univariate - a data set that consists of one variable (Ex. Income)
Bivariate - a data set that consists of two variables (Ex. Income, Age)
Multivariate - a data set that consists of three or more variables (Ex. Income, Age, Gender)
 
Types of Data
Categorical Variables (also called "qualitative")
Verbal (e.g. vehicle type: car, truck, SUV)
A binary variable is limited to only two choices (e.g. gender)
Coded (e.g. vehicle type: 1, 2, 3)
Numerical Variables (also called "quantitative")
Discrete - Measured in whole units (e.g. Broken eggs in a carton; x = 1, 2, 3)
Continuous - Measured in fractional units (e.g. Patient waiting time; x = 14.27 minutes)


Levels of Measurement
Nominal Data - Categories only. (Nominal = "Names"). Ex. Eye Color
Ordinal Data - Data that can be ranked. Ex. "Rarely, Never…" (as choices)
Interval Data - Ranked data in which the distance between units has meaning. Ex. Temperature
Ratio Data - A meaningful zero exists in the scale.  Ex. Accts payable $21.7 million

(Likert scale)  - An ordinal scale ranking the respondents' agreement with or opinion of a certain statement.  Ex. "How would you rate this class?" "Very difficult", "Difficult", "About Standard"...

Time series data - each observation in the data collection represents a different point in time Ex. The M1 money supply over 20 quarters.
 
Cross-sectional - each observation represents a different individual unit at the same point in time  Ex. Traffic fatalities in the 50 states for a given year.
 
Sample - Analysis of only some items selected from a population
 
Census - Analysis of the entire population
 
Random Sample -  Items for a sample are selected by randomization or by chance.
Simple Random Sample - Use of random numbers to select items from a list (e.g. VISA card holders)
Systematic Samples - Selection of every kth item from a list or sequence. (ex. Restaurant customers)
Stratified Samples - Select randomly within defined strata (ex. age, occupation, gender)
Cluster Samples - Like stratified sampling except that the strata used are geographical areas (e.g. ZIP codes)
 
Non-Random Sample - A selection process that involves more input from the researcher.  (ex. The sample is dependent on selection criteria crafted by the researcher, like students wearing red that day or a "typical" employee.)
Judgment Sample - Use expert knowledge to choose "typical" items (e.g. which employees to interview)
Convenience Sample - Use a sample that happens to be available (e.g. ask co-worker opinions at lunch)
Focus Group – In-depth dialogue with a representative panel of individuals. (e.g. iPod users)

Topics
 
2.2 Levels of Measurement
The four levels of measurement; nominal, ordinal, interval and ratio, are distinguished not only by their respective types of data but also by the mathematical functions that can be performed with the data.  This makes a difference as to how the data collection can be mathematically analyzed.  For example, a mean would be impossible to obtain in a data set of eye colors of students.  But a mode would be possible to obtain from such a set.

Table 2.3 shows these distinctions in format.

Beyond the definitions provided above, those distinctions are:

Nominal - Only counting can be done (e.g. frequency tallies, finding the mode)
Ordinal - Counting and order statistics can be done (mode, median, rank tests can be performed)
Interval - Statistics that use sums or differences can be performed (e.g. mean, standard deviation)
Ratio - All Statistical operations can be performed including ratios of numbers (important for functions we'll learn about later, like proportions)

This all makes a difference in terms of what sort of survey you'll be crafting, based on what sort of data you expect to be collecting.  The closer to ratio your data gets, the more in-depth your analysis can be.
 
Further examples of the types of data in each measurement level can be seen in Figure 2.2.   Some of those examples are:
Nominal - Vehicle type: X = SUV, car, truck
Ordinal - Rate a new song:  X = poor, OK, good, great*  (Distance between the units has little to no measureable meaning)
Interval - What is the temperature?  X = 72.3 F (in this case, the distance between the units has real meaning.)
Ratio - Weekly pay: X = $457.14

Likert Scale:  An ordinal ranking commonly used in surveys.   It asks the respondent to rank a certain product, person, etc in terms of scaled categories, typically phrased in 5-7 rankings.  Ex. How likely are you to take another Econ Statistics class?  Not Likely at All, Somewhat unlikely, Likely, Very Likely, Definitely.
 
A Likert scale can be Interval ONLY if the distances between responses has meaning.  Otherwise, it's considered an "ordinal" scale.
 
Changing Data by Recoding:
Data can be changed downward on the levels of measurements, but not upward.  For example, a blood pressure of under 130 can be deemed "normal", between 130 and 140 deemed "slightly elevated and over 140 deemed "high".  This is an example of taking an Interval measurement and "downgrading" it to an Ordinal scale.
 
2.3 Time Series vs. Cross-Sectional
Time Series Data - If each observation in the sample represents an equally spaced point in time, then we are dealing with time series data.  The periodicity is the time between observations (days, weeks, months, quarters, years, etc).  A macroeconomic example of time series data would be  measures like GDP (national income), the unemployment rate (economic indicators) or the M1 money supply (monetary data).   A microeconomic example would be the firms' sales over quarters or employee absenteeism over a month.

We are interested in tracking patterns and trends over time.  (e.g. annual growth in consumer debit card use)

Cross-Sectional Data -  If each observation in the sample represents a different individual unit at the same point in time, we have cross-sectional data.  (e.g. VISA balances of new mortgage applicants or GPAs of students in statistics courses).  

We are interested in variation among observations or in relationships. (Do correlations exist? If so, what are they?)

Some variables, like unemployment can be either time series (monthly data over the past 60 months) or cross-sectional (January unemployment rates of 50 US cities).

Sampling Concepts
It's unrealistic to study every target population.  Many are far too large to be a feasible project.  It will often be important to select a sample from a target population

Sample or Census?
Sometimes it is possible (or necessary) to conduct an analysis of a large population.  Such an analysis is called a "census".  The US conducts one of these on its own population every 10 years. The limitations are many, including the vast expense of training census takers, safeguarding data, tracking down non-responses or incomplete responses.  It is often inaccurate.  (The 1990 census is believed to have missed some 8 million people, leading to a shortchange of as many as 16 House Representatives.)

Target Population - is the population in which we are interested. (e.g. likely voters in the Republican nomination)

Sampling Frame - The group from which we derive a sample. If the frame differs from the target population, then our estimates will be of little use.  (ex. Phone directories, voter registration lists.)

Finite or Infinite? - A population is finite if it has a definite size, N, even if the size is unknown.  Ex. The number of cars in a McDonald's parking lot will be a finite number.  A population is considered infinite if it is of arbitrarily large size, such as the number of M&Ms being produced on assembly lines.   Quality control processes are done from n items (sample) because the populations are effectively infinite.  

Rule of Thumb:  A population may be treated as infinite when N (population size) is at least 20 times n (sample size).  (In other words, when N/n ≥ 20)

Parameters or Statistics
From a sample of n items, chosen from a population, we can compute statistics that can be used to estimate parameters found in the population.  Different symbols are used to distinguish these.  For example, the mean of a population is symbolized by the Greek letter µ (pronounced "myu"). The mean of a sample is symbolized by x̄ (stated as "x-bar").

Situations in Which a Sample May be Preferred:
Infinite Population - As defined above.
Destructive Testing - When the test will destroy or devalue the item (e.g. crash tests, battery life)
Timely Results -  Sampling is faster for when timely results are needed (e.g. testing jars of peanut butter for contamination)
Accuracy -  Sampling can provide higher accuracy when resources would be spread too thin on a census.
Cost - The cost of sampling is typically much lower.
Sensitive Information - Confidentiality and interviewer training can be increased in a sample when dealing with sensitive information.

Situations in which a Census may be preferred:
Small population - When the population is small, there is little reason to sample. It becomes worth the cost of the population census.
Large Sample Size -  When the required sample size approaches the size of the population, its worth it to just take the census.
Database Exists -  When the data are on disk, we can examine 100% of cases. Validating data against the physical records may raise the cost.
Legal Requirements - Banking regulations require that they count ALL cash in teller's drawers at the end of the business day.  The US Congress outlawed sampling in the 2000 decennial population census.

Sampling Methods
The two main categories of sampling methods are random sampling and non-random sampling.  Table 2.6 defines the typical sampling methods under each of these categories.   I added these definitions in the Definitions section above.  What follows is more detail on the distinctions between methods.
 
Random Sampling

Simple Random Sampling - each item of a population as an equal chance of being selected for the sample.  Easiest to select from a list and with the help of a random number generator, such as found in Excel.  =RANDBETWEEN(x, y)

With or Without Replacement - Using this method, the same number can occur more than once.  If we allow duplicates, we are sampling with replacement.  If we do not and prefer to select again, we are sampling without replacement, but you have introduced a bias.  (Make sure to not switch those, since the concept would imply the opposite of what it's called.)

Systematic Sampling - Randomly selecting every kth item from a list.
Beneficial to use in an infinite population (such as testing every 5,000th light bulb produced). Also well-suited to linearly organized data, such pulling every 10th file from a file drawer of alphabetized patient files.

Stratified Sample - Utilizing prior information about the population to efficiently sample.  This method selects simple random samples from pre-existing strata in the population. Allows to keep percentages of the population in tact (e.g. in a population of employees that are 55% male, sample can be kept the same proportion.)

Cluster Samples - Simple random sampling from strata that are geographical in nature. Ex. Dividing a city into sub-regions like blocks and selecting a simple random sample from each sub-region.  This is most useful when: population frame and strata characteristics are not readily available, too costly to obtain a stratified sample, the cost of obtaining data increases with distance or when some loss of reliability is acceptable. (Useful in studies of, say, crime victimization or gasoline prices.)


 
Non-Random Sampling

Judgment Sampling - Relies on the expertise of the sampler to choose the items that will be representative of the population. (Ex. In order to estimate industry cost of R&D in the medical equipment industry, we might ask an industry expert to select several "typical" firms.)  Risk of subconscious bias increases with this method.

Convenience Sampling - Grabbing whatever sample is handy.  Typical example is of a newspaper report attempting to solicit public opinion on a specific issue and grabbing subjects to interview from a pre-selected area.  The downside is that your sample may not be truly representative of the population.

Focus Group:   Small panel of individuals pre-selected for a group discussion about the issue or product. Subjects are prescreened for diverse characteristics while holding compatibility with the study. (ex. Panel of 20 women between the ages of 25 and 40 selected to discuss a womens' magazine.)
 
Sample Size - Will be dependent on the nature of the product or issue studied. The larger the sample, the more accurate the results, but also more costly.  Ex. The amount of caffeine in a can of Mountain Dew will be fairly consistent from can to can.  The amount of caffeine in a cup of tea, however, will vary because different consumers will steep the tea bag at different times. In the latter case, a larger sample is needed for better accuracy.

 Sources of Error:
Table 2.10 defines potential survey errors.   
Source of ErrorCharacteristics
Nonresponse biasRespondents differ from nonrespondents (e.g. cell phone users are often missed in telephone polls)
Selection biasSelf-selected respondents are atypical (e.g. a talk show hosts invites viewers to answer an online poll about their sex lives. Those who are willing to respond likely differ in substantive ways from those who aren't.)
Response biasRespondents give false information (hoax respondents or deliberately distorting responses out of embarassment, privacy, etc concerns)
Coverage errorIncorrect specification of frame or population (e.g. including grads of only private universities and missing public universities or students who attended but didn't graduate)
Measurement ErrorSurvey instrument wording is biased or unclear
Interviewer errorResponses influenced by the interviewer. (Facial expressions or tone may lead subject to respond differently.)
Sampling errorRandom and unavoidable


Data Sources
It's important that you know where to find comprehensive data.  Many excellent public sources are available.
 
US General Data Statistical Abstract of the United States
US economic dataEconomic Report of the United States
AlmanacsWorld Almanac, Time Almanac
PeriodicalsEconomist, BusinessWeek, Fortune
IndicesThe New York Times, The Wall Street Journal
DatabasesCompustat, Citibase, US Census
World DataCIA World Factbook
WebGoogle, Yahoo, MSN
Survey Research
 Most survey research follows the same basic steps:

  1. State the goals of the research
  1. Develop your budget (time, money, staff)
  2. Create a research design (target population, frame, sample size)
  1. Choose a survey type and method of administration
  1. Design a data collection instrument (e.g. questionnaire)
  1. Pretest the survey instrument as needed
  1. Administer the survey (follow up if needed)
  1. Code the data and analyze it


Survey Types
Each survey type has its benefits and liabilities.  Each is subject to particular types of errors.
 
See Table 2.12 for detail on strengths and weaknesses of each type

MailNeed well-targeted and current mailing list  
Low response rates and open to non-response bias
ZIP code lists are often costly, but can be used for stratified sample grouping by income, education and attitudes
Cover letter should state goals of research clearly
TelephoneRandom dialing yields very low response rate and is poorly targeted  
Purchased phone lists will reach target population, but will still have low response rate (due to voicemail, work hours, no call lists)  
Language barriers will contribute to nonresponse bias
InterviewsExpensive and time consuming, but the trade-off between sample size for high-quality results may be worth it.  
Cost of training interviewers.  But interviewers can obtain info on sensitive subjects (e.g. birth control practices, gender discrimination in corps.)   
WebSubject to nonresponse bias because they miss those without computers or who don't trust intents.  Works best when targeted to a well-defined interest group on a question of self-interest. (e.g. opinions of CPAs on Sarbanes-Oxley accounting laws)
Direct ObservationCan be done in a controlled setting. (e.g. psychology lab) but requires informed consent which can change behavior.  Some unobtrusive observation is possible in certain settings. (e.g. how many airline passengers carry on more than two bags, etc)


Survey Guidelines

PlanningWhat is the purpose of the survey? How is your budget best spent?
DesignTo ensure the most useful responses, you must invest time and resources into the design of your survey.  Use advice resources (like design books) to avoid costly mistakes.
QualityCare in preparation needed. Public now has raised expectations of quality expected due to availability of glossy printing.
Pilot TestQuestions that are clear to you may not be clear to others. Pretest the questionnaire on friends, family and coworkers. Using a panel of strangers is still better.
Buy-InResponse rates can be improved by offering a token in exchange for participation, like a coupon or small gift.
ExpertiseConsider hiring an outside consultant in the early stages even if you conduct the bulk of the experimentation on your own.


Questionnaire Design
American Statistical Organization offers booklets and brochures on planning and designing survey research.   Don't use a crowded layout. Begin with concise statement of purpose, an assurance of anonymity and directions with what to do with the completed questionnaire. Number the questions. Divide the questionnaire into naturally occurring topic subdivisions. Allow respondents to skip portions that don't pertain to them. (e.g. "If you responded "No" to Question 7, then skip to Question 15.") Include an "escape option" (e.g. "don't know" or "doesn't apply"). Use wording and response scales that match the reading ability and knowledge level of target population.

Question Wording
Much of your response accuracy will depend on how you word the questions you ask.   Consider the following example:

Version 1: Should taxes be cut?
Version 2: Should taxes be cut if it means reducing highway maintenance?
Version 3: Should taxes be cut if it means firing police officers and teachers?

The unconstrained choice (Version 1) will appear to be a "free lunch" (savings without consequences).  To add in the consequences of cutting taxes will make a difference in the responses you get.
 
Another problem is to make sure you have covered ALL possibilities in your selections:
 
What is your party preference?
Democrat
Republican

Overlapping classes is another problem
What is your age?
17-21
21-25
25-30

I think you can see how these sorts of poorly worded questions will lead to faulty responses and, more importantly, wasted time and money.

This concludes my review of Chapter 2.  I am skipping the parts that cover particulars about using statistical computer packages such as MINITAB or SPSS.   We don't have access to these types of software and, he said that Excel would be all the software you needed for this class.  



Popular Posts