Statistics Chapter 2

Chapter 2 – Data Collection

Symbols:

∑ = sum

Population - All of the items we are interested in. (Can be finite, such as passengers on a plane or infinite, such as all Cokes bottled in an ongoing process.) Population symbols are commonly Greek letters. (But not always)

N = size

µ = mean

σ2 = variance

σ = standard deviation

Sample - A subset of the population that we will actually be analyzing. Sample symbols are commonly Roman letters. (But not always)

n = size

x̄ = mean

s2 = variance

s = standard deviation

Definitions:

Subject - single member of a data collection we want to study, such as persons,
     firms or regions

Variable - a characteristic of a the subject, such as an employee's income or an
      invoice amount

Versions of Data:
Univariate - a data set that consists of one variable (Ex. Income)

Bivariate - a data set that consists of two variables (Ex. Income, Age)

Multivariate - a data set that consists of three or more variables (Ex. Income, Age, Gender)

Types of Data
Categorical Variables (also called "qualitative")

Verbal (e.g. vehicle type: car, truck, SUV)

A binary variable is limited to only two choices (e.g. gender)

Coded (e.g. vehicle type: 1, 2, 3)

Numerical Variables (also called "quantitative")

Discrete - Measured in whole units (e.g. Broken eggs in a carton; x = 1, 2, 3)

Continuous - Measured in fractional units (e.g. Patient waiting time; x = 14.27 minutes)

Levels of Measurement
Nominal Data - Categories only. (Nominal = "Names"). Ex. Eye Color

Ordinal Data - Data that can be ranked. Ex. "Rarely, Never…" (as choices)

Interval Data - Ranked data in which the distance between units has meaning. Ex. Temperature

Ratio Data - A meaningful zero exists in the scale. Ex. Accts payable $21.7 million

(Likert scale) - An ordinal scale ranking the respondents' agreement with or opinion of a certain statement. Ex. "How would you rate this class?" "Very difficult", "Difficult", "About Standard"...

Time series data - each observation in the data collection represents a different point in time Ex. The M1 money supply over 20 quarters.

Cross-sectional - each observation represents a different individual unit at the same point in time Ex. Traffic fatalities in the 50 states for a given year.

Sample - Analysis of only some items selected from a population

Census - Analysis of the entire population

Random Sample - Items for a sample are selected by randomization or by chance.
Simple Random Sample - Use of random numbers to select items from a list (e.g. VISA card holders)

Systematic Samples - Selection of every kth item from a list or sequence. (ex. Restaurant customers)

Stratified Samples - Select randomly within defined strata (ex. age, occupation, gender)

Cluster Samples - Like stratified sampling except that the strata used are geographical areas (e.g. ZIP codes)

Non-Random Sample - A selection process that involves more input from the researcher. (ex. The sample is dependent on selection criteria crafted by the researcher, like students wearing red that day or a "typical" employee.)
Judgment Sample - Use expert knowledge to choose "typical" items (e.g. which employees to interview)

Convenience Sample - Use a sample that happens to be available (e.g. ask co-worker opinions at lunch)

Focus Group – In-depth dialogue with a representative panel of individuals. (e.g. iPod users)

Topics

2.2 Levels of Measurement
The four levels of measurement; nominal, ordinal, interval and ratio, are distinguished not only by their respective types of data but also by the mathematical functions that can be performed with the data. This makes a difference as to how the data collection can be mathematically analyzed. For example, a mean would be impossible to obtain in a data set of eye colors of students. But a mode would be possible to obtain from such a set.

Table 2.3 shows these distinctions in format.

Beyond the definitions provided above, those distinctions are:

Nominal - Only counting can be done (e.g. frequency tallies, finding the mode)

Ordinal - Counting and order statistics can be done (mode, median, rank tests can be performed)

Interval - Statistics that use sums or differences can be performed (e.g. mean, standard deviation)

Ratio - All Statistical operations can be performed including ratios of numbers (important for functions we'll learn about later, like proportions)

This all makes a difference in terms of what sort of survey you'll be crafting, based on what sort of data you expect to be collecting. The closer to ratio your data gets, the more in-depth your analysis can be.

Further examples of the types of data in each measurement level can be seen in Figure 2.2.   Some of those examples are:
Nominal - Vehicle type: X = SUV, car, truck

Ordinal - Rate a new song: X = poor, OK, good, great* (Distance between the units has little to no measureable meaning)

Interval - What is the temperature? X = 72.3 F (in this case, the distance between the units has real meaning.)

Ratio - Weekly pay: X = $457.14

Likert Scale: An ordinal ranking commonly used in surveys.   It asks the respondent to rank a certain product, person, etc in terms of scaled categories, typically phrased in 5-7 rankings. Ex. How likely are you to take another Econ Statistics class? Not Likely at All, Somewhat unlikely, Likely, Very Likely, Definitely.

A Likert scale can be Interval ONLY if the distances between responses has meaning. Otherwise, it's considered an "ordinal" scale.

Changing Data by Recoding:
Data can be changed downward on the levels of measurements, but not upward. For example, a blood pressure of under 130 can be deemed "normal", between 130 and 140 deemed "slightly elevated and over 140 deemed "high". This is an example of taking an Interval measurement and "downgrading" it to an Ordinal scale.

2.3 Time Series vs. Cross-Sectional
Time Series Data - If each observation in the sample represents an equally spaced point in time, then we are dealing with time series data. The periodicity is the time between observations (days, weeks, months, quarters, years, etc). A macroeconomic example of time series data would be measures like GDP (national income), the unemployment rate (economic indicators) or the M1 money supply (monetary data).   A microeconomic example would be the firms' sales over quarters or employee absenteeism over a month.

We are interested in tracking patterns and trends over time. (e.g. annual growth in consumer debit card use)

Cross-Sectional Data - If each observation in the sample represents a different individual unit at the same point in time, we have cross-sectional data. (e.g. VISA balances of new mortgage applicants or GPAs of students in statistics courses).

We are interested in variation among observations or in relationships. (Do correlations exist? If so, what are they?)

Some variables, like unemployment can be either time series (monthly data over the past 60 months) or cross-sectional (January unemployment rates of 50 US cities).

Sampling Concepts

It's unrealistic to study every target population. Many are far too large to be a feasible project. It will often be important to select a sample from a target population

Sample or Census?

Sometimes it is possible (or necessary) to conduct an analysis of a large population. Such an analysis is called a "census". The US conducts one of these on its own population every 10 years. The limitations are many, including the vast expense of training census takers, safeguarding data, tracking down non-responses or incomplete responses. It is often inaccurate. (The 1990 census is believed to have missed some 8 million people, leading to a shortchange of as many as 16 House Representatives.)

Target Population - is the population in which we are interested. (e.g. likely voters in the Republican nomination)

Sampling Frame - The group from which we derive a sample. If the frame differs from the target population, then our estimates will be of little use. (ex. Phone directories, voter registration lists.)

Finite or Infinite? - A population is finite if it has a definite size, N, even if the size is unknown. Ex. The number of cars in a McDonald's parking lot will be a finite number. A population is considered infinite if it is of arbitrarily large size, such as the number of M&Ms being produced on assembly lines.   Quality control processes are done from n items (sample) because the populations are effectively infinite.

Rule of Thumb: A population may be treated as infinite when N (population size) is at least 20 times n (sample size). (In other words, when N/n ≥ 20)

Parameters or Statistics

From a sample of n items, chosen from a population, we can compute statistics that can be used to estimate parameters found in the population. Different symbols are used to distinguish these. For example, the mean of a population is symbolized by the Greek letter µ (pronounced "myu"). The mean of a sample is symbolized by x̄ (stated as "x-bar").

Situations in Which a Sample May be Preferred:

Infinite Population - As defined above.

Destructive Testing - When the test will destroy or devalue the item (e.g. crash tests, battery life)

Timely Results - Sampling is faster for when timely results are needed (e.g. testing jars of peanut butter for contamination)

Accuracy - Sampling can provide higher accuracy when resources would be spread too thin on a census.

Cost - The cost of sampling is typically much lower.

Sensitive Information - Confidentiality and interviewer training can be increased in a sample when dealing with sensitive information.

Situations in which a Census may be preferred:

Small population - When the population is small, there is little reason to sample. It becomes worth the cost of the population census.

Large Sample Size - When the required sample size approaches the size of the population, its worth it to just take the census.

Database Exists - When the data are on disk, we can examine 100% of cases. Validating data against the physical records may raise the cost.

Legal Requirements - Banking regulations require that they count ALL cash in teller's drawers at the end of the business day. The US Congress outlawed sampling in the 2000 decennial population census.

Sampling Methods
The two main categories of sampling methods are random sampling and non-random sampling. Table 2.6 defines the typical sampling methods under each of these categories.   I added these definitions in the Definitions section above. What follows is more detail on the distinctions between methods.

Random Sampling

Simple Random Sampling - each item of a population as an equal chance of being selected for the sample. Easiest to select from a list and with the help of a random number generator, such as found in Excel. =RANDBETWEEN(x, y)

With or Without Replacement - Using this method, the same number can occur more than once. If we allow duplicates, we are sampling with replacement. If we do not and prefer to select again, we are sampling without replacement, but you have introduced a bias. (Make sure to not switch those, since the concept would imply the opposite of what it's called.)

Systematic Sampling - Randomly selecting every kth item from a list.

Beneficial to use in an infinite population (such as testing every 5,000th light bulb produced). Also well-suited to linearly organized data, such pulling every 10th file from a file drawer of alphabetized patient files.

Stratified Sample - Utilizing prior information about the population to efficiently sample. This method selects simple random samples from pre-existing strata in the population. Allows to keep percentages of the population in tact (e.g. in a population of employees that are 55% male, sample can be kept the same proportion.)

Cluster Samples - Simple random sampling from strata that are geographical in nature. Ex. Dividing a city into sub-regions like blocks and selecting a simple random sample from each sub-region. This is most useful when: population frame and strata characteristics are not readily available, too costly to obtain a stratified sample, the cost of obtaining data increases with distance or when some loss of reliability is acceptable. (Useful in studies of, say, crime victimization or gasoline prices.)

Non-Random Sampling

Judgment Sampling - Relies on the expertise of the sampler to choose the items that will be representative of the population. (Ex. In order to estimate industry cost of R&D in the medical equipment industry, we might ask an industry expert to select several "typical" firms.) Risk of subconscious bias increases with this method.

Convenience Sampling - Grabbing whatever sample is handy. Typical example is of a newspaper report attempting to solicit public opinion on a specific issue and grabbing subjects to interview from a pre-selected area. The downside is that your sample may not be truly representative of the population.

Focus Group:   Small panel of individuals pre-selected for a group discussion about the issue or product. Subjects are prescreened for diverse characteristics while holding compatibility with the study. (ex. Panel of 20 women between the ages of 25 and 40 selected to discuss a womens' magazine.)

Sample Size - Will be dependent on the nature of the product or issue studied. The larger the sample, the more accurate the results, but also more costly. Ex. The amount of caffeine in a can of Mountain Dew will be fairly consistent from can to can. The amount of caffeine in a cup of tea, however, will vary because different consumers will steep the tea bag at different times. In the latter case, a larger sample is needed for better accuracy.

Sources of Error:
Table 2.10 defines potential survey errors.

Source of Error Characteristics

Nonresponse bias Respondents differ from nonrespondents (e.g. cell phone users are often missed in telephone polls)

Selection bias Self-selected respondents are atypical (e.g. a talk show hosts invites viewers to answer an online poll about their sex lives. Those who are willing to respond likely differ in substantive ways from those who aren't.)

Response bias Respondents give false information (hoax respondents or deliberately distorting responses out of embarassment, privacy, etc concerns)

Coverage error Incorrect specification of frame or population (e.g. including grads of only private universities and missing public universities or students who attended but didn't graduate)

Measurement Error Survey instrument wording is biased or unclear

Interviewer error Responses influenced by the interviewer. (Facial expressions or tone may lead subject to respond differently.)

Sampling error Random and unavoidable

Data Sources
It's important that you know where to find comprehensive data. Many excellent public sources are available.

US General Data Statistical Abstract of the United States

US economic data Economic Report of the United States

Almanacs World Almanac, Time Almanac

Periodicals Economist, BusinessWeek, Fortune

Indices The New York Times, The Wall Street Journal

Databases Compustat, Citibase, US Census

World Data CIA World Factbook

Web Google, Yahoo, MSN

Survey Research
Most survey research follows the same basic steps:

State the goals of the research

Develop your budget (time, money, staff)

Create a research design (target population, frame, sample size)

Choose a survey type and method of administration

Design a data collection instrument (e.g. questionnaire)

Pretest the survey instrument as needed

Administer the survey (follow up if needed)

Code the data and analyze it

Survey Types
Each survey type has its benefits and liabilities. Each is subject to particular types of errors.

See Table 2.12 for detail on strengths and weaknesses of each type

Mail Need well-targeted and current mailing list
Low response rates and open to non-response bias
ZIP code lists are often costly, but can be used for stratified sample grouping by income, education and attitudes
Cover letter should state goals of research clearly

Telephone Random dialing yields very low response rate and is poorly targeted
Purchased phone lists will reach target population, but will still have low response rate (due to voicemail, work hours, no call lists)
Language barriers will contribute to nonresponse bias

Interviews Expensive and time consuming, but the trade-off between sample size for high-quality results may be worth it.
Cost of training interviewers. But interviewers can obtain info on sensitive subjects (e.g. birth control practices, gender discrimination in corps.)

Web Subject to nonresponse bias because they miss those without computers or who don't trust intents. Works best when targeted to a well-defined interest group on a question of self-interest. (e.g. opinions of CPAs on Sarbanes-Oxley accounting laws)

Direct Observation Can be done in a controlled setting. (e.g. psychology lab) but requires informed consent which can change behavior. Some unobtrusive observation is possible in certain settings. (e.g. how many airline passengers carry on more than two bags, etc)

Survey Guidelines

Planning What is the purpose of the survey? How is your budget best spent?

Design To ensure the most useful responses, you must invest time and resources into the design of your survey. Use advice resources (like design books) to avoid costly mistakes.

Quality Care in preparation needed. Public now has raised expectations of quality expected due to availability of glossy printing.

Pilot Test Questions that are clear to you may not be clear to others. Pretest the questionnaire on friends, family and coworkers. Using a panel of strangers is still better.

Buy-In Response rates can be improved by offering a token in exchange for participation, like a coupon or small gift.

Expertise Consider hiring an outside consultant in the early stages even if you conduct the bulk of the experimentation on your own.

Questionnaire Design
American Statistical Organization offers booklets and brochures on planning and designing survey research.   Don't use a crowded layout. Begin with concise statement of purpose, an assurance of anonymity and directions with what to do with the completed questionnaire. Number the questions. Divide the questionnaire into naturally occurring topic subdivisions. Allow respondents to skip portions that don't pertain to them. (e.g. "If you responded "No" to Question 7, then skip to Question 15.") Include an "escape option" (e.g. "don't know" or "doesn't apply"). Use wording and response scales that match the reading ability and knowledge level of target population.

Question Wording
Much of your response accuracy will depend on how you word the questions you ask.   Consider the following example:

Version 1: Should taxes be cut?

Version 2: Should taxes be cut if it means reducing highway maintenance?

Version 3: Should taxes be cut if it means firing police officers and teachers?

The unconstrained choice (Version 1) will appear to be a "free lunch" (savings without consequences). To add in the consequences of cutting taxes will make a difference in the responses you get.

Another problem is to make sure you have covered ALL possibilities in your selections:

What is your party preference?
Democrat

Republican

Overlapping classes is another problem
What is your age?

17-21

21-25

25-30

I think you can see how these sorts of poorly worded questions will lead to faulty responses and, more importantly, wasted time and money.

This concludes my review of Chapter 2. I am skipping the parts that cover particulars about using statistical computer packages such as MINITAB or SPSS.   We don't have access to these types of software and, he said that Excel would be all the software you needed for this class.

Homework Will Never End

Search This Blog

Statistics Chapter 2

Labels

Popular posts from this blog

Setting The Stage For Learning About The Earth

The Romantics: John Keats and Samuel T. Coleridge

Data Analysis

Source of Error	Characteristics
Nonresponse bias	Respondents differ from nonrespondents (e.g. cell phone users are often missed in telephone polls)
Selection bias	Self-selected respondents are atypical (e.g. a talk show hosts invites viewers to answer an online poll about their sex lives. Those who are willing to respond likely differ in substantive ways from those who aren't.)
Response bias	Respondents give false information (hoax respondents or deliberately distorting responses out of embarassment, privacy, etc concerns)
Coverage error	Incorrect specification of frame or population (e.g. including grads of only private universities and missing public universities or students who attended but didn't graduate)
Measurement Error	Survey instrument wording is biased or unclear
Interviewer error	Responses influenced by the interviewer. (Facial expressions or tone may lead subject to respond differently.)
Sampling error	Random and unavoidable

US General Data	Statistical Abstract of the United States
US economic data	Economic Report of the United States
Almanacs	World Almanac, Time Almanac
Periodicals	Economist, BusinessWeek, Fortune
Indices	The New York Times, The Wall Street Journal
Databases	Compustat, Citibase, US Census
World Data	CIA World Factbook
Web	Google, Yahoo, MSN

Mail	Need well-targeted and current mailing list Low response rates and open to non-response bias ZIP code lists are often costly, but can be used for stratified sample grouping by income, education and attitudes Cover letter should state goals of research clearly
Telephone	Random dialing yields very low response rate and is poorly targeted Purchased phone lists will reach target population, but will still have low response rate (due to voicemail, work hours, no call lists) Language barriers will contribute to nonresponse bias
Interviews	Expensive and time consuming, but the trade-off between sample size for high-quality results may be worth it. Cost of training interviewers. But interviewers can obtain info on sensitive subjects (e.g. birth control practices, gender discrimination in corps.)
Web	Subject to nonresponse bias because they miss those without computers or who don't trust intents. Works best when targeted to a well-defined interest group on a question of self-interest. (e.g. opinions of CPAs on Sarbanes-Oxley accounting laws)
Direct Observation	Can be done in a controlled setting. (e.g. psychology lab) but requires informed consent which can change behavior. Some unobtrusive observation is possible in certain settings. (e.g. how many airline passengers carry on more than two bags, etc)

Planning	What is the purpose of the survey? How is your budget best spent?
Design	To ensure the most useful responses, you must invest time and resources into the design of your survey. Use advice resources (like design books) to avoid costly mistakes.
Quality	Care in preparation needed. Public now has raised expectations of quality expected due to availability of glossy printing.
Pilot Test	Questions that are clear to you may not be clear to others. Pretest the questionnaire on friends, family and coworkers. Using a panel of strangers is still better.
Buy-In	Response rates can be improved by offering a token in exchange for participation, like a coupon or small gift.
Expertise	Consider hiring an outside consultant in the early stages even if you conduct the bulk of the experimentation on your own.