Sampling Issues

Quantitative Methods

About Lesson

LOS K requires us to:

describe the issues regarding the selection of the appropriate sample size, data mining bias, sample selection bias, survivorship bias, look-ahead bias, and time period bias.

1. Sample Size Selection

a. As discussed, the confidence interval, which is calculated using the following formula:

is affected by:

i. z or t,

ii. α, and

iii. also the standard error, i.e. σ/(n)^1/2 or s/(n)^1/2

Thus, the larger the standard error, the wider is the confidence interval. This is because; a larger n reduces the standard error and therefore the confidence intervals

b. Thus, there are inherent trade-offs in selecting a sample based on both statistical and economic factors:

i. A larger sample may result in increased precision due to the use of z-statistics rather than t-statistics, and reduction in the estimate of the standard error.

ii. A larger sample may also result in the cost being more than the benefits.

2. Data Mining Bias

a. Data mining is the practice of hitting a data set over and over again until you hit the gold.

Thus, just by random chance alone, a significant relationship will be found that actually does not exist in any other data set.

b. This data mining is typically not motivated by a theory or hypothesis. The significant results can also be obtained as a result of data narrowing, i.e. dropping the outlier cases or torturing the data until it confesses.

c. To verify the relationship and/or discover data-mining biases, we can conduct out-of-sample tests.

3. Sample Selection Bias

a. Sample selection bias is a result of the exclusion of certain data/variables due to unavailability.

b. Survivorship bias is a particular kind of selection bias wherein we only observe those firms that have succeeded and, therefore, survive.

4. Look-Ahead Bias

a. Look-ahead bias occurs when researchers use data not available at the test date to test a model and use it for predictions.

b. It may be particularly pronounced when using accounting data, which is typically reported with a lag in time.

5. Time Period Bias

a. Time-period bias occurs when the model uses data from a time period when the data is not representative of all possible values of the data across time.

b. Too short of a time period increases the likelihood of period-specific results.

c. Too long of a time period increases the chance of a regime change.