What happens if I use OLS in a multiple regression but the sample is not random?

I know that, to use OLS estimators in linear regressions, there are few assumption to be satisfied. However, it is not clear to me what would happen if I would use OLS in a multiple regression without having a random sample, so that (Xi, Yi) would not be iid. Which sort of problem may I face?

68.8k 13 13 gold badges 123 123 silver badges 274 274 bronze badges asked Oct 31, 2016 at 16:25 61 1 1 silver badge 3 3 bronze badges

$\begingroup$ A good place to look is this paper. Section 6 to end (the first five sections are not that relevant to your question) $\endgroup$

Commented Oct 31, 2016 at 16:34 $\begingroup$ Your model will be biased $\endgroup$ Commented Oct 31, 2016 at 16:37

$\begingroup$ @user603 I think my econometric background is too limitated to understand the contents of that paper. Nothing simpler? $\endgroup$

Commented Oct 31, 2016 at 17:02

$\begingroup$ @jchaykow, how would you define model bias? I know what a biased parameter estimator is, but not quite sure about a biased model. $\endgroup$

Commented Oct 31, 2016 at 18:11

$\begingroup$ @RichardHardy a model that is trained on biased data will overfit certain non-representative subset of the overall population. $\endgroup$

Commented Oct 31, 2016 at 20:02

2 Answers 2

$\begingroup$

First, OLS is nothing more than an algorithm for fitting a linear model of the form $$ y = \mathbf + \epsilon $$ In other words, you are positing that the phenomenon $y$ is a linear function of the variables $\mathbf$, plus some additively separable disturbance term.

If this is a good assumption, then there is some true, constant $\mathbf$, and you apply some estimator -- such as OLS -- to estimate what it is.

If your sample is non-random -- there is some correlation between your $\mathbf$'s and your error term -- then OLS estimates of $\mathbf<\hat\beta>$ will not be equal in expectation to the true $\mathbf$. This is to say that they are biased.

In other words, if you were to take many many samples from the population of $\mathbf$ and $y$, your average $\mathbf<\hat\beta>$ would not equal $\beta$.

answered Oct 31, 2016 at 17:04 generic_user generic_user 13.6k 10 10 gold badges 48 48 silver badges 68 68 bronze badges

$\begingroup$ I think non i.i.d. does not imply dependence (let alone correlation) between $X$ and $\epsilon$, although it does not prohibit it either. E.g. you could have serially correlated errors that are independent of $X$ and still call it non-i.i.d., no? $\endgroup$

Commented Oct 31, 2016 at 17:23

$\begingroup$ Yeah, that's right. But the title of the question says "non-random sample." It is possible that OP has longitudinal data, and just needs to correct the standard errors. The question is somewhat unclear. $\endgroup$

Commented Oct 31, 2016 at 17:56

$\begingroup$ I will try to explain better. I would like to know, theoretically, what would happen if I would try to estimate OLS having a non random sample, violating the assumption of iid. I think that if iid is satisfied, then the expected value of β^ would be equal to the real β, otherwise it would be not. But I am not sure $\endgroup$

Commented Oct 31, 2016 at 18:04

$\begingroup$ @bobo55, take a look at the OLS assumptions to make sure you got those correctly. See e.g. this thread, especially the first answer. $\endgroup$

Commented Oct 31, 2016 at 18:07

$\begingroup$ Clarify what you mean by nonrandom. If you mean that you've got repeated observations of individuals, and those individuals are autocorrelated in time, your estimate of beta will still be unbiased, though inference is complicated. If your sample is nonrandom because some unobserved factor is both correlated with your X's and y, beta will be biased. $\endgroup$

Commented Oct 31, 2016 at 20:17 $\begingroup$

When the sample is not random, you have to consider whether the way you got the sample introduced bias. That is, the way data was gathered IRL can affect the extent to which the sample is representative of the population.

For example, say you want to predict who someone is going to vote for based on their media habits. You get the data from asking your friends. The problem is that your friends are probably not going to be representative for the population at large.

Why? One reason could be that we tend to become friends with people who share similar media preferences (maybe you became friends partly because you both love the same youtube channel). Another could be that friends tend to have the same socioeconomic status, and socioeconomic status affects which types of media that are consumed.

In this case, when you do your OLS, your regression coefficient will reflect your friends, but it's very hard to say whether it reflects the population at large. If you're only interested in your friends, that's fine. If you want to generalize, you're in trouble.

According to (Mercer et al., 2017), for non-random samples you basically have to consider whether your non-random sample reflects the population. in terms of confounders.

For example, if all your friends have the same gender as you, sampling your friends is going to be a problem because gender likely affects media habits.

But an all male sample might not be a problem. E.g. if you're testing a new pill for erectile dysfunction, you're probably OK going with an all male sample. Basically, it depends on the theoretical knowledge of your field.

When we start to appeal to the theoretical knowledge of our field, we are moving out of statistics and into the world of causal inference (see e.g. Pearl's Book of Why).

Random sampling (of the population) is a way to not have to deal with any of this. With random sampling (from the population) you can say "I know the way I gathered the data didn't introduce bias, because it was at random".

Randomization protects you from the known, the unknown, and the unknown unknown sources of bias.