We cannot collect all of the data that we might want, in order to find out whatever it is that we want to know. A botanist cannot collect data from every rose plant on the planet, an ornithologist cannot collect data on every humming bird on the planet, and a wine researcher cannot collect data on every wine on the planet. So, instead, we collect data on a sub-sample, and we then generalize from that sample to the population that we are actually interested in.
Many people seem to think that the size of the sample is important, and they are right. However, size is not the most important thing, not by a long chalk. The most important thing is that the sample must not be biased. Even a small unbiased sample is much much better than a large biased sample.
Bias refers to whether the sample accurately represents the population we are taking the sample from. If the sample does represent the population then it is unbiased, and if it does not represent the population then it is biased. Bias is bad. In fact, it is often fatal to the work, because we will end up making claims about the population that are probably untrue.
Let's take the first example that I worked out for myself, when I was first learning about science. In 1976, Shere Hite published The Hite Report on Female Sexuality in the U.S.A. She had distributed questionnaires in many different ways, including direct mailouts and enclosures in magazines. She described the final sample of females as follows: "All in all, one hundred thousand questionnaires were distributed, and slightly over three thousand returned (more or less the standard rate of return for this type of questionnaire distribution).” She also emphasized that her sample size was much larger than had ever before been used for studies of human sexual behavior (eg. by Kinsey, or Masters and Johnson).
Here, the intended population from which the sample was taken is not the same as the actual sampled population — the questionnaires may well have been distributed to a group of females who were representative of women in the U.S.A., but there is no reason to expect that the respondents were. The respondents chose to respond, while other women chose not to.
It should be obvious that there are only two reasonable conclusions about females in the U.S.A. that can be drawn from this study: (1) it seems that c. 3% of the females will discuss their sex lives, and (2) it is likely that 97% of the females do not voluntarily discuss their sex lives. There is no necessary reason to expect that the sexual activities of these two groups will be the same, at least in the 1970s. Indeed, our general knowledge of people probably leads us to expect just the opposite. Hite’s report is thus solely about the smaller of these two groups (ie. those who will reveal their sex lives), and no justifiable conclusions can be reached about the larger group.
Note that the problem here is not the sample size of 3,000 — it is solely the non-representativeness of this sample that is at issue, since a sample of this size could easily be representative even of a population as large as that of the U.S.A. At one extreme, if I want to work out the ratio of males:females on this planet, then I will actually get the right answer even with a sample of two people, provided one is male and the other is female!
It is important to note that all samples are an unbiased representation of some population, whether large or small. The trick is that we need to
work out what that population is. If it is not the same as the population that we intended, then we are in trouble, if we try to generalize our conclusions beyond the actual
population. This was Shere Hite's problem, because she drew general conclusions about women in the U.S.A. (her intended population) rather than just those women who will discuss their sex lives (her sampled population).
It is for this reason that government censuses try to sample all (or almost all) of the relevant people. This is the best way to avoid biases — if you can get data from nearly everyone, then there cannot be much bias in your sample!
Professional survey organizations (e.g. Nielsen, Gallup, etc) usually try to address this issue by defining specific sub-samples of their intended population, and then pooling those sub-samples to get their final sample (this is called stratified sampling). For example, they will explicitly sample people from different ages, and different professions, and different ethnic backgrounds, etc — defining sub-groups using any criteria that they feel might be relevant to the question at hand. This greatly increases their chances of getting an unbiased sample of the general populace.
But even this approach does not guarantee that they will succeed. The example that I used to give my students involved predictions for nine consecutive Australian federal elections (1972-1990) from seven different survey organizations. These polling groups mostly forecast the winning political party correctly, although the winning percentages were sometimes quite inaccurately estimated. However, there was one year (1980) when they all got it wrong; that is, they all predicted that the Labor party would win, by margins of 2-9%, whereas the Liberal/NCP coalition actually won by 1% of the vote. In this case their stratified sampling failed to account for the geographical distribution of voters in the various electoral regions.
Note, also, that these types of survey organizations do not focus as much on sample size as they do on bias, as I emphasized above. For example, in 2014, the Nielsen survey organization announced an addition of 6,200 metered homes to its sample used for assessing television markets in the USA, in terms of which channels/shows are being watched (see Nielsen announces significant expansion to sample sizes in local television markets) — this represented "an almost 50% increase in sample size across the set meter market." That is, even after the increase, c. 20,000 homes are currently being used to sample an estimated population of nearly 120,000,000 US homes with TVs (see Nielsen estimates 118.4 million TV homes in the U.S. for the 2016-17 TV season).
The points that I have made here also apply to the modern phenomenon of collecting and analyzing what is called "Big Data". This has become a buzz expression in the modern world, appearing, for example, in biology with the study of genomics and the business world with the study of social media. Apparently, the idea is that the sheer size of the samples will cure all data analysis ills.
However, data are data, and an enormous biased dataset is of no more use than is a small biased one. In fact, mathematically, all Big Data may do is make you much more confident of the wrong answer. To put it technically, large sample sizes will address errors due to stochastic variation (ie. random variability), but they cannot address errors due to bias.
So, Bid Data can lead to big mistakes, unless we think about possible biases before we reach our conclusions.