Evidence-based Stratification Methodology for Non-probabilistic Sampling Surveys
There is increasing use of non-probability sampling methods in large-scale surveys due to the costs involved in ensuring that the sample chosen is representative of the population, as is the case with probability sampling. Conventionally, it has been believed that non-probability sampling does not permit precise estimates of how the statistical properties of the sample differ from the statistical properties of the population due to possible biases in the non-probability sample. However, the increasing growth of big data survey data using non-probability sampling methods may provide an opportunity for researchers to use novel methods for quantifying the amount of bias that may exist in different strata so that within each stratum it may be possible to select respondents through probability sampling or random sampling to create pseudo-controlled samples for estimating population parameters. In this thesis, we use one of the largest survey databases ever collected in healthcare (Improving Practice Questionnaire IPQ for patients visiting their doctor in UK) through convenience sampling to show it is possible to adopt different stratification strategies in conjunction with machine learning techniques to help researchers to decide on the most appropriate stratification method for estimating population parameters from the chosen strata. Such strategies can enrich our knowledge for an evidence-based stratification methodology to reveal similarities and differences in feedback experience among different smaller sub-populations. This research combines standard statistical and machine learning techniques into a systematic stratification methodology to analysis survey data collected through non-probability sampling. In summary, the traditional statistical problem of how to estimate population parameters from a study that does not use probability sampling is shown in this thesis to be possible through the use of big data and appropriate use of measures and metrics from machine learning as well as standard statistical methods for analysing population parameters. The implication of this thesis are that it will be possible, in the age of big data, to overcome traditional statistical concerns about the quality of data not obtained through traditional probabilistic techniques and that outcomes of statistical analysis using non-probability sampling methods can be as reliable as from probability sampling, provided that a clear methodology is used to quantify bias at various stratification levels.