A universal concern with all information systems must be the quality of the data contained within them, hence, the well-known adage of the computer age: garbage in, garbage out. Nevertheless, it should be recognized that “errors and uncertainty are facts of life in all information systems” (Openshaw, 1989). The process of describing aspects of reality as a file structure on storage media requires a high level of abstraction, as was illustrated in Chapter 2. Thus, any attempt to completely represent reality in GIS, while no doubt resulting in robust and flexible data sets, would also result in large, complex, and costly data sets that would require a higher order of technology to handle them.
Historically, a detailed consideration of data quality issues in GIS lagged considerably behind the mainstream of GIS development and application. This is evident from the growth of the relevant literature, which underscores a sudden vogue in spatial data quality research from 1987 onward, some 25 years after the introduction of GIS (Figure 8.1). This lag in concern for spatial data quality may be attributed to:
The inherent trust most users have in computer output, particularly after some complex analysis.
The possible lack of awareness among operators and managers from nonspatial disciplines of the sources of uncertainty in spatial data sets and the consequences of propagating them through analyses, other than the need to correct blunders.
The growing desire in the late 1980s for remote sensing (RS) and GIS data integration, there having been already a body of research on accuracy assessment of RS data.
The growth of GIS through stages of inventory, analysis, and management (Crain and Macdonald, 1984) such that a need to consider the consequences of uncertainty in outcomes on decision making may only become apparent after some years of system development. Data are usually collected within a specific context and the design for any primary data collection is usually specified within that context. Surveyor and user may be the same individual, part of the same team or linked by contract.
Thus, the chances for misinterpretation of outcomes or misconceptions concerning accuracy of the data should, in theory, be quite small. But, data are likely to have a life span (shelf life) well beyond the original context and may well be used as secondary data on other projects. Those who collected the data may be unaware of subsequent uses (or misuses) to which their data are put. Most of the early literature on GIS data quality was concerned with the accuracy of data sets, or more specifically, the recognition and avoidance of error. We will be taking a wider view of this issue by considering the level of uncertainty that exists in the use of spatial data and the fitness-for-use of GIS outputs.
Error is the deviation of observations and computations from the truth or what is perceived as the truth. This assumes that an objective truth can be known and measured. Statistically, errors may be identified as gross (outliers, blunders), systematic (uniform shift, bias), or random (normally distributed about the true value).
Reliability concerns the trust or confidence given to a set of input data on the basis of available metadata (data about the data: its lineage, consistency, completeness, and purported accuracy) and upon inspection of the data by the user. It refers to the assessed quality of the data on receipt. The user can then judge its appropriateness for use in a particular context. Fitness-for-use, however, refers to the assessed quality of the products of analyses used in decision making. Such evaluations and judgments must necessarily be the responsibility of the user (Chrisman, 1982).
Where only a single theme is used, then fitness-for-use can be judged directly from the data’s reliability. However, where data sets are integrated and themes are combined or transformed, then the analytical outputs are characterized by combination and propagation of the data reliabilities of the individual themes. In as much as research has focused on quality measures for data reliability, progress in the derivation of quality measures, meaningful in the evaluation of fitness-for-use, has been slower.
Uncertainty in its broadest sense can be used as a global term to encompass any facet of the data, its collection, its storage, its manipulation, or its presentation as information that may raise concern, doubt, or skepticism in the mind of the user as to the nature or validity of the results or intended message. Theoretically, this definition would also include mishandling of the data through improper analysis, inappropriate or erroneous use of GIS functions, poor cartographic technique, and so on. Thus, in the context of environmental modeling, Burrough et al. (1996) consider the quality of GIS informational output to be a function of both model quality and data quality.
Modeling issues will be discussed separately in Chapter 9. The term uncertainty, therefore, will be used to refer to the inevitable inaccuracies, inexactness, or inadequacies that exist in most spatial data sets and their resultant propagation through analyses to adversely affect the usefulness of results and certainty in decision making. Four broad categories of uncertainty are given in Figure 8.2. Intrinsic and inherited uncertainties are those associated with primary and secondary methods of data collection (Thapa and Burtch,1990), respectively. Secondary data (e.g., existing maps) will also have an element of intrinsic uncertainty. Once the data are used within GIS, intrinsic and inherited uncertainty will be propagated and additional uncertainty may be derived due to the nature of hardware and software. This is operational uncertainty. The resulting levels of uncertainty, if not quantified or in some way known, may lead to overconfident, uncertain, or erroneous decision making. Uncertainty or error in use may also derive from different perceptions or misinterpretation of the output information on the part of the user.