Data Mining applied to Air Pollution
by Brian J Read
An understanding of the behaviour of air pollution is needed to predict
it and then to guide action to ameliorate it. Calculations with dynamical
models are based on the relevant physics and chemistry. To help with the
design and validation of such models, a complementary approach is described
here. It examines data on air quality empirically by Data Mining using,
in particular, machine learning techniques, aiming for a better understanding
of the phenomenon and a more direct interpretation of the data.
The work of the Database Group at CLRC has long concentrated on the
practical application of data management technology. The emphasis is on
helping users at the laboratory and externally to exploit the value in
data. Implementing databases and providing easy access is the basis for
this, supplemented by data exploration and decision support tools. More
recently, interest has extended to data mining, or more fully Knowledge
Discovery in Databases (KDD). This may be defined as "the non-trivial
extraction of implicit, previously unknown and potentially useful knowledge
from data". Data mining is just the discovery stage of the whole KDD
process. Indeed, most of the work in practice lies in the preparatory stages
of data selection and data cleaning. Extensive data exploration is essential
if the data mining is to yield intelligent results.
Data mining is multi-disciplinary: it covers expert systems, database
technology, statistics, machine learning ("AI") and data visualisation.
It goes beyond directed querying of a database (eg by using SQL) by instead
looking for hypotheses or questions rather than detailed answers. Most
interest is in mining commercial data - for example credit profiling or
market basket analysis. However it is starting to be used in scientific
applications too. CLRC as a leading research laboratory has masses of data.
Thus there is the motivation to see how data mining techniques might supplement
the more traditional scientific analysis in formulating and testing hypotheses.
Of specific interest are the induction of rules and neural net models.
Considering environmental data, measurement and possibly control of air
pollution is increasingly topical. In applying the KDD process, our objectives
are two-fold:
- to improve our understanding of the relevant factors and their relationships,
including the possible discovery of non-obvious features in the data that
may suggest better formulations of the physical models
- to induce models solely from the data so that dynamical simulations
might be compared to them, and that they may also have utility, offering
(short term) predictive power.
The investigation uses urban air quality measurements taken hourly in
the City of Cambridge (UK). These are especially useful since simultaneous
weather data from the same location are also available. The objectives
are, for example, to look for and interpret possible correlations between
each pollutant (NO, NO2, NOx, CO, O3 and PM10) and a) the other pollutants
b) the weather (wind strength and direction, temperature, relative humidity
and radiance) looking in particular for lags that is, one attribute
seeming to affect another with a delay of perhaps hours or of days. Other
factors are possible. For example, clearly noticeable is lower NOx on Sundays
through less traffic.
The initial analysis concentrated on the daily maxima of the pollutants.
This simplifies the problem, the results providing a guide for a later
full analysis. Also the peak values were further expressed as bands (eg
low, medium and high). The bands are directly related to standards or targets
recommended by the Expert Panel on Air Quality Standards (EPAQS) that the
public can appreciate. The two principal machine learning techniques used
are neural networks and the induction of rules and decision trees. Expressing
their predictions as band values makes the results of such rules and models
easier to understand.
Work so far supports the common experience in data mining that most
of the effort is in data preparation and exploration. The data must be
cleaned to allow for missing and bad measurements. Detailed examination
leads to transforming the data into more effective forms. The modelling
process is very iterative, using statistics and visualisation to guide
strategy. The temporal dimension with its lagged correlations adds significantly
to the search space for the most relevant parameters.
More extensive investigation is needed to establish under what circumstances
data mining might be as effective as dynamical modelling. (For instance,
urban air quality varies greatly from street to street depending on buildings
and traffic.) A feature of data mining is that it can short circuit the
post-interpretation of the output of numerical simulations by directly
predicting the probability of exceeding pollution thresholds. More generally,
data mining analysis might offer a reference model in the validation of
simulation calculations.
Please contact:
Brian J Read - CLRC
Tel: +44 1235 44 6492
E-mail: b.j.read@rl.ac.uk