Professional Documents
Culture Documents
X
j
X
k
(1)
where X is cumulative percentage (or fractions) and their respective
values (j and k) and N is the number of elements (observations).
A minimum node size of 10 was applied in this study, since
a simpler tree is easier to understand and faster to use and, more
importantly, smaller trees provide greater predictive accuracy for
unseen data. This number has also been used in other studies
(Murphy et al., 1994; Zhang and Burton, 1999).
The maximum tree elevation was not specied in this study, as
was the case for earlier programs suchas AID(Automatic Interaction
Detection). A too large tree needs to be pruned back to its optimal
size (i.e. backward pruning) on the basis of a V-fold cross-validation
(Berk, 2003) with several assumptions such as random-row-
holdback validation, xed number of terminal nodes, and smooth
minimum spikes. It does not require a separate, independent
dataset, which would reduce the data used to build the tree. It
partitions the used dataset to build the reference unpruned tree into
a number of groups (i.e. folds). A 10 V-folds value was adopted in
this study, since a larger value increases computation time and may
not result in a more optimal tree.
3.5. Construction of the zinc concentration map
Using the resulting decision-tree model, a predictive concen-
tration map of zinc distribution in soils of the study area was
produced under a GIS environment through conditional parametric
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 5
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
Fig. 2. Decision-tree model for predicting zinc concentration based on terrain parameters (pH pH; proximity to roads Roads; Land cover/use LUC; Soil type Soils; Distance to drainage line Drain; Slope
gradient SG; Soil depth Sd; Organic matter om; Hydraulic conductivity cond; lithology litho; Surroundings of waste areas WA; Nearness to cities C).
R
.
B
o
u
K
h
e
i
r
e
t
a
l
.
/
E
n
v
i
r
o
n
m
e
n
t
a
l
P
o
l
l
u
t
i
o
n
x
x
x
(
2
0
0
9
)
1
9
6
A
R
T
I
C
L
E
I
N
P
R
E
S
S
P
l
e
a
s
e
c
i
t
e
t
h
i
s
a
r
t
i
c
l
e
i
n
p
r
e
s
s
a
s
:
B
o
u
K
h
e
i
r
,
R
.
,
e
t
a
l
.
,
S
p
a
t
i
a
l
s
o
i
l
z
i
n
c
c
o
n
t
e
n
t
d
i
s
t
r
i
b
u
t
i
o
n
f
r
o
m
t
e
r
r
a
i
n
p
a
r
a
m
e
t
e
r
s
:
A
G
I
S
-
b
a
s
e
d
d
e
c
i
s
i
o
n
-
t
r
e
e
m
o
d
e
l
i
n
L
e
b
a
n
o
n
,
E
n
v
i
r
o
n
.
P
o
l
l
u
t
.
(
2
0
0
9
)
,
d
o
i
:
1
0
.
1
0
1
6
/
j
.
e
n
v
p
o
l
.
2
0
0
9
.
0
8
.
0
0
9
weight application. If different end results (concentrations in
mg kg
1
) characterize eld sites within a given unit of the inter-
section layer (between different predictor parameters), new sub-
polygons were delineated. In the case of similar results, unit poly-
gons were joined (merged). This map was validated based on eld
surveys. An independent dataset has been randomly chosen in all
landscape units, consisting of 30% (60 sites) of the total number of
eld sites. The accuracy assessment was based on analysis of the
error matrix, which is a square array of dimension n n (n is the
number of classes). The total accuracy refers to the ratio of total
number of correctly inferred Zn classes divided by the total number
of test samples (60).
4. Results and discussion
4.1. Model performance evaluation
The regression-tree model created with 25 terminal nodes
(Fig. 2) correctly classied 88% of the training data, but V-fold cross-
validation indicated that this model would correctly classify 76% of
an independent dataset. The number of observations (N) per
terminal node ranged from1 to 8, and the node does not split if N is
less than 10 observations.
The mean value of the target variable (zinc concentration in
mg kg
1
) of the rows that are in a terminal node of the tree is the
estimated median value. From the tree (Fig. 2), one can see that if
the pH is less than 8 (N 110 rows), then the estimated (average)
value of the target variable is 200 mg kg
1
, whereas if the pH
exceeds 8 (N 30 rows), then the average value of the target
variable is 110 mg kg
1
. In fact, zinc like other heavy metals
becomes more water-soluble under acid conditions and can move
downwards with water through the soils.
The minimumvalidation relative error (or validation cost) occurs
with ve nodes, at a value of 0.26 and a validation standard error of
0.02. This validation error value is the cost relative to the cost for
a tree with one node. It is the best measure of how well the tree will
t an independent dataset different from the learning dataset.
4.2. Evaluation of the inuence of independent parameters
on soil Zn content
The relative importance of the predictor parameters is repre-
sented as follows: pH (100% in relative importance), surroundings
of waste areas (90%), proximity to roads (80%), nearness to cities
(50%), distance to the drainage (25%), lithology (24%), land cover/
use (14%), slope gradient (10%), conductivity (7%), soil type (7%),
organic matter (5%), and soil depth (5%). pH appears as the master
parameter (generating the split from the parent node) in the whole
pollution process by zinc metal, since it is strongly correlated with
soil types and parent materials inuencing the cation mobility and
regulates the solubility of zinc in soils. The spatial variability of zinc
in topsoils is affected by extrinsic (roads, waste areas and cities) and
intrinsic aspects (drainage lines and soil parent materials). In
addition, there is a good correspondence between zinc distribution
and land cover/use, especially that related to a secular anthropic
activity (e.g., cultivation resulting in chemical anomalies). The slope
gradient has demonstrated a certain impact by inuencing the
accumulation of zinc in mosses. Similarly, hydraulic conductivity
and soil type (texture) control zinc distribution, but to a lesser
extent (7%). In clay-rich, low-permeability soils, heavy metals are
bound to soil particles, and thus there are only few toxicity risks.
Organic matter and soil depth have similar effects on the zinc
concentration. The former inuences heavy metal absorption in
soils; this effect is probably due to the cation exchange capacity of
organic material. In addition, since heavy metals come from parent
materials, their concentrationwill typically increase with soil depth
due to two processes. One is soil formation from rock parent
materials, which have higher concentrations of heavy metals; the
other is deposition of materials at the soil surface, such as plant
organic matter, which typically contain low heavy metal concen-
trations. As the soil materials from these two processes mix,
a gradient of increasing concentration is created through the soil
prole. The other parameters (i.e. slope length and stoniness ratio)
do not intervene in building the regression-tree, and their effect is
totally masked.
4.3. Production of the zinc concentration map
The predictive concentration map of zinc (Fig. 3), at a 1:50.000
cartographic scale, was produced using the results of the built
decision-tree model. This map was divided into ve concentration
classes having an equal range of distribution (Table 2). This division
seems necessary to prioritize the measurements needed to reduce
the level of zinc concentration in contaminated areas. In this map,
class 3 (with a zinc concentration ranging between 100 and
150 mg kg
1
) covers the largest area (39%) being dispersed in the
Fig. 3. Predicted zinc concentration map of the study area.
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 7
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
studied region (Table 2). This indicates a widespread higher risk of
contamination if no remedial measures are applied. Classes 4 (high
contamination) and 5 (very high contamination) occur in 31.5% of
the area, but they have a far larger impact since they are distributed
as patches in densely populated areas.
4.4. Accuracy of the zinc concentration map
The predicted Zn values were veried against test data. The
confusion matrix between the measured zinc concentration classes
and the modeled ones indicates a good overall accuracy of 78%
(Table 3). This accuracy value denotes the number of correctly
modeled sites divided by the total number of sites. It is different
from the explained variance of the built decision-tree model (88%),
since it is dedicated to validate all adopted approaches combining
the integration of landscape unit map, quantication of terrain
parameters and decision-tree modeling. In contrast, the explained
variance reects only some eld/laboratory measurements of zinc
concentration chosen for model training. The users accuracy (Au),
i.e. the percentage of sites belonging to a model class correctly
corresponding to the reference data, read along columns, is ranging
from 50% to 87.5% (50% being the result of dividing 4 over 8 and
multiplying by 100; 87.5% is derived from dividing 7 over 8 and
multiplying by 100). Omission errors (decits) (Eo), computed
along columns, correspond to the distribution of sites of a class,
derived from modeling, in the various classes of reference data.
They are equal (in %) to 100% users accuracy. They vary between
12.5% and 50%. The producers accuracy (Ap), the percentage of sites
belonging to a reference class correctly classied by the model, read
along the rows, is 6788%. Commission errors (excesses) (Ec),
computed along rows, correspond to the distribution of a reference
class among various classes derived from modeling. Being equal to
100% producers accuracy, they range between 12% and 33%.
Modeling often overestimates zinc concentration levels, while
underestimation is rare; this can be considered a positive point for
management planning considerations, because the possibility of
overlooking actual zinc pollution risks decreases.
4.5. Advantages and problems of the constructed model
Our regression-tree model (88% variance) has dened a map of
zinc concentration with ve classes for a region situated in the
northern part of Lebanon (195 km
2
). Such a map was not available
in Lebanon, nor in many other countries. It represents the result of
modeling fromgeo-environmental characteristics and can meet the
scientic needs of researchers and decision-makers for exploring
land-related problems.
The explained model variance can be ameliorated using other
details within the predictor parameters such as soil structure, soil
compactness, slope aspect, slope curvature, etc. This is an impor-
tant future research topic since the contribution of such variables to
explaining additional variance can be tested. This model can be
extrapolated to other areas in the country if the functional capac-
ities of GIS are used, because they allow the integration of several
parametric maps for producing landscape unit maps, on which zinc
concentration measurements can be determined.
The concept of decision-tree modeling can also be tested for
other heavy metals (e.g., copper, lead, cadmium, etc.). However,
a major difculty is encountered related to the coarse scale of some
parametric maps used for producing the zinc concentration map.
5. Conclusion
The constructed decision-tree model enabled, for the rst time
the mapping of predicted zinc concentrations in a region of
Lebanon at a scale of 1:50.000, based on geo-environmental
characteristics (e.g., topography, geology, soil and land cover/use).
The modeling approach was easily implemented with available GIS
software and is suitable for data exploration and predictive zinc
concentration mapping. It is explicit and can be critically evaluated
and revised when necessary. It can also easily be extrapolated to
other Mediterranean countries undergoing socio-economic change.
This decision-tree approach did not produce concentration
maps of soil zinc pollution with accuracies substantially higher
than those reported in the literature for other methods and/or
models (Facchinelli et al., 2001; Li et al., 2001; Norra et al., 2001;
Zhang, 2005; Franco et al., 2006). However, it can be considered
a quick, simple, realistic, and informative method for combining
geo-environmental variables in order to generate a map describing
the potential zinc concentration (predictive map). This map can be
used to prioritize the choice of study areas for further measurement
and modeling and may in the short-term help with the selection
and adoption of measures to reduce the contamination of densely
populated areas and arable lands. Although the chosen scale of this
map (1:50.000) seems to be sufcient for estimating the zinc
concentration to consider strategies for land protection, the map
can be improved for more localized contamination assessment if
more detailed data sets are available [higher resolution DTMs and
more detailed GIS parametric maps (predictors)].
References
Adachia, K., Tainoshob, Y., 2004. Characterization of heavy metal particles
embedded in tire dust. Environmental International 30, 10091017.
Berk, R.A., 2003. An Introduction to Ensemble Methods for Data Analysis. UCLA
Department of Statistics. Technical Report.
Bocca, B., Alimonti, A., Petrucci, F., Violante, N., Sancessario, G., Forte, G., 2004.
Quantication of trace elements by sector eld inductively coupled plasma
spectrometry in urine, serum, blood and cerebrospinal uid of patients with
Parkinsons disease. Spectrochimica Acta B59, 559566.
Table 3
Error matrix for measuring the accuracy of the map predicting zinc concentration.
Zinc concentration Modeled map
Very
low
Low Medium High Very
High
Total Ap
(%)
Ec
(%)
Field observation/
laboratory
analysis
Very
low
a
7 2 1 0 0 10 70 30
Low 1 8 2 0 0 11 73 27
Medium 0 0 23 3 0 26 88 12
High 0 0 0 4 2 6 67 33
Very
high
0 0 1 1 5 7 71 29
Total 8 10 27 8 7 60
Au (%) 87.5 80 85 50 71
P
47
Eo (%) 12.5 20 15 50 29 OA 78%
Au users accuracy, Eo Omission errors (decits), Ap producers accuracy,
Ec Commission errors (excesses), OA Overall accuracy.
Bolditalics values indicates: Sites classied correctly,
P
Total number of correctly
modeled sites.
a
Very low zinc concentration: 50 mg kg
1
; Low: 50100 mg kg
1
; Medium:
100150 mg kg
1
; High: 150200 mg kg
1
; Very high: >200 mg kg
1
.
Table 2
Classes of zinc concentration and corresponding areas for the studied region.
Classes Zinc concentration Area (km
2
) % of the study area
1 (very low) 50 mg kg
1
5 2.5
2 (low) 50100 mg kg
1
53 27
3 (medium) 100150 mg kg
1
75 39
4 (high) 150200 mg kg
1
46 23.5
5 (very high) >200 mg kg
1
16 8
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 8
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
Bou Kheir, R., Girard, M.-C., Khawlie, M., 2004. Use of a structural classication
OASIS for the mapping of landscape units in a representative region of Lebanon.
Canadian Journal of Remote Sensing 30 (4), 617630 (in French).
Bou Kheir, R., Wilson, J., Deng, Y., 2007. Use of terrain variables for predictive gully
erosion mapping in Lebanon. Earth Surface Processes and Landforms 32,
17701782.
Bou Kheir, R., Chorowicz, J., Abdallah, C., Dhont, D., 2008. Soil and bedrocks
distribution estimated from gully form and frequency: a GIS-based decision-
tree model for Lebanon. Geomorphology 93, 482492.
Breiman, L., 2001. Decision-tree forests. Machine Learning 45 (1), 532.
Chow, C.W.K., Davey, D.E., Mulcahy, D.E., 1997. A neural network approach to zinc
and copper interferences in potentiometric stripping analysis. Journal of
Intelligent Material Systems and Structures 8 (2), 177183.
De Temmerman, L., Vanongeval, L.B., Hoenig, M., 2003. Heavy metal content of
arable soil in Northern Belgium. Water, Air and Soil Pollution 148, 6176.
DGA, 1963. Topographic Maps at a Scale 1:20.000. Direction of Geographic Affairs,
Republic of Lebanon.
Dubertret, L., 1945. Geological Maps of Lebanon at 1:50.000 Scale. Ministry of Public
Affairs, Republic of Lebanon.
Facchinelli, A., Sacchi, E., Mallen, L., 2001. Multivariate statistical and GIS-based
approach to identify heavy metal sources in soils. Environmental Pollution 114,
313324.
Franco, C., Soares, A., Delgado, J., 2006. Geostatistical modelling of heavy metal
contamination in the topsoil of Guadiamar river margins (S Spain) using
a stochastic simulation technique. Geoderma 136, 852864.
Franklin, J., 1998. Predicting the distributions of shrub species in California chap-
arral and coastal stage communities from climate and terrain-derived variables.
Journal of Vegetation Science 9, 733748.
Friedl, M.A., Brodley, C.E., Strahler, A.H., 1999. Maximizing land cover classication
accuracies produced by decision trees at continental to global scales. IEEE
Transactions on Geoscience Remote Sensing 37, 969977.
Fuge, R., 2005. Anthropogenic sources. In: Selinum, O. (Ed.), Essentials of Medical
Geology: Impacts of the Natural Environment on Public Health. Academic Press,
Amsterdam, pp. 4360.
Ge` ze, B., 1956. Soil Map of Lebanon at a Scale of 1:200.000, Explicative Note, 1956.
Republic of Lebanon. Ministry of Agriculture, Beirut, Lebanon.
Henderson, B.L., Bui, E.N., Moran, C.J., Simon, D.A.P., 2005. Australia-wide predic-
tions of soil properties using decision trees. Geoderma 124, 383398.
Huang, X., Jensen, J.R., 1997. A machine-learning approach to automated knowl-
edge-base building for remote sensing image analysis with GIS data. Photo-
grammetric Engineering & Remote Sensing 63, 11851194.
Kandrika, S., 2008. Land use/land cover classication of Orissa using multi-temporal
IRS-P6 awifs data: a decision tree approach. International Journal of Applied
Earth Observation and Geoinformation 10 (2), 186193.
Li, X.D., Poon, C.S., Liu, P.S., 2001. Heavy meal contamination of urban soils and
street dusts in Hong Kong. Applied Geochemistry 16, 13611368.
Lin, Y.P., Teng, T.P., Chang, T.K., 2002. Multivariate analysis of soil heavy metal
pollution and landscape pattern in Changhua County in Taiwan. Landscape and
Urban Planning 62, 1935.
LNCSR-LMoA, 2002. Land Cover/Use Map of Lebanon at a Scale of 1:20.000. Leb-
anese National Council for Scientic Research and Lebanese Ministry of
Agriculture.
Loh, W.Y., Shih, Y.S., 1997. Split selection methods for classication trees. Statistica
Sinica 7, 815840.
Manta, D.S., Angelone, M., Bellanca, A., Neri, R., Sprovieria, M., 2002. Heavy metals
in urban soils: a case study from the city of Palermo (Sicily), Italy. Science of the
Total Environment 300 (13), 229243.
Manzoor, S., Munir, H.S., Shaheen, N., Khalique, A., Jaffar, M., 2006. Multivariate
analysis of trace metals in textile efuents in relation to soil and groundwater.
Journal of Hazardous Materials A137, 3137.
McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environ-
mental correlation. Geoderma 89, 6794.
Michaelsen, J., Schimel, D.S., Friedl, M.A., Davis, F.W., Dubayah, R.C., 1994. Regression
tree analysis of satellite and terrain data to guide vegetation sampling and
surveys. Journal of Vegetation Science 5, 673686.
Murphy, P., Patrick, M., Michael, J.P., 1994. Exploring the decision forest: an
empirical investigation of Occams Razor in decision-tree induction. Journal of
Articial Intelligence Research 1, 257275.
Navas, A., Machin, J., 2002. Spatial distribution of heavy metals and arsenic in soils
of Aragon (Northeast Spain): controlling factors and environmental implica-
tions. Applied Geochemistry 17, 961973.
Norra, S., Weber, A., Kramer, U., Stuben, D., 2001. Mapping of trace metals in urban
soils. Journal of Soil Sediment 1, 7797.
Reimann, C., de Caritat, P., 2005. Distinguishing between natural and anthropo-
genic sources for elements in the environment: regional geochemical surveys
versus enrichment factors. The Science of the Total Environment 337 (13),
91107.
Saadia, R.T., Munir, H.S., Shaheen, N., Khalique, A., Manzoor, S., Jaffar, M., 2006.
Multivariate analysis of trace metal levels in tannery efuents in relation to soil
and water: a case study from Peshawar, Pakistan. Journal of Environmental
Management 79, 2029.
Saha, A.K., Gupta, R.P., Arora, M.K., 2002. GIS-based landslide hazard zonation in the
Bhagirathi (Ganga) valley, Himalayas. International Journal of Remote Sensing
23 (2), 357369.
Schroder, W., 2006. GIS, geostatistics, metadata banking, and tree-based models for
data analysis and mapping in environmental monitoring and epidemiology.
International Journal of Medical Microbiology 296 (S1), 2336.
Smolders, E., Degryse, F., 2002. Fate and effect of zinc from tire debris in soil.
Environmental Science and Technology 36 (17), 37063710.
Venables, W.M., Ripley, B.D., 1994. Modern Applied Statistics with S-Plus. Springer,
New York.
Waisberg, M., Joseph, P., Hale, B., Beyersmann, D., 2003. Molecular and cellular
mechanisms of cadmium carcinogenesis. Toxicology 192, 95117.
Zhang, C., 2005. Using multivariate analyses and GIS to identify pollutants and their
spatial patterns in urban soils in Galway, Ireland. Environmental Pollution 142,
501511.
Zhang, H., Burton, S., 1999. Recursive Partitioning in the Health Sciences. Springer,
New York, 226 pp.
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 9
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009