You are on page 1of 9

Spatial soil zinc content distribution from terrain parameters: A GIS-based

decision-tree model in Lebanon


Rania Bou Kheir
a, b,
*
, Mogens H. Greve
b
, Chadi Abdallah
c
, Tommy Dalgaard
b
a
Lebanese University, Faculty of Letters and Human Sciences, Department of Geography, GIS Research Laboratory, P.O. Box 90-1065, Fanar, Lebanon
b
Department of Agroecology and Environment, Faculty of Agricultural Sciences (DJF), Aarhus University, Blichers Alle 20, P.O. Box 50, DK-8830 Tjele, Denmark
c
National Council for Scientic Research, Remote Sensing Center, P.O. Box 11-8281, Beirut, Lebanon
GIS regression-tree analysis explained 88% of the variability in eld/laboratory Zinc concentrations.
a r t i c l e i n f o
Article history:
Received 22 February 2009
Received in revised form
3 July 2009
Accepted 26 August 2009
Keywords:
Zn concentration
Soil pollution
Decision-tree model
GIS
Terrain analysis
a b s t r a c t
Heavy metal contamination has been and continues to be a worldwide phenomenon that has attracted
a great deal of attention from governments and regulatory bodies. In this context, our study proposes
a regression-tree model to predict the concentration level of zinc in the soils of northern Lebanon (as a case
study of Mediterranean landscapes) under a GIS environment. The developed tree-model explained 88% of
variance in zinc concentration using pH (100% in relative importance), surroundings of waste areas (90%),
proximity to roads (80%), nearness to cities (50%), distance to drainage line (25%), lithology (24%), land
cover/use (14%), slope gradient (10%), conductivity (7%), soil type (7%), organic matter (5%), and soil depth
(5%). The overall accuracy of the quantitative zinc map produced (at 1:50.000 scale) was estimated to be
78%. The proposed tree model is relatively simple and may also be applied to other areas.
2009 Elsevier Ltd. All rights reserved.
1. Introduction
Soil contamination by metals has become a widespread serious
problem in many parts of the world, including the Mediterranean
environments. The reasons are manyfold: an increase in irrigation
using waste water; the uncontrolled application of sewage sludge,
industrial efuents, pesticides and fertilizers; rapid urbanization;
the atmospheric deposition of dust and aerosols, and vehicular
emissions, to mention but a few. Due to the non-biodegradability of
heavy metals and their long biological half-lives for elimination,
their accumulation in the food chainwill have a signicant effect on
human health in the long term (Fuge, 2005). Past studies have
revealed that human exposure to high concentrations of heavy
metals will lead to their accumulation in the fatty tissues of the
human body and affect the central nervous system, or the heavy
metals may be deposited in the circulatory system and disrupt the
normal functioning of the internal organs (Waisberg et al., 2003;
Bocca et al., 2004). In order to prevent further environmental
deterioration and adverse health effects and to examine possible
methods of remediation, an identication of the spatial distribution
of the contaminated areas, especially those affected by zinc (Zn)
metal, is urgently needed.
The spatial variability of Zn topsoil contents may be affected by
soil parent materials and anthropogenic sources. The natural
background concentration of zinc in dry soils is varying between
0 mg kg
1
and 18 mg kg
1
(De Temmerman et al., 2003). The
problems associated with the characterization of Zn in the majority
of sites are often due to multiple sources including agricultural
practices, enhanced urbanization, industrialization, and mining
activities. Zinc is used as an additive in the vulcanization process to
strengthen crude rubber in tyre manufacturing (Adachia and
Tainoshob, 2004). Tyre wear, motor oil, grease, brake emissions,
and corrosion of galvanized parts may contribute to the high Zn
content in roadside soils (Smolders and Degryse, 2002; Reimann
and de Caritat, 2005). Crash barriers and roofs are also considered
a major source of pollutants, charging running water with heavy
metals, especially zinc (1045 mg/m/year) (Manta et al., 2002).
In this context, our study proposes a statistical regression deci-
sion-tree model that in a simple, realistic and practical way predicts
the spatial distribution and concentrations of soil zinc contents,
based on the analysis of terrain parameters (predictors) likely to
impact on soil zinc concentration and quantication of their
weights using Geographic Information Systems (GIS) in a study area
across Lebanon. It comprises a set of rules to classify (predict)
a dependent target variable (zinc concentration in mg kg
1
) using
* Corresponding author at: Lebanese University, Faculty of Letters and Human
Sciences, Department of Geography, GIS Research Laboratory, P.O. Box 90-1065,
Fanar, Lebanon. Tel.: 961 3 982 848.
E-mail addresses: rania.boukheir@agrsci.dk, boukheir@gmail.com (R. Bou Kheir).
Contents lists available at ScienceDirect
Environmental Pollution
j ournal homepage: www. el sevi er. com/ l ocat e/ envpol
ARTICLE IN PRESS
0269-7491/$ see front matter 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.envpol.2009.08.009
Environmental Pollution xxx (2009) 19
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
the values of independent terrain parameters. These predictors are
both ordinal (e.g., slope gradient) and nominal (e.g., lithology)
features; they cover natural (e.g., soil type) and anthropic (e.g.,
proximity to roads) parameters. The machine learning, probabi-
listic, non-parametric, decision-tree method has been extensively
exploited for vegetation mapping (Kandrika, 2008), ecological
modeling (Michaelsen et al., 1994), gully erosion modeling (Bou
Kheir et al., 2007), soil mapping (Henderson et al., 2005; Bou Kheir
et al., 2008), epidemiological monitoring (Schroder, 2006), and
remote sensing studies such as land-use classication based on
threshold values of various band data (Huang and Jensen, 1997;
Friedl et al., 1999). It is essentially an exploratory method (Venables
and Ripley, 1994), and it has been used predictively (Franklin, 1998)
without the need of specifying data-tted form function. The zinc
concentration map resulting from the conversion of the decision-
tree model, at 1:50.000 cartographic scale, serves as a useful
inventory for land-use management and environmental decision-
making.
2. Study area
The study area (Fig. 1), located in the northern part of Lebanon,
extends over 195 km
2
and exhibits a variety of soils developed on
diverse lithologies, various land cover/use modes and under a wide
range of climatic conditions. It is also susceptible to Zn pollutions
fromvarious humans activities. For this reason, it has been selected
for developing and testing the decision-tree methodology.
The physiography of the region displays a high variety. Con-
trasting landscapes are found from the mountain environment in
the eastern and central parts of the area to the western coastal
plains. This diversity is paralleled by the distribution of vegetation
cover from forests (mainly composed of Pinus pinea 14.5% and
Quercus calliprinos 6.5%) to sparse shrubby (20%) and herbaceous
(7.5%) vegetation. The agricultural land (27%) is less widespread
than natural vegetation, but pesticides are commonly used in high
quantities in order to increase yields. In addition, the study area has
a well-established network of highways and roads, and many
residential, commercial and industrial buildings (17%) are erected
alongside the roads.
The outcropping stratigraphic sequence exposes rock forma-
tions spanning from the Middle Jurassic to more recent times.
According to the 1:50.000 geological maps of Lebanon (Dubertret,
1945), fourteen rock units are present in the study area (Fig. 1). The
highly fractured and jointed limestone and dolomitic limestone
outcrops of the Cenomanian (C4) form around 32% of the area,
followed in importance by the dolomite, limestone and dolomitic
limestone of the Kimmeridgian (J6 25%). One perennial coastal
(Nahr El-Jawz) river crosses the study area, owing seawards with
an EW orientation. The climate is typically Mediterranean with
a mean annual precipitation between 900 and 1100 mm, and
a signicant amount of rain (7580%) falling in the winter period
(NovemberMarch). The terrains are often unprotected during
winter, which is the critical period for water runoff and the leaching
of pollutants from soils.
The important diversity in parent materials along with
geomorphic features, land cover/use and climate has conditioned
the development of a wide variety of soils. A total of 18 different soil
units were identied on the available soil map (Ge` ze, 1956), of
M
e
d
it err a
n
e
a
n
S
e
a
J
o
r
d
a
n
E g y p t
Cyprus
Lebanon
Study area
Beirut
Tyr
Tripoli
Legend
Faults
Basalts
Quaternary
(Q) Quaternary deposits
Paleogene
(m) Miocene
Cretaceous
(C6) Senonian
(C5) Turonian
(C4) Cenomanian
(C3) Albian
(C2) Aptian
(C1) Barrermian
Jurassic
(J) Kimmerdjian
0 2 4 6 1
Km
Fig. 1. Location of the study area within Lebanon.
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 2
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
which 15 are soil series and three are soil associations. Three of
these cover 71% of the soils in the study area (discontinuous red
mountainous soils 55%, mixed soils on alternating marl and
limestone 8%, and sandy soils 8%).
3. Modeling approach
The quantitative soil mapping of zinc distributionwas generated
in several steps. Base maps were used to produce the landscape
unit map, which was combined with eld surveys and laboratory
analyses for allocating site locations and determining their zinc
concentration (mg kg
1
). The obtained point location layer was
then intersected with maps of predictor terrain parameters.
A decision-tree model was then explored on the result of this
intersection combining, for eld locations, zinc concentration and
the corresponding parameters. This model was used for quantita-
tive mapping of zinc (mg kg
1
) within the soils of the study area in
a GIS environment.
3.1. Model choice
The decision-tree model is a logical model (deductive reasoning)
represented as a binary (two-way split) tree that shows how the
value of a variable named target variable can be predicted by
using the values of a set of predictor variables. It has a number of
advantages when compared to numerically oriented techniques
such as linear and nonlinear regression (function tting), logistic
regression, articial neural networks ANNs, and genetic algorithms
(McKenzie and Ryan, 1999; Breiman, 2001). Decision-trees are easy
to build and interpret, and can automatically handle interactions
between both continuous (ordinal, interval) and categorical
(nominal) variables. Linear regression, for instance, is appropriate
only if the data can be modeled by a straight line function, which is
often not the case. Also, linear regression cannot easily handle
categorical variables, nor is it easy to look for interactions between
variables. As with linear regression, nonlinear and logistic regres-
sions are not well suited for categorical variables.
Decision-trees can identify the most decisive variables, which
are those that are used for creating the splits near the top of the
tree. In addition, they do not require the specication of the formof
a function to be tted to the data as is necessary for other
competing procedures (e.g., nonlinear regression).
Articial Neural Networks (ANNs) are often compared to deci-
sion-trees models because both methods can model data that have
nonlinear relationships between variables, and both can handle
interactions between variables. They have been used also for
assessing soil Zn content (Chow et al., 1997). However, neural
networks have a number of drawbacks. They do not present an
easily understandable model allowing researchers to get the full
explanation of the underlying nature of the data being analyzed. In
addition, they handle only binary categorical input data, and not
those with multiple classes. It is difcult also to incorporate
a neural network model into a computer system without using
a dedicated interpreter for the model. In contrast, once a decision-
tree model has been built, it can be converted to statements that are
implemented easily in most computer languages without requiring
a separate interpreter.
Moreover, decision-trees can identify a target (dependent)
variable. This is not possible with unsupervised machine learning
methods like cluster analysis, factor analysis (principal component
analysis) and statistical measures (Lin et al., 2002; Saadia et al.,
2006; Manzoor et al., 2006), which treat all variables equally
without predicting the value of a variable. These methods rather
look for patterns, groupings or other ways to characterize the data
that may lead to the understanding of the way the data interrelate.
In addition, decision-trees can indicate the relative weight of each
predictor variable in explaining the training data, while bivariate
analysis demonstrates only the implication of a couple of predictor
variables against the target variable. Therefore, if the goal of an
analysis is to predict the value of some variable, then decision-tree
modeling is a recommended approach. But predicting the future
remains a hard task, even with decision-trees.
3.2. Allocation of eld sites and zinc distribution
Field surveys coupled with laboratory analyses were conducted
for detailed measurements of zinc concentration at selected sites.
These sites were chosen by random stratied sampling method to
provide a representative sampling for the whole study area. In this
sampling program, soils from urban, suburban, county park areas
and arable lands were collected based on the different site condi-
tions. Thus this sampling method covers all landscape units, which
differ by at least one of the following variables: geological substrate,
soil type and land cover/use (Bou Kheir et al., 2004). These land-
scape units were determined by combining the corresponding GIS
layers (explained below). A sampling density of one sample per km
2
was used.
Geographic locations of all sampling points (200) were deter-
mined using a global positioning system(GPS) with 10-mprecision.
At each site, soil samples were taken from the top 20-cm soil layer.
The spatial distribution of Zn content in this layer is greatly
inuenced by various anthropic sources such as urbanization,
industrialization, agricultural land uses, vehicular emissions and
atmosphere deposition of dust and aerosols. The collected soil
samples were then stored in polyethylene bags for transport and
storage. These samples were air-dried in an oven at 50

C for three
days and subsequently sieved through a 2.0-mmpolyethylene sieve
to remove stones, coarse materials and other debris. Soil subsam-
ples (z20 g) were placed in a mechanical agate grinder and nely
ground (<200 mm). The prepared soil samples were then stored in
polyethylene bags in desiccators for chemical analysis.
The ground soil samples were analyzed for zinc using a strong
acid digestion method. Approximately 0.5 g of the soil samples was
weighed and placed in pre-cleaned Pyrex test tubes. Concentrated
nitric acid (8 ml) and 3 ml of concentrated perchloric acid were
added. The mixtures were heated in an aluminium block at 200

C
for 3 h until they were completely dry. After the test tubes were
cooled, 10.0 ml of 5% HNO
3
was added and heated at 70

C for 1 h
with occasional mixing. Upon cooling, the mixtures were decanted
into polyethylene tubes and centrifuged at 3500 rpm for 10 min.
Zinc concentrations of the solutions were measured using induc-
tively coupled plasma-atomic emission spectrometry (ICP-AES;
Perkin-Elmer 3300 DV).
3.3. Collection of predictor terrain parameters of zinc distribution
Soil Zn accumulation is driven by terrain characteristic, indus-
trial and agriculture land uses, soil properties and enhanced
urbanization (Facchinelli et al., 2001; Li et al., 2001; Navas and
Machin, 2002). For that, in the regression-tree analysis, fourteen
terrain parameters were selected as the independent variables,
whereas the classes of Zn pollution were dependent variables. The
generated parameters (Table 1), i.e. lithology, slope gradient, slope
length, distance to the drainage line, soil type, pH, conductivity,
organic matter, stoniness ratio, soil depth, land cover/use, prox-
imity to roads, surroundings of waste areas, and nearness to cities
were extracted from satellite imageries, digital terrain models
(DTMs) and/or ancillary maps. The rules for quantifying/under-
standing of the impact of these terrain parameters on Zn content
could be acquired from the training data through the built
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 3
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
regression-tree model. The pollution class for Zn at unsampled
locations could be predicted based on these rules. A synoptic
classication of ve equal interval classes was plotted for each of
the considered parameters (at the exception of the lithologies and
soil types) (Table 1) to match the pollution classes of Zn in the
produced quantitative map.
3.3.1. Parent material
Natural concentration of Zn in soils depends primarily on the
geochemistry of the parent material (De Temmerman et al., 2003).
This can exhibit high spatial variability over heterogonous litholo-
gies. Information on lithology was extracted from scanned and
registered geological map of Lebanon at 1:50.000 scale (Dubertret,
1945).
3.3.2. Slope gradient and length
Topography reects soil types and likely transport processes for
pollution of heavy metals across the landscape. Slope gradient and
slope length were derived from a mosaic of digital terrain models
(DTMs) with a planimetric resolution of 50 m, generated for the
study area from topographic maps at a 1:20.000 scale with an
elevation contour interval of 20 m (DGA, 1963) using ArcGIS
Table 1
The different terrain parameters (predictors) likely to impact on the zinc concen-
tration and their corresponding classes.
Class no. Signication
(a) Lithology
1 J6 (dolomite and dolomitic limestone)
2 bJ6 (basalt)
3 C1 (calcareous sandstone and intercalations of
siltstone and clays)
4 C2 (clastic limestone, limestone and dolomitic
limestone)
5 C3 (shaley limestone and marl)
6 C4 (limestone, marly limestone and dolomitic
limestone)
7 C5 (marly limestone and limestone)
8 C6 (chalky marl and marly limestone)
9 bC (basalt)
10 m2 (marly limestone and marl)
11 qc (colluvial deposits)
12 qd (xed dunes)
13 qta (alluvium)
14 bq (basalt)
(b) Land cover/use
1..20 Natural vegetation cover
21..41 Agricultural lands
42..45 Bare lands
4647 Water bodies
47..58 Human practices
(c) Slope gradient
1 8%
2 815%
3 1530%
4 3060%
5 60%
(d) Slope length
1 50 m
2 50100 m
3 100150 m
4 150200 m
5 >200 m
(e) Distance to drainage line
1 50 m
2 50100 m
3 100150 m
4 150200 m
5 >200 m
(f) Proximity to roads
1 50 m
2 50100 m
3 100150 m
4 150200 m
5 >200 m
(g) Surroundings of waste areas
1 50 m
2 50100 m
3 100150 m
4 150200 m
5 >200 m
(h) Soil type
1 Coastal sand
2 Consolidated dunes
3 Gravel and massive landslide deposits
4 Decalcied and rubied coastal sand
5 Discontinuous red mountainous soils/terra rossa
6 Continuous red mountainous soils/terra rossa
7 Brown soils
8 Yellowish mountain soils
9 Sandy soils
10 Greyey soils
11 Recent uvial alluvium
12 Mixed soils on marl with bedded limestone
13 Mixed soils on alternating marl or limestone
and sandstone
Table 1 (continued)
Class no. Signication
14 Mixed soils on alternating marl, limestone,
sandstone and basalt
15 Black or greyey soils
16 Dolomitic sand associated to terra rossa
17 Yellowish soils associated to greyey soils
18 Sandy soils associated to greyey soils
(i) Soil pH
Extremely to moderately acid <5.8
Slightly acid 5.86.5
Neutral 6.57.2
Slightly alkaline 7.27.9
Moderately to strongly
alkaline
> 7.9
(j) Hydraulic conductivity
Rapid >42.34
Moderately rapid 14.1142.34
Moderate 4.2314.11
Moderately slow 1.414.23
Slow <1.41
(k) Nearness to cities
1 50 m
2 50100 m
3 100150 m
4 150200 m
5 >200 m
(l) Organic matter content
1 <0.5%
2 0.51%
3 11.5%
4 1.52%
5 >2%
(m) Stoniness ratio
1 <5%
2 535%
3 3565%
4 6595%
5 >95%
(n) Soil depth
Very shallow <0.25 m
Shallow 0.25<0.5 m
Moderate 0.5<1.0 m
Deep 1.0<1.5 m
Very deep 1.55 m
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 4
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
software. Five classes of slope were distinguished: 8%, 815%,
1530%, 3060%, and 60%. These classes were determined using
a clustering method based on the frequency distribution of slopes
and the break-in-slope that inuence formation of various soils.
Five classes of slope length were distinguished also, with an equal
interval of 50 m between each class (Table 1).
3.3.3. Distance to drainage line
Drainage networks, which can be considered a major source of
pollution, were extracted from topographic maps at a scale of
1:20.000 (DGA, 1963). A topology was built for the drainage systems
and then buffered. The inuence of drainage was given to the buffer
zoneuptoadistanceof 50mfromtheclosest drainageline(Sahaet al.,
2002). Thus, ve classes were determined in the study area (Table 1).
3.3.4. Soil properties
Soils in this study area are derived from different parent mate-
rials, and this is a key factor determining their natural content of
heavy metals. Soil types were represented by a digital registered
form of the soil map of Lebanon established by Ge` ze (1956) at
a 1:200.000 scale. Soil pH, hydraulic conductivity and organic
matter content were analyzed, respectively, by soil suspension in
water and chlorhydric potassium (ratio 1:2), double rink, and
oxidation with potassium dichromate (K
2
Cr
2
O
7
). Soil depth was
measured through a sounding by an auger at each site. Stoniness
ratio (which is the relative proportion of stones on the soil surface)
was determined by visual observations with ve classes: <5%,
535%, 3565%, 6595%, and >95%.
3.3.5. Land cover/use
Agricultural practices can add Zn to soils through application of
manure or inorganic fertilizers. Repeated use of Zn-based chem-
icals in orchard and arable lands can result also in soil Zn
contamination. The land cover/use parameter was estimated from
a recent land cover/use map at 1:20.000 scale. This map was
produced through visual interpretation of high-resolution Indian
satellite images IRS (5 m) acquired in October 1998 and based on
CORINE (Coordinated Information on the European Environment)
Land Cover methodology (level 4) (LNCSR-LMoA, 2002). Fifty-eight
classes were plotted belonging to ve major categories: 1) natural
vegetation cover, 2) agricultural lands, 3) bare lands, 4) water
bodies and 5) human practices.
3.3.6. Proximity to roads
Roads are usually sites of anthropologically induced pollution
(e.g., vehicular emissions). For this reason, roads were included in
this study. Thus, a buffer zone of 50 m above the road (maximum
height of a talus cut created by the construction of a road) was
created, and ve classes were considered, ranging from less than
50 m to more than 200 m from the road.
3.3.7. Surroundings of waste areas
The concentration of Zinc may also result fromburning of waste
in the environment. The inuence of waste areas on Zn accumu-
lation can reach several meters to several kilometres fromthe point
source depending on the industry involved and it is not easy to
determine the distance affected. Here, 50 m, 100 m, 150 m, 200 m,
and more than 200 m were selected as the buffer distance around
the waste areas.
3.3.8. Nearness to cities
Recent enhanced urbanization and industrialization have
contributed to increased occurrence of heavy metals including Zn
contamination in ecosystems. In this study, a buffer zone of 50 m
around the city or town boundary was selected.
3.4. Statistical analysis
An ASCII point data le for the eld sample locations was
constructed. This le comprises several columns: geographic site
coordinates (x, y), the zinc concentration in mg kg
1
(target vari-
able), and various predictor parameters being both ordinal (pH,
conductivity, organic matter, stoniness ratio, soil depth, slope
gradient, slope length, distance to the drainage line, proximity to
roads, surroundings to waste areas, and nearness to cities) and
nominal (lithology, soil type, land cover/use). Adecision-tree model
was explored using this le with several steps: (1) nd the best
possible split through the examination of each predictor, (2) create
two child nodes, (3) determine into which child node each eld
location goes, and (4) continue the process until the criterion of the
minimum node size is fullled.
The number of splits to be evaluated is equal to 2
(k1)
1, where
k is the number of categorical classes of predictor variables
(Breiman, 2001). For example, if the land cover/use with ve classes
is considered, fteen splits need to be tried. With the increase of the
number of classes, there is an exponential growth of splits and
computation time. We considered differences in the value of
a continuous variable up to 1% of the whole range, which is
equivalent to ten thousand classes (Loh and Shih, 1997).
The algorithm used for evaluating the quality of the constructed
tree is the Gini splitting method, which is considered as the default
method (Breiman, 2001). Each split is chosen in order to maximize
the heterogeneity (node impurity) of the classes of a target variable
in child nodes. The Gini method has been considered slightly better
than the Entropy tree-tting algorithm (Loh and Shih, 1997;
Breiman, 2001). The Gini coefcient is used to measure the degree
of inequality of a variable in terms of frequency distribution. It
ranges between 0 (perfect equality) and 1 (perfect inequality). The
Gini mean difference (GMD) is dened as the mean of the differ-
ence between each observation and every other observation
(Breiman, 2001):
GMD
1
N
2
X
N
j 1
X
N
k 1

X
j
X
k

(1)
where X is cumulative percentage (or fractions) and their respective
values (j and k) and N is the number of elements (observations).
A minimum node size of 10 was applied in this study, since
a simpler tree is easier to understand and faster to use and, more
importantly, smaller trees provide greater predictive accuracy for
unseen data. This number has also been used in other studies
(Murphy et al., 1994; Zhang and Burton, 1999).
The maximum tree elevation was not specied in this study, as
was the case for earlier programs suchas AID(Automatic Interaction
Detection). A too large tree needs to be pruned back to its optimal
size (i.e. backward pruning) on the basis of a V-fold cross-validation
(Berk, 2003) with several assumptions such as random-row-
holdback validation, xed number of terminal nodes, and smooth
minimum spikes. It does not require a separate, independent
dataset, which would reduce the data used to build the tree. It
partitions the used dataset to build the reference unpruned tree into
a number of groups (i.e. folds). A 10 V-folds value was adopted in
this study, since a larger value increases computation time and may
not result in a more optimal tree.
3.5. Construction of the zinc concentration map
Using the resulting decision-tree model, a predictive concen-
tration map of zinc distribution in soils of the study area was
produced under a GIS environment through conditional parametric
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 5
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
Fig. 2. Decision-tree model for predicting zinc concentration based on terrain parameters (pH pH; proximity to roads Roads; Land cover/use LUC; Soil type Soils; Distance to drainage line Drain; Slope
gradient SG; Soil depth Sd; Organic matter om; Hydraulic conductivity cond; lithology litho; Surroundings of waste areas WA; Nearness to cities C).
R
.
B
o
u
K
h
e
i
r
e
t
a
l
.
/
E
n
v
i
r
o
n
m
e
n
t
a
l
P
o
l
l
u
t
i
o
n
x
x
x
(
2
0
0
9
)
1

9
6
A
R
T
I
C
L
E
I
N
P
R
E
S
S
P
l
e
a
s
e
c
i
t
e
t
h
i
s
a
r
t
i
c
l
e
i
n
p
r
e
s
s
a
s
:
B
o
u
K
h
e
i
r
,
R
.
,
e
t
a
l
.
,
S
p
a
t
i
a
l
s
o
i
l
z
i
n
c
c
o
n
t
e
n
t
d
i
s
t
r
i
b
u
t
i
o
n
f
r
o
m
t
e
r
r
a
i
n
p
a
r
a
m
e
t
e
r
s
:
A
G
I
S
-
b
a
s
e
d
d
e
c
i
s
i
o
n
-
t
r
e
e
m
o
d
e
l
i
n
L
e
b
a
n
o
n
,
E
n
v
i
r
o
n
.
P
o
l
l
u
t
.
(
2
0
0
9
)
,
d
o
i
:
1
0
.
1
0
1
6
/
j
.
e
n
v
p
o
l
.
2
0
0
9
.
0
8
.
0
0
9
weight application. If different end results (concentrations in
mg kg
1
) characterize eld sites within a given unit of the inter-
section layer (between different predictor parameters), new sub-
polygons were delineated. In the case of similar results, unit poly-
gons were joined (merged). This map was validated based on eld
surveys. An independent dataset has been randomly chosen in all
landscape units, consisting of 30% (60 sites) of the total number of
eld sites. The accuracy assessment was based on analysis of the
error matrix, which is a square array of dimension n n (n is the
number of classes). The total accuracy refers to the ratio of total
number of correctly inferred Zn classes divided by the total number
of test samples (60).
4. Results and discussion
4.1. Model performance evaluation
The regression-tree model created with 25 terminal nodes
(Fig. 2) correctly classied 88% of the training data, but V-fold cross-
validation indicated that this model would correctly classify 76% of
an independent dataset. The number of observations (N) per
terminal node ranged from1 to 8, and the node does not split if N is
less than 10 observations.
The mean value of the target variable (zinc concentration in
mg kg
1
) of the rows that are in a terminal node of the tree is the
estimated median value. From the tree (Fig. 2), one can see that if
the pH is less than 8 (N 110 rows), then the estimated (average)
value of the target variable is 200 mg kg
1
, whereas if the pH
exceeds 8 (N 30 rows), then the average value of the target
variable is 110 mg kg
1
. In fact, zinc like other heavy metals
becomes more water-soluble under acid conditions and can move
downwards with water through the soils.
The minimumvalidation relative error (or validation cost) occurs
with ve nodes, at a value of 0.26 and a validation standard error of
0.02. This validation error value is the cost relative to the cost for
a tree with one node. It is the best measure of how well the tree will
t an independent dataset different from the learning dataset.
4.2. Evaluation of the inuence of independent parameters
on soil Zn content
The relative importance of the predictor parameters is repre-
sented as follows: pH (100% in relative importance), surroundings
of waste areas (90%), proximity to roads (80%), nearness to cities
(50%), distance to the drainage (25%), lithology (24%), land cover/
use (14%), slope gradient (10%), conductivity (7%), soil type (7%),
organic matter (5%), and soil depth (5%). pH appears as the master
parameter (generating the split from the parent node) in the whole
pollution process by zinc metal, since it is strongly correlated with
soil types and parent materials inuencing the cation mobility and
regulates the solubility of zinc in soils. The spatial variability of zinc
in topsoils is affected by extrinsic (roads, waste areas and cities) and
intrinsic aspects (drainage lines and soil parent materials). In
addition, there is a good correspondence between zinc distribution
and land cover/use, especially that related to a secular anthropic
activity (e.g., cultivation resulting in chemical anomalies). The slope
gradient has demonstrated a certain impact by inuencing the
accumulation of zinc in mosses. Similarly, hydraulic conductivity
and soil type (texture) control zinc distribution, but to a lesser
extent (7%). In clay-rich, low-permeability soils, heavy metals are
bound to soil particles, and thus there are only few toxicity risks.
Organic matter and soil depth have similar effects on the zinc
concentration. The former inuences heavy metal absorption in
soils; this effect is probably due to the cation exchange capacity of
organic material. In addition, since heavy metals come from parent
materials, their concentrationwill typically increase with soil depth
due to two processes. One is soil formation from rock parent
materials, which have higher concentrations of heavy metals; the
other is deposition of materials at the soil surface, such as plant
organic matter, which typically contain low heavy metal concen-
trations. As the soil materials from these two processes mix,
a gradient of increasing concentration is created through the soil
prole. The other parameters (i.e. slope length and stoniness ratio)
do not intervene in building the regression-tree, and their effect is
totally masked.
4.3. Production of the zinc concentration map
The predictive concentration map of zinc (Fig. 3), at a 1:50.000
cartographic scale, was produced using the results of the built
decision-tree model. This map was divided into ve concentration
classes having an equal range of distribution (Table 2). This division
seems necessary to prioritize the measurements needed to reduce
the level of zinc concentration in contaminated areas. In this map,
class 3 (with a zinc concentration ranging between 100 and
150 mg kg
1
) covers the largest area (39%) being dispersed in the
Fig. 3. Predicted zinc concentration map of the study area.
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 7
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
studied region (Table 2). This indicates a widespread higher risk of
contamination if no remedial measures are applied. Classes 4 (high
contamination) and 5 (very high contamination) occur in 31.5% of
the area, but they have a far larger impact since they are distributed
as patches in densely populated areas.
4.4. Accuracy of the zinc concentration map
The predicted Zn values were veried against test data. The
confusion matrix between the measured zinc concentration classes
and the modeled ones indicates a good overall accuracy of 78%
(Table 3). This accuracy value denotes the number of correctly
modeled sites divided by the total number of sites. It is different
from the explained variance of the built decision-tree model (88%),
since it is dedicated to validate all adopted approaches combining
the integration of landscape unit map, quantication of terrain
parameters and decision-tree modeling. In contrast, the explained
variance reects only some eld/laboratory measurements of zinc
concentration chosen for model training. The users accuracy (Au),
i.e. the percentage of sites belonging to a model class correctly
corresponding to the reference data, read along columns, is ranging
from 50% to 87.5% (50% being the result of dividing 4 over 8 and
multiplying by 100; 87.5% is derived from dividing 7 over 8 and
multiplying by 100). Omission errors (decits) (Eo), computed
along columns, correspond to the distribution of sites of a class,
derived from modeling, in the various classes of reference data.
They are equal (in %) to 100% users accuracy. They vary between
12.5% and 50%. The producers accuracy (Ap), the percentage of sites
belonging to a reference class correctly classied by the model, read
along the rows, is 6788%. Commission errors (excesses) (Ec),
computed along rows, correspond to the distribution of a reference
class among various classes derived from modeling. Being equal to
100% producers accuracy, they range between 12% and 33%.
Modeling often overestimates zinc concentration levels, while
underestimation is rare; this can be considered a positive point for
management planning considerations, because the possibility of
overlooking actual zinc pollution risks decreases.
4.5. Advantages and problems of the constructed model
Our regression-tree model (88% variance) has dened a map of
zinc concentration with ve classes for a region situated in the
northern part of Lebanon (195 km
2
). Such a map was not available
in Lebanon, nor in many other countries. It represents the result of
modeling fromgeo-environmental characteristics and can meet the
scientic needs of researchers and decision-makers for exploring
land-related problems.
The explained model variance can be ameliorated using other
details within the predictor parameters such as soil structure, soil
compactness, slope aspect, slope curvature, etc. This is an impor-
tant future research topic since the contribution of such variables to
explaining additional variance can be tested. This model can be
extrapolated to other areas in the country if the functional capac-
ities of GIS are used, because they allow the integration of several
parametric maps for producing landscape unit maps, on which zinc
concentration measurements can be determined.
The concept of decision-tree modeling can also be tested for
other heavy metals (e.g., copper, lead, cadmium, etc.). However,
a major difculty is encountered related to the coarse scale of some
parametric maps used for producing the zinc concentration map.
5. Conclusion
The constructed decision-tree model enabled, for the rst time
the mapping of predicted zinc concentrations in a region of
Lebanon at a scale of 1:50.000, based on geo-environmental
characteristics (e.g., topography, geology, soil and land cover/use).
The modeling approach was easily implemented with available GIS
software and is suitable for data exploration and predictive zinc
concentration mapping. It is explicit and can be critically evaluated
and revised when necessary. It can also easily be extrapolated to
other Mediterranean countries undergoing socio-economic change.
This decision-tree approach did not produce concentration
maps of soil zinc pollution with accuracies substantially higher
than those reported in the literature for other methods and/or
models (Facchinelli et al., 2001; Li et al., 2001; Norra et al., 2001;
Zhang, 2005; Franco et al., 2006). However, it can be considered
a quick, simple, realistic, and informative method for combining
geo-environmental variables in order to generate a map describing
the potential zinc concentration (predictive map). This map can be
used to prioritize the choice of study areas for further measurement
and modeling and may in the short-term help with the selection
and adoption of measures to reduce the contamination of densely
populated areas and arable lands. Although the chosen scale of this
map (1:50.000) seems to be sufcient for estimating the zinc
concentration to consider strategies for land protection, the map
can be improved for more localized contamination assessment if
more detailed data sets are available [higher resolution DTMs and
more detailed GIS parametric maps (predictors)].
References
Adachia, K., Tainoshob, Y., 2004. Characterization of heavy metal particles
embedded in tire dust. Environmental International 30, 10091017.
Berk, R.A., 2003. An Introduction to Ensemble Methods for Data Analysis. UCLA
Department of Statistics. Technical Report.
Bocca, B., Alimonti, A., Petrucci, F., Violante, N., Sancessario, G., Forte, G., 2004.
Quantication of trace elements by sector eld inductively coupled plasma
spectrometry in urine, serum, blood and cerebrospinal uid of patients with
Parkinsons disease. Spectrochimica Acta B59, 559566.
Table 3
Error matrix for measuring the accuracy of the map predicting zinc concentration.
Zinc concentration Modeled map
Very
low
Low Medium High Very
High
Total Ap
(%)
Ec
(%)
Field observation/
laboratory
analysis
Very
low
a
7 2 1 0 0 10 70 30
Low 1 8 2 0 0 11 73 27
Medium 0 0 23 3 0 26 88 12
High 0 0 0 4 2 6 67 33
Very
high
0 0 1 1 5 7 71 29
Total 8 10 27 8 7 60
Au (%) 87.5 80 85 50 71
P
47
Eo (%) 12.5 20 15 50 29 OA 78%
Au users accuracy, Eo Omission errors (decits), Ap producers accuracy,
Ec Commission errors (excesses), OA Overall accuracy.
Bolditalics values indicates: Sites classied correctly,
P
Total number of correctly
modeled sites.
a
Very low zinc concentration: 50 mg kg
1
; Low: 50100 mg kg
1
; Medium:
100150 mg kg
1
; High: 150200 mg kg
1
; Very high: >200 mg kg
1
.
Table 2
Classes of zinc concentration and corresponding areas for the studied region.
Classes Zinc concentration Area (km
2
) % of the study area
1 (very low) 50 mg kg
1
5 2.5
2 (low) 50100 mg kg
1
53 27
3 (medium) 100150 mg kg
1
75 39
4 (high) 150200 mg kg
1
46 23.5
5 (very high) >200 mg kg
1
16 8
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 8
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009
Bou Kheir, R., Girard, M.-C., Khawlie, M., 2004. Use of a structural classication
OASIS for the mapping of landscape units in a representative region of Lebanon.
Canadian Journal of Remote Sensing 30 (4), 617630 (in French).
Bou Kheir, R., Wilson, J., Deng, Y., 2007. Use of terrain variables for predictive gully
erosion mapping in Lebanon. Earth Surface Processes and Landforms 32,
17701782.
Bou Kheir, R., Chorowicz, J., Abdallah, C., Dhont, D., 2008. Soil and bedrocks
distribution estimated from gully form and frequency: a GIS-based decision-
tree model for Lebanon. Geomorphology 93, 482492.
Breiman, L., 2001. Decision-tree forests. Machine Learning 45 (1), 532.
Chow, C.W.K., Davey, D.E., Mulcahy, D.E., 1997. A neural network approach to zinc
and copper interferences in potentiometric stripping analysis. Journal of
Intelligent Material Systems and Structures 8 (2), 177183.
De Temmerman, L., Vanongeval, L.B., Hoenig, M., 2003. Heavy metal content of
arable soil in Northern Belgium. Water, Air and Soil Pollution 148, 6176.
DGA, 1963. Topographic Maps at a Scale 1:20.000. Direction of Geographic Affairs,
Republic of Lebanon.
Dubertret, L., 1945. Geological Maps of Lebanon at 1:50.000 Scale. Ministry of Public
Affairs, Republic of Lebanon.
Facchinelli, A., Sacchi, E., Mallen, L., 2001. Multivariate statistical and GIS-based
approach to identify heavy metal sources in soils. Environmental Pollution 114,
313324.
Franco, C., Soares, A., Delgado, J., 2006. Geostatistical modelling of heavy metal
contamination in the topsoil of Guadiamar river margins (S Spain) using
a stochastic simulation technique. Geoderma 136, 852864.
Franklin, J., 1998. Predicting the distributions of shrub species in California chap-
arral and coastal stage communities from climate and terrain-derived variables.
Journal of Vegetation Science 9, 733748.
Friedl, M.A., Brodley, C.E., Strahler, A.H., 1999. Maximizing land cover classication
accuracies produced by decision trees at continental to global scales. IEEE
Transactions on Geoscience Remote Sensing 37, 969977.
Fuge, R., 2005. Anthropogenic sources. In: Selinum, O. (Ed.), Essentials of Medical
Geology: Impacts of the Natural Environment on Public Health. Academic Press,
Amsterdam, pp. 4360.
Ge` ze, B., 1956. Soil Map of Lebanon at a Scale of 1:200.000, Explicative Note, 1956.
Republic of Lebanon. Ministry of Agriculture, Beirut, Lebanon.
Henderson, B.L., Bui, E.N., Moran, C.J., Simon, D.A.P., 2005. Australia-wide predic-
tions of soil properties using decision trees. Geoderma 124, 383398.
Huang, X., Jensen, J.R., 1997. A machine-learning approach to automated knowl-
edge-base building for remote sensing image analysis with GIS data. Photo-
grammetric Engineering & Remote Sensing 63, 11851194.
Kandrika, S., 2008. Land use/land cover classication of Orissa using multi-temporal
IRS-P6 awifs data: a decision tree approach. International Journal of Applied
Earth Observation and Geoinformation 10 (2), 186193.
Li, X.D., Poon, C.S., Liu, P.S., 2001. Heavy meal contamination of urban soils and
street dusts in Hong Kong. Applied Geochemistry 16, 13611368.
Lin, Y.P., Teng, T.P., Chang, T.K., 2002. Multivariate analysis of soil heavy metal
pollution and landscape pattern in Changhua County in Taiwan. Landscape and
Urban Planning 62, 1935.
LNCSR-LMoA, 2002. Land Cover/Use Map of Lebanon at a Scale of 1:20.000. Leb-
anese National Council for Scientic Research and Lebanese Ministry of
Agriculture.
Loh, W.Y., Shih, Y.S., 1997. Split selection methods for classication trees. Statistica
Sinica 7, 815840.
Manta, D.S., Angelone, M., Bellanca, A., Neri, R., Sprovieria, M., 2002. Heavy metals
in urban soils: a case study from the city of Palermo (Sicily), Italy. Science of the
Total Environment 300 (13), 229243.
Manzoor, S., Munir, H.S., Shaheen, N., Khalique, A., Jaffar, M., 2006. Multivariate
analysis of trace metals in textile efuents in relation to soil and groundwater.
Journal of Hazardous Materials A137, 3137.
McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environ-
mental correlation. Geoderma 89, 6794.
Michaelsen, J., Schimel, D.S., Friedl, M.A., Davis, F.W., Dubayah, R.C., 1994. Regression
tree analysis of satellite and terrain data to guide vegetation sampling and
surveys. Journal of Vegetation Science 5, 673686.
Murphy, P., Patrick, M., Michael, J.P., 1994. Exploring the decision forest: an
empirical investigation of Occams Razor in decision-tree induction. Journal of
Articial Intelligence Research 1, 257275.
Navas, A., Machin, J., 2002. Spatial distribution of heavy metals and arsenic in soils
of Aragon (Northeast Spain): controlling factors and environmental implica-
tions. Applied Geochemistry 17, 961973.
Norra, S., Weber, A., Kramer, U., Stuben, D., 2001. Mapping of trace metals in urban
soils. Journal of Soil Sediment 1, 7797.
Reimann, C., de Caritat, P., 2005. Distinguishing between natural and anthropo-
genic sources for elements in the environment: regional geochemical surveys
versus enrichment factors. The Science of the Total Environment 337 (13),
91107.
Saadia, R.T., Munir, H.S., Shaheen, N., Khalique, A., Manzoor, S., Jaffar, M., 2006.
Multivariate analysis of trace metal levels in tannery efuents in relation to soil
and water: a case study from Peshawar, Pakistan. Journal of Environmental
Management 79, 2029.
Saha, A.K., Gupta, R.P., Arora, M.K., 2002. GIS-based landslide hazard zonation in the
Bhagirathi (Ganga) valley, Himalayas. International Journal of Remote Sensing
23 (2), 357369.
Schroder, W., 2006. GIS, geostatistics, metadata banking, and tree-based models for
data analysis and mapping in environmental monitoring and epidemiology.
International Journal of Medical Microbiology 296 (S1), 2336.
Smolders, E., Degryse, F., 2002. Fate and effect of zinc from tire debris in soil.
Environmental Science and Technology 36 (17), 37063710.
Venables, W.M., Ripley, B.D., 1994. Modern Applied Statistics with S-Plus. Springer,
New York.
Waisberg, M., Joseph, P., Hale, B., Beyersmann, D., 2003. Molecular and cellular
mechanisms of cadmium carcinogenesis. Toxicology 192, 95117.
Zhang, C., 2005. Using multivariate analyses and GIS to identify pollutants and their
spatial patterns in urban soils in Galway, Ireland. Environmental Pollution 142,
501511.
Zhang, H., Burton, S., 1999. Recursive Partitioning in the Health Sciences. Springer,
New York, 226 pp.
R. Bou Kheir et al. / Environmental Pollution xxx (2009) 19 9
ARTICLE IN PRESS
Please cite this article in press as: Bou Kheir, R., et al., Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree
model in Lebanon, Environ. Pollut. (2009), doi:10.1016/j.envpol.2009.08.009

You might also like