## A Bayesian Network Model of Two-Car Accidents

## A Bayesian Network Model of Two-Car Accidents

**MARJAN SIMONCIC***

### ABSTRACT

This paper describes the Bayesian network method for modeling traffic accident data and illustrates its use. Bayesian networks employ techniques from probability and graph theory to model complex systems with interrelated components. The model is built using two-car accident data for 1998 from Slovenia, and inferences are made from the model about how knowledge of the values of certain variables influences the probabilities for values of other variables or outcomes (e.g., how seat-belt use affects injury severity). An advantage of the Bayesian network method presented here is its complex approach where system variables are interdependent and where no dependent and independent variables are needed.

KEYWORDS: Road accidents, modeling, Bayesian networks, machine learning.

### INTRODUCTION

This paper presents a Bayesian network model of two-car accidents based on different factors that influence accident outcomes. The outcomes examined are "fatality or serious injury" and "other outcomes." Influencing factors include:

- road characteristics (e.g., roadway, pavement),
- traffic flow characteristics,
- time/season factors (e.g., weather, season, weekday, daytime, rush hour),
- characteristics of the people involved in an accident (e.g., age, sex, driving experience, health status, intoxication),
- use of protective devices (seat belt, air bag),
- types of vehicles (especially their crash resistance design), and
- speed of the vehicles involved.

Besides these factors, other stochastic influences affect the likelihood of an accident and its outcome. The factors presented above are highly interrelated. For instance, road conditions are influenced by the weather. Traffic flow depends on the time of the day, whether it is a weekday or weekend, and weather conditions. The characteristics of people involved (e.g., age, sex, experience) can often be related to the speed of the vehicles in an accident and the use or non-use of seat belts. The outcome of an accident is, by and large, dependent on the speed of the vehicles involved.

A large road accident dataset was used to model the
interdependence among the variables related to accidents ("knowledge
of the subject") and the dependence of the outcome on the relevant
variables. Bayesian networks^{1}
seem particularly useful for representing knowledge in domains where
large sets of interrelated (and relevant) data are available. They
are based on a combination of probability theory, which deals with
uncertainty, and graph theory, which deals with complexity
(interrelatedness). These networks are an important tool in the
design and analysis of machine learning algorithms and are based on
the idea of modularity whereby a complex system is built by
combining simpler parts. Probability theory connects parts and
ensures the consistency of the system as a whole while providing the
possibility of interfacing the models with the data (see Jordan
1999). This paper aims to show that Bayesian networks can also prove
their potential in modeling road accidents.

### BAYESIAN NETWORKS

#### A Simple Example of a Bayesian Network

This section presents a simple Bayesian network for road accidents. The example is merely for illustrative purposes and is not intended to present a valid model. The aim is to introduce the concept of Bayesian networks by example.

Using a given geographic area, the number of road accident
casualties per day can be schematically explained. Many factors are
interrelated: the number of road casualties depends on how many
trips car drivers took in the area and the danger level; the number
of trips is related to weather conditions and the season (e.g.,
summer means more vacation travel); season and weather are also
related; the level of danger is influenced by the average speed of
vehicles on the roads and on road conditions (e.g., a slippery
road); and road conditions depend on the weather and season and
influence the average speed and level of danger. Figure
1 presents these relationships in a directed acyclic graph where
the nodes correspond to different variables that are characteristic
of the given domain under consideration. Links^{2}
in the graph represent dependence between variables, and acyclic
means that there is no node from which it is possible to follow a
sequence of (directed) links and return to the same node.

Let us suppose that all variables can only take on a finite number of discrete values. We are interested in identifying the probabilities of different events expressed in given values for all variables. This can be expressed with a joint probability distribution over all possible events in the given domain. The number of possible events grows exponentially with the number of relevant variables and, therefore, the joint probability function approach quickly becomes unmanageable. Bayesian networks can streamline the process, because they are a compact way of factoring the joint probability distribution into local, conditional distributions that reduce the number of multiplications necessary to obtain the probability of specific events.

If we interpret the Bayesian network in probabilistic terms, the
related joint distribution function over a given domain can be
written (described by *n* variables) with the product^{3}:

(1)

where *X _{i}* is the variable and

*x*is its value;

_{i}*Pa*(

*X*) is the set of variables that represents

_{i}*X*'s parents

_{i}^{4}and

*pa*(

*X*) is a vector of actual values for all parents of

_{i}*X*.

_{i}Let us note here the general validity of the chain rule formula:

*P *( *x*_{1} , *x*_{2},,
*x*_{n}) = *P* ( *x*_{1}) *P*
(*x* _{2} | *x*_{1} ) •*P* (
*x*_{3} | *x*_{1}, *x*_{2}
)*P* (*x*_{n} | *x*_{1} ,
*x*_{2} ,, *x*_{n-1})

From our example in figure 1, we have:

*P* ( *x*_{1} , *x*_{2} ,,
*x*_{7}) = *P* ( *x*_{1 }) *P* (
*x*_{2 }| *x*_{1 }) •*P* (
*x*_{3 }| *x*_{1 }, *x*_{2})
*P* ( *x*_{4 }| *x*_{1 },
x_{2}) *P* ( *x*_{5 }| *x*_{3
}) •*P* ( *x*_{6 }| *x*_{3
}, *x*_{5}) *P* ( *x*_{7 }|
*x*_{4 }, *x*_{6})

Aside from the global semantics reflected in equation (1), there is also a local meaning related to a Bayesian network. From figure 1, we see:

*P* ( *x*_{4 }| *x*_{1 },
*x*_{2 }, *x*_{3}) = *P* (
*x*_{4 }| *x*_{1 },
*x*_{2})

where *X*_{4} is independent of the variable
*X*_{3} given *X*_{1} and
*X*_{2} (reflecting the fact that *X*_{3}
is not among the parents of *X*_{4} ). These local
semantics are very useful for constructing a Bayesian network. Here,
only direct causes (or predispositions) are selected as the parents
of a given variable, which leads to the automatic fulfillment of
local independence conditions.

Links in Bayesian networks may have different meanings. If we have a link from node A to node B, this could mean:

1. A causes B,

2. A partially causes or predisposes B,

3. B is an imperfect observation of A,

4. A and B are functionally related, or

5. A and B are statistically correlated.

This paper employs the second meaning of a link.

Bayesian networks for a certain domain can be used for inference purposes. With the network in figure 1, we will illustrate the meaning of inference and also show the difference between a Bayesian network model and better known classical models, such as logistic regression. After a product specification (equation (1)) of a joint probability distribution is obtained, the probability of any event in the domain can be expressed. Conditional events where certain variables have known values are especially interesting. This type of probabilistic inference is called a belief update. An example for the domain represented in figure 1 is the following:

For illustrative purposes, we have assumed that one possible
value of the variable *X* _{3} (road conditions) is
"slippery." This variable can also take on other values. A similar
description holds for variable *X* _{7} . This
expression can be further simplified, but this is unnecessary here.

Let us now illustrate the difference between the Bayesian network
model and the classical logistic regression (for logistic regression
see Agresti (1990) or Hosmer and Lemeshow (2000)). The most
significant difference is that with logistic regression the model's
dependent and independent variables must be chosen; while, with the
Bayesian network model, all variables are treated equally. The
logistic regression has a response (or dependent) variable *Y*
that is a categorical variable with *J* ( *J *≥ 2) classes
and a vector *X* (with *p* components) of explanatory (or
independent) variables that are also categorical^{5}
variables. Here, *Y* could be the number of casualties (with
*Y* = 1 for "high" and *Y* = 0 for "other"). The
components of vector *X* could be the six other variables from
figure 1. The generalized logit model can be put in the following
way:

(2)

If the attributes *X* are also 0/1 variables, then the
following formula is valid^{6}:

*k* = 1, 2, ..., *p*

The expression is called the odds ratio and allows an easy
interpretation of the estimated parameters^{7}.
In the logit model for figure 1, exp( *β _{k }*) is the
odds that the number of casualties will be high in the circumstances
given by variable

*x*= 1 relative to the odds that the number of casualties will not be high in the circumstances given by variable

_{k}*x*= 0.

_{k}It is obvious that the model shown in equation (2) does not
explicitly take into account eventual interdependence between
variables of *X*, nor does it allow for an estimation of other
probabilities that could be of interest (e.g., the belief update
given as an example for the network in figure 1). Interdependences
among variables in a Bayesian network are explicit and represent a
distinguishing feature of the method.

The general problem of computing posterior probabilities (or of a belief update) for large and structurally more complex Bayesian networks is computationally very demanding (more precisely: NP-hard). The computational burden was the reason that the inference in Bayesian networks was initially limited only to special types of structures, namely tree-structured networks. Later, efficient algorithms were proposed for more general types of network structures (Lauritzen and Spiegelhalter 1988; Zhang and Poole 1996).

#### Formal Definition of Bayesian Networks

Bayesian networks contain qualitative (structural) and quantitative (probabilistic) parts. The qualitative part is based on statistical independence statements and can be represented by a directed acyclic graph. The nodes are related to random variables of interest for a given domain, while the links correspond to a direct influence among the variables. The quantitative part is captured by local probability models, given by a set of conditional probability distributions. Both the qualitative and quantitative parts of the Bayesian network uniquely represent the joint probability distribution over a domain. The definitions follow.

**Definition 1.** A Bayesian network *B* is a triplet
(*X, A, P*) where:

*X*is a set of nodes*A*is a set of links that, together with*X*, represent a directed acyclic graph:*G*= (*X, A*)*P*= {*P*(*x*|*pa*(*x*)):*xε X*}

where *Pa*(*X*) is the set of parents of *X*, and
*pa*(*x*) is its instantiation^{8}.*P*
stands for probability.

It is clear that *P* is the set of conditional probabilities
for all variables, given their parents. From definition 1, the
conclusion can be drawn that nodes and variables are used
interchangeably. Variables in a Bayesian network are called nodes
when we speak about the graph.

Graph G corresponding to a Bayesian network has to be acyclic. If cycles were allowed, the feedback influence would be enabled. It is well known that feedback cycles are difficult to model quantitatively and no calculus has been developed for the Bayesian network to cope with these.

The notion of conditional independency is a basic concept of
Bayesian networks. We say that (random) variables *A* and
*B* are independent given the variable *C* if the
following is true:

*P*(*A|B,C*) = *P*(*A|C*)

This means that if the value of variable *C* is known, then
knowledge of *B* does not alter the probability of
*A*.

The Bayesian network provides a graphic representation of many independency relationships that are embedded in the underlying probability model. No formal definitions are provided here, but it should be understood that the mathematical conception of d-separation is fundamental relative to independence (Jensen 2001).

The next definition gives the global interpretation of Bayesian networks.

**Definition 2.** The prior joint probability
*P*_{B} of a Bayesian network *B* is defined by
the following expression:

The factorization in definition 2 rests on a set of local
independence assumptions, asserting that each variable is
independent of its predecessors^{9}
in the network, given its parents. The opposite is also true. We can
use the interdependence in constructing Bayesian networks from
expert opinion, because selecting as parents all the direct causes
of a given variable satisfies the local conditional independence
conditions (Pearl 2000).

For the Bayesian network from figure 1, the prior joint probability is equal to:

*P _{B}*(

*x*) =

*P*(

_{B}*X*

_{1}=

*x*

_{1},

*X*

_{2}=

*x*

_{2},,

*X*

_{7}=

*x*

_{7})

=

*P*(

*X*

_{1}=

*x*

_{1})

*P*(

*X*

_{2}=

*x*

_{2}|

*X*

_{1}=

*x*

_{1}) •

*P*(

*X*

_{3}=

*x*

_{3}|

*X*

_{1}=

*x*

_{1},

*X*

_{2}=

*x*

_{2}) •

*P*(

*X*

_{4}=

*x*

_{4}|

*X*

_{1}=

*x*

_{1},

*X*

_{2}=

*x*

_{2}) •

*P*(

*X*

_{5}=

*x*

_{5}|

*X*

_{3}=

*x*

_{3}) •

*P*(

*X*

_{6}=

*x*

_{6}|

*X*

_{3}=

*x*

_{3},

*X*

_{5}=

*x*

_{5}) •

*P*(

*X*

_{7}=

*x*

_{7}|

*X*

_{4}=

*x*

_{4},

*X*

_{6}=

*x*

_{6}) •

When we have a joint probability distribution defined on a set of
variables *X*, we can calculate the probability distribution of
any subset *S* of *X*. This calculation is called
marginalization and is very useful in inference exercises on
Bayesian networks.

**Definition 3.** Let *S* be a subset of the set of
variables *X*. The marginal probability
*P _{B}*(

*S*) is defined by

Let us now suppose that some variables have specific values. In
our example from figure 1, variables *X*_{7} and
*X*_{3} may be observed to have values "high"
(*X*_{7}) and "slippery" (*X*_{3}). If
*Y* ⊆ *X* is the set of variables with actual (observed)
values, *Y*_{0} is the corresponding vector of values
and *X*_{1} ⊆ *X* is the set of variables of
interest (*X*_{1} ⊆ *X − Y*), then the following
definition of posterior probability is useful.

**Definition 4.** The posterior probability

*P _{B}* (

*X*

_{1}|

*Y*=

*Y*

_{0}) of

*X*

_{1}of

*X*

_{1}

is defined by the expression

### THE MODEL

#### Data

This paper focuses on road accidents in which two car drivers were involved. The empirical part is based on data from the road accidents database assembled by the Slovenian Ministry of the Interior from police reports. For the model, 1998 data containing 36,704 Slovenian police accident reports were used. From this total, 17,558 (48%) were of the selected type. To illustrate the risk of Slovenian drivers being involved in a two-car accident, some basic data show that, in 1998, 797,855 cars were registered in Slovenia (the country has 2 million inhabitants). Because we are looking at accidents involving two cars, we know that approximately 4% of the Slovenian car fleet was involved in accidents of this type that year.

Table
1 presents data on two-car accidents for selected variables.
Variables from Accident_type to Cause (the first column of table 1)
are related to the accident, while variables from Age to Injury are
related to the drivers^{10}.
The share of accidents that resulted in a fatality or serious injury
of at least one person is 1.9%. Over 70% of accidents occur in
built-up areas and more than half happen in good weather and under
normal traffic conditions. Among participants, the lion's share
corresponds to drivers 25 to 64 years old, yet the share of drivers
under 25 years of age is also relatively high (23%). For drivers
involved in accidents, a significant proportion has less than one
year of driving experience (12.9%). Only a small share of drivers
involved in accidents was intoxicated (4.3%).

#### Bayesian Network Estimation

A Bayesian network for a given domain can be estimated using different approaches. This paper uses a template model that should not vary from one problem to another. Our purpose here is to estimate a fixed Bayesian network over a given set of variables, obtained by a combination of expert judgment and empirical data. Specifications for some alternative possibilities for estimating a Bayesian network are presented below.

A difficult part of building a Bayesian network is quantifying probabilities, which can be derived from various sources:

- from domain experts (subjective probabilities),
- from published statistical studies,
- derived analytically, or
- learned directly from raw data.

This paper uses the last option, mainly because of the availability of a relatively large database.

Sometimes the process of learning the structure of a Bayesian network (if necessary) may be even more difficult than quantifying probabilities. According to the structure, models can be classified as those with a known structure or those with an unknown structure. We experimented with both options.

There are basically two different approaches to learning the
structure of a Bayesian network from data: 1) search and scoring
methods and 2) dependency analysis methods. In the first approach,
different scoring criteria are used for evaluating competing
structures. Two of the well-known methods of this type are the
Bayesian scoring method (Cooper and Herskovits 1992) and the minimum
description length method (Lam and Bacchus 1994). Because learning a
Bayesian network structure by a search and score approach is
NP-hard, different heuristic searches have been proposed. Algorithms
from the second group try to discover the dependences among
variables from data and then use them to infer the structure. During
this process, a conditional independence test, usually based on the
concept of mutual information of two nodes (variables), *X* and
*Y*, is used

In this expression, *P _{e}* denotes the observed
relative frequencies in the dataset. Conditional mutual information
is defined analogously:

*Z* can be a single node or a set of nodes. Mutual
information *I* is non-negative and equal to 0 when *X*
and *Y* are conditionally independent. The higher the mutual
information, the stronger the dependence between *X* and
*Y*. In heuristic algorithms a certain threshold *ε* is
usually used: if *I*(*X,Y*) is smaller than *ε*, then
*X* and *Y* are taken as marginally independent.
Similarly, if *I*(*X,Y|Z*) is smaller than *ε*, we
consider *X* and *Y* as conditionally independent given
*Z*.

All these methods can be expected to find the correct structure only when the probability distribution of the data satisfies certain assumptions. But generally both types of methods find only approximations for the true structure.

According to the available data, models for learning Bayesian networks can be classified into those with complete data available or those with incomplete data available. In the first case, all variables are observed for all instances in the database while, in the second case, values for some variables may be missing or some variables may not even be observed (hidden variables). Because the available database used for this paper contains complete data, the first possibility is relevant.

*Variables Considered in the Model *

Some conditions of an accident may be called exogenous. They are tied to the accident and happen without the volition or action of the drivers involved. Variables from table 1 in this category are:

- weather condition,
- weekday,
- settlement (whether an accident occurs in a built-up area or not), and
- daytime (whether an accident occurs during the night or day).

These external conditions influence some internal and objective
conditions also tied to the accident, such as traffic and the
roadway. For each accident, these conditions are also exogenous^{11}.

Besides these internal and objective conditions, there are also internal subjective (and not volitional) conditions that relate to the drivers involved:

- age and sex,
- driving experience,
- intoxication (alcohol), and
- use of a seat belt.

Objective and subjective internal conditions influence the cause
of an accident. The particular cause further influences the outcome
of the accident. Here, only two types of accident outcomes are
considered: a fatality or serious injury, and other^{12}.^{}
Subjective internal conditions and the cause of an accident
influence the type of driver injury.

Different network structures can reflect these conditions. In the
process of finding a suitable network structure, we experimented
with PowerConstructor. PowerConstructor (Cheng et al. 2001) is a
computer program that can estimate the Bayesian network structure if
a database of cases is available. The method (Cheng et al. 1997)
used in PowerConstructor for comparing competing structures is of
the dependency analysis type and requires
*O*(*n*^{4}) conditional independence tests
(*n* being the number of variables). The program is able to
take into account additional restrictions on variables (e.g.,
partial ordering, forbidden links, roots, leaves, and causes and
effects).

For this research, external variables and the variables related to the driver (e.g., age, sex, and experience) were among the root nodes (links can only point out of such nodes). Variables relating to the type of accident and the drivers' injuries were put among the leaf nodes (links can only point into such nodes). The variable related to the fault of the two drivers involved was also put among the leaves. PowerConstructor produced results pretty much as anticipated, except for some links that were missing.

Our anticipation was also based on some relevant findings from the literature. Kim (1996) analyzed the differences between male and female involvement in motor vehicle collisions in Hawaii and found that male drivers are:

- 4 times more likely than female drivers to not wear a seat belt,
- 3.6 times more likely than female drivers to be involved in alcohol-related collisions,
- 2 times more likely than female drivers to be involved in speed-related collisions, and
- 1.3 times more likely than female drivers to be involved in head-on collisions.

For the relationship between road accident severity and recorded weather, Edwards (1998) based her conclusions on data from police reports and found that:

- accident severity decreases in rain as compared with good weather,
- accident severity in fog shows geographical variation, and
- evidence for accident severity in high winds is inconclusive.

It is also well known that older drivers are more likely to be killed if involved in a fatal crash than younger drivers. Based on these results and common sense, additional restrictions for PowerConstructor included the following links:

- Age → Injury (older drivers are expected to be more prone to serious injuries than younger drivers)
- Seat belt → Injury (drivers not wearing a seat belt are likely to be more vulnerable)
- Experience → At-fault driver (drivers with little driving experience are more likely to be at fault)
- Sex → Seat belt use
- Sex → Alcohol
- Alcohol → At-fault driver

The resulting network is presented in figure 2. It is evident that only a small number of all theoretically possible interdependences was found to be important.

Weekday, daytime, and weather conditions influence traffic. An
assumption was made that the share of intoxicated drivers is greater
for accidents that happen at night than during the day. Only weather
influences road conditions^{13}
.The type of accident and the use of a seat belt also depend on
whether an accident happens in a built-up area or not (settlement
variable). A smaller share of drivers wearing a seat belt in
built-up areas was expected.

Figure 2 also takes into account the different characteristics of drivers. Drivers with little driving experience are more likely to be at fault in an accident than more experienced ones. There are also significant differences between men and women, with women being more likely to use seat belts than men. On average, older drivers are more prone to serious injuries than younger ones.

The central variable in figure 2 is the cause of an
accident^{14},
which is influenced by road, weather, and traffic conditions and by
the variable related to driver intoxication. Finally, the outcome of
an accident (defined as the most serious injury to participants in
an accident) is largely conditioned by the cause of the
accident.

The estimated structure seems plausible, but a different one may also be acceptable. The scoring functions used in the optimizing approach could shed some light on the quality of the estimated Bayesian network. Furthermore, the Kullback-Leibler measure of divergence could be used. Its value could be computed for the structure at hand but would only be of interest when comparing two or more specific structures. By presenting the most probable explanation (MPE), the corresponding probability, and the relative frequency obtained from the database, the statistical quality of the given network can be seen. MPE is given by the most probable configuration of values for all variables in the Bayesian network. For the estimated structure, the MPE is given by the following values for variables:

*Night = No; Weekday = Wrk (working day); Weather = Bright;
Settlement = Yes; Experience = E11-Inf (driver's experience of 11
years or more); Sex = Male; Age = A25-64; Seat_belt = Yes; Alcohol =
No; Alco12 = No; Roadway = Dry; Traffic = Norm (normal); Cause = PV
(car maneuvers); At-fault_driver = No; Injury = Oth (other then
fatality or serious injury); Accident_type = Oth*

Given the estimated structure of the Bayesian network and the conditional probabilities for each node, the probability of the MPE can be computed as shown below.

*P(MPE) = P(Night = No) *• *P(Weekday = Wrk) *•
*P(Weather = Bright) *• *P(Settlement = Yes) *•
*P(Experience = E11-Inf) *•

*P(Sex = Male)*•

*P(Age = A25-64)*•

*•*

P(Roadway = Dry|Weather = Bright)

P(Roadway = Dry|Weather = Bright)

*•*

P(Traffic = Norm|Weather = Bright, Weekday = Wrk, Night = No)

P(Traffic = Norm|Weather = Bright, Weekday = Wrk, Night = No)

*•*

P(Belt_use = Yes|Sex = Male, Settlement = Yes)

P(Belt_use = Yes|Sex = Male, Settlement = Yes)

*•*

P(Alcohol = No|Night = No, Weekday = Wrk, Sex = Male)

P(Alcohol = No|Night = No, Weekday = Wrk, Sex = Male)

*•*

P(Alco12 = No|Night = No, Weekday = Wrk)

P(Alco12 = No|Night = No, Weekday = Wrk)

*•*

P(At-fault_driver = No|Experience = E11-Inf, Alcohol = No)

P(At-fault_driver = No|Experience = E11-Inf, Alcohol = No)

*•*

P(Cause = PV|Roadway = Dry, Traffic = Norm, Weather = Bright, Alco12 = No)

P(Cause = PV|Roadway = Dry, Traffic = Norm, Weather = Bright, Alco12 = No)

*•*

P(Injury = Oth|Age = A25-64, Belt_use = Yes)

P(Injury = Oth|Age = A25-64, Belt_use = Yes)

*0.0018*

P(Accident_type = Oth|Settlement = Yes, Cause = PV) =

P(Accident_type = Oth|Settlement = Yes, Cause = PV) =

An examination of databases for 1998 and 1999 produced the following relative frequencies for MPE:

*P*_{e}(1998) = 94 / 35116 = 0.00268

*P*_{e}(1999) = 103 / 39950 = 0.00258

It is obvious that even the most likely explanation has a small
probability of its appearance. A comparison of *P*(*MPE*)
and *P*_{e}(*MPE*) can serve as an indication of
the quality of the estimated Bayesian network.

Figure
3 presents probabilities (also called beliefs) estimated from
the database of accidents for 1998 and based on the assumption of
the network structure given in figure 2. Values of variables related
to the different nodes are self-explanatory. Let us recall the
abbreviation used for accident type and injury: 1) *Fos* means
a fatality or serious injury, and 2) *Oth* means other (less
serious) outcomes. (Abbreviations for values related to the variable
Cause are explained in table 1.) Figure 3 shows only the
unconditional probabilities that correspond to each node (and not
the conditional probabilities discussed earlier).

### INFERENCE IN THE BAYESIAN NETWORK

The discussion here focuses on only three tables with specific inference results. For the inference process, Netica software (Norsys 1997) was used, and it proved to be very convenient and effective. Results are presented in tables 2 to 4 where predetermined values for a selected categorical variable (or variables) are given in the first column and probabilities for variables of interest are seen in other columns.

Table 2 shows inference results based on evidence for the variable related to the type of accident. Inference results are presented only for variables Cause, Settlement, Night, and Alco12. The probability that the cause of the accident is inappropriate speed (HI) is 0.279 in the case of accident type "Fos" (fatality or serious injury) and 0.134 for the accident type "Oth" (less severe injury). The odds ratio is therefore 2.1. Only a slightly smaller odds ratio is found for cause SV (wrong side/direction); a similar odds ratio for the Settlement variable (2.2); smaller odds ratios for variables Night and Alco12; and odds ratios smaller than 1 for cause PV (car maneuvers), OS (other), and VR (safety distance).

Table 3 reports the inference results based on the evidence for the intoxication variables (Alcohol and Alco12). The probability of an accident taking place at night is 0.752 if drivers are intoxicated and 0.206 if they are not. The odds ratio is, therefore, 3.7. Odds ratios are also high for variables Sex, At_fault, and Cause (for the values related to inappropriate speed and driving on the wrong side of the road).

Inference results based on the evidence for some exogenous variables are presented in table 4. The results shown correspond to a risky situation (driving at night, outside built-up areas, on the weekend, and in rainy weather) and to risky demographic variables (young and inexperienced drivers, i.e., males less than 25 years of age and less than 1 year of driving experience). Nonrisky values were defined with the opposite values for binary variables. For other (non-binary) variables, the following values were used: age between 24 and 65, driving experience more than 11 years, and for the weekday the working day. Odds ratios are especially high for the type of accident and intoxication variables.

While more inference results and a complete picture of the influence on all variables are available, this paper presents only the more interesting variables because the primary aim is to illustrate the capabilities of Bayesian networks in this domain of knowledge. A more indepth analysis of inference results could be used for detecting any weaknesses in the Bayesian network and for improving its structure. By using data for more than one year, the results become more reliable. New variables can also be added, for example, actual data on traffic flows on the road sections on which accidents occur or other specific characteristics of roads and regions.

### CONCLUSIONS

This paper deals with road accidents involving two car drivers. A model of such accidents is presented to capture the interrelations between different relevant variables. To this end, Bayesian networks that have proved their modeling capabilities in different knowledge domains were used. The paper first introduces Bayesian networks on a small example and then formally defines them. After presenting data on two-car accidents for Slovenia in 1998, a structure is proposed based on knowledge of the domain and on computer experiments. For this structure the corresponding probabilities were estimated from the available database. We then demonstrate how the estimated Bayesian network can be used for drawing inferences. Inference results are consistent with expectations as far as the direction of influence is concerned.

The estimated Bayesian network can be regarded as a compact and structured representation of the given database of two-car accidents. This representation relates to specific types of accidents in a given country and year. It also enables different inferences, but other methods, such as logistic regression, should also be used.

Based on the research presented here, we feel that Bayesian networks can be fruitfully applied in the domain of road-accident modeling. Compared with other well-known statistical methods, the main advantage of the Bayesian network method seems to be its complex approach where system variables are interdependent and where no dependent and independent variables are needed. The method's chief weakness is the somewhat arbitrary search for an appropriate network structure. Nevertheless, the results shown here are encouraging and point to possible directions for improvement, such as including more variables and larger datasets that cover more years. Extending the Bayesian network (with good performance results) into a decision network is another possibility.

### ACKNOWLEDGMENTS

The Ministry of Science and Education of the Republic of Slovenia supported this research. Thanks go to anonymous referees for suggestions on improving this paper and to Jie Cheng for providing his PowerConstructor software for use with the data. Any errors, however, remain ours alone.

### REFERENCES

Agresti, A. 1990. *Categorical Data Analysis.*
New York, NY: Wiley & Sons.

Cheng, J., D.A. Bell, and W. Liu. 1997. Learning Belief Networks from Data: An Information Theory Based Approach. Proceedings of the Sixth ACM International Conference on Information and Knowledge Management.

____. 2001. Learning Belief Networks from Data: An Efficient Approach Based on Information Theory. Available at http://www.cs.ualberta.ca/%7Ejcheng/bnpc.htm, as of January 24, 2005.

Cooper G.F. and E. Herskovits. 1992. A Bayesian Method
for the Induction of Probabilistic Networks from Data. *Machine
Learning* 9:309-347.

Edwards, J.B. 1998. The Relationship Between Road
Accident Severity and Recorded Weather. *Journal of Safety
Research* 29(4):249-262.

Hosmer, D.W. and S. Lemeshow. 2000. *Applied
Logistic Regression.* New York, NY: Wiley & Sons.

Jensen, F.V. 2001. *Bayesian Networks and Decision
Graphs.* New York, NY: Springer-Verlag.

Jordan, M.I., ed. 1999. *Learning in Graphical
Models.* Cambridge, MA: The MIT Press.

Kim, K.E. 1996. Differences Between Male and Female Involvement in Motor Vehicle Collisions in Hawaii, 1986-1993. Proceedings from the Second National Conference. Available at http://www.durp.hawaii.edu.

Lam, W. and F. Bacchus. 1994. Learning Bayesian Belief
Networks: An Approach Based on the MDL Principle. *Computational
Intelligence *10:269-293.

Lauritzen, S.L. and D.J. Spiegelhalter. 1988. Local
Computations with Probabilities on Graphical Structures and Their
Application to Expert Systems. *Journal of the Royal Statistical
Society B* 50(2):157-194.

Norsys Software Corp. 1997. *Netica Application
User's Guide.* Vancouver, Canada.

Pearl, J. 2000. *Causality.* Cambridge, UK:
Cambridge University Press.

Zhang, N.L. and D. Poole. 1996. Exploiting Causal
Independence in Bayesian Network Inference. *Journal of Artificial
Intelligence Research *5:301-328.

### END NOTES

^{1} Some similar or synonymous
concepts are graphic models, belief networks, probabilistic
networks, independence networks, causal networks, and Markov
fields.

^{2} In Bayesian network
literature, the terms *vertex* and *edge* are sometimes
applied instead of *node* and *link*.

^{3} The probability of the
event *A* is denoted by *P* (*A*).

^{4} Node A is the parent of
node B if there is a link from A toward B in the graph.

^{5} In a general logistic
regression, they are not limited to only these types of
variables.

^{6} A similar interpretation
is possible if we have categorical variables with more than two
values.

^{7} Explanatory variables can
be interdependent. Their interdependence plays a role in the
estimation of these parameters (see chapter 2 in Hosmer and Lemeshow
2000).

^{8} When the state of a
variable is known, we say that it is instantiated. We have an
instantiation of a set of variables if each variable is instantiated
(Jensen 2001).

^{9} *A* is a predecessor
of *B* if a directed path (a sequence of links) exists from
*A* to *B*.

^{10} Passengers are taken
into account only indirectly. A fatal accident may mean that both
drivers were only injured, but at least one passenger was
killed.

^{11} It is assumed that an
individual driver does not significantly influence traffic
conditions.

^{12} The variable
Accident_type is related to the accident, while the variable Injury
is related to the driver. This presents no problem for an analysis
with Bayesian networks.

^{13} New variables could have
been added here but were not in order to maintain a more manageable
total number of variables.

^{14} This is partly
conditioned by the large number of possible states (seven) and by
the method used in PowerConstructor.

### ADDRESS FOR CORRESPONDENCE

*M. Simoncic, Institute for Economic Research, Kardeljeva pl. 17, Ljubljana, Slovenia. Email: simoncicm@ier.si