**DIMITRIS X. KOKOTOS ^{1}**

Ship accidents frequently result in total ship loss, an outcome with severe economic and human life consequences. Predicting the total loss of a ship when an accident occurs can provide vital information for ship owners, ship managers, classification societies, underwriters, brokers, and national authorities in terms of risk assessment. This paper investigates the use of classification trees to predict this type of loss. It uses a set of predictor variables that correspond to a number of factors identified as the most relevant to the total loss of a ship and sample data generated from a large database of recorded ship accidents worldwide. Through extensive tests of induction algorithms, Exhaustive CHAID was found to be more efficient in classifying the total loss accident cases. The predictive ability of the resulting classification tree structure can be utilized for risk assessment reporting.

KEYWORDS: Classification trees, ship accidents, total ship loss.

The analysis of ship accident cases is of great importance because of the economic costs (Goulielmos and Giziakis 1995; Bureau of Transport and Communications Economics 1994), the environmental impacts (Commission des Communautés Européennes 1992), and the loss of human lives. Causes of accidents include ships running aground; touching the sea bottom; striking wharves, drilling rigs, platforms, or other external substances; colliding with other ships; catching on fire; or suffering an explosion or other serious hull or machinery damage. The worst possible outcome of an accident may be the total loss of the ship. We define total loss of a ship here as a ship that is irretrievably damaged or sunk in a way that it cannot be salvaged (*actual total loss*) or as a ship that is so damaged that its recovery and repair would exceed the ship's insured value (*constructive total loss*) (Hudson 1996). Different factors determine the total loss of a ship. These factors are related to the quality of the construction, restoration or the resistance of the vessel, the violence and the severity of the accident, and the existing weather and sea conditions at the time of the accident.

Previous studies that look at the problem of predicting ship accidents and possible total ship loss use various datasets and data analysis techniques: discriminant analysis, logistic regression, stochastic models, and neural networks (Psaraftis et al. 1998). Giziakis et al. (1996) used logistic regression on accident data from the Greek Ministry of Mercantile Marine to predict ship failures based on several factors such as the age and the type of the ship, its gross tonnage, registration, etc. Le Blanc and Rucks (1996) proposed discriminant analysis to model ship accident classification in the Missisipi River region. Otay and Özkan (2003) proposed a stochastic prediction model to study the possibility of vessel accidents (collision, ramming, and grounding) in the Strait of Istanbul. Hashemi et al. (1995) developed a neural network structure to predict ship accidents under different conditions on the lower Mississippi River. Le Blanc et al. (2001) compared statistical analysis and neural network computing techniques (Kohonen networks) in a dataset of 900 ship accident cases on the same regions of Mississippi River. This comparison concluded that neural networks are significantly superior for classifying and predicting ship accidents over earlier statistical methods.

The above-mentioned research work and applications examine a relatively small number of accident cases restricted to a particular geographic area or a controlled region (rivers, ports, straits, canals, etc.). Traditional statistical methods and neural networks are the basic data analysis tools used to date for the development of applications. In this paper, we present a classification tree application for predicting total ship loss based on a dataset extracted from an existing large dataset of ship accident cases worldwide. We tested different algorithms for expanding classification trees and a number of values for initial parameters to conclude that Exhaustive CHAID ^{1} is the most effective algorithm that provides the best classification rates for accidents in which a ship is a total loss. This particular approach of using classification trees has not been investigated in depth in previous research efforts and applications for modeling ship accidents.

In the next section, we present a short overview of classification tree theory and the comparison tests of four classification tree expansion algorithms. The following section presents the predictive variables of the model, describes the data preparation procedures, and provides basic descriptive statistics. We then cover the application of classification tree theory to our test dataset. The last section presents conclusions and discussion of issues concerning potential applications based on the proposed classification tree.

The classification tree is a data mining technique for predicting the membership of cases in classes defined by a dependent variable usually of the categorical type. Each case is measured along a number of predictor variables. The implementation of a classification tree is achieved through a training process (*induction*) in which a specific algorithm is applied to a sample dataset (*a training set*) composed of the predictor variables.

A typical induction algorithm works in two phases: the splitting phase and the pruning phase. The splitting phase is an iterative top-down process that expands the tree by defining *nodes* connected by *branches*. The nodes at the end of branches are called *leaves*. The first node at the top of the tree is the *root node*. At every node, the splitting algorithm creates new nodes by selecting a predictor variable so that the resulting nodes are as far as possible from each other. The distance measurement used for the splitting depends primarily on the specific splitting algorithm and is determined by such statistics as gini, entropy, chi-squared, gain ratio, etc. One important feature of the splitting algorithm is the so-called *greedy*. This refers to the ability of the algorithm to look forward in the tree in order to examine if another combination of splitting could produce better overall classification results.

An alternative representation of the classification tree can be given by using a set of nested IF-THEN rules. Each IF-THEN rule identifies a unique path from the root to a leaf and describes a certain class of cases. This alternative representation of the tree is better for analysis, particularly when the tree is greatly expanded. The nodes at the lowest part of a branch that cannot be split further into other nodes because they contain cases with only one outcome are called *pure leaves*. The splitting phase terminates when a stopping rule, initially selected by the user, is satisfied. Stopping rules may include the maximum number of nodes, the number of variables in a node considered for splitting, a minimum number of cases per node, and so forth. Once the structure of the tree is developed, pruning may be required to make the tree more applicable to other similar datasets or to exclude nodes that seem inappropriate for the specific dataset or application.

The prediction accuracy of the classification tree is highly related to the misclassification *cost* (Fawcett 2001). The term cost is used to describe the situation when some predictions either occur more frequently than others or have more important consequences. Misclassification cost represents the percentage of cases that are incorrectly classified and it is frequently used as a typical measurement of the accuracy of the prediction. For a given class, misclassification cost is set to a specific value to denote the severity of a wrong prediction for that class. Another issue related to the cost is the *priors* or a priori probabilities that denote how likely a case will fall into one of the classes. Unequal priors are used in problems with specific knowledge about the size of the classes. Arrangements for defining misclassification costs and priors are confounded in complex problems (Ripley 1994).

To ensure that the tree will perform as well as in the training sample, a validation procedure can be applied. The most preferred type of validation is testing with a sample taken from the original dataset, especially when this dataset is large enough. The sample size can be approximately one-third to one-half of the learning dataset (Brieman et al. 1984). When no sample dataset is available, the validation can be done on subsets of the original training set. In all cases, the misclassification costs in the validation procedure must be close enough to those obtained by the learning procedure. This procedure verifies that the tree will perform equally well with other datasets. In the case when the misclassification costs are not close enough to the costs of the learning sample, the size and the splitting of the tree must be reconsidered.

A number of induction algorithms and software tools to implement classification trees appear in the literature. The various algorithms differ mainly in the statistical criteria used for splitting the nodes, in the types of dependent variables they support (scale, ordinal, nominal), in the number of nodes they allow for splitting, and in the elimination of redundancy during the generation of the rules. Among others, Classification and Regression Tree (known as CART or C&RT) (Brieman et al. 1984; Lee et al. 1997), CHAID (Kass 1980) and its extension the Exhaustive CHAID (Biggs et al. 1991), and QUEST (quick unbiased efficient statistical tree) (Loh and Shih 1997) are the most recently developed and more popular induction algorithms. A short description of these algorithms follows:

**CART**generates only binary trees. It constructs the tree by examining all possible splits at each node for each predictor variable and uses the goodness-of-fit measurement criterion to find the best split. It assumes scale and ordinal or nominal types in the predictor and dependent variables.**CHAID**determines the best split at each node by merging pairs of categories of the predictor variable with respect to their distance from the dependent variable. The chi-square test measures this distance. It produces nonbinary trees and assumes scale and ordinal or nominal types in the predictor variables.**Exhaustive CHAID**is an improvement over CHAID as it finds the optimal split by continuously testing all possible category subsets in order to merge related pairs until only one single pair remains.**QUEST**constructs the tree by examining the association of each predictor variable to the dependent variable and selecting the predictor with the highest association for splitting. Then Quadratic Discriminant Analysis (QDA) is used to find the best split point for the predictor variable selected. The association of a predictor to the dependent variable is measured by ANOVA F-test, Levene's test, or Pearson's chi-square test if the predictor is of the ordinal, continuous, and nominal type, respectively. QUEST like CART, yields binary trees.

QUEST is generally faster than the other techniques, but cannot be applied to regression type problems, that is, when the dependent variable is continuous. CHAID produces, at each split, a greater number of nodes than the other two algorithms, thus forming wider trees. To date, the literature does not give a recommendation for which algorithm to use to maximize the predictive accuracy of the tree. The practice usually followed is to test the different algorithms in order to find which one minimizes the misclassification costs and at the same time satisfies the restrictions of the dataset, such as the existence of missing values and the handling of ordinal or nominal variables (Witten and Frank 2000). The approach we take in this study is to identify the algorithm that will minimize the total loss accident classification rates.

We also directly compare classification trees to other traditional statistical methods such as logistic regression (Dillon and Goldstein 1984), because they can classify cases depending on classes defined by a dependent variable. Logistic regression is similar to other statistical explanatory and classification techniques such as linear regression, ordinal least squares and discriminant analysis, but it has less stringent requirements because it assumes no linearity of relationships between the dependent and the independent variables and does not assume normally distributed variables. As in the classification trees, the effectiveness of the statistical method is measured by the misclassification rate, that is, the percentage of cases that are not correctly classified to the total number of cases.

In order to build an explanatory model to predict total ship loss, a preparatory phase of this study identified a number of factors that were conceptually grouped with those directly related to the vessel and with those that describe the particular conditions at the time of the accident. We initially identified these factors using accident reports (*Lloyd's Casualty Week* 19921999) and subsequently verified them from other references (Psaraftis et al. 1998; Giziakis 1996). The factors chosen include the type, size, age, and condition of the vessel at the time of the accident; its previous record of accidents; the weather and sea conditions; and the place and location of the ship when the accident occurred.

This study is based on an existing database of accident cases that was created for other projects (Giziakis and Kokotos 1996; Kokotos 2003). This database contains 27,664 records of shipping accidents worldwide between 1992 and 1999. The data were compiled mainly from textual ship accident references taken from *Lloyd's Casualty Week *(19921999) and validated against annual editions of *Lloyd's Register of Ships* (19921999), annual editions of *Lloyd's World Casualty Statistics* (19921999), and *Lloyd's Maritime Atlas* (1993). This reference database was further organized into predictive variables properly chosen to relate closely to the factors previously identified as the most relevant for explaining total ship loss, and it was prepared to conform to Lloyd's Casualty Information System (1980).

Through a data cleaning effort, a dataset of 4,619 shipping accident cases was generated by eliminating cases with identical accident information, missing values, and unreliable data. The final dataset used in this study contained only 352 accident cases (7.6%) where total ship loss was reported, while the rest of the cases (4,267 or 92.4% of the total) were related to accidents with no total ship loss. In this dataset, a small number of ships were involved in more than one accident with one resulting in total loss.

The remainder of this section presents the predictive variables and the most important descriptive statistics for a better understanding of the problem.

The **year when the ship was constructed** was used as an indicator of the general condition of the vessel. Most of the ships in the sample were built between 1967 and 1990; only about 20% were built before 1967 or after 1991. For the class of accidents with total ship loss, the average value for this variable was 1974.2; while for the class of no total ship loss, the corresponding value was 1977.8.

The **age of the ship** was calculated as the difference between the year of the accident and the year when the ship was built. Ships with ages between 15 and 20 years were more frequently involved in accidents.

The **gross tonnage of the ship** was used as a typical measure for the size of the ship, which strongly depends on the type of the ship (see below). The distribution of the values for the gross tonnage of the ships in the dataset was 13.6% below 1,000 tons; 67.4% between 1,000 and 24,500 tons; and 20% over 24,500 tons. A simple comparison of the average values of gross tonnage in the classes of accidents with or without total ship loss (12,084 and 18,234 tons, respectively) indicated that smaller ships were more frequently lost than bigger ships.

The **types of the ships** recorded in the dataset were tanker, general cargo, ferry, container, and bulk carrier. Containers appear to have the lowest accident rate where the ship is a total loss (about 3%), while for all the other types this rate was not significantly different from the average (between about 7% and 9% of the total number of ships).

The **number of previous ship owners** reflects the general condition of the vessel, because the practice followed by many ship owners is to sell the ship if its condition is declining and the involvement in serious accidents is expected to be more frequent. For simplicity, this variable was categorized into four groups corresponding to one, two, three, or four to five or more owners. The percentages of total accident cases included in these groups are 35.9%, 23.8%, 29.4%, and 10.9%, respectively.

The **number of previous ship accidents,** regardless of their type and their severity, served as an extra indication of the general condition of the ship. The maximum value in the dataset was seven. Total ship loss was not significantly correlated with this variable. Particularly for the class of accidents with total loss, 83% of these cases had only one previous accident and the rest (17%) had more than one.

The **registration society of the ship** at the time of the accident ^{2} was another variable. Registration societies certificate the condition of ships by adopting different survey standards. Sixteen distinct registration societies were recorded in the database and coded from R1 to R16 in random order. The average percentage of the accidents with total ship loss per society was 7.6% and the minimum and maximum were 0.5% and 24%, respectively.

The **type of accident** variable describes what occurred, independently from the outcome of the accident (total ship loss or no total ship loss). The accident type was coded according to Lloyd's Casualty Information System and accident classification standards. Table 1 shows the percentage of accidents with total ship loss by accident type. It shows that fire/explosion and contacts with external substances are the accident types with the highest and lowest frequency of total ship loss.

The accident type is also very closely related to the type of the ship (figure 1). This dimensionless graph very closely plots the particular categories of the two variables that have strong relationships. The figure shows that tankers have frequent collisions because of their size and their lack of flexibility in maneuvering, while containers due to their cargo suffer from fires. Grounding accidents are more frequent for general cargo ship, and contacts are independent of the type of the ship.

The variable for the **year and the month of the accident** covered 1992 to 1999. No significant differences were found between the number of accidents in different years and months. The average number of accidents with total loss per month was 7.6%.

The particular **geographic area of the accident** coded into 12 major areas according to the standard classification of *Lloyds Maritime Atlas* areas. Figure 2 shows the 12 areas defined and table 2 presents the distribution of the number of accidents with total ship loss in the 12 areas. The greatest number of accidents with total ship loss occurred in the Indian Ocean and the fewest in the Gulf of Mexico-West Indies-Newfoundland.

For the specific **location of the ship** at the time of the accident, values are "port" for accidents that occurred within the region of a port, "overseas" for accidents that occurred at sea far from the coast, and "controlled seaways" for straits and canals. The number of accident cases and the associated percentages are 2,188 (47.4%) for ports, 1,548 (33.5%) for overseas, and 883 (19.1%) for controlled seaways. The number of accidents with total ship loss was equally distributed for all types of locations: ports, 6%; overseas, 9.8%; and controlled seaways, 7.8%. The most frequent accidents in ports were grounding and collisions and, in the overseas category, hull/machinery damage.

The variable for the **reported weather conditions** when the accident happened included: calm weather, poor visibility, storm, freezing conditions, and typhoon. Most of the accidents with total ship loss occurred during typhoons (8.8%), storms (8.3%), or in poor visibility (7.3%). In calm weather or freezing conditions, as expected, the percentage of the accidents with total ship loss was significantly lower (5.0% and 4.3%, respectively). In relation to the accident type, 43.8%, 49.2%, and 56.9% of the hull/machinery accidents occurred in calm weather, during storms, or in freezing conditions, respectively. During typhoons, the most frequent accident was grounding (50.1%), and in poor visibility, collisions (47.2%) were the most frequent accidents.

**Total loss of a ship** is the dependent variable for the analysis and is defined as a dichotomous variable accepting values of yes or no. Statistical tests performed (one-way ANOVA) to compare the average values of the above-mentioned variables for the classes of accidents distinguished by total loss showed no significant differentiation among them.

In this section, we test different classification tree induction algorithms and logistic regression in order to identify the best-performing tree structure to predict total ship loss. We used the 12 variables described earlier as predictors with total loss as the dependent variable. Total loss is a dichotomous variable (it accepts values of yes or no), while the predictors are of various types: scale (e.g., gross tonnage, year ship was built), ordinal (e.g., number of previous ship owners), and nominal (e.g., location of the accident, weather conditions).

In a preliminary stage of the analysis, CART, CHAID, Exhaustive CHAID, QUEST, and logistic regression were applied to the dataset by defining equal misclassification costs and priors, assuming no previous knowledge of the problem. This effort, although it produced total misclassification rates of 93% due to the unbalanced training set (only 7.6% of the cases consisted of the class with total ship loss) resulted in very poor classification rates for the class of Total Loss = "yes." For that particular class, Exhaustive CHAID showed the best performance (a classification rate of 55.3%), the logistic regression showed the worst (only 9.99%), and the other two algorithms showed approximately 22%.

To resolve the problem of poor classification in the small class of accident cases with total ship loss, we experientially adjusted the misclassification costs in a second stage of the analysis. Logistic regression was excluded from this stage of the analysis because it cannot accept any further improvement. The misclassification cost for Total Loss = "yes" was set to a ratio of 12 to 1 so as to indirectly reflect the importance and severity of the total loss outcome compared with the damages and the consequences of a simple accident, a practice proposed in similar cost-sensitive classifications problems (Hollmen and Skubacz 2000).

Table 3 presents the rates of the correctly classified cases obtained from the four induction algorithms. From the table, it can be seen that Exhaustive CHAID retains its superiority over the rest of the algorithms and achieves the best rates in both cases.

In all the tests carried out in this analysis, the classification tree algorithms were applied in a sample training set of 3,079 cases (two-thirds of the total number of cases). The remaining 1,026 cases were considered as the test dataset used for validation. To ensure a uniform distribution of cases in every split, the child nodes were defined to include a number of cases not greater than the half of the parent node. We used SPSS/Answer Tree software to implement the classification algorithms.

To confirm that the results of the Exhaustive CHAID were not dependent on the particular dataset and that this algorithm will perform well using other similar datasets, a validation procedure including three different tests was applied. First, Exhaustive CHAID was tested using the dataset of 1,026 cases (one-third of the initial dataset not included in the training set) and produced classification rates of 87.8% and 84.3% for the total number of test cases and the cases of Total Loss = "yes," respectively. A second used 10 subsets randomly selected from the initial dataset. This test gave the best classification rates84.1% and 80.5%, respectively. A third test was a manual test of the classification tree structure for a small number of new accident cases not included in the initial dataset. Again, the classification rates were approximately similar to the outcome of the other two testing methods. This validation procedure verified that the tree structure produced by Exhaustive CHAID provides the best predictions of total ship loss.

Figure 3 presents the final classification tree structure produced using Exhaustive CHAID after the adjustments in misclassification costs during the training phase. For economy of presentation, only the first three levels of the tree are shown. Each node is identified by the node number and the number of cases included in the node for the classes of Total Loss = "no" and "yes," followed by the percentages and the totals. Table 4 gives, for every node presented in the tree, the condition applied for the expansion of the tree.

The hierarchy of the classification tree shows that the first split defined four nodes using the "year ship was built" predictor. Depending on the node of the first level, in the next splits the predictors "geographic area," "location," and "gross tonnage" were used to define nodes 5 up to 13. In the next levels, other predictors were included except for "weather condition" and "number of previous accidents," which had not been used anywhere in the tree structure.

The graphical representation of a classification tree, as in figure 3, may not be very convenient for analysts or decisionmakers, particularly when the tree is wide and contains a large number of nodes. An alternative, more suitable presentation of the tree can be given by describing each node by IF-THEN rules of the form:

"IF *condition* THEN *prediction*"

in which the *condition* part is a composite condition including the AND logical operator, and the *prediction* part is given in terms of a probability value for the *condition* to be true.

By using the alternative IF-THEN presentation of the classification tree produced in this study, different types of nodes can be located: those that contain cases in which total loss has a significant probability of occurring, those in which total loss of a ship is unlikely to occur, and those that present no clear conclusion. The most important nodes for this study are those that have significant probabilities of total loss. These types of nodes, although limited, reveal certain conditions of accidents in which total loss of the ship is a strong possible outcome. The following examples of selected nodes demonstrate the three types of nodes of the tree. The symbol "!=" which appears in the IF conditions is the "not equal" operator.

Example 1 describes a typical node for which the group of accident cases has very limited probability of total loss. Example 2 uncovers a group of accident cases with significant risk situations (39.3%). This is considered a valuable output of the analysis. Example 3 refers to a node where no clear distinction between the cases with total loss and no total loss can be seen, because the probabilities do not significantly differ from those obtained from the whole dataset. The condition associated with this node is very simple and not very specific.

**Example 1 **

**Rule**: IF *(Year ship was built > 1981) AND (Geographic Area ! = "Mediterranean-Black Sea" AND Geographic Area ! = "Australia" AND Geographic Area ! = "Alaska-Bering-USSR Arctic-Iceland") AND (Accident ! = "Grounding" AND Accident ! = "Fire/Explosion") AND (Number of previous ship owners > 2)* THEN *Prediction = NO, Probability = 0.9783.*

**Number of cases.** Total number of cases = 94. Cases of total loss YES = 2 (2.17%), NO = 92 (97.83%). Probability for NO total loss = 0.9783.

**Description.** Any ship built after 1981 with two or more previous owners, involved in accident types "contact," "collision," and "hull/machinery damage" different from "grounding" and "fire/explosion," in areas other than "Mediterranean-Black Sea," "Australia," and "Alaska-Bering-USSR Arctic-Iceland."

**Example 2 **

**Rule.** IF *(Year ship was built < = 1973) AND (Geographic Area ! = "Pacific Ocean" AND Geographic Area ! = "Australia" AND Geographic Area ! = "Atlantic Ocean" AND Geographic Area ! = "Red Sea W & E African Coast ") AND (Accident ! = "Contact" AND Accident ! = "Hull/Machinery Damage") AND (Registration society ! = "R11" AND Registration society ! = "R2" AND Registration society ! = "R16" AND Registration society ! = "R12")* THEN *Prediction = YES, Probability = 0.393.*

**Number of cases.** Total number of cases = 79. Cases of total loss YES = 31 (39.3%), NO = 48 (60.7%). Probability for total loss YES = 0.393.

**Description.** Any ship built before 1973 registered in societies R1, R3 to R10, and R12 to R15 involved in accident types "collision," "fire/explosion," and "grounding" in areas other than "Australia," "Atlantic Ocean," and "Red Sea-West and East African Coast."

**Example 3 **

**Rule:** IF *(Year ship was built > 1977 AND Year ship was built < = 1981) AND (Gross ship tonnage < = 12603)* THEN *Prediction = YES, Probability = 0.0994.*

**Number of cases.** Total number of cases = 352. Cases of total loss YES = 35 (9.95%), NO = 317 (90.05%). Probability for total loss YES = 0.0995.

**Description.** Any type of ship built between 1977 and 1981 having gross tonnage less than 12,603 tons.

The above-mentioned examples demonstrate the use of classification to identify groups of accident cases with significant or no significant possibilities of total ship loss.

In this paper, we presented a classification tree application to predict total loss of a ship as a consequence of an accident. The application was based on a large dataset of accident cases occurring in locations worldwide. Extensive tests indicated that the Exhaustive CHAID induction algorithm minimized misclassification costs, a criterion that we defined as the most important for the particular application. Due to the unbalanced training set, the initial choice of setting equal costs resulted in poor classification rates for the class of Total Loss = "yes." To resolve this problem, initial information concerning misclassification rates was defined to reflect the importance of the total ship loss outcome in this particular application. The experiential comparison between different induction algorithms and logistic regression verified the superiority in classification of data mining techniques and especially of the classification trees over traditional statistical methods. Classification trees compared with statistical methods are also easily understood by both experts and non-experts and can provide a good illustration of the classification.

The prediction of total loss is of great importance for ship owners, ship managers, classification societies, underwriters, brokers, and national authorities, because it can provide valuable information for issuing risk assessment reports. In the case of a ship accident, by considering parameters such as the characteristics of the vessel, the geographic area and the particular location, the type of the accident, etc., through traversing of the tree or testing the IF-THEN rules, estimates can be made for the accident of the probability of total ship loss. In many accident cases, the prediction should be accurate and clear and can be used to activate different rescue plans so as to reduce the costs of damages to the vessel and, particularly, to save lives. The classification tree for predicting total ship loss may be utilized in the context of a potential decision support system and a risk management information system that will record, evaluate, and process data for ship accidents.

Biggs, D., B. DeVille, and E. Suen. 1991. A Method of Choosing Multiway Partitions for Classification and Decision Trees. *Journal of Applied Statistics*18(1):4962.

Brieman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984. *Classification and Regression Trees.* Belmont, CA: Wadsworth International Group.

Bureau of Transport and Communications Economics. 1994. *Structural Failure of Large Bulk Ships,* Report 85. Commonwealth of Australia.

Commission des Communautés Européennes. 1992. L'Impact des Transports sur L'Environnement: Une Stratégique Communautaire pour un Development des Transports Respectueux de L'Environnement. *Livre Vert Relatif* COM(92) 46 final/29/4/92. Brussels, Belgium.

Dillox, W.R. and M. Goldstein. 1984. *Multivariate Analysis Methods and Applications. *New York, NY: John Wiley.

Fawcett, T. 2001. Using Rule Sets to Maximize ROC Performance, Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November2 December 2001, San Jose, CA, pp. 131138.

Giziakis, K. 1996. *Criticism of the Content of Variables that are Used in the Analysis of Accidents in the Shipping Industry,* volume in honor of Professor Stavropoylos. Piraeus, Greece: University of Piraeus.

Giziakis, K., E. Giziaki, A. Pardali-Lainou, V. Michalopoulos, and D. Kokotos. 1996. Minimising the Risk of Failure for an Effective and Reliable European Shipping Network. *Proceedings of the 3rd European Research Roundtable Conference on Shortsea Shipping: Building European ShortSea Networks.* Bergen, Norway: Delft University Press.

Giziakis, K. and D.X. Kokotos. 1996. Needs and Benefits from the Development of Shipping Accident Databases, Proceedings of the Conference on Greek Coasts and Seas*,* Feb. 2829, 1996, Piraeus, Greece.

Goulielmos, A.M. and K. Giziakis. 1995. Treatment of Uncompensated Cost of Marine Accidents in a Model of Welfare Economics, Proceedings of JAME Conference, Boston, MA.

Hashemi, R.R., L.A. Le Blanc, C.T. Rucks, and A. Shearry. 1995. A Neural Network for Transportation Safety Modeling. *Expert Systems with Applications* 9(3):247256.

Haviland, E.K. 1970. Classification Society Registers from the Point of View of a Marine Historian. *American Neptune* 30:939.

Hollmen, J. and M. Skubacz. 2000. Input Dependent Misclassification Costs for Cost-Sensitive Classifiers. *Proceedings of the 2nd International Conference on Data Mining.* Edited by M. Taniguchi. Bellerica, MA: WIT Press.

Hudson, N.G. 1996. *Marine Claims Handbook, 5th Edition.* London, England: Witherby's Publishing.

Kass, G.V. 1980. An Exploratory Technique for Investigating Large Quantities of Categorical Data. *Applied Statistics *29:119131.

Kokotos, D. 2003. Data Mining: Decision Tree Analysis Upon Vessel Accidents, Ph.D. thesis. Department of Maritime Studies, University of Piraeus, Piraeus, Greece.

Le Blanc, L.A. and C.T. Rucks. 1996. A Multiple Discriminant Analysis of Vessel Accidents. *Accident Analysis and Prevention* 28(4):501510.

Le Blanc, L.A., R.R. Hashemi, and C.T. Rucks. 2001. Pattern Development for Vessel Accidents: A Comparison of Statistical and Neural Computing Techniques. *Expert Systems with Applications* 20(3):163171.

Lee, Y., B.V. Roy, C.D. Reed, R.P. Lippman, and K. Wadsworth. 1997. *Solving Data Mining Problems Through Pattern Recognition.* Upper Saddle River, NJ: Prentice Hall.

Lloyd's Casualty Information System. 1993. London, England: Lloyd's of London Press, Ltd.

*Lloyd's Casualty Week.* 19921999. London, England: Lloyds of London Press, Ltd.

*Lloyd's Maritime Atlas, 10th ed.* London, England: Lloyd's London Press, Ltd.

*Lloyd's Register of Ships.* 19921999. London, England: Lloyd's Register of Shipping.

*Lloyd's World Casualty Statistics.* 19921999. London, England: Lloyd's Register of Shipping.

Loh, W.Y. and Y.S. Shih. 1997. Split Selection Methods for Classification Trees. *Statistica Sinica* 7:815840.

Otay, N. and S. Özkan. 2003. Stochastic Prediction of Maritime Accidents in the Strait of Istanbul, Proceedings of the 3rd International Conference on Oil Spills in the Mediterranean and Black Sea Regions, Istanbul, Turkey, September, pp. 92104.

Psaraftis, C., G. Panagakos, N. Desipris, and N. Ventikos. 1998. An Analysis of Maritime Transportation Risk Factors, Proceedings of the 8th Conference ISOPE, Montreal, Canada, Vol. IV, pp. 484492.

Ripley, B.D. 1994. Neural Networks and Related Methods for Classification (with discussion). *Journal of the Royal Statistical Society B* 56:409456.

Witten, I.H. and E. Frank. 2000. *Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.* San Francisco, CA: Morgan Kaufmann Publishers.

1. CHAID stands for Chi-Squared Automatic Interaction Detection.

2. See *Lloyd's Register of Shipping* (www.mariners-l.co.uk/ResLloydsRegister.htm) for additional information on classification societies and their published registers. (Also see Haviland 1970.)

D. Kokotos, Department of Maritime Studies, University of Piraeus, 80 Karaoli and Dimitriou Street, 18534 Piraeus, Greece. E-mail: dkokotos@unipi.gr

Corresponding author: J. Smirlis, Department of Statistics and Actuarial Science, University of Piraeus, 80 Karaoli and Dimitriou Street, 18534 Piraeus, Greece. E-mail: jsmirlis@unipi.gr