8 years, 9 months ago

## AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service

### AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service is organized within the framework of the 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html), and is an integral part of the 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/2014/ceim.html) devoted to the fire protection engeneering. The task is related to the problem of extracting useful knowledge from incident reports obtained from The State Fire Service of Poland. Prizes worth over 3,000 USD will be awarded to the most successful teams. The contest is sponsored by Dituel Sp. z o.o. (http://www.dituel.com.pl/) and F&K Consulting Engineers (http://www.fkce.pl/), with a support from The University of Warsaw (http://www.mimuw.edu.pl/) and ICRA project.

Introduction

Incident Data Reporting Systems (IDRS) are used by public safety services across the globe to gather information about the incidents which required their actions. This information is used not only to simply document the events but it can also be incorporated into the training of new officers. Moreover, the knowledge extracted from such reports can help in better identification of threats and in planning of more effective procedures. An example of such a reporting system is EWID which is used by the State Fire Service of Poland. Each report from this system consists of two parts. The first one contains a summary of resources utilized during the action in a form of structured and quantified characteristics. The second part contains a natural language description of the reported events,  which is entered by the officer coordinating the action. In the proposed data mining competition, we would like to raise the problem of extracting useful knowledge from the reports generated in the EWID system. In particular, we would like to ask the participants to identify key factors influencing the risk of serious injuries among firefighters and people involved in various incidents.

Special session at CEIM'14: A special session devoted to the competition will be held at 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/ceim) which is a part of 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html). We will invite authors of selected reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The invited teams will be chosen based on their final rank, innovativeness of their approach and quality of the submitted report.

Our AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service has come to an end. Thank you very much for your hard work! You managed to improve the baseline result by over 5%.

The competition attracted a total of 116 teams from which 57 were active and submitted at least one solution to the leaderboard. A total number of submissions was nearly 1300. From the active teams, 46 provided us a brief report describing their approach.

The official Winners:

1. Adam Zagórecki, Cranfield University, United Kingdom (team zagorecki)

2. Dymitr Ruta, EBTIC, Khalifa University, United Arab Emirates  (team dymitrruta)

3. Stefan Nikolić, Vladimir Ivančević, Marko Knežević, and Ivan Luković, University of Novi Sad, Serbia (team stefan.nikolic)

We would also like to distinguish six teams: zagoreckidymitrrutastefan.nikolicpiotrmarcbnitekna and invite them to contribute extended versions of their reports to a CEIM'14 workshop https://fedcsis.org/ceim, which will host a session devoted to our contest. Organizers of the workshop will be sending separate invitation letters shortly.

• The competition is open for all interested researchers, specialists and students. Only members of the Contest Organizing Committee cannot participate.
• Participants may submit solutions as teams made up of one or more persons.
• Each team needs to designate a leader responsible for communication with the Organizers. A single person can be a leader of only one team.
• One person may be incorporated in maximally 3 teams.
• Each team needs to be composed of a different set of persons.
• The total number of submission for any single team is limited to 100 solutions.
• Each team is obliged to provide a short report describing their final solution. Reports must contain information such as the name of a team, names of all team members, the last preliminary evaluation score and a brief overview of the used approach. Their length should not exceed 2000 words and they should be submitted in the pdf format using our submission system by May 19, 2014. Only submissions made by teams that provided the reports will qualify for the final evaluation.

Data format: The training data set is provided in two formats. The first one is a traditional tabular representation of data as a comma-separated values file, namely trainingData.csv. Each row of this file represents a single EWID report and, in the consecutive columns, it contains values of its characteristics. The attributes in this table can be divided into two groups. The first one contains the features extracted from a quantitative part of the report and the second group corresponds to a document-term matrix obtained from the natural language description sections. In total, the training data available to participants store information about 50,000 incident reports which are described by 11,852 attributes. All the conditional attributes are discrete and only a few have more than two possible values. For convenience of participants, the same data set is available in a sparse matrix format as an EAV file, namely trainingData.eav. In every row the file contains exactly three integer numbers - an identifier of an object, an identifier of an attribute and the corresponding value. To each report there are also assigned values of three binary decision attributes. Information about those values for the training data is stored in a file decisionLabels.csv, which is available for all participants. The first decision attribute indicates incidents where there occurred a serious injury or death of one of the firefighters or members of the rescue team. The second decision attribute indicates cases in which there were children among injured people and the third attribute identifies situations where civilians were hurt. It is worth noting that the nature of the considered problem implies that the provided data set is highly dimensional, since the total number of conditional attributes corresponds to the number of distinct words in the textual part of the reports (after lemmatization) plus several hundreds of attributes from the quantitative part of the reports. The data is also sparse, since only a small fraction of the attributes have a non-zero value for a particular report. In addition, all three decision attributes are highly imbalanced, since the positive classes correspond to relatively rare events. There is also a separate test data set which will be used for the evaluation of submissions. It has similar characteristics to the training data but the test data will not be made available for participants of the competition.

Format of submissions: The participants of the competition are asked to indicate sets of attributes that allow to accurately classify the incidents and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly ten lines. In the consecutive lines, this file should contain at least three integer numbers (in each line) indicating attributes from the training data set, separated by commas and without any spaces. There is no upper limit as to the number of attributes indicated in a single line, however, the evaluation system will penalize solutions that use a large number of features.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during CEIM'14 workshop (https://fedcsis.org/ceim) at the FedCSIS'14 conference.

Quality of the submissions will be assessed by measuring performance of a classifier ensemble composed of Naive Bayes models. Those models will be constructed using attribute sets indicated in the submitted solution, separately for each decision attribute. An output of the ensemble will be computed by averaging probabilities of the positive classes returned by individual Naive Bayes models. All training data will be used for the construction of the models and the test will be performed on a separate data set which is not available for participants. The performance of the ensemble will be measured by taking an average Area Under the ROC Curve (AUC) over the probability predictions for each decision attribute, decreased by a penalty for using a large number of conditional attributes. Namely, if we denote by:

$$\begin{array}{ccl} s & - & \textrm{a submitted solution}, \\ |s| & - & \textrm{a total number of attributes used in the solution (with repetitions)}, \\ AUC_i(s) & - & \textrm{Area Under the ROC Curve (AUC) of a classifier ensemble for the i-th decision attribute}, \end{array}$$

then the quality measure used for the assessment of submissions can be expressed as:

$score(s) = F \left(\frac{1}{3}\sum\limits_{i = 1}^3 AUC_i(s) - penalty(s)\right)$

where the penalty is equal to:

$penalty(s) = \left(\frac{|s| - 30}{1000}\right)^2$

and the function F: $$F(x) = \begin{cases} x & \textrm{for } x > 0\\ 0 & otherwise\hspace{0.5cm}. \end{cases}$$

An exemplary solution: We prepared a simple solution to give an example of a correctly formatted submision file. It is available here. The attributes in the example we selected based on their correlation with the decisions. The preliminary evaluation score of this solution is 0.9119 - it is displayed on the leaderboard as the baseline_solution score. In case of any questions please post on the forum or write us an email: AAIA14Contest@mimuw.edu.pl

Rank Team Name Is Report Preliminary Score Final Score Submissions
1
zagorecki
True 0.9583 0.962325 2
2
dymitrruta
True 0.9520 0.960775 2
3
stefan.nikolic
True 0.9444 0.959698 2
4
bongod
True 0.9411 0.954148 2
5
piotr
True 0.9376 0.954015 2
6
marcb
True 0.9491 0.953572 2
7
jz
True 0.9359 0.950105 2
8
lp319499
True 0.9370 0.949651 2
9
ts277592
True 0.9370 0.949426 2
10
korzenek
True 0.9252 0.945968 2
11
mswizdor
True 0.9346 0.944222 2
12
nitekna
True 0.9321 0.941363 2
13
jotek7
True 0.9192 0.940932 2
14
gszpak
True 0.9346 0.940160 2
15
superhp
True 0.9297 0.939981 2
16
rudimichal
True 0.9277 0.939740 2
17
bartek
True 0.9268 0.939511 2
18
apersona
True 0.9287 0.938455 2
19
wm320825
True 0.9264 0.938122 2
20
sir51307
True 0.9231 0.937291 2
21
ab290668
True 0.9325 0.937120 2
22
mg320637
True 0.9307 0.936966 2
23
pgj
True 0.9333 0.935773 2
24
hm306317
True 0.9358 0.934463 2
25
jk320790
True 0.9224 0.934194 2
26
filipborowiec
True 0.9252 0.933498 2
27
szefo617
True 0.9124 0.933342 2
28
ksloniewski
True 0.9285 0.933070 2
29
witek
True 0.9240 0.932803 2
30
kp321139
True 0.9267 0.931472 2
31
mduczi
True 0.9222 0.927492 2
32
mw291426
True 0.9145 0.926558 2
33
ai292615
True 0.9214 0.925945 2
34
ps306453
True 0.9131 0.925856 2
35
insejniasty
True 0.9208 0.924352 2
36
pk320686
True 0.9177 0.921753 2
37
pm355765
True 0.9125 0.921677 2
38
jaszko
True 0.9227 0.919855 2
39
tobuchowski
True 0.8844 0.910186 2
40
dc305192
True 0.8661 0.892328 2
41
chmielu
True 0.8760 0.889521 2
42
lukaszp
True 0.8931 0.888914 2
43
makarewicz
True 0.8807 0.884934 2
44
bartosz
True 0.8852 0.871873 2
45
masterofu
True 0.8023 0.825476 2
46
makier
True 0.8080 0.821643 2
47
engoe
False 0.9334 No report file found or report rejected. 2
48
giuseppe.fatiguso
False 0.9271 No report file found or report rejected. 2
49
lukasz
False 0.9241 No report file found or report rejected. 2
50
reksio
False 0.9120 No report file found or report rejected. 2
51
baseline_solution
False 0.9119 No report file found or report rejected. 2
52
akarwan
False 0.9189 No report file found or report rejected. 2
53
tf248762
False 0.9119 No report file found or report rejected. 2
54
mm319482
False 0.9112 No report file found or report rejected. 2
55
a.ruta
False 0.8784 No report file found or report rejected. 2
56
tobuchowski2
False 0.8770 No report file found or report rejected. 2
57
ls306462
False 0.7412 No report file found or report rejected. 2
58
kcwong
False 0.6904 No report file found or report rejected. 2
59
maniek
False 0.0000 No report file found or report rejected. 2
• Feb. 3, 2014: start of the competition, data sets become available,
• May 5, 2014: deadline for submitting the predictions,
• May 7, 2014: deadline for sending the reports, end of the challenge,
• May 12, 2014: on-line publication of final results, sending invitations for submitting short papers for the special session,
• June 2, 2014: deadline for submissions of papers describing the selected solutions,
• June 16, 2014: deadline for submissions of camera-ready papers selected for presentation at the CEIM'14 workshop.

Authors of the top ranked solutions (based on the final evaluation scores) will be awarded with prizes:

• First Prize: computer hardware worth 2,000 USD + one free FedCSIS'14 conference registration,
• Second Prize: computer hardware worth 1,000 USD + one free FedCSIS'14 conference registration,
• Third Prize: one free FedCSIS'14 conference registration.

The award ceremony will take place during the FedCSIS'14 conference (September 7-10, Warsaw). Additionally, invited authors who decide to attend the conference will receive a diploma and a competition T-shirt.

Andrzej Janusz (Chairman), University of Warsaw

Adam Krasuski, Main School of Fire Service & University of Warsaw

Dominik Ślęzak, University of Warsaw & Infobright Inc.

Hung Son Nguyen, University of Warsaw

Sebastian Stawicki, University of Warsaw

Guillermo Rein, Imperial College London

Stanisław Łazowy, Main School of Fire Service

Discussion Author Replies Last post
Test set availability Dymitr 0 by Dymitr
Monday, May 12, 2014, 21:02:30
A delay in the final evaluation process Andrzej 0 by Andrzej
Saturday, May 10, 2014, 21:18:47
Friday, May 09, 2014, 10:51:27