10 years, 6 months ago

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service is organized within the framework of the 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html), and is an integral part of the 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/2014/ceim.html) devoted to the fire protection engeneering. The task is related to the problem of extracting useful knowledge from incident reports obtained from The State Fire Service of Poland. Prizes worth over 3,000 USD will be awarded to the most successful teams. The contest is sponsored by Dituel Sp. z o.o. (http://www.dituel.com.pl/) and F&K Consulting Engineers (http://www.fkce.pl/), with a support from The University of Warsaw (http://www.mimuw.edu.pl/) and ICRA project.

Introduction

Incident Data Reporting Systems (IDRS) are used by public safety services across the globe to gather information about the incidents which required their actions. This information is used not only to simply document the events but it can also be incorporated into the training of new officers. Moreover, the knowledge extracted from such reports can help in better identification of threats and in planning of more effective procedures. An example of such a reporting system is EWID which is used by the State Fire Service of Poland. Each report from this system consists of two parts. The first one contains a summary of resources utilized during the action in a form of structured and quantified characteristics. The second part contains a natural language description of the reported events,  which is entered by the officer coordinating the action. In the proposed data mining competition, we would like to raise the problem of extracting useful knowledge from the reports generated in the EWID system. In particular, we would like to ask the participants to identify key factors influencing the risk of serious injuries among firefighters and people involved in various incidents.

Special session at CEIM'14: A special session devoted to the competition will be held at 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/ceim) which is a part of 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html). We will invite authors of selected reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The invited teams will be chosen based on their final rank, innovativeness of their approach and quality of the submitted report.

Terms & Conditions
 
 

Our AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service has come to an end. Thank you very much for your hard work! You managed to improve the baseline result by over 5%.

The competition attracted a total of 116 teams from which 57 were active and submitted at least one solution to the leaderboard. A total number of submissions was nearly 1300. From the active teams, 46 provided us a brief report describing their approach.

The official Winners:

  1. Adam Zagórecki, Cranfield University, United Kingdom (team zagorecki)

  2. Dymitr Ruta, EBTIC, Khalifa University, United Arab Emirates  (team dymitrruta)

  3. Stefan Nikolić, Vladimir Ivančević, Marko Knežević, and Ivan Luković, University of Novi Sad, Serbia (team stefan.nikolic)

Congratulation on your excellent results!

We would also like to distinguish six teams: zagoreckidymitrrutastefan.nikolicpiotrmarcbnitekna and invite them to contribute extended versions of their reports to a CEIM'14 workshop https://fedcsis.org/ceim, which will host a session devoted to our contest. Organizers of the workshop will be sending separate invitation letters shortly.

Data format: The training data set is provided in two formats. The first one is a traditional tabular representation of data as a comma-separated values file, namely trainingData.csv. Each row of this file represents a single EWID report and, in the consecutive columns, it contains values of its characteristics. The attributes in this table can be divided into two groups. The first one contains the features extracted from a quantitative part of the report and the second group corresponds to a document-term matrix obtained from the natural language description sections. In total, the training data available to participants store information about 50,000 incident reports which are described by 11,852 attributes. All the conditional attributes are discrete and only a few have more than two possible values. For convenience of participants, the same data set is available in a sparse matrix format as an EAV file, namely trainingData.eav. In every row the file contains exactly three integer numbers - an identifier of an object, an identifier of an attribute and the corresponding value. To each report there are also assigned values of three binary decision attributes. Information about those values for the training data is stored in a file decisionLabels.csv, which is available for all participants. The first decision attribute indicates incidents where there occurred a serious injury or death of one of the firefighters or members of the rescue team. The second decision attribute indicates cases in which there were children among injured people and the third attribute identifies situations where civilians were hurt. It is worth noting that the nature of the considered problem implies that the provided data set is highly dimensional, since the total number of conditional attributes corresponds to the number of distinct words in the textual part of the reports (after lemmatization) plus several hundreds of attributes from the quantitative part of the reports. The data is also sparse, since only a small fraction of the attributes have a non-zero value for a particular report. In addition, all three decision attributes are highly imbalanced, since the positive classes correspond to relatively rare events. There is also a separate test data set which will be used for the evaluation of submissions. It has similar characteristics to the training data but the test data will not be made available for participants of the competition.

Format of submissions: The participants of the competition are asked to indicate sets of attributes that allow to accurately classify the incidents and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly ten lines. In the consecutive lines, this file should contain at least three integer numbers (in each line) indicating attributes from the training data set, separated by commas and without any spaces. There is no upper limit as to the number of attributes indicated in a single line, however, the evaluation system will penalize solutions that use a large number of features.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during CEIM'14 workshop (https://fedcsis.org/ceim) at the FedCSIS'14 conference.

Quality of the submissions will be assessed by measuring performance of a classifier ensemble composed of Naive Bayes models. Those models will be constructed using attribute sets indicated in the submitted solution, separately for each decision attribute. An output of the ensemble will be computed by averaging probabilities of the positive classes returned by individual Naive Bayes models. All training data will be used for the construction of the models and the test will be performed on a separate data set which is not available for participants. The performance of the ensemble will be measured by taking an average Area Under the ROC Curve (AUC) over the probability predictions for each decision attribute, decreased by a penalty for using a large number of conditional attributes. Namely, if we denote by:

$$ \begin{array}{ccl} s & - & \textrm{a submitted solution}, \\ |s| & - & \textrm{a total number of attributes used in the solution (with repetitions)}, \\ AUC_i(s) & - & \textrm{Area Under the ROC Curve (AUC) of a classifier ensemble for the i-th decision attribute}, \end{array} $$

then the quality measure used for the assessment of submissions can be expressed as:

\[score(s) = F \left(\frac{1}{3}\sum\limits_{i = 1}^3 AUC_i(s) - penalty(s)\right)\]

where the penalty is equal to:

\[penalty(s) = \left(\frac{|s| - 30}{1000}\right)^2\]

and the function F: $$F(x) = \begin{cases} x & \textrm{for } x > 0\\ 0 & otherwise\hspace{0.5cm}. \end{cases}$$

An exemplary solution: We prepared a simple solution to give an example of a correctly formatted submision file. It is available here. The attributes in the example we selected based on their correlation with the decisions. The preliminary evaluation score of this solution is 0.9119 - it is displayed on the leaderboard as the baseline_solution score. In case of any questions please post on the forum or write us an email: AAIA14Contest@mimuw.edu.pl

Rank Team Name Is Report   Preliminary Score Final Score Submissions
1
zagorecki
True True 0.9583 0.962325 2
2
dymitrruta
True True 0.9520 0.960775 2
3
stefan.nikolic
True True 0.9444 0.959698 2
4
bongod
True True 0.9411 0.954148 2
5
piotr
True True 0.9376 0.954015 2
6
marcb
True True 0.9491 0.953572 2
7
jz
True True 0.9359 0.950105 2
8
lp319499
True True 0.9370 0.949651 2
9
ts277592
True True 0.9370 0.949426 2
10
korzenek
True True 0.9252 0.945968 2
11
mswizdor
True True 0.9346 0.944222 2
12
nitekna
True True 0.9321 0.941363 2
13
jotek7
True True 0.9192 0.940932 2
14
gszpak
True True 0.9346 0.940160 2
15
superhp
True True 0.9297 0.939981 2
16
rudimichal
True True 0.9277 0.939740 2
17
bartek
True True 0.9268 0.939511 2
18
apersona
True True 0.9287 0.938455 2
19
wm320825
True True 0.9264 0.938122 2
20
sir51307
True True 0.9231 0.937291 2
21
ab290668
True True 0.9325 0.937120 2
22
mg320637
True True 0.9307 0.936966 2
23
pgj
True True 0.9333 0.935773 2
24
hm306317
True True 0.9358 0.934463 2
25
jk320790
True True 0.9224 0.934194 2
26
filipborowiec
True True 0.9252 0.933498 2
27
szefo617
True True 0.9124 0.933342 2
28
ksloniewski
True True 0.9285 0.933070 2
29
witek
True True 0.9240 0.932803 2
30
kp321139
True True 0.9267 0.931472 2
31
mduczi
True True 0.9222 0.927492 2
32
mw291426
True True 0.9145 0.926558 2
33
ai292615
True True 0.9214 0.925945 2
34
ps306453
True True 0.9131 0.925856 2
35
insejniasty
True True 0.9208 0.924352 2
36
pk320686
True True 0.9177 0.921753 2
37
pm355765
True True 0.9125 0.921677 2
38
jaszko
True True 0.9227 0.919855 2
39
tobuchowski
True True 0.8844 0.910186 2
40
dc305192
True True 0.8661 0.892328 2
41
chmielu
True True 0.8760 0.889521 2
42
lukaszp
True True 0.8931 0.888914 2
43
makarewicz
True True 0.8807 0.884934 2
44
bartosz
True True 0.8852 0.871873 2
45
masterofu
True True 0.8023 0.825476 2
46
makier
True True 0.8080 0.821643 2
47
engoe
False True 0.9334 No report file found or report rejected. 2
48
giuseppe.fatiguso
False True 0.9271 No report file found or report rejected. 2
49
lukasz
False True 0.9241 No report file found or report rejected. 2
50
reksio
False True 0.9120 No report file found or report rejected. 2
51
baseline_solution
False True 0.9119 No report file found or report rejected. 2
52
akarwan
False True 0.9189 No report file found or report rejected. 2
53
tf248762
False True 0.9119 No report file found or report rejected. 2
54
mm319482
False True 0.9112 No report file found or report rejected. 2
55
a.ruta
False True 0.8784 No report file found or report rejected. 2
56
tobuchowski2
False True 0.8770 No report file found or report rejected. 2
57
ls306462
False True 0.7412 No report file found or report rejected. 2
58
kcwong
False True 0.6904 No report file found or report rejected. 2
59
maniek
False True 0.0000 No report file found or report rejected. 2
  • Feb. 3, 2014: start of the competition, data sets become available,
  • May 5, 2014: deadline for submitting the predictions,
  • May 7, 2014: deadline for sending the reports, end of the challenge,
  • May 12, 2014: on-line publication of final results, sending invitations for submitting short papers for the special session,
  • June 2, 2014: deadline for submissions of papers describing the selected solutions,
  • June 16, 2014: deadline for submissions of camera-ready papers selected for presentation at the CEIM'14 workshop.

Authors of the top ranked solutions (based on the final evaluation scores) will be awarded with prizes:

  • First Prize: computer hardware worth 2,000 USD + one free FedCSIS'14 conference registration,
  • Second Prize: computer hardware worth 1,000 USD + one free FedCSIS'14 conference registration,
  • Third Prize: one free FedCSIS'14 conference registration.

The award ceremony will take place during the FedCSIS'14 conference (September 7-10, Warsaw). Additionally, invited authors who decide to attend the conference will receive a diploma and a competition T-shirt.

Andrzej Janusz (Chairman), University of Warsaw

Adam Krasuski, Main School of Fire Service & University of Warsaw

Dominik Ślęzak, University of Warsaw & Infobright Inc.

Hung Son Nguyen, University of Warsaw

Sebastian Stawicki, University of Warsaw

Guillermo Rein, Imperial College London

Stanisław Łazowy, Main School of Fire Service

  Discussion Author Replies Last post
Test set availability Dymitr 0 by Dymitr
Monday, May 12, 2014, 19:02:30
A delay in the final evaluation process Andrzej 0 by Andrzej
Saturday, May 10, 2014, 19:18:47
Leaderboard changes Adam 1 by Andrzej
Friday, May 09, 2014, 08:51:27
Deadlines Adam 3 by Andrzej
Monday, May 05, 2014, 04:45:06
End of the competition Andrzej 0 by Andrzej
Friday, May 02, 2014, 15:32:10
Number of selected reports to for publication Eftim 1 by Eftim
Tuesday, April 29, 2014, 14:11:09
100 solutions limit Adam 2 by Adam
Wednesday, April 23, 2014, 12:49:34
N/A for preliminary score Piotr 1 by Piotr
Sunday, April 06, 2014, 10:33:42
description of evaluation metric piotr 5 by Andrzej
Tuesday, April 01, 2014, 10:45:57
Welcome to AAIA'14 Data Mining Competition! Andrzej 0 by Andrzej
Sunday, February 02, 2014, 22:28:19