5 years ago

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service is organized within the framework of the 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/aaia), and is an integral part of the 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/ceim) devoted to the fire protection engeneering. The task is related to the problem of extracting useful knowledge from incident reports obtained from The State Fire Service of Poland. Prizes worth over 3,000 USD will be awarded to the most successful teams. The contest is sponsored by Dituel Sp. z o.o. (http://www.dituel.com.pl/) and F&K Consulting Engineers (http://www.fkce.pl/), with a support from The University of Warsaw (http://www.mimuw.edu.pl/) and ICRA project (http://icra-project.org/).

Introduction

Incident Data Reporting Systems (IDRS) are used by public safety services across the globe to gather information about the incidents which required their actions. This information is used not only to simply document the events but it can also be incorporated into the training of new officers. Moreover, the knowledge extracted from such reports can help in better identification of threats and in planning of more effective procedures. An example of such a reporting system is EWID which is used by the State Fire Service of Poland. Each report from this system consists of two parts. The first one contains a summary of resources utilized during the action in a form of structured and quantified characteristics. The second part contains a natural language description of the reported events,  which is entered by the officer coordinating the action. In the proposed data mining competition, we would like to raise the problem of extracting useful knowledge from the reports generated in the EWID system. In particular, we would like to ask the participants to identify key factors influencing the risk of serious injuries among firefighters and people involved in various incidents.

Special session at CEIM'14: A special session devoted to the competition will be held at 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/ceim) which is a part of 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/aaia). We will invite authors of selected reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The invited teams will be chosen based on their final rank, innovativeness of their approach and quality of the submitted report.

Terms & Conditions
 
 
  • The competition is open for all interested researchers, specialists and students. Only members of the Contest Organizing Committee cannot participate.
  • Participants may submit solutions as teams made up of one or more persons.
  • Each team needs to designate a leader responsible for communication with the Organizers. A single person can be a leader of only one team.
  • One person may be incorporated in maximally 3 teams.
  • Each team needs to be composed of a different set of persons.
  • The total number of submission for any single team is limited to 100 solutions.
  • Each team is obliged to provide a short report describing their final solution. Reports must contain information such as the name of a team, names of all team members, the last preliminary evaluation score and a brief overview of the used approach. Their length should not exceed 2000 words and they should be submitted in the pdf format using our submission system by May 19, 2014. Only submissions made by teams that provided the reports will qualify for the final evaluation.

In case of questions related to the competition please contact us via email: AAIA14Contest@mimuw.edu.pl.

Please logIn to the system!

Data format: The training data set is provided in two formats. The first one is a traditional tabular representation of data as a comma-separated values file, namely trainingData.csv. Each row of this file represents a single EWID report and, in the consecutive columns, it contains values of its characteristics. The attributes in this table can be divided into two groups. The first one contains the features extracted from a quantitative part of the report and the second group corresponds to a document-term matrix obtained from the natural language description sections. In total, the training data available to participants store information about 50,000 incident reports which are described by 11,852 attributes. All the conditional attributes are discrete and only a few have more than two possible values. For convenience of participants, the same data set is available in a sparse matrix format as an EAV file, namely trainingData.eav. In every row the file contains exactly three integer numbers - an identifier of an object, an identifier of an attribute and the corresponding value. To each report there are also assigned values of three binary decision attributes. Information about those values for the training data is stored in a file decisionLabels.csv, which is available for all participants. The first decision attribute indicates incidents where there occurred a serious injury or death of one of the firefighters or members of the rescue team. The second decision attribute indicates cases in which there were children among injured people and the third attribute identifies situations where civilians were hurt. It is worth noting that the nature of the considered problem implies that the provided data set is highly dimensional, since the total number of conditional attributes corresponds to the number of distinct words in the textual part of the reports (after lemmatization) plus several hundreds of attributes from the quantitative part of the reports. The data is also sparse, since only a small fraction of the attributes have a non-zero value for a particular report. In addition, all three decision attributes are highly imbalanced, since the positive classes correspond to relatively rare events. There is also a separate test data set which will be used for the evaluation of submissions. It has similar characteristics to the training data but the test data will not be made available for participants of the competition.

Format of submissions: The participants of the competition are asked to indicate sets of attributes that allow to accurately classify the incidents and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly ten lines. In the consecutive lines, this file should contain at least three integer numbers (in each line) indicating attributes from the training data set, separated by commas and without any spaces. There is no upper limit as to the number of attributes indicated in a single line, however, the evaluation system will penalize solutions that use a large number of features.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during CEIM'14 workshop (https://fedcsis.org/ceim) at the FedCSIS'14 conference.

Quality of the submissions will be assessed by measuring performance of a classifier ensemble composed of Naive Bayes models. Those models will be constructed using attribute sets indicated in the submitted solution, separately for each decision attribute. An output of the ensemble will be computed by averaging probabilities of the positive classes returned by individual Naive Bayes models. All training data will be used for the construction of the models and the test will be performed on a separate data set which is not available for participants. The performance of the ensemble will be measured by taking an average Area Under the ROC Curve (AUC) over the probability predictions for each decision attribute, decreased by a penalty for using a large number of conditional attributes. Namely, if we denote by:

$$ \begin{array}{ccl} s & - & \textrm{a submitted solution}, \\ |s| & - & \textrm{a total number of attributes used in the solution (with repetitions)}, \\ AUC_i(s) & - & \textrm{Area Under the ROC Curve (AUC) of a classifier ensemble for the i-th decision attribute}, \end{array} $$

then the quality measure used for the assessment of submissions can be expressed as:

\[score(s) = F \left(\frac{1}{3}\sum\limits_{i = 1}^3 AUC_i(s) - penalty(s)\right)\]

where the penalty is equal to:

\[penalty(s) = \left(\frac{|s| - 30}{1000}\right)^2\]

and the function F: $$F(x) = \begin{cases} x & \textrm{for } x > 0\\ 0 & otherwise\hspace{0.5cm}. \end{cases}$$

An exemplary solution: We prepared a simple solution to give an example of a correctly formatted submision file. It is available here. The attributes in the example we selected based on their correlation with the decisions. The preliminary evaluation score of this solution is 0.9119 - it is displayed on the leaderboard as the baseline_solution score. In case of any questions please post on the forum or write us an email: AAIA14Contest@mimuw.edu.pl

Rank Team Name Score Submission Date
1 zagorecki 0.962325 Tuesday, May 6, 2014, 00:17:35
2 dymitrruta 0.960775 Monday, April 21, 2014, 22:42:44
3 stefan.nikolic 0.959698 Sunday, May 4, 2014, 23:25:32
4 bongod 0.954148 Monday, May 5, 2014, 22:58:27
5 piotr 0.954015 Sunday, April 27, 2014, 20:17:35
6 marcb 0.953572 Thursday, April 24, 2014, 12:02:28
7 jz 0.950105 Saturday, May 3, 2014, 15:01:16
8 lp319499 0.949651 Thursday, May 1, 2014, 16:07:43
9 ts277592 0.949426 Saturday, April 26, 2014, 19:50:24
10 korzenek 0.945968 Friday, May 2, 2014, 17:02:54
11 mswizdor 0.944222 Saturday, April 26, 2014, 10:51:01
12 nitekna 0.941363 Wednesday, April 30, 2014, 01:54:46
13 jotek7 0.940932 Thursday, May 1, 2014, 18:35:01
14 gszpak 0.940160 Friday, April 25, 2014, 23:33:22
15 superhp 0.939981 Friday, April 25, 2014, 00:17:08
16 rudimichal 0.939740 Friday, April 25, 2014, 23:56:53
17 bartek 0.939511 Saturday, April 26, 2014, 00:05:13
18 apersona 0.938455 Friday, April 25, 2014, 23:08:23
19 wm320825 0.938122 Thursday, April 24, 2014, 12:56:19
20 sir51307 0.937291 Monday, May 5, 2014, 00:19:22
21 ab290668 0.937120 Saturday, May 3, 2014, 16:25:57
22 mg320637 0.936966 Friday, April 25, 2014, 21:43:35
23 pgj 0.935773 Thursday, April 24, 2014, 20:08:41
24 hm306317 0.934463 Tuesday, April 22, 2014, 21:13:22
25 jk320790 0.934194 Sunday, May 4, 2014, 01:58:16
26 filipborowiec 0.933498 Sunday, April 20, 2014, 21:43:40
27 szefo617 0.933342 Sunday, May 4, 2014, 15:30:32
28 ksloniewski 0.933070 Tuesday, May 6, 2014, 01:59:08
29 witek 0.932803 Tuesday, April 29, 2014, 19:23:05
30 kp321139 0.931472 Monday, May 5, 2014, 23:33:09
31 mduczi 0.927492 Friday, April 25, 2014, 21:06:31
32 mw291426 0.926558 Thursday, May 1, 2014, 19:40:43
33 ai292615 0.925945 Tuesday, April 29, 2014, 17:35:00
34 ps306453 0.925856 Saturday, April 26, 2014, 00:33:07
35 insejniasty 0.924352 Wednesday, April 30, 2014, 14:15:28
36 pk320686 0.921753 Saturday, April 26, 2014, 20:55:52
37 pm355765 0.921677 Friday, May 2, 2014, 12:06:48
38 jaszko 0.919855 Friday, April 25, 2014, 11:48:23
39 tobuchowski 0.910186 Wednesday, April 30, 2014, 14:24:29
40 dc305192 0.892328 Saturday, April 26, 2014, 17:29:16
41 chmielu 0.889521 Sunday, May 4, 2014, 04:18:14
42 lukaszp 0.888914 Sunday, May 4, 2014, 00:08:02
43 makarewicz 0.884934 Friday, April 25, 2014, 14:18:33
44 bartosz 0.871873 Tuesday, May 6, 2014, 01:57:35
45 masterofu 0.825476 Saturday, April 26, 2014, 00:14:06
46 makier 0.821643 Thursday, April 24, 2014, 21:46:29
47 engoe No report file found! Friday, April 25, 2014, 08:55:41
48 lukasz No report file found! Saturday, April 12, 2014, 00:17:13
49 giuseppe.fatiguso No report file found! Saturday, May 3, 2014, 13:38:00
50 reksio No report file found! Tuesday, March 4, 2014, 14:43:18
51 baseline_solution No report file found! Monday, February 3, 2014, 08:58:38
52 akarwan No report file found! Monday, April 28, 2014, 12:13:57
53 tf248762 No report file found! Monday, May 5, 2014, 17:47:57
54 mm319482 No report file found! Monday, May 5, 2014, 17:34:40
55 a.ruta No report file found! Monday, May 5, 2014, 23:49:38
56 tobuchowski2 No report file found! Wednesday, April 30, 2014, 19:19:15
57 ls306462 No report file found! Friday, March 14, 2014, 11:26:23
58 kcwong No report file found! Friday, March 7, 2014, 04:10:11
59 maniek No report file found! Monday, February 3, 2014, 22:47:59

Our AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service has come to an end. Thank you very much for your hard work! You managed to improve the baseline result by over 5%.

The competition attracted a total of 116 teams from which 57 were active and submitted at least one solution to the leaderboard. A total number of submissions was nearly 1300. From the active teams, 46 provided us a brief report describing their approach.

The official Winners:

  1. Adam Zagórecki, Cranfield University, United Kingdom (team zagorecki)

  2. Dymitr Ruta, EBTIC, Khalifa University, United Arab Emirates  (team dymitrruta)

  3. Stefan Nikolić, Vladimir Ivančević, Marko Knežević, and Ivan Luković, University of Novi Sad, Serbia (team stefan.nikolic)

Congratulation on your excellent results!

We would also like to distinguish six teams: zagoreckidymitrrutastefan.nikolicpiotrmarcbnitekna and invite them to contribute extended versions of their reports to a CEIM'14 workshop https://fedcsis.org/ceim, which will host a session devoted to our contest. Organizers of the workshop will be sending separate invitation letters shortly.

  • Feb. 3, 2014: start of the competition, data sets become available,
  • May 5, 2014: deadline for submitting the predictions,
  • May 7, 2014: deadline for sending the reports, end of the challenge,
  • May 12, 2014: on-line publication of final results, sending invitations for submitting short papers for the special session,
  • June 2, 2014: deadline for submissions of papers describing the selected solutions,
  • June 16, 2014: deadline for submissions of camera-ready papers selected for presentation at the CEIM'14 workshop.

Authors of the top ranked solutions (based on the final evaluation scores) will be awarded with prizes:

  • First Prize: computer hardware worth 2,000 USD + one free FedCSIS'14 conference registration,
  • Second Prize: computer hardware worth 1,000 USD + one free FedCSIS'14 conference registration,
  • Third Prize: one free FedCSIS'14 conference registration.

The award ceremony will take place during the FedCSIS'14 conference (September 7-10, Warsaw). Additionally, invited authors who decide to attend the conference will receive a diploma and a competition T-shirt.

Andrzej Janusz (Chairman), University of Warsaw

Adam Krasuski, Main School of Fire Service & University of Warsaw

Dominik Ślęzak, University of Warsaw & Infobright Inc.

Hung Son Nguyen, University of Warsaw

Sebastian Stawicki, University of Warsaw

Guillermo Rein, Imperial College London

Stanisław Łazowy, Main School of Fire Service

  Discussion Author Replies Last post
Test set availability Dymitr 0 by Dymitr
Monday, May 12, 2014, 19:02:30
A delay in the final evaluation process Andrzej 0 by Andrzej
Saturday, May 10, 2014, 19:18:47
Leaderboard changes Adam 1 by Andrzej
Friday, May 09, 2014, 08:51:27
Deadlines Adam 3 by Andrzej
Monday, May 05, 2014, 04:45:06
End of the competition Andrzej 0 by Andrzej
Friday, May 02, 2014, 15:32:10
Number of selected reports to for publication Eftim 1 by Eftim
Tuesday, April 29, 2014, 14:11:09
100 solutions limit Adam 2 by Adam
Wednesday, April 23, 2014, 12:49:34
description of evaluation metric piotr 5 by Andrzej
Tuesday, April 01, 2014, 10:45:57
N/A for preliminary score Piotr 1 by Piotr
Sunday, April 06, 2014, 10:33:42
Welcome to AAIA'14 Data Mining Competition! Andrzej 0 by Andrzej
Sunday, February 02, 2014, 22:28:19