AAIA'14 DM Competition

11 years, 1 month ago

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service

AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service is organized within the framework of the 9th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html), and is an integral part of the 1st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/2014/ceim.html) devoted to the fire protection engeneering. The task is related to the problem of extracting useful knowledge from incident reports obtained from The State Fire Service of Poland. Prizes worth over 3,000 USD will be awarded to the most successful teams. The contest is sponsored by Dituel Sp. z o.o. (http://www.dituel.com.pl/) and F&K Consulting Engineers (http://www.fkce.pl/), with a support from The University of Warsaw (http://www.mimuw.edu.pl/) and ICRA project.

Introduction

Incident Data Reporting Systems (IDRS) are used by public safety services across the globe to gather information about the incidents which required their actions. This information is used not only to simply document the events but it can also be incorporated into the training of new officers. Moreover, the knowledge extracted from such reports can help in better identification of threats and in planning of more effective procedures. An example of such a reporting system is EWID which is used by the State Fire Service of Poland. Each report from this system consists of two parts. The first one contains a summary of resources utilized during the action in a form of structured and quantified characteristics. The second part contains a natural language description of the reported events, which is entered by the officer coordinating the action. In the proposed data mining competition, we would like to raise the problem of extracting useful knowledge from the reports generated in the EWID system. In particular, we would like to ask the participants to identify key factors influencing the risk of serious injuries among firefighters and people involved in various incidents.

Special session at CEIM'14: A special session devoted to the competition will be held at 1^st Complex Events and Information Modelling workshop (CEIM'14 https://fedcsis.org/ceim) which is a part of 9^th International Symposium on Advances in Artificial Intelligence and Applications (AAIA'14, https://fedcsis.org/2014/aaia.html). We will invite authors of selected reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The invited teams will be chosen based on their final rank, innovativeness of their approach and quality of the submitted report.

Terms & Conditions

Summation

Our AAIA'14 Data Mining Competition: Key risk factors for Polish State Fire Service has come to an end. Thank you very much for your hard work! You managed to improve the baseline result by over 5%.

The competition attracted a total of 116 teams from which 57 were active and submitted at least one solution to the leaderboard. A total number of submissions was nearly 1300. From the active teams, 46 provided us a brief report describing their approach.

The official Winners:

Adam Zagórecki, Cranfield University, United Kingdom (team zagorecki)
Dymitr Ruta, EBTIC, Khalifa University, United Arab Emirates (team dymitrruta)
Stefan Nikolić, Vladimir Ivančević, Marko Knežević, and Ivan Luković, University of Novi Sad, Serbia (team stefan.nikolic)

Congratulation on your excellent results!

We would also like to distinguish six teams: zagorecki, dymitrruta, stefan.nikolic, piotr, marcb, nitekna and invite them to contribute extended versions of their reports to a CEIM'14 workshop https://fedcsis.org/ceim, which will host a session devoted to our contest. Organizers of the workshop will be sending separate invitation letters shortly.

Task description

Data format: The training data set is provided in two formats. The first one is a traditional tabular representation of data as a comma-separated values file, namely trainingData.csv. Each row of this file represents a single EWID report and, in the consecutive columns, it contains values of its characteristics. The attributes in this table can be divided into two groups. The first one contains the features extracted from a quantitative part of the report and the second group corresponds to a document-term matrix obtained from the natural language description sections. In total, the training data available to participants store information about 50,000 incident reports which are described by 11,852 attributes. All the conditional attributes are discrete and only a few have more than two possible values. For convenience of participants, the same data set is available in a sparse matrix format as an EAV file, namely trainingData.eav. In every row the file contains exactly three integer numbers - an identifier of an object, an identifier of an attribute and the corresponding value. To each report there are also assigned values of three binary decision attributes. Information about those values for the training data is stored in a file decisionLabels.csv, which is available for all participants. The first decision attribute indicates incidents where there occurred a serious injury or death of one of the firefighters or members of the rescue team. The second decision attribute indicates cases in which there were children among injured people and the third attribute identifies situations where civilians were hurt. It is worth noting that the nature of the considered problem implies that the provided data set is highly dimensional, since the total number of conditional attributes corresponds to the number of distinct words in the textual part of the reports (after lemmatization) plus several hundreds of attributes from the quantitative part of the reports. The data is also sparse, since only a small fraction of the attributes have a non-zero value for a particular report. In addition, all three decision attributes are highly imbalanced, since the positive classes correspond to relatively rare events. There is also a separate test data set which will be used for the evaluation of submissions. It has similar characteristics to the training data but the test data will not be made available for participants of the competition.

Format of submissions: The participants of the competition are asked to indicate sets of attributes that allow to accurately classify the incidents and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly ten lines. In the consecutive lines, this file should contain at least three integer numbers (in each line) indicating attributes from the training data set, separated by commas and without any spaces. There is no upper limit as to the number of attributes indicated in a single line, however, the evaluation system will penalize solutions that use a large number of features.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during CEIM'14 workshop (https://fedcsis.org/ceim) at the FedCSIS'14 conference.

Quality of the submissions will be assessed by measuring performance of a classifier ensemble composed of Naive Bayes models. Those models will be constructed using attribute sets indicated in the submitted solution, separately for each decision attribute. An output of the ensemble will be computed by averaging probabilities of the positive classes returned by individual Naive Bayes models. All training data will be used for the construction of the models and the test will be performed on a separate data set which is not available for participants. The performance of the ensemble will be measured by taking an average Area Under the ROC Curve (AUC) over the probability predictions for each decision attribute, decreased by a penalty for using a large number of conditional attributes. Namely, if we denote by:

$$ \begin{array}{ccl} s & - & \textrm{a submitted solution}, \\ |s| & - & \textrm{a total number of attributes used in the solution (with repetitions)}, \\ AUC_i(s) & - & \textrm{Area Under the ROC Curve (AUC) of a classifier ensemble for the i-th decision attribute}, \end{array} $$

then the quality measure used for the assessment of submissions can be expressed as:

\[score(s) = F \left(\frac{1}{3}\sum\limits_{i = 1}^3 AUC_i(s) - penalty(s)\right)\]

where the penalty is equal to:

\[penalty(s) = \left(\frac{|s| - 30}{1000}\right)^2\]

and the function F: $$F(x) = \begin{cases} x & \textrm{for } x > 0\\ 0 & otherwise\hspace{0.5cm}. \end{cases}$$

An exemplary solution: We prepared a simple solution to give an example of a correctly formatted submision file. It is available here. The attributes in the example we selected based on their correlation with the decisions. The preliminary evaluation score of this solution is 0.9119 - it is displayed on the leaderboard as the baseline_solution score. In case of any questions please post on the forum or write us an email: AAIA14Contest@mimuw.edu.pl

Final results

Rank	Team Name	Is Report		Preliminary Score	Final Score	Submissions
1	zagorecki	True	True	0.9583	0.962325	2
2	dymitrruta	True	True	0.9520	0.960775	2
3	stefan.nikolic	True	True	0.9444	0.959698	2
4	bongod	True	True	0.9411	0.954148	2
5	piotr	True	True	0.9376	0.954015	2
6	marcb	True	True	0.9491	0.953572	2
7	jz	True	True	0.9359	0.950105	2
8	lp319499	True	True	0.9370	0.949651	2
9	ts277592	True	True	0.9370	0.949426	2
10	korzenek	True	True	0.9252	0.945968	2
11	mswizdor	True	True	0.9346	0.944222	2
12	nitekna	True	True	0.9321	0.941363	2
13	jotek7	True	True	0.9192	0.940932	2
14	gszpak	True	True	0.9346	0.940160	2
15	superhp	True	True	0.9297	0.939981	2
16	rudimichal	True	True	0.9277	0.939740	2
17	bartek	True	True	0.9268	0.939511	2
18	apersona	True	True	0.9287	0.938455	2
19	wm320825	True	True	0.9264	0.938122	2
20	sir51307	True	True	0.9231	0.937291	2
21	ab290668	True	True	0.9325	0.937120	2
22	mg320637	True	True	0.9307	0.936966	2
23	pgj	True	True	0.9333	0.935773	2
24	hm306317	True	True	0.9358	0.934463	2
25	jk320790	True	True	0.9224	0.934194	2
26	filipborowiec	True	True	0.9252	0.933498	2
27	szefo617	True	True	0.9124	0.933342	2
28	ksloniewski	True	True	0.9285	0.933070	2
29	witek	True	True	0.9240	0.932803	2
30	kp321139	True	True	0.9267	0.931472	2
31	mduczi	True	True	0.9222	0.927492	2
32	mw291426	True	True	0.9145	0.926558	2
33	ai292615	True	True	0.9214	0.925945	2
34	ps306453	True	True	0.9131	0.925856	2
35	insejniasty	True	True	0.9208	0.924352	2
36	pk320686	True	True	0.9177	0.921753	2
37	pm355765	True	True	0.9125	0.921677	2
38	jaszko	True	True	0.9227	0.919855	2
39	tobuchowski	True	True	0.8844	0.910186	2
40	dc305192	True	True	0.8661	0.892328	2
41	chmielu	True	True	0.8760	0.889521	2
42	lukaszp	True	True	0.8931	0.888914	2
43	makarewicz	True	True	0.8807	0.884934	2
44	bartosz	True	True	0.8852	0.871873	2
45	masterofu	True	True	0.8023	0.825476	2
46	makier	True	True	0.8080	0.821643	2
47	engoe	False	True	0.9334	No report file found or report rejected.	2
48	giuseppe.fatiguso	False	True	0.9271	No report file found or report rejected.	2
49	lukasz	False	True	0.9241	No report file found or report rejected.	2
50	reksio	False	True	0.9120	No report file found or report rejected.	2
51	baseline_solution	False	True	0.9119	No report file found or report rejected.	2
52	akarwan	False	True	0.9189	No report file found or report rejected.	2
53	tf248762	False	True	0.9119	No report file found or report rejected.	2
54	mm319482	False	True	0.9112	No report file found or report rejected.	2
55	a.ruta	False	True	0.8784	No report file found or report rejected.	2
56	tobuchowski2	False	True	0.8770	No report file found or report rejected.	2
57	ls306462	False	True	0.7412	No report file found or report rejected.	2
58	kcwong	False	True	0.6904	No report file found or report rejected.	2
59	maniek	False	True	0.0000	No report file found or report rejected.	2

Schedule

Feb. 3, 2014: start of the competition, data sets become available,
May 5, 2014: deadline for submitting the predictions,
May 7, 2014: deadline for sending the reports, end of the challenge,
May 12, 2014: on-line publication of final results, sending invitations for submitting short papers for the special session,
June 2, 2014: deadline for submissions of papers describing the selected solutions,
June 16, 2014: deadline for submissions of camera-ready papers selected for presentation at the CEIM'14 workshop.

Awards

Authors of the top ranked solutions (based on the final evaluation scores) will be awarded with prizes:

First Prize: computer hardware worth 2,000 USD + one free FedCSIS'14 conference registration,
Second Prize: computer hardware worth 1,000 USD + one free FedCSIS'14 conference registration,
Third Prize: one free FedCSIS'14 conference registration.

The award ceremony will take place during the FedCSIS'14 conference (September 7-10, Warsaw). Additionally, invited authors who decide to attend the conference will receive a diploma and a competition T-shirt.

Contest organizing committee

Andrzej Janusz (Chairman), University of Warsaw

Adam Krasuski, Main School of Fire Service & University of Warsaw

Dominik Ślęzak, University of Warsaw & Infobright Inc.

Hung Son Nguyen, University of Warsaw

Sebastian Stawicki, University of Warsaw

Guillermo Rein, Imperial College London

Stanisław Łazowy, Main School of Fire Service

Forum

Discussion	Author	Replies	Last post
Test set availability	Dymitr	0	by Dymitr Monday, May 12, 2014, 19:02:30
A delay in the final evaluation process	Andrzej	0	by Andrzej Saturday, May 10, 2014, 19:18:47
Leaderboard changes	Adam	1	by Andrzej Friday, May 09, 2014, 08:51:27
Deadlines	Adam	3	by Andrzej Monday, May 05, 2014, 04:45:06
End of the competition	Andrzej	0	by Andrzej Friday, May 02, 2014, 15:32:10
Number of selected reports to for publication	Eftim	1	by Eftim Tuesday, April 29, 2014, 14:11:09
100 solutions limit	Adam	2	by Adam Wednesday, April 23, 2014, 12:49:34
N/A for preliminary score	Piotr	1	by Piotr Sunday, April 06, 2014, 10:33:42
description of evaluation metric	piotr	5	by Andrzej Tuesday, April 01, 2014, 10:45:57
Welcome to AAIA'14 Data Mining Competition!	Andrzej	0	by Andrzej Sunday, February 02, 2014, 22:28:19