FedCSIS'20 Challenge

Big news for all of you who would like to continue research related to this competition and evaluate your new results!

Ground truth test target values from the Network Device Workload Prediction challenge are now available for download. Responding to numerous requests, we decided to give you access to all evaluation files (check them out in the Data files section). Scroll down for a detailed description of the competition task.

If you are planning to use the data from this challenge in your publications, please add a reference to the paper describing the competition:
Andrzej Janusz, Mateusz Przyborowski, Piotr Biczyk, Dominik Ślęzak: Network Device Workload Prediction: A Data Mining Challenge at Knowledge Pit. FedCSIS 2020: 77-80

EMCA Software is a Polish vendor of Energy Logserver - a system capable of collecting data from various log sources to provide in-depth data analysis and alerting to its end-users. EMCA is based in Poland but also operates in Nordics, APAC, and the USA through partner channels. The company focuses on cybersecurity and IT infrastructure monitoring use cases, intending to deliver a system that is ready-to-use and offers inbox correlations and predictions on monitored data.

By this challenge, we want to help EMCA to answer the question of whether it is possible to reliably predict workload-related characteristics of monitored devices, based on historical data gathered from such devices. This task is of paramount importance for IT and technical teams that can put their hands on a tool that allows them to manage the capacity of their infrastructure.

An additional difficulty within this challenge, and also the reason why it might be especially interesting for the data science community, arises from the fact that devices considered in the data are not uniform. In essence, logs cover readings from various types of hardware. Some of them are cross-dependent, as they are a part of the same IT system. Moreover, some devices have multiple interfaces for which the data is aggregated.

More details regarding the task and a description of the challenge data can be found in the Task description section.

Special session at FedCSIS 2020: As in previous years, a special session devoted to the competition was held at the conference. We invited authors of selected challenge reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The papers were indexed by the IEEE Digital Library and Web of Science. The invited teams were chosen based on their final rank, innovativeness of their approach, and quality of the submitted report.

All published papers are available in the conference proceedings: https://annals-csis.org/Volume_21/

Terms & Conditions

Contest Participation Rules:

Competition organizers are QED Software sp. z o.o., and EMCA Software.sp. z o.o.
The competition is open to all interested researchers, specialists, and students. Only members of the Contest Organizing Committee and employees of EMCA Software, and QED Software cannot participate.
Participants may submit solutions as teams made up of one or more persons.
The deadline for submitting the solutions is June 8, 2020 (23:59 GMT).
Each team needs to designate a leader responsible for communication with the Organizers. A single person can be a leader of only one team.
One KnowledgePit account can only be associated with a single team at a time. It is not possible to withdraw from a team, but teams can be merged.
Each team needs to be composed of a different set of persons.
A single person may enroll in the challenge with only one KnowledgePit account.
Each team is obliged to provide a short report describing their final solution. The report must contain information such as the name of the team, the names of all team members, and a brief overview of the used approach. The description should explain all data preprocessing steps and model construction steps. It should be submitted in the pdf format using our submission system by June 10, 2020 (23:59 GMT). Only submissions made by teams that provided the reports will qualify for the final evaluation.
After the final evaluation, three top-ranked teams will be asked to provide the source codes that can be used to reproduce their final solutions and documentation that would allow running the code. If the code has to be run within a complex environment (e.g. distributed Hadoop cluster) a detailed setup explanation should be provided as well. The source codes will be used to verify the legitimacy of solutions. Winners of the challenge are chosen from the top-ranked teams that provide reports and the legitimate source codes of their solutions for the verification. Only such teams are eligible for the awards in this challenge.
Additionally, in this challenge, the winners are eligible for money prizes only if their final solution improves the baseline score by at least 10%.
The fact of accepting the award is equivalent to granting the organizers a worldwide, non-exclusive, sub-licensable, transferable, royalty-free, perpetual and irrevocable right to use, reproduce, distribute, create derivative works of, publicly perform, publicly display, digitally perform, make, have made, sell, offer for sale and/or import the winning submission and the source code used to generate it, in any media now known or hereafter developed, for any purpose whatsoever, commercial or otherwise, without further approval or any payment to the participant. By accepting the award the participants also acknowledge that they have full and unrestricted rights to grant aforementioned rights.
The fact of accepting the award is equivalent to allowing for usage of participant's name, affiliation and/or prize information by competition organizers for promotional purposes in any medium without additional compensation.
Organizers hold the right to extend the deadlines for submitting solutions and/or reports. In such a case, they will inform participants about the change using the competition forum.
Organizers are not responsible for any consequences of technical issues related to the evaluation system or the competition platform.
The final ranking of the competing teams will be done based on the final evaluation results. In a case of draws in the evaluation scores, the time of the submission will be taken into account.
Each report, paper, and any other type of publication basing on the research where data from this competition is used should accredit KnowledgePit, QED Software, and EMCA Software as the institutions that provided data for the study.
Organizers may reject any submission if they suspect that it was produced in an unfair way (e.g., used unintended data leaks) or was submitted by a team that has broken the competition rules without providing any additional explanation.
By enrolling in this competition, you grant the organizers the right to process your submissions and reports for the purpose of evaluation and post-competition research. Your data is administrated by eSensei Sp. z o.o.

FedCSIS 2020 Challenge: Network Device Workload Prediction has ended. In total, the competition attracted 150 teams which submitted over 700 solutions. We would like to thank all participants for this great contribution!

The considered task was indeed challenging - the final solutions from all teams were ranked lower than the competition baseline. However, solutions submitted by several teams are very promising, and we will be further investigating their possible practical applications.

Selected teams submitted extended versions of their reports to the special session of FedCSIS 2020. These were published responding to our challenge and giving solutions to the diagnosed problem.

Rank	Team Name	Is Report		Preliminary Score	Final Score	Submissions
1	baseline solution	True	True	0.2267	0.229530	3
2	Les Trois Mousquetaires	True	True	0.1888	0.162979	19
3	papiez69	True	True	0.1841	0.151499	13
4	Wrong Team Name	True	True	0.1836	0.143708	6
5	Stanisław Kaźmierowski	True	True	0.1464	0.098542	15
6	kajetan	True	True	0.1512	0.077224	5
7	datafreaks	True	True	0.0731	0.070106	4
8	sienkiewicz	True	True	0.0225	0.014939	8
9	-_-	True	True	0.0109	0.012374	21
10	Piotr Grabowski	True	True	-0.0005	-0.000089	5
11	pacman	True	True	-0.0013	-0.000972	1
12	Fni 2	True	True	-0.0013	-0.000972	4
13	cdata	True	True	0.3059	-0.059837	90
14	amy	True	True	0.3130	-0.138349	100
15	SELECT name FROM competition.losers	True	True	0.0146	-0.475198	22
16	Funny Team Name	True	True	-0.9526	-0.583656	5
17	Dymitr	True	True	0.3223	-0.779576	146
18	MultiPandas	True	True	-0.1923	-0.840216	33
19	pszulc	True	True	0.0096	-1.129627	10
20	NJJ	True	True	-0.4814	-1.868633	5
21	Andrey	True	True	-2.8140	-2.400179	6
22	Karol Waszczuk	True	True	0.1955	-2.430554	38
23	The Sherpas	True	True	-2.0595	-2.480422	4
24	Piotr Szulc	True	True	0.2030	-2.600249	11
25	kaambal	True	True	-1.7577	-8.128287	2
26	RandomGenerator	True	True	0.2575	-61.164407	35
27	pkuczko	True	True	-999.0000	-999.000000	9
28	little_skynet	False	True	0.1113	No report file found or report rejected.	7
29	Climber	False	True	0.2066	No report file found or report rejected.	21
30	berlin	False	True	0.1049	No report file found or report rejected.	18
31	Alex	False	True	0.0185	No report file found or report rejected.	3
32	noidea	False	True	0.0004	No report file found or report rejected.	7
33	dataloader	False	True	-0.0013	No report file found or report rejected.	1
34	mathurin	False	True	-0.0013	No report file found or report rejected.	8
35	go	False	True	-0.0013	No report file found or report rejected.	11
36	vbhargav875	False	True	-0.0013	No report file found or report rejected.	3
37	IME	False	True	-0.0013	No report file found or report rejected.	1
38	joe	False	True	-0.0013	No report file found or report rejected.	1
39	TRN	False	True	-0.0013	No report file found or report rejected.	1
40	heheteam	False	True	-0.0013	No report file found or report rejected.	1
41	Kirov reporting	False	True	-0.0474	No report file found or report rejected.	6
42	makak	False	True	-0.0570	No report file found or report rejected.	5
43	pesto	False	True	-0.1022	No report file found or report rejected.	2
44	Michal	False	True	-0.1399	No report file found or report rejected.	17
45	TeamName	False	True	-0.0013	No report file found or report rejected.	9
46	ahihi_ahaha	False	True	-0.6925	No report file found or report rejected.	3
47	M	False	True	-1.4472	No report file found or report rejected.	4
48	One_n_Only	False	True	-6.9848	No report file found or report rejected.	10
49	DenisVorotyntsev	False	True	-318.4680	No report file found or report rejected.	2
50	pauli	False	True	-327.1493	No report file found or report rejected.	1
51	Franciszek Budrowski	False	True	-488.8955	No report file found or report rejected.	2
52	onemanarmy	False	True	-999.0000	No report file found or report rejected.	1
53	Azul	False	True	-999.0000	No report file found or report rejected.	1
54	Niko	False	True	-999.0000	No report file found or report rejected.	8

Please log in to the system!

Training data in this challenge are hourly aggregated values of various workload characteristics extracted from device logs. They were made available in the form of a CSV table containing ten columns. The first three of these columns are identifiers. They are followed by the mean, standard deviation, and a candlestick aggregation of the corresponding values. In particular, the meanings of the columns in the data set are:

hostname: an ID of the device
series: a name of the considered characteristic
time_window: a timestamp of the aggregation window; the row aggregates values from an hour starting at the indicated timestamp
Mean: the mean of the values
SD: the standard deviation of the values
Open: a value of the first reading during the corresponding hour
High: the maximum of values
Low: the minimum of values
Close: a value of the last reading during the corresponding hour
Volume: the number of values

For each hostname-series pair in the data, values can be arranged into a time series spanning for over 80 days. Note, however, that some values can be missing for some pairs. Moreover, hostnames correspond to heterogeneous types of devices for which different sets of characteristics are monitored. Some of these devices are a part of the same system and it is likely that their workloads are highly correlated.

The task and the format of submissions: the task in this challenge is to predict future workload characteristic values of a number of devices from the training data. IDs of the devices (hostname) and their characteristics for which the predictions are to be made (series) are indicated in the solution_template.csv file. This file was made available in the Data files section. Participants of the challenge are asked to predict 168 consecutive values of each indicated time series (one full week) and upload the predictions through the submission system.

The format of submissions should be the same as in the solution_template.csv file. Solutions should be submitted as CSV files containing 170 columns. The first two columns should contain device ID (hostname) and characteristic ID (series), respectively. They should be followed by 168 numeric columns containing predictions – mean values of the corresponding characteristics for the next 168 hours (one week starting at 2020-02-20 12:00:00) after the end of the training data. The file exemplary_solution.csv contains an example of a correctly formatted submission file.

Evaluation: the quality of submissions will be evaluated using the $R^2$ measure, i.e., for each time series, the forecasts will be compared to ground truth values, and their quality will be assessed using the formula:

$$R^2(f, y) = 1 - \frac{RSS(f, y)}{TSS(y)},$$ where $RSS(f, y)$ is the residual sum of squares of forecasts: $$RSS(f, y) = \sum_i (y_i - f_i)^2,$$ and $TSS(y)$ is the total sum of squares: $$TSS(y) = \sum_i (y_i - \bar{y})^2,$$ and $\bar{y}$ is the mean value of time series $y$ estimated using available training data. The submission score is the average $R^2$ value over all time series from the test set.

Solutions will be evaluated on-line and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Moreover, to be eligible for the awards, the winning teams must exceed the score of the baseline solution by at least 10%.

In case of any questions, please post on the competition forum or write an email to contact {at} knowledgepit.ml

March 23, 2020: start of the challenge, the data set becomes available
March 25, 2020: submission system opens
June 8, 2020 (23:59:59 GMT): submission system closes
June 10, 2020 (23:59:59 GMT): sending reports due
June 17, 2020: online publication of the final results, sending invitations for submitting papers
July 1, 2020: deadline for submitting invited papers
July 8, 2020: notification of paper acceptance
July 15, 2020: camera-ready of accepted papers, and registration to the conference due

Authors of the top-ranked solutions (based on the final evaluation scores) were awarded prizes funded by the sponsors:

First Prize: 1500 USD + one free FedCSIS'20 conference registration,
Second Prize: 1000 USD + one free FedCSIS'20 conference registration,
Third Prize: 500 USD + one free FedCSIS'20 conference registration.

The award ceremony took place during the FedCSIS'20 conference. Please note that the winners were eligible for the money prizes only if their final score exceeds the baseline solution score by at least 10%.

Andrzej Janusz, QED Software & University of Warsaw
Piotr Biczyk, QED Software
Artur Bicki, EMCA Software
Mateusz Przyborowski, QED Software & University of Warsaw

In case of any questions please post on the competition forum or write an email at contact {at} knowledgepit.ml

This forum is for all users to discuss matters related to the competition. Good manners apply!

Discussion	Author	Replies	Last post
Online publication of the final results	Kacper	1	by Andrzej Thursday, June 18, 2020, 20:34:13
re-opening of the submission system	Andrzej	5	by Dymitr Wednesday, June 10, 2020, 18:41:18
broken submission system	Jan	3	by Andrzej Tuesday, June 09, 2020, 17:58:48
the end of the challenge	Andrzej	0	by Andrzej Tuesday, June 09, 2020, 11:51:51
What is the Baseline R2 value?	Ashwini kumar	2	by Andrzej Monday, June 08, 2020, 08:14:30
Timer inconsistent with schedule	Jan	1	by Andrzej Thursday, June 04, 2020, 08:58:25
Target variable	IOANNIS	1	by Andrzej Friday, May 29, 2020, 18:35:30
Submission deadline approaching	Piotr	3	by Andrzej Saturday, May 23, 2020, 11:11:08
Maintenance break	Andrzej	0	by Andrzej Thursday, April 16, 2020, 14:52:25
The competition is officially open!	Piotr	2	by Andrzej Monday, March 30, 2020, 15:53:53