1 year, 6 months ago

## FedCSIS 2020 Challenge: Network Device Workload Prediction

### FedCSIS 2020 Data Mining Challenge: Network Device Workload Prediction is the seventh data mining competition organized in association with Conference on Computer Science and Information Systems (https://fedcsis.org/). This time, the considered task is related to the monitoring of large IT infrastructures and the estimation of their resource allocation. The challenge is sponsored by EMCA Software and Polish Information Processing Society (PTI).

Big news for all of you who would like to continue research related to this competition and evaluate your new results!

Ground truth test target values from the Network Device Workload Prediction challenge are now available for download. Responding to numerous requests, we decided to give you access to all evaluation files (check them out in the Data files section). Scroll down for a detailed description of the competition task.

If you are planning to use the data from this challenge in your publications, please add a reference to the paper describing the competition:
Andrzej Janusz, Mateusz Przyborowski, Piotr Biczyk, Dominik Ślęzak: Network Device Workload Prediction: A Data Mining Challenge at Knowledge Pit. FedCSIS 2020: 77-80

EMCA Software is a Polish vendor of Energy Logserver - a system capable of collecting data from various log sources to provide in-depth data analysis and alerting to its end-users. EMCA is based in Poland but also operates in Nordics, APAC, and the USA through partner channels. The company focuses on cybersecurity and IT infrastructure monitoring use cases, intending to deliver a system that is ready-to-use and offers inbox correlations and predictions on monitored data.

By this challenge, we want to help EMCA to answer the question of whether it is possible to reliably predict workload-related characteristics of monitored devices, based on historical data gathered from such devices. This task is of paramount importance for IT and technical teams that can put their hands on a tool that allows them to manage the capacity of their infrastructure.

An additional difficulty within this challenge, and also the reason why it might be especially interesting for the data science community, arises from the fact that devices considered in the data are not uniform. In essence, logs cover readings from various types of hardware. Some of them are cross-dependent, as they are a part of the same IT system. Moreover, some devices have multiple interfaces for which the data is aggregated.

More details regarding the task and a description of the challenge data can be found in the Task description section.

Special session at FedCSIS 2020: As in previous years, a special session devoted to the competition was held at the conference. We invited authors of selected challenge reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The papers were indexed by the IEEE Digital Library and Web of Science. The invited teams were chosen based on their final rank, innovativeness of their approach, and quality of the submitted report.

All published papers are available in the conference proceedings: https://annals-csis.org/Volume_21/

FedCSIS 2020 Challenge: Network Device Workload Prediction has ended. In total, the competition attracted 150 teams which submitted over 700 solutions. We would like to thank all participants for this great contribution!

The considered task was indeed challenging - the final solutions from all teams were ranked lower than the competition baseline. However, solutions submitted by several teams are very promising, and we will be further investigating their possible practical applications.

Selected teams submitted extended versions of their reports to the special session of FedCSIS 2020. These were published responding to our challenge and giving solutions to the diagnosed problem.

Rank Team Name Is Report Preliminary Score Final Score Submissions
1
baseline solution
True 0.2267 0.229530 3
2
Les Trois Mousquetaires
True 0.1888 0.162979 19
3
papiez69
True 0.1841 0.151499 13
4
Wrong Team Name
True 0.1836 0.143708 6
5
Stanisław Kaźmierowski
True 0.1464 0.098542 15
6
kajetan
True 0.1512 0.077224 5
7
datafreaks
True 0.0731 0.070106 4
8
sienkiewicz
True 0.0225 0.014939 8
9
-_-
True 0.0109 0.012374 21
10
Piotr Grabowski
True -0.0005 -0.000089 5
11
pacman
True -0.0013 -0.000972 1
12
Fni 2
True -0.0013 -0.000972 4
13
cdata
True 0.3059 -0.059837 90
14
amy
True 0.3130 -0.138349 100
15
SELECT name FROM competition.losers
True 0.0146 -0.475198 22
16
Funny Team Name
True -0.9526 -0.583656 5
17
Dymitr
True 0.3223 -0.779576 146
18
MultiPandas
True -0.1923 -0.840216 33
19
pszulc
True 0.0096 -1.129627 10
20
NJJ
True -0.4814 -1.868633 5
21
Andrey
True -2.8140 -2.400179 6
22
Karol Waszczuk
True 0.1955 -2.430554 38
23
The Sherpas
True -2.0595 -2.480422 4
24
Piotr Szulc
True 0.2030 -2.600249 11
25
kaambal
True -1.7577 -8.128287 2
26
RandomGenerator
True 0.2575 -61.164407 35
27
pkuczko
True -999.0000 -999.000000 9
28
little_skynet
False 0.1113 No report file found or report rejected. 7
29
Climber
False 0.2066 No report file found or report rejected. 21
30
berlin
False 0.1049 No report file found or report rejected. 18
31
Alex
False 0.0185 No report file found or report rejected. 3
32
noidea
False 0.0004 No report file found or report rejected. 7
33
False -0.0013 No report file found or report rejected. 1
34
mathurin
False -0.0013 No report file found or report rejected. 8
35
go
False -0.0013 No report file found or report rejected. 11
36
vbhargav875
False -0.0013 No report file found or report rejected. 3
37
IME
False -0.0013 No report file found or report rejected. 1
38
joe
False -0.0013 No report file found or report rejected. 1
39
TRN
False -0.0013 No report file found or report rejected. 1
40
heheteam
False -0.0013 No report file found or report rejected. 1
41
Kirov reporting
False -0.0474 No report file found or report rejected. 6
42
makak
False -0.0570 No report file found or report rejected. 5
43
pesto
False -0.1022 No report file found or report rejected. 2
44
Michal
False -0.1399 No report file found or report rejected. 17
45
TeamName
False -0.0013 No report file found or report rejected. 9
46
ahihi_ahaha
False -0.6925 No report file found or report rejected. 3
47
M
False -1.4472 No report file found or report rejected. 4
48
One_n_Only
False -6.9848 No report file found or report rejected. 10
49
DenisVorotyntsev
False -318.4680 No report file found or report rejected. 2
50
pauli
False -327.1493 No report file found or report rejected. 1
51
Franciszek Budrowski
False -488.8955 No report file found or report rejected. 2
52
onemanarmy
False -999.0000 No report file found or report rejected. 1
53
Azul
False -999.0000 No report file found or report rejected. 1
54
Niko
False -999.0000 No report file found or report rejected. 8

Training data in this challenge are hourly aggregated values of various workload characteristics extracted from device logs. They were made available in the form of a CSV table containing ten columns. The first three of these columns are identifiers. They are followed by the mean, standard deviation, and a candlestick aggregation of the corresponding values. In particular, the meanings of the columns in the data set are:

• hostname: an ID of the device
• series: a name of the considered characteristic
• time_window: a timestamp of the aggregation window; the row aggregates values from an hour starting at the indicated timestamp
• Mean: the mean of the values
• SD: the standard deviation of the values
• Open: a value of the first reading during the corresponding hour
• High: the maximum of values
• Low: the minimum of values
• Close:  a value of the last reading during the corresponding hour
• Volume: the number of values

For each hostname-series pair in the data, values can be arranged into a time series spanning for over 80 days. Note, however, that some values can be missing for some pairs. Moreover, hostnames correspond to heterogeneous types of devices for which different sets of characteristics are monitored. Some of these devices are a part of the same system and it is likely that their workloads are highly correlated.

The task and the format of submissions: the task in this challenge is to predict future workload characteristic values of a number of devices from the training data. IDs of the devices (hostname) and their characteristics for which the predictions are to be made (series) are indicated in the solution_template.csv file. This file was made available in the Data files section. Participants of the challenge are asked to predict 168 consecutive values of each indicated time series (one full week) and upload the predictions through the submission system.

The format of submissions should be the same as in the solution_template.csv file. Solutions should be submitted as CSV files containing 170 columns. The first two columns should contain device ID (hostname) and characteristic ID (series), respectively. They should be followed by 168 numeric columns containing predictions – mean values of the corresponding characteristics for the next 168 hours (one week starting at 2020-02-20 12:00:00) after the end of the training data. The file exemplary_solution.csv contains an example of a correctly formatted submission file.

Evaluation: the quality of submissions will be evaluated using the $R^2$ measure, i.e., for each time series, the forecasts will be compared to ground truth values, and their quality will be assessed using the formula:

$$R^2(f, y) = 1 - \frac{RSS(f, y)}{TSS(y)},$$ where $RSS(f, y)$ is the residual sum of squares of forecasts: $$RSS(f, y) = \sum_i (y_i - f_i)^2,$$ and $TSS(y)$ is the total sum of squares: $$TSS(y) = \sum_i (y_i - \bar{y})^2,$$ and $\bar{y}$ is the mean value of time series $y$ estimated using available training data. The submission score is the average $R^2$ value over all time series from the test set.

Solutions will be evaluated on-line and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Moreover, to be eligible for the awards, the winning teams must exceed the score of the baseline solution by at least 10%.

In case of any questions, please post on the competition forum or write an email to contact {at} knowledgepit.ml

• March 23, 2020: start of the challenge, the data set becomes available
• March 25, 2020: submission system opens
• June 8, 2020 (23:59:59 GMT): submission system closes
• June 10, 2020 (23:59:59 GMT): sending reports due
• June 17, 2020: online publication of the final results, sending invitations for submitting papers
• July 1, 2020: deadline for submitting invited papers
• July 8, 2020: notification of paper acceptance
• July 15, 2020: camera-ready of accepted papers, and registration to the conference due

Authors of the top-ranked solutions (based on the final evaluation scores) were awarded prizes funded by the sponsors:

• First Prize: 1500 USD + one free FedCSIS'20 conference registration,
• Second Prize: 1000 USD + one free FedCSIS'20 conference registration,
• Third Prize: 500 USD + one free FedCSIS'20 conference registration.

The award ceremony took place during the FedCSIS'20 conference. Please note that the winners were eligible for the money prizes only if their final score exceeds the baseline solution score by at least 10%.

• Andrzej Janusz, QED Software & University of Warsaw
• Piotr Biczyk, QED Software
• Artur Bicki, EMCA Software
• Mateusz Przyborowski, QED Software & University of Warsaw

In case of any questions please post on the competition forum or write an email at contact {at} knowledgepit.ml

This forum is for all users to discuss matters related to the competition. Good manners apply!
Discussion Author Replies Last post
Online publication of the final results Kacper 1 by Andrzej
Thursday, June 18, 2020, 22:34:13
re-opening of the submission system Andrzej 5 by Dymitr
Wednesday, June 10, 2020, 20:41:18
broken submission system Jan 3 by Andrzej
Tuesday, June 09, 2020, 19:58:48
the end of the challenge Andrzej 0 by Andrzej
Tuesday, June 09, 2020, 13:51:51
What is the Baseline R2 value? Ashwini kumar 2 by Andrzej
Monday, June 08, 2020, 10:14:30
Timer inconsistent with schedule Jan 1 by Andrzej
Thursday, June 04, 2020, 10:58:25
Target variable IOANNIS 1 by Andrzej
Friday, May 29, 2020, 20:35:30
Submission deadline approaching Piotr 3 by Andrzej
Saturday, May 23, 2020, 13:11:08
Maintenance break Andrzej 0 by Andrzej
Thursday, April 16, 2020, 16:52:25
The competition is officially open! Piotr 2 by Andrzej
Monday, March 30, 2020, 17:53:53