4 years, 4 months ago
FedCSIS 2020 Challenge: Network Device Workload Prediction
FedCSIS 2020 Data Mining Challenge: Network Device Workload Prediction is the seventh data mining competition organized in association with Conference on Computer Science and Information Systems (https://fedcsis.org/). This time, the considered task is related to the monitoring of large IT infrastructures and the estimation of their resource allocation. The challenge is sponsored by EMCA Software and Polish Information Processing Society (PTI).
Big news for all of you who would like to continue research related to this competition and evaluate your new results!
Ground truth test target values from the Network Device Workload Prediction challenge are now available for download. Responding to numerous requests, we decided to give you access to all evaluation files (check them out in the Data files section). Scroll down for a detailed description of the competition task.
If you are planning to use the data from this challenge in your publications, please add a reference to the paper describing the competition:
Andrzej Janusz, Mateusz Przyborowski, Piotr Biczyk, Dominik Ślęzak: Network Device Workload Prediction: A Data Mining Challenge at Knowledge Pit. FedCSIS 2020: 77-80
EMCA Software is a Polish vendor of Energy Logserver - a system capable of collecting data from various log sources to provide in-depth data analysis and alerting to its end-users. EMCA is based in Poland but also operates in Nordics, APAC, and the USA through partner channels. The company focuses on cybersecurity and IT infrastructure monitoring use cases, intending to deliver a system that is ready-to-use and offers inbox correlations and predictions on monitored data.
By this challenge, we want to help EMCA to answer the question of whether it is possible to reliably predict workload-related characteristics of monitored devices, based on historical data gathered from such devices. This task is of paramount importance for IT and technical teams that can put their hands on a tool that allows them to manage the capacity of their infrastructure.
An additional difficulty within this challenge, and also the reason why it might be especially interesting for the data science community, arises from the fact that devices considered in the data are not uniform. In essence, logs cover readings from various types of hardware. Some of them are cross-dependent, as they are a part of the same IT system. Moreover, some devices have multiple interfaces for which the data is aggregated.
More details regarding the task and a description of the challenge data can be found in the Task description section.
Special session at FedCSIS 2020: As in previous years, a special session devoted to the competition was held at the conference. We invited authors of selected challenge reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The papers were indexed by the IEEE Digital Library and Web of Science. The invited teams were chosen based on their final rank, innovativeness of their approach, and quality of the submitted report.
All published papers are available in the conference proceedings: https://annals-csis.org/Volume_21/
FedCSIS 2020 Challenge: Network Device Workload Prediction has ended. In total, the competition attracted 150 teams which submitted over 700 solutions. We would like to thank all participants for this great contribution!
The considered task was indeed challenging - the final solutions from all teams were ranked lower than the competition baseline. However, solutions submitted by several teams are very promising, and we will be further investigating their possible practical applications.
Selected teams submitted extended versions of their reports to the special session of FedCSIS 2020. These were published responding to our challenge and giving solutions to the diagnosed problem.
Rank | Team Name | Is Report | Preliminary Score | Final Score | Submissions | |
---|---|---|---|---|---|---|
1 | baseline solution |
True | True | 0.2267 | 0.229530 | 3 |
2 | Les Trois Mousquetaires |
True | True | 0.1888 | 0.162979 | 19 |
3 | papiez69 |
True | True | 0.1841 | 0.151499 | 13 |
4 | Wrong Team Name |
True | True | 0.1836 | 0.143708 | 6 |
5 | Stanisław Kaźmierowski |
True | True | 0.1464 | 0.098542 | 15 |
6 | kajetan |
True | True | 0.1512 | 0.077224 | 5 |
7 | datafreaks |
True | True | 0.0731 | 0.070106 | 4 |
8 | sienkiewicz |
True | True | 0.0225 | 0.014939 | 8 |
9 | -_- |
True | True | 0.0109 | 0.012374 | 21 |
10 | Piotr Grabowski |
True | True | -0.0005 | -0.000089 | 5 |
11 | pacman |
True | True | -0.0013 | -0.000972 | 1 |
12 | Fni 2 |
True | True | -0.0013 | -0.000972 | 4 |
13 | cdata |
True | True | 0.3059 | -0.059837 | 90 |
14 | amy |
True | True | 0.3130 | -0.138349 | 100 |
15 | SELECT name FROM competition.losers |
True | True | 0.0146 | -0.475198 | 22 |
16 | Funny Team Name |
True | True | -0.9526 | -0.583656 | 5 |
17 | Dymitr |
True | True | 0.3223 | -0.779576 | 146 |
18 | MultiPandas |
True | True | -0.1923 | -0.840216 | 33 |
19 | pszulc |
True | True | 0.0096 | -1.129627 | 10 |
20 | NJJ |
True | True | -0.4814 | -1.868633 | 5 |
21 | Andrey |
True | True | -2.8140 | -2.400179 | 6 |
22 | Karol Waszczuk |
True | True | 0.1955 | -2.430554 | 38 |
23 | The Sherpas |
True | True | -2.0595 | -2.480422 | 4 |
24 | Piotr Szulc |
True | True | 0.2030 | -2.600249 | 11 |
25 | kaambal |
True | True | -1.7577 | -8.128287 | 2 |
26 | RandomGenerator |
True | True | 0.2575 | -61.164407 | 35 |
27 | pkuczko |
True | True | -999.0000 | -999.000000 | 9 |
28 | little_skynet |
False | True | 0.1113 | No report file found or report rejected. | 7 |
29 | Climber |
False | True | 0.2066 | No report file found or report rejected. | 21 |
30 | berlin |
False | True | 0.1049 | No report file found or report rejected. | 18 |
31 | Alex |
False | True | 0.0185 | No report file found or report rejected. | 3 |
32 | noidea |
False | True | 0.0004 | No report file found or report rejected. | 7 |
33 | dataloader |
False | True | -0.0013 | No report file found or report rejected. | 1 |
34 | mathurin |
False | True | -0.0013 | No report file found or report rejected. | 8 |
35 | go |
False | True | -0.0013 | No report file found or report rejected. | 11 |
36 | vbhargav875 |
False | True | -0.0013 | No report file found or report rejected. | 3 |
37 | IME |
False | True | -0.0013 | No report file found or report rejected. | 1 |
38 | joe |
False | True | -0.0013 | No report file found or report rejected. | 1 |
39 | TRN |
False | True | -0.0013 | No report file found or report rejected. | 1 |
40 | heheteam |
False | True | -0.0013 | No report file found or report rejected. | 1 |
41 | Kirov reporting |
False | True | -0.0474 | No report file found or report rejected. | 6 |
42 | makak |
False | True | -0.0570 | No report file found or report rejected. | 5 |
43 | pesto |
False | True | -0.1022 | No report file found or report rejected. | 2 |
44 | Michal |
False | True | -0.1399 | No report file found or report rejected. | 17 |
45 | TeamName |
False | True | -0.0013 | No report file found or report rejected. | 9 |
46 | ahihi_ahaha |
False | True | -0.6925 | No report file found or report rejected. | 3 |
47 | M |
False | True | -1.4472 | No report file found or report rejected. | 4 |
48 | One_n_Only |
False | True | -6.9848 | No report file found or report rejected. | 10 |
49 | DenisVorotyntsev |
False | True | -318.4680 | No report file found or report rejected. | 2 |
50 | pauli |
False | True | -327.1493 | No report file found or report rejected. | 1 |
51 | Franciszek Budrowski |
False | True | -488.8955 | No report file found or report rejected. | 2 |
52 | onemanarmy |
False | True | -999.0000 | No report file found or report rejected. | 1 |
53 | Azul |
False | True | -999.0000 | No report file found or report rejected. | 1 |
54 | Niko |
False | True | -999.0000 | No report file found or report rejected. | 8 |
Training data in this challenge are hourly aggregated values of various workload characteristics extracted from device logs. They were made available in the form of a CSV table containing ten columns. The first three of these columns are identifiers. They are followed by the mean, standard deviation, and a candlestick aggregation of the corresponding values. In particular, the meanings of the columns in the data set are:
- hostname: an ID of the device
- series: a name of the considered characteristic
- time_window: a timestamp of the aggregation window; the row aggregates values from an hour starting at the indicated timestamp
- Mean: the mean of the values
- SD: the standard deviation of the values
- Open: a value of the first reading during the corresponding hour
- High: the maximum of values
- Low: the minimum of values
- Close: a value of the last reading during the corresponding hour
- Volume: the number of values
For each hostname-series pair in the data, values can be arranged into a time series spanning for over 80 days. Note, however, that some values can be missing for some pairs. Moreover, hostnames correspond to heterogeneous types of devices for which different sets of characteristics are monitored. Some of these devices are a part of the same system and it is likely that their workloads are highly correlated.
The task and the format of submissions: the task in this challenge is to predict future workload characteristic values of a number of devices from the training data. IDs of the devices (hostname) and their characteristics for which the predictions are to be made (series) are indicated in the solution_template.csv file. This file was made available in the Data files section. Participants of the challenge are asked to predict 168 consecutive values of each indicated time series (one full week) and upload the predictions through the submission system.
The format of submissions should be the same as in the solution_template.csv file. Solutions should be submitted as CSV files containing 170 columns. The first two columns should contain device ID (hostname) and characteristic ID (series), respectively. They should be followed by 168 numeric columns containing predictions – mean values of the corresponding characteristics for the next 168 hours (one week starting at 2020-02-20 12:00:00) after the end of the training data. The file exemplary_solution.csv contains an example of a correctly formatted submission file.
Evaluation: the quality of submissions will be evaluated using the $R^2$ measure, i.e., for each time series, the forecasts will be compared to ground truth values, and their quality will be assessed using the formula:
$$R^2(f, y) = 1 - \frac{RSS(f, y)}{TSS(y)},$$ where $RSS(f, y)$ is the residual sum of squares of forecasts: $$RSS(f, y) = \sum_i (y_i - f_i)^2,$$ and $TSS(y)$ is the total sum of squares: $$TSS(y) = \sum_i (y_i - \bar{y})^2,$$ and $\bar{y}$ is the mean value of time series $y$ estimated using available training data. The submission score is the average $R^2$ value over all time series from the test set.
Solutions will be evaluated on-line and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Moreover, to be eligible for the awards, the winning teams must exceed the score of the baseline solution by at least 10%.
In case of any questions, please post on the competition forum or write an email to contact {at} knowledgepit.ml
- March 23, 2020: start of the challenge, the data set becomes available
- March 25, 2020: submission system opens
- June 8, 2020 (23:59:59 GMT): submission system closes
- June 10, 2020 (23:59:59 GMT): sending reports due
- June 17, 2020: online publication of the final results, sending invitations for submitting papers
- July 1, 2020: deadline for submitting invited papers
- July 8, 2020: notification of paper acceptance
- July 15, 2020: camera-ready of accepted papers, and registration to the conference due
Authors of the top-ranked solutions (based on the final evaluation scores) were awarded prizes funded by the sponsors:
- First Prize: 1500 USD + one free FedCSIS'20 conference registration,
- Second Prize: 1000 USD + one free FedCSIS'20 conference registration,
- Third Prize: 500 USD + one free FedCSIS'20 conference registration.
The award ceremony took place during the FedCSIS'20 conference. Please note that the winners were eligible for the money prizes only if their final score exceeds the baseline solution score by at least 10%.
- Andrzej Janusz, QED Software & University of Warsaw
- Piotr Biczyk, QED Software
- Artur Bicki, EMCA Software
- Mateusz Przyborowski, QED Software & University of Warsaw
In case of any questions please post on the competition forum or write an email at contact {at} knowledgepit.ml
Discussion | Author | Replies | Last post | |
---|---|---|---|---|
Online publication of the final results | Kacper | 1 | by Andrzej Thursday, June 18, 2020, 20:34:13 |
|
re-opening of the submission system | Andrzej | 5 | by Dymitr Wednesday, June 10, 2020, 18:41:18 |
|
broken submission system | Jan | 3 | by Andrzej Tuesday, June 09, 2020, 17:58:48 |
|
the end of the challenge | Andrzej | 0 | by Andrzej Tuesday, June 09, 2020, 11:51:51 |
|
What is the Baseline R2 value? | Ashwini kumar | 2 | by Andrzej Monday, June 08, 2020, 08:14:30 |
|
Timer inconsistent with schedule | Jan | 1 | by Andrzej Thursday, June 04, 2020, 08:58:25 |
|
Target variable | IOANNIS | 1 | by Andrzej Friday, May 29, 2020, 18:35:30 |
|
Submission deadline approaching | Piotr | 3 | by Andrzej Saturday, May 23, 2020, 11:11:08 |
|
Maintenance break | Andrzej | 0 | by Andrzej Thursday, April 16, 2020, 14:52:25 |
|
The competition is officially open! | Piotr | 2 | by Andrzej Monday, March 30, 2020, 15:53:53 |