4 years, 5 months ago
Semester Project for Data Mining 2019/2020 Course
This is the second project for students enrolled in the Data Mining course 2019/2020 at the Faculty of Mathematics, Informatics and Mechanics at the University of Warsaw.
The task in this challenge is to classify abstracts of scientific articles from ACM Digital Library into topics from the ACM Computing Classification System (the old version from 1998). This problem can be regarded as a multi-label classification of textual data.
More details regarding the available data, submission format, and evaluation can be found in the Task description section.
Data for this project consists of two tables in a tab-separated columns format. Each row in those files corresponds to an abstract of a scientific article from ACM Digital Library, which was assigned to one or more topics from the ACM Computing Classification System.
The training data (DM2020_training_docs_and_labels.csv) has three columns: the first one is an identifier of a document, the second one stores the text of the abstract, and the third one contains a list of comma-separated topic labels.
The test data (DM2020_test_docs.csv) has a similar format, but the labels in the third column are missing.
The task and the format of submissions: the task for you is to predict the labels of documents from the test data and submit them to the evaluation system. A correctly formatted submission should be a text file with exactly 100000 lines. Each line should correspond to a document from the test data set (the order matters!) and contain a list of one or more predicted labels, separated by commas.
Evaluation: the quality of submissions will be evaluated using the average F1-score measure, i.e., for each test document, the F1-score between the predicted and true labels will be computed, and the values obtained for all test cases will be averaged.
Solutions will be evaluated on-line and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Participants can submit many solutions but before the competition ends, each team needs to indicate up to two final solutions that will undergo the final evaluation (on the remaining part of test data).
In case of additional questions, please post them on the competition forum.
Here you can find data for this challenge. To get the data, you need to be enrolled and logged in.
Rank | Team Name | Is Report | Preliminary Score | Final Score | Submissions | |
---|---|---|---|---|---|---|
1 | python to make it harder |
True | True | 0.4429 | 0.442684 | 18 |
2 | Andrzej Janusz |
True | True | 0.4339 | 0.433442 | 1 |
3 | Not so standard deviation |
True | True | 0.4296 | 0.429108 | 26 |
4 | 0x444d5f50726f6a6563747844 |
True | True | 0.4226 | 0.421489 | 8 |
5 | hackermen |
True | True | 0.4077 | 0.410506 | 1 |
6 | lraszkiewicz |
True | True | 0.4112 | 0.409127 | 6 |
7 | ps371816 |
True | True | 0.4079 | 0.406427 | 6 |
8 | Krety w Krainie Danych |
True | True | 0.4023 | 0.403361 | 11 |
9 | ABC |
True | True | 0.3155 | 0.310596 | 11 |
10 | mim_team |
True | True | 0.2558 | 0.256685 | 6 |
11 | mikra |
False | True | 0.3832 | No report file found or report rejected. | 5 |
- May 15, 2020: start of the challenge, the data sets become available and submission system is opened
- June 21, 2020 (23:59:59 GMT): submission system closes
- June 21, 2020 (23:59:59 GMT): sending reports due
This forum is for all users to discuss matters related to the competition. Good manners apply!
Discussion | Author | Replies | Last post | |
---|---|---|---|---|
Can we use languages other than R? | Jakub | 1 | by Andrzej Thursday, May 21, 2020, 15:51:48 |