I would like to ask a question regarding the freqency of warning signals in the training and test data. The proportion of "ones" in the training data is about 0.02. Is it similar for the test data? By similar I mean a value in the range about +-0.01. I wonder if the organizers could give us a hint about it.
All the best,
the frequencies of 'warning' signals can be significantly different for different longwalls (the main working sites). As you might have noticed, some part of the test data corresponds to readings from longwalls that are not present in the available training sets. Moreover, the test part corresponds to a different period of time.
I cannot reveal any more information before the competition ends :-)
Best regards and good luck!
thanks for your reply. Indeed, I formulated my question in a bit unfortunate way and it appears to unveil the properties of the testing data before the competition's end. Let me put it another way: what is the sampling procedure for test data? I understand that the test dataset covers different time (future) with respect to the training data. However, the observations in the test data are not aligned according to time as opposed to the observations in the training data (as Dymitr noted it in his previous question). How the observations were chosen to form test dataset?
Thank you in advance for any clarifications.
For each of the main working sites the data from the test period were sampled from a uniform distribution and then all overlapping time series were removed. After that, all data were shuffled to hide relations between consecutive series.
Andrzej, could you please elaborate on this? As far as I understood that for a main working site with id = x you are given a number of consecutive observations (time series). Then, you sample at random (uniformly) observations and include them in the test set. What does it mean that time series are overlapping (for removal)?
The final step is to shuffle the data to hide temporal relations.
Moreover, I would like to ask if there is any particular procedure for choosing main working sites for test data (apart from the fact that there should be a few new sites as compared to the training data).
We used all main working sites for which we had data corresponding to the test period. Some of the main working site IDs which are present in the training data does not appear in the test set simply because they were not explored in that time (they were already depleted).
Regarding to overlapping time series - each data record is associated with a 32 hour long period of time (24 hourly readings + next 8 hours corresponding to the target value). By non-overlapping time series I meant records for which those periods do not overlap.
I hope that it clarifies everything.