Do we have the rough relative location and orientations of the main_working_sites (longwalls) ? We have depth difference from the column metadata description but perhaps horizontal distance between the sites might yield addititional useful evidence. I also wanted to share with you that additional data are unlikely to yield any improvement over the testing performance as they apper simply to show warnings from different sites from both training and testing periods and as already mentioned testing set also appears to contain very different previously unseen sites. Lots of warnings tend to fire at once for a particular longwall as a one-off rare event hence the representativeness of such a s small testing set (compared to the training set) is doomed to very poor. My view is also that for consistency with the training set, the testing set should also be provided ordered in time :).
Unfortunately, we currently do not have any data regarding exact distances between the main working sites from the data. However, the region name id and the seam name id, which are available in the mata-data table, can be used to roughly aggregate the working sites with regard to their proximity.
Regarding the second part of your post, data for 13 (out of 21) main working sites which appear in the test data are also present in the training data. This corresponds to approximately 70% of test data. From those 13 sites, 9 appear only in the additional training data sets. Moreover, the additional data can be used not only to better train your models, but also to more accurately evaluate their performance. Such evaluation could be far more useful than the one provided by the preliminary score.
Best regards and good luck!
Many thanks for the explanation, I fully agree, however the most eventful sites 264, 373 and 437 that account for 53% of all training data seen so far do not have instances in the preliminary testing - which more than not allows to state that we are testing different sites than presented training data for :). Again this does not have to be a problem but may explain the sources of big discrepancies between different parts of the data in this competition, but I guess such is the reality of this data.
Once again many thanks, Dymitr