Neither the training nor the test data table is ordered by time. We don't want to reveal those orderings for now.
Is the test set from a period after the training set?
Not knowing the order makes it impossible to construct a sound validation strategy. I am stuck with standard K-fold cross validation, which inevitably leads to training on samples which were generated after samples in the validation set. In real life we of course always train on samples from the past and predict the future. If we don't know the order of the training set our validation strategy will inevitably overfit.
Yes, the test data is from a period after the training set.
I understand the issue, I will ask the rest of the organizing committee what can we do about it.
Thank you for your feedback,
I understand your point perfectly, but unfortunately, we can't reveal the ordering of training data as it potentially could undermine the anonymization process. We cannot risk that at this point.
Regarding the validation procedure, please note that splitting data based on time is not the only viable option. Since data from each of SoD's clients can be assumed to be independent, you may validate your models using splits by client codes. I realize that it is not a perfect solution, but it is better than using random validation folds.