АнглийскийFuf. It's time to breathe slightly freely. I haven't been writing in my diary for three weeks, haven't been learning English, haven't been reading textbook on mathematical statistics, haven't been reading psychological books, stopped doing sport exercises at home, stopped training memory and attention - because I had decided to participate in the International Data Analysis Olympiad (idao) by Yandex, Higher School of Economics, Harbour.Space University and Sberbank:
Surely, I'm not counting upon passing even the online qualifying round. For me it just looked as interesting training opportunity, and my aim was just to make a successful submission.
The task of this competition was to predict future users' interests on some online marketing platform. The logs of users' views in different categories were provided and using this history the participants had to select subset of users with five categories for each that they had not viewed for three last weeks, but will in the following seven days. There were two tracks: on the first one only a file with concrete prediction had to be uploaded, on the second you had to provide a program that would run on the server with time and memory limits. I've chosen the first one, and though after a submission to the Track1 I would have time to work on the Track2, I decided not to. I would be able to create only a baseline solution that would not considerably raise my position on the leaderboard; it would only satisfy my curiosity about how seriously does machine learning improve the prediction. I've stopped on the 54-th line on the "public" leaderboard, and the final results of "private" submissions will be evaluated privily.
Anyway, the preliminary 54-th place of 107 teams with any submission of 900 registered looks like not so bad result, especially for the first participation in "international" competition, where teams could have more than one member. (But I was alone ).
The main problem for me was that yandex laid out 900MB logs with views from more than million users, and the processing of such huge amount of data turned out to be very laborious on my machine. Even their sample benchmark gave me a MemoryError. My models also tended to eat enormous amount of memory and impermissible amount of time. Occasionally, my computer was hanging and I had to reboot. So, I had to save intermediate results to disk constantly. The Pandas library would help greatly, as I see, but unfortunately, I haven't studied it yet and I couldn't to grasp its principles quickly. Pandas, I need to make friends with Pandas...
Just to complain, I would not say that the contest was thoroughly organized. After online registration at the end of December I was left with the question "what to expect further?", until on the 25th of February I was notified that the contest has started. I guess, the 900MB of data was problem not only for me, or there would be more active participants. The mentioned benchmark didn't work on my average computer; the metric formula and its implementation were a little bit different; I got several 500 Internal Server Error while submitting my predictions. One of the sent emails stated that private part would be opened on the 9th of February, but it was opened only on 10th. The 11th of February is declared as the online round end, but today is 12th and submissions are still accepted. (No, I don't want to try any more, I'm tired and consider the competition ended for me). And they wrote that they encountered technical problems with Track2... All this small unpleasant things really spoil the impression, though I consider the opportunity to look at practical yandex tasks as being notably valuable, of course.
Well, I finished participating by now, but I can't just put a point, as I need to bring my code in order and push it to my Github account - it can be useful as portfolio sample. And I have to do it in parallel with homework on free course on Data Science that started recently (Data Mining in Action). And I have unfinished .Net code on one of my ideas... Fuck. And textbook on mathematical statistics, and classical algorithms, and Pandas, and Git principles, and another free course (OpenDataScience). And Lingualeo lessons, and attempts to improve memory and attention, and sport sports training... And all needed information is in forms that are so difficult to percept - English, formulas, higher mathematics, Python code with two different versions, unknown libraries, unstudied tools... Fuck! Fuck!!! FUCK!!! Why it is so hard to become a good machine learning specialist?