Data Science: Kaggle GRANDMASTER за полгода? | Павел Плесков, Data Nerds

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

If you think that Kaggle Grandmaster looks like this - you are wrong. Kaggle Grandmaster looks like this. And today we are going to interview him Pavel, thanks for coming today. My first question to you: Could you tell us about your background. I have an exceptional technical background. I have finished maths school 57 - and graduated from Moscow State University majoring in maths and from NES (New Economic School in Moscow) After that I have been working in consulting for a year. - Which particularly? Oliver Wyman, it is not the Big Three, rather Big Three and a half. I have been there for 9 months. One of the worst things there was futile work Of course Investment banking would be worse than that. But still. Then I decided to follow my passion which is algorithmic trading. For four years I have been operating in high-frequency trading, we were making trading robots for the stock market. Then I spent two years working in one of the well-known fund here in Moscow, for two years I even had my own fund, but but then liking it there very much and I realized that I want to do data-analysis more since in high-frequency trading all the data you have is some time series. Time series for financial instruments and a lot of talented people that solve the same problem: predictions of these temporary series at some time interval Physicist, mathematicians and economists, International Olympiad champions in Math all of them do the same task, the same routine. I realized that this was very inefficient way of using their brain power. Those people could go into academia, making way less money That's why, many of them go to the trading industry. There you can earn a lot but there is a very small amount of data. I mean the type of data, which you analyze is very limited. Now we see a really tremendous hype around AI and a number of neural networks analyze images, texts, Well, any data actually. So I understood that I wanted to go into of data science. How did you go there? After all, this is not going so fast, is it? Yes, but I was lucky that I could accumulate some capital and be able not to work for some time. The first thing I started to watch all standard courses on this topic on Coursera. Watched them all. Made all home assignments. However, it turned out that it was completely inefficient way to get a clue or two in data science, because that all the theoretical things became forgotten too fast. and all data analysis is based on manual work, and you have to to deal with data. For example, in consulting you deal with Excel, but when you work with real data science you deal with Pandas, Python and ext.. And the best way to dive into this job is to participate in competitions. I began to participate on the platform Kaggle in competitions and there I gained all my experience. But, look, you started to participate in this kind of activity September 17th. So its less than a year ago. Yes, its less than a year ago, in November I won my first prize and earned five thousand dollars. And there was no one grid, it was a competition for adaptation of the text in Russian. there was a big bias I mean, that only the participants with knowledge of the language. Although among the winners there were foreign participants. But for me every new competition is kinda new experience. So, it was possible to grow exponentially. Every day learn some new methods, new types of data, new knowledge about software and hardware, etc. Six months since you started to work with Kaggle and now you are a grandmaster in competitions How you managed to reach that success? I know for a fact that people spend years to do this. Yes, I'm a grand aster. I have got the Grandmaster title for seven months and I was given it right before my 30th birthday it was absolutely the best gift that I was waiting for. Happy birthday, Pasha, right? I worked my fingers to the bone for six months 10-12 work hours per day I was doing only competitions and contests. Optimizing hardware for that. Bought some new hardware items. Made a devbox 'cause I had only my laptop I spent all the time I had. Absolutely And I became top-20 in the world and top-3 in Russia. But in fact the answer - how I managed to do it - is simple I got lucky In other words, my background is a perfect fit for competitions and Data Science It is clear there is a mathematical background, the logic of mathematical methods this is very important for the formation of hypotheses and further verification. Then I got my programmist background that I could acquire in trading I started coding in C# and C++ then I absolutely fell in love with Python because it is not C++ It is rather easier It is called "the language of housewives". But I liked it so much So I didn't use anything - anymore to code So, I enjoyed all these instruments I applied I have a background in trading It is common that one of the key objectives in trading is making predictions in time series. But there is still a sideline task which is to search imbalances and inefficiency in the market cases and things like that - Arbitrage? Yes, arbitration is one of the ideas But they might vary hugely and they may not have any connection or relation to time series predictions or things like that It might relate to infrastructure or something common to it These things are really useful for competitions when you need to come up with something that is "thinking outside the box" and all these things together allowed me so effectively advance further in my opinion The fact that I had the opportunity to spend a lot of time. I had the perfect background. Moreover, the key things is that I got so much pleasure How many competitions did you go through? Totally, about 30 contests and a couple of hackatons Where we've met actually | Yeah, right And 15 of them were successful and I got medals or gifts, money and so far Tell me about your mathematical background. In my point of view, practical data science a deep knowledge of math is not needed And you need to know the various tools and how you can apply them, right? Yeah, in fact there are two approaches and there are people who really want to dive into something like neural networks and understand how it is arranged And for this you need some sort of hard linear analysis and an understanding of how matrix is constructed, operations with them and so on. But there is also more practical business approach when you just conditionally import those packages which are already on the market. Do not look deep into and you decide only practical task for optimization of some loss in a certain algorithm and that's it. For those who absolutely love mathematics it will be hard for them 'cause there's little theory how on how neural networks work and and algorithms in general and ML of course. That's really very little there mathematics because it is optimization It is not determined and therefore they will be dissapointed But the good news for those who do not know it well it is not necessary to know it in the first place How specifically did your background help you? My Math-background helped me mostly in the logical aligning the process in or generally in research I would say in research mostly ML doesn't differ from any different academic area or scientific activity You form some hypotheses testing it on the data most of your hypotheses 99 out of 100 will be invalid and incorrect But once you guess it and it will advance you further you will create a better model and improve your results This method is absolutely strict logical process Well, it's hard to rethink it or a possible way of deviating from it. There is a task to choose the optimal architecture optimal parameters if you didn't test some hypotheses it is likely that they were already tested by your competitors and there was a golden grain in fact So, we might call it a technical mindset? Or even consulting mindset. when you lay out the possible options for structures Or we might call it a business approach. Those people who run a startup. The first thing they do they have an idea and they test it for some reason on small sample of their friends or make a small product The same applies in in the data science. You have some kind of hypothesis. You collected a little pipeline. You set up a plan and test it It works then you might sophisticate your model Could you tell us about the hardware you collected to participate in competitions. Yeah, a great question. I originally had a desktop with something old inside with i7 Intel and 32GB RAM First thing I did was installing of SSD-disks. They have a read speed 500 mbps However, sometimes it seemed not enough for competitions. There were several competitions where the entire process consists of a million very small pictures. And therefore, and when the neural network starts reading them there are several threads 'cause the pictures are very small and it might be overloaded with too much requests For these purposes, there are even more fast drives with rapid speed - three and a half gigabit per second You have to plug them directly into a motherboard's PCI slot. Sorry for technical details. These are the things that every DS has to take into consideration Because you have to deal with that There are few options. Either assemble your own desktop or constantly pay for some cloud solutions. But for me and for many people hourly payments aren't comfortable and not an option Plus it might be really expensive Therefore, usually people tend to to pay a few thousands of dollars to build their own workstation. After upgrading my devbox last year I bought a video card I did it on Avito, it was previously used but it absolutely was not a problem after I won a few more competitions and received the prize money. I've done a bit more expensive and serious workstation Much more serious. There are already 4 cards 32GB RAM and 4 GPU But still it may not be enough for for some tasks Wow, what kind of a computer do you need for these tasks then? There are some stories that well-known GMs use devboxes with 500 GB RAM But it is true for ML Everything regarding forests and models of different XGBoosts Using them, you may upgrade them as hard as you want Really a lot of things If this is all very very it is difficult to intervene in the standard Operational and model quality in principle can be improved if you will make a lot of features and interractions within them. So, it is mindful per se. So, one will never become a grand master, working just on your MacBook? But this is not entirely true There are some cases when People form a team and each member can bring his or her expertise Therefore, even people without video cards won the team competition sometimes you can find some cunning leak in the data and also such cases when one may use some kind of cheats or shortcuts Thought in a different way from different angle Don't create standard models but but to see something that the rest misses This is not necessary for this to have super powerful computer. On the average, of course, hardware quite often decides what you will get Have you ever participated in team competitions? The most part of competitions on Kaggle I solved, being a part of the team. I have two solo medals Working in a team helps a lot. So you look what techniques and methods use other people, what kind of tools are used there, what repositories they use, what packages and how they write the code and their data handling method The fact there are some very small tricks. For example when you keep all your pre-processed data on your hard drive for not to pre-preprocess again or you keep all your jupiter notebooks with code of the whole their versions for you can reproduce the solution later There is a big problem at the end of the competitions.Iff you pretend on the prize you must reproduce fully your decisions And very often when people is doing reserch for two month and have analysed thousands of hypotheses they forget what they have already done and in what order They mixed a million of decisions and reproduce all of them is absolutely impossible. There are some to "managing" yourself. You can get these knowledges from your team members. And while team-work it is easier to improve yourself. When you are alone, you can stop and you will be lazy any moment. But when you feel responsibility with your team and you see that your colleague came home at 2 a.m., because he made competitive tasks or he did them whole weekend You know that you must work hard too And it's clear that synergy of different skill helps you achieve great heights, fight for the first places in the world competition You have the same team or different teams? It is a advantage of Kaggle that teams change. I only had same guys on my teams 2 times But more often you communicate.with different people The best resources for this is the community "Open Date Science". It is a Slack team which is called ODS And there is that community of more then thirty thousand already Russian-speaking sentinists And certainly a lot of people get involved in the competition. Many people is aware of the need to be in a team. So, someone can help with iron, expertise in the stacking/ And someone can help in the neural networks. When all these skills are combined in one team is bring good results Therefore, its very easy to find his like-minded people there You tell "I have this results and I used several methods. Let's unite our solutions and totally it will be better" So, you can combine solutions but your action is the biggest plus is diversification of solutions and approaches. Usually solos begin to deal with competition. Someone starts early, someone later, but closer to deadline (people unite in teams). On kaggle there is a week before the end of competitions when it is possible to join the team. The most important moment is when you can choose your partner in your team or one of the teams choose you. Do you choose a week before the end? Is it possible to do it earlier? Yes, there are such cases, but its very often they dont work very good, because in fact people develop an idea together diversification does not make. Very often people can explain for two months. But sometimes they just begin to be lazy. Just like if there was only one Listen, let's speak about practical part Competition is cool but what does it mean for the work. How quickly do headhunters start hunting you? But starting from any prize places on any hakaton, because normaly the goal of any hakatona is headhunting or search of teams with interesting projects. And basically after hackaton where we have met, recruiters wrote me. But that time I left Russia for six months. And I could not take their jobwork there But I have a lot of activity in Russia, when I entered the top 100. When did you enter top 100? - In the end of winter. - So in fact six months after beginning of the competition Kaggle? Yes But actually its for those who have a lot of time is a feasible task to enter at top 100 for six months. Adequate background and big love for this process. But if you do it for who hours after work and you do not do it in the weekends, it take a lot of time. Do not expect any results from top 100. It is the top 100 of people. Totally there are 80000 people on the Kaggle. So, the top 100 of the 80,000 people It is necessary to understand that there are 80,000 people on keggle with a rating. But in fact there is one million registered users. Certainly there are a lot of fake accounts and so on. But principally it is very popular platform and a lot of people know about it, especially in the world of Data science. So, recruiters began to write me closer to spring Moreover I became to advance in competitions and they offered absolutely different position from the simple data scientist to the head of date science It is clear that salaries varied very much. But in fact the wage gap was from 100 to almost 500 thousand rubles, and there are many disputes about it And very often it occurs in the Community ODS when the company opens a vacancy with very low pay fork, but everyone understands that the market is very hot and there is not this kind of specialists at the market. Even worse when there is no pay fork They all shoot dislikes. There are some Moscow companies believing that salaries for experienced Data Scientist expert 100K rubles is good salary. For example in Sberbank they offer salary from 400 K rubles. They can afford it. So, they are "vacuuming" the labour market very rigidly. Therefore, our labour market is very hot and actually it is new and there is no good experts there. And these pay forks are absolutely striking Therefore, its a good idea to enter to the labour market now Especially if your are ready to spend half of a year for hard work So work hard like you forward and there is a chance to work somewhere for maximum 40 K. But I would like to notice that most of companies do not need the great Data Scientist This company needs to solve very simple problems such as data cleaning, data gathering and business analytics. It is a little bit like data science, but a little easier version. And there are a lot of these specialists on labour market. Certainly they do not cost such money but there are many more. There were some people who just looked some courses on Coursera and have done something by hands and took part in the couple of competitions and it's enough for most companies. Tell me what is the difference between Kaggle business data science and the one that is used in the companies Yes it's a very cool question because for me it was shocking, but it was quite predictable. When I finished competitions and moved to consulting, the biggest difference was, of course, in the quality of data. Usually the company sends you very "dirty" data. Or even worse when they have no data at all And then your first task is to project any engineering system which will collect in the proper quality. And these steps are absent in the competitions because on the competitions the company prepares you some data sets They usually spend a lot of money to collect and clean it. And they give you the metric you need to optimize to get its place in the rating that is already coordinated with their business objective. That is, this is the same task to coordinate the businessman metric and optimization metric. And all this you have on the competitions but not in real life. And you apend a lot of time for these steps And as a result, the second thing is you do not need state of the art models in business. The main thing is that the model was easy to explain and then easy to produce on not very powerful computers or mobile phones device and so on. Its very big differences between competitions. But it should be simpler, but work like private practice in competitions. This is stacking a very large number models. It can be 100 models than each of them can learn the whole days and simultaneously on many GPUs It can last quite a long time. Such types of business use the same. It's vital necessary somewhere to fight for every decimal point What are you doing now? Now I'm working as a chief data science in the consulting company "Data nerds" It is a young company. We work with large corporations. There is no narrow specificity in what we do. We position ourselves as a company that can work with any data and any algorithms We have very bright and young graduates of CMC and NES in our team. And we are actively searching for customers. We are actively involved into several projects. We already have experience in retail, in agriculture, in banking and a bit in telecom It means that we already cover the main industries. We have had some projects on this topic. Are there any examples of projects? What are they? For example, there is one in telecom. The most standard task is forecasting of the customer outflow. when there is a lot of different information about customers, how many of them have left. If you learn to predict what customers might leave, then you can target and motivate them with different discounts or emails and try to change their mind. The usual tasks in retail are prediction of demand for different goods or the best way to regulate prices. It is quite a dual problem. But again, these industries are good because they usually have a lot of data. The IT systems were built long time ago and have great data collection. So you can work with them to receive and bring value for business right now. There is no need to wait for years to collect the data. There are lots of projects in agroculture related to computer vision, when you need to define, let`s say, a sown field. a sown field. do some pictures from the airplane or from the space or some routine operations, that do not require any data, but you can always install the camera to collect that data, to do data markup and to build some kind of model which tracks, for example, whether or not the employees wash their hands, or steal fertilizer or something else. Do you deal with NLP? There is a standard translation task It seems that it is solved for example, Google Translate But first, this system is not publicly available. Second, there are some specific topics that are very difficult to translate for example, some specific languages or themes. But it is still possible to work with them if you can accumulate sufficient number of documents in both languages, plus there is a lot of legal industry. tasks. Because these people work with a large number of contracts and sometimes they are scanned, sometimes contain the original variant of texts. And they have to search for the large amount of the information from them, as an example, dates, addresses and so on. And the machine learning can deal with all of it easilly. It is now a big market in Russia What makes you different from other companies? E.g. subsidiaries of The Big Three? Any similarities? The situation there is the following: they work in the same direction but in a niche. Every company of the Big Three and their sub-companies deeply specialize in data analysis of machine learning, not in Excel Not so long time ago McKinsey Company bought Quantum Black which dealt with analytics for F-1. All of them work with terabytes of data and of course they know table data sheet inside out. But the Big Three team has the problem that they are mostly work with tabular data and don`t work enough with texts and pictures. Therefore, small companies, like ours,can be much more flexible and quickly implement models or products for different companies. Because the number of changes that can be done with more data tools without using traditional tabular data or time serias is amazing. All of course everyone saw the great apps like Prisma which were popular two years ago. And it was only the top of the iceberg compared to what you can do with pictures from the text. It would be really interesting to have a look on the rest In addition, Kaggle is the biggest competitive platform, containing the largest set of free quality data sets Companies have invested a lot of money to collect and mark these data sets and, moreover, people then looked them, tested them, checked and cleaned and some models began to be used. And all this is absolutely free of charge can be found online. Only few people understand how big the the potential is Pasha, I want to ask about the career opportunuties Seems that with your professional background and experience you can easily find something now in The United States of America when everyone is now trying to get there, especially to Silicone Valley. But you stay here. Why? One of the problem is that now it is quite difficult to get visa there, especially with the current president, laws have changed and even such large companies like Google and Facebook constantly tell their staff that were hired, that there are no quotas this year. You can`t come. Sometimes people are sent to the additional verification. For example, a man came to work in the US according to the law he must get an additional verification and must leave the country, though he is already employed this is one of the reasons and one of the problems. The second, Kaggle is known throughout the world and know who are grandmasters. All of them are on the top 100 on kaggle. But if you move to the USA you have some decrease not only in social status, but in your working status, too. It means, no one will make you head of data science in the company in USA. although in Russia such I can get this position and I do not like this situation. So if you come to USA you'll be just a data sentientist from Russia It does not matter your level of grandmaster And we think, that there is high wages in USA. But if you count after deduction taxes deducting. For example a studio apartment in San Francisco costs 4000 dollars for month. It is clear that San francisco is the most expensive city for rent an apartment in the world. But if you count the salary which all this remains simple disposable income in Russia in the current market, you can get very comparable numbers. Of course the level of publicity goods which you receive in USA and in Moscow are different. There are some disadvantages in Moscow but they are quite commensurable with places for work or live. Tell us where you see yourself in the next five years? Do you plan to professional development in data science or do you see something related I can say that I found my own vocation. I get absolutly pleasure from my job. When I participated in competitions my girlfriend often noticed that I just woke up at 5 a.m. and went to the computer and became to start network with any code Maybe on Monday,Saturday Sunday. I have never felt it whole my life in other jobs. So, of course, I will continue to develop in spite of the fact that around this themes of hupe/ I just like it. Plus my background fits perfectly. And although I have already participated in many competitions, I understand that a passed a little time. I've been actively doing this for a year and I have quite a lot of the lack of knowledge that I want to cover. So my goal is to become a world-class expert world level about all data science methods and this is what I actively do I continue to participate in competitions, discover new methods to communicate with people perform, etc. After five years I would say that the goal of our consulting company to make a product in the end For this we want to work out expertise in a certain industry to our team worked well, etc. I think it is quite possible now. It is so cool, that you have already could find your calling that you love so much Few can do it Pasha, please, give us five tips how to find yourself. And if it is data science, how to find yourself in the data science. Well, five is too much. I will do for money. But lets start a couple of them. Well, there are two very important and very simple advices that really know a lot of people but a few follows. so, try to do what you like and do not think about money. Every morning you can get up and think "what am I goint to do today if I do not need to earn money. There was a situation for a year and every morning I got up to prepare for the competition on kaggle. I have not any questions what will I do today. Every day I grew up and found out something new. I could improve my skills, and in using of tools and I really enjoyed it. Tips that it is clear that not everybody has resource for this. Well, you can just think about it Plus you need to understand yourself well You must understand your strengths sides. If you know your strengths and you understand what you like to do not for money. There are two main secrets to success how to bear your calling. It was not difficult for me to understand my strengths. I was interesting in technical skills and everything related to mathematics, programming and analysis. However, at first I went to trading As it turned out it was not the best application for my skills. I closed some of my own needs for data analysis in solving the problem, etc. but in the end, this industry was not suitable for me, because that there was all the secret and for example, for me was not easy to act and be a public person. Now, in the data science it is almost as inalienable attribute. When you often participate in the competitions you are invited to appear in the jury at the conference of hackatons and so on. And I get great pleasure from It. It's also become such the secret of success. If someone is watching now, I would like to repeat your way in data science. How would they start moving in this direction? First you need to answer the question why do you need to do this. That is, if your goal is: Well, I will go to the new. I'm a programmer and now I will switche to a new industry where a little more pay. Usually it ends badly. But if a person really understands that he takes pleasure from data science, from the analysis. He had some projects which he did just for fun, it is definitely the first step to listen a couple of courses on coursera specializations from Yandex and MFTI which are available in Russian It is a very good course. I confirm this course I myself passed them. There is a very low entry threshold considering that it is in Russian and not in English. Thats is clear that there is more information in English nevertheless And certainly, to participation in competitions is absolutely best way to pump yourself as quickly and easily as possible It looks like "Just do it" Just to take half a year Working hard and you need use your maximum brain capabilities And a little luck Then you will become a grand-master and if not then a master Thank you very much, Pasha Well, what to do is clear, Then just do it Thanks to all.

Info

Channel: Флесс

Views: 79,878

Rating: 4.9241195 out of 5

Keywords: fless, flessibilita, машинное обучение, data science, KAGGLE, kaggle, grandmaster, GRANDMASTER, программирование, software engineer, programming, machine learning, flessguest, flesstalks, consulting++, ConsultingPlusPlus, fless.pro, рогуленко, fless шад

Id: 5wMAPUrd0ag

Channel Id: undefined

Length: 36min 8sec (2168 seconds)

Published: Thu Aug 09 2018