Caution:
- This video introducing my workflow on Kaggle
- I do not explain data analytical aspects so-called EDA
- Attention to information sharing due to the competition being active
- There are no features over the public kernels Hello, I am TKM. Today, I would like to talk about Kaggle competition. I am going to participate in an active competition and explain until submission. The reason why I decide to make this video is that I am enjoying unemployed period so, I would like to try something new. Besides, I am interested in learning video making. And, As I see my Twitter time line, I see many people who want to try Kaggle, but do not know how to join and how to deal with Kaggle competition So, I guess that if I make a end-to-end Kaggle tutorial video, I can help such people well. From environment setting to making submission. So, I decided to make this tutorial. I like the video at NicoNico which makes Tetris in an hour. I am hoping that I can do same things in Kaggle. He made Tetris from scratch. I am hoping that I can do the same things in Kaggle. Well, I try it. I am showing the Kaggle top page. This top page seems little meritorious This page is opened as secret tab. But, once we log in the page, We can see this page. In left column, There are discussions and Kernel which show data and graphs. This is timeline. We can vote them. If yon got many votes, Kaggle team would prize you and you could get about $500. This chance is once a month. They are looking for good kernels. It is good to try to make a a good kernel. I participate in Kaggle as tkm2261. I am a Competition Master. Let's see my profile. The competitions I participated in are here. I have been in Kaggle for 2 years. But, I have started seriously since the end of 2016. So, I am being active about a year. This likes github heat-map. But. The color is different from github's one. I participated in Instacart competition and Bosh competition which detect defective items. one of my team mates, hskskk-san, shared materials about the competition. If you are interested in, please search it. I will share the link below of the this video. Second, Quora Question Pairs competition. In this competition, we made a model to distinguish 2 questions are same meaning or not. We can earn a gold medal if we are in the top 1%. If I remember right, Silver medal is Top 10%. If we get a gold and two silver medals, we can become a Kaggle Master. Kaggle Master is one of the goals of many Kagglers. I was also happy when I became a Kaggle Master. In this tutorial.... Well... In this tutorial, I want to participate in a competition. Click the competition button, there is a list of competitions. This competition's prize is very high. But, only U.S. residents are eligible to receive the prize. So, the number of the participants is not many. This prize is high too. So, many teams are involving. 3,800 teams This competition is an image recognition competition. In a image recognition competition, we need much computation resource. Now, smly, Japanese guy is in top grope. He is a grand master. In a image recognition competition, although the number of the participants is small, it needs much effort and computation resource. It is good to try it if you interested in it. Recently, There is much demand for people who has a skill of Image processing. If we prove to have solid experience of Image processing, we can get a high salary job easily. It is also good to try it if you interested in it. In this tutorial, I am going to try this competition: Porto Seguro's Self Driver Prediction. Porto Seguro is an insurance company in Brazil. We will make a model to predict a drive drives safely or not from given data. The number of participating teams is very large, 2526 teams. Deciding insurance premiums with drivering info is seen in many countries. In Japan, several companies conducted demonstration tests. In America, Progressive American Insurance Company is famous This area is getting hot. Maybe, already hot. I think. The reason why I chose this competition as a teaching material is that the data is very clean. I dare say it's too clean. In the Forum, some people claims that it is not good for the competition and Kagglers because data handing is one of the competition's factor and Kagglers cannot learn it from the date. While it may be true that, it is good for introducing my kaggle workflow I try it. Let's start the competition. Move to data tab. Actually. I have already started. Now, my rank is around 130. maybe about 130. In my opinion, Mahjong's ability and Kaggle are similar. It is because both of them need fortune. Of course, we need much ability to get good results. Sometimes, beginner can won it they find nice features, so-called golden features. And, Each people are cut out for each competition. So, I think it's similar to Mahjong's strength. You win some, you lose some. Don't be so hard on your self. We should enjoy Kaggle easily and simply. But, I acknowledge I sometimes feel frustrated if I lose someone. Particularly, in my case, I cannot help searching other Japanese guys. Well, Let's put it aside. In this tutorial, I am focusing on introducing my kaggle workflow rather than aiming high score. To introduce my way of processing data is a main topic It depends on each competition how to improve your model. Idea is key I cannot make good advice. So, I am focusing on workflow. and techniques. You can see the data page here. You can download data from this page. Recently, 7z format is ... common. IIs commpression rate is good. Personally, I like gzip format because It is easy to decompress. I am not sure but there maybe some reasons of using 7z. In Image recognition competition, there was a case that the data was shared via torrent, Some guys use torrent on his company wi-fi, and he was scolded. I heard such story. Let's download this data. When you download data for the first time, you should agree with the statements. Select Yes and download them. Today, I have already downloaded them. They are here. So, I am going to use them. If you have a question about this tutorial, please ask me on Twitter. Please ask me freely. And, I made a slack channel with smly and threecource. This channel is Japanese kaggler slack. Now, There are about 400 people. Very active. And, please join us if you are interested in it. In this channel, we have beginners-help room. This room is the most active. Many people are asking and discussing about many issues. Kaggle masters actively reply questions. He is a master. And grandmasters also reply them. This channel is very insightful. Please join us if you are interested in it. In this beginners-help, As you can see here, You can ask anything. Mounting is forbidden. We k○ll such nasty people. Ooops! Please join it freely. Well, Let's start this competition. First, I make a computation environment. Although I can do with Mac, GCP is more convenient. So, I will introduce a way on GCP. First, I make a google account. But, I forget how to make a google account OK, I fill in this form. Name is Tokugawa Ieyasu.
(The name is famous Japanese Sho-gun in the 1600s) User name is ... kaggle.porto is okay. Password is secret. Well, next is birthday. I don't know Tokugawa Ieyasu's birthday. So, I fill in them just quick. 1985, and Today is Oct. 17th. Sex is male. Below questions are not needed to fill in. Next, I resister my credit card. Otherwise, I cannot get $300 coupon. Google do not charge a large amount of money suddenly. It's okay to register your credit card. Not remember Now, Let's move on to GCP website. Sorry, there is a fly in my room. Here is a GCP. I am looking console button.
I found. No Yes Just quick. I am eligible for a $300 coupon with valid for 12 months.
It's nice deal. I am not a Google sales man. haha I will use compute engine. and Storage, Sometimes, I use Bigquery. Bigquery is very nice for Kaggle. AWS Redshift is quite expensive in personal use. On the other hand, BQ is not expensive. In my case, it costs several hundreds dollars in a month on most Kaggle competitions. First, I set up the compute engine. What is this? We need to make a project. "My Project" is Okay. I close the header. Well, I registered my credit card. I got $300 coupon. As you can see, Google never charge you without your agreement even if you run out the coupon. Student who don't have a credit card ask your parents to set up. If you show this video to your, they maybe help you. It takes several minutes to set up GCE. During this, I upload all data to google strage click choice "My Project" Okay Making a bucket. Any name is OK. kaggle-porto Oh, I can use this name. Then, in this case, speed is important. I choose Regional. Location is US. us-west or us-central The reason why I select US is that Computer Instances are cheaper than Aasia regions' ones. So, I chose US. Of cource, Asia region is ok. There is no difference in function. Important things are to make sure to put data and instance on same region. [intput is typo. I fix it later] Folder name "input" is good because in Kaggle kernel, data is placed in the "input" directory which is same level with notebook. It's good to be same. Now, I upload them. I am uploading on Browser. Command-Line tool is also convenient. and sometimes faster. But these data is small. No difference. Ok, I have done. It is not bad that I upload data directly to a machine by scp command. Uploading to Storage is faster and during a competition, often, I set up several machines to do parameter tuning. In this case, Storage is convenient because all machines can access Storage. And, for example, you make a folder whose name is date. Then, you can upload your code to this folder as snapshot. It helps you to make sure your code's reproducibility. As I wrote in my blog, Of course, we should use Git. But, in Kaggle, data is also important. This snapshot work well in kaggle competitions. Now, I made only "input" folder here. It hes taken many minutes. Okay. During waiting for it, I will make a git repository. If you have a github premium account, it is okay. It should be a premium account. In Kaggle, So-called Private sharing, Sharing info without mention in a forum is forbidden. except for in team. So, Managing your code in github might be seen as a private sharing. Bitbucket is better because you can make a private repository freely. So, for avoiding private sharing, I will make a post about this tutorial in the forum. I will make English sub. I usually make a repository whose name is "Kagggle" + "_" + something. I have already had same name repository. Okay, anything is ok. kaggle youtube porto I will upload this repository to my github later. As long as I share in the forum, sharing info is ok. At this moment, I made it as private. I've done. Is setup Finished?
Oh, finished! Click Make Well, Let's make a instance. Any instance name is ok. If we make many instances, we should name them properly. Zone is us-central Instnance spec is ... I sometimes am asked that what type of instance is good for Kaggle. In the first stage of competition, Any instance is OK. But, I want 8 core and 32GB memory at least. Without Xeon, The spec is the highest spec without Xeon in desktop. But, memory is important. Large memory allows us to handle data freely. Memory 60 GB or more is nice. I use such machines in the middle of a competition. Usually Or, recently, we can compute many things on GPU. So, CPU is 4 core and attaching GPU and, 32 - 60 and more memory is needed. But, in this competition, we do not need much memory. Maybe, 15GB is sufficient. But, I have $300 coupon. I choose 8 core and 32 GB memory. Large machine make me high. Next, boot disk Default is Debian. But, Ubuntu is always a good choice. I am not familiar with Ubuntu 17. I select 16. Maybe, there is no problem with Ubuntu 17. But, if you want to use Tensorflow or some GPU software, You need to pay attention to the version. Today, I use only CPU. And, Anaconda set up automatically. I introduce it later. Size is ... 100GB is enough. it is luxurious. I attach all access authorities. Well, I am wondering whether open 80 port or not Ok, I open it. But, we should set it as Fairewall roles in another setting. Then, we see notebooks via the port. Ah..., I do not open port 80 because it is risky. I will open 8888 port later. I attach my ssh key here. You can also use a project key. Now, I attach here. Then, I launch a terminal. pbcopy ... There is something error. But Okay. .ssh/id_rea.pub I copy it my clipboard. Actually. I don't use Mac after I quit my job. So, such errors are happening, Is There the file? Ok, I use "cat" command. Don't see my key ><. Ok, finish. I succeed in making a instance. GCP's instance lunch speed is nice although AWS is also fast recently. And, you can connect a machine without ssh client. click thin ssh button. Next part is Ubunt setup!!