MBA to IBM Data Scientist: Exclusive Interview with Greg Rafferty

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
but one thing that everyone at IBM has in common is that they they're consultants so they need to be able to work with the client they need to be in meetings with executives and be able to talk intelligently about solutions hi everybody my name is heavy Chen Jung I'm the data scientist here at Rick early and editorial associate at Taurus data science comm and I have a very distinct pleasure of introducing mr. Greg Rafferty who is the lead data scientist at IBM at the San Francisco office he has an incredible resume and I'm very excited to have him you today so thanks for coming Greg thanks Avery how are you good good how's everything going in San Francisco it's really busy right now um do tell but it's enjoyable it's good to be busy great ok so we'll be diving into all the great stuff but before going into the IBM aspect I actually wanted to take a step back and talk about your work profile previously where you worked in international field so can tell us more about your work abroad yeah sure I was mostly working with the US government doing aid work I was living in Armenia for several years in Russia for several years so it was not really data science initially I was posted in project management I was working in clean energy projects I have a background mechanical engineering so I did a little bit of mechanical engineering on these projects there's mostly management and after several years in Armenia then I went to Russia continued that work but then later I worked at a startup as a business analyst and that's how I got into tech and after that I came to San Francisco and got into data sights oh wow so what first led you to Armenia and Russia in the first place like are these regions that you were interested in yeah both a mix of personal and professional I mentioned I was a mechanical engineer before I was working in the mining industry and we had several clients in Russia and so I was working with them I started learning the language in order to better communicate and while I was getting my MBA we had a sister school with Saint Petersburg State University so I had this really great opportunity to take a class there I spent several months in st. Petersburg and I loved it and so my career to start building up this with Russia and with former Soviet states and when the winning economy crashed in 2008 then I had this opportunity to go to Armenia and work with the government on these aid projects yeah I'd really wanted to try that out and it just it was good timing so I did that and I really loved it had a great time that's kind of interesting because you started out and as a mechanical engineer so how did you end up falling into business aspects you know getting your MBA doing project management was that a rough transition and how did that happen the first place it was actually very smooth I so I started as a mechanical engineer and I'd been moving towards like project roles and managing people and my my idea at the time was that I would continue with this company and I would lead some joint ventures abroad and in this company they they highly respected MBAs and so I got my MBA through the company and they actually supported me on it so I continued to work while I was getting my MBA which is rewarding but really intense having a full time of well being at school full time oh my gosh so yeah it was a two-year program and a really great wing learned a lot and I really had a great time then right when I finished the economy was crashing and the opportunities to be a manager in this company was really drying up at least for the next couple years and so that's why it's just that time to try something new I say so is that when data and analytics became part of your professional life well I as an engineer I used a lot of data and analytics I would build simple regression models but nothing beyond excel well I started coding while I was in Moscow working with the startup and that's when I built my first sort of deep models and I did not deep learning but models that you could be handled in Excel and and I really really enjoyed it a lot and so when I got back to the US my job is very tableau happy and that was my first introduction to sequel and at that point I really started looking into large different machine learning models available and realizing there's a lot that I could learn here and I ruined really enjoying it and so then I boot camp galvanize in order to to push myself up over the edge and really data science I say what exactly is galvanize so galvanize is a boot camp I did there was a three-month immersive program so it was full-time about 8 to 12 hours a day and 5 to 6 days a week so very very intense but only three months so you can really kind of get it out of the way good and if you have if you have like a good base that it really is really helps you just push yourself up over the edge and get into data science a lot of these blue camps are like that but it really helps if you do have the foundational base because it's not nearly as intensive as a master's program for instance so you do need to sort of know what you're getting into and be able to fill in the gaps on your own Wow so you started with mechanical engineering and then you switch to business and then you found yourself a galvanize during data science so how did you then find yourself at IBM so I'd been working a lot with the Coursera platform studying data science on my own and doing couple projects and a galvanize I did another really big project looking at Trump's Twitter stream and that's sort of I got a little bit of a reputation with NLP and IBM was looking for an NLP data scientist and through some connections I was introduced to the hiring manager and it seemed like a really great fit right off the bat and so I moved to IBM directly from galvanize and I spent so I'm about 18 months into my time at IBM here and I've been doing a lot of a I work a lot of NLP work and then also just some basic client based analytics work I'm on the consulting team so I do travel a lot and I work with clients exclusively so project I do is is with a client on on one of their use cases mm-hmm there's a lot to unpack there so let's get started with the NLP aspect so by that of course means natural language processing but can you get a little bit more in depth about what kind of projects you get to do with NLP at IBM yeah so one of the coolest projects I did we called it annotation and I've actually applied for patent on that and so I'm Lee I'm in the process I'm really hoping that comes through but what that does is it it basically takes a corpus of tens of thousands of documents and it it identifies what those documents are about and it clusters them and then it applies annotations to those documents so that you can build a knowledge graph around them and iBM has a tool called knowledge studio and that is it's a manual annotation process and it takes roughly three weeks to build a model of manually annotating and that's very very labor-intensive it's not interesting work and it takes a domain expert in order to do it so you have to have like a lawyer document these are annotate these documents for three weeks which is really not a good use of time so what this tool does is is I use a work Tyvek and I use clustering and then I use some feature extraction tools it's an API from IBM Watson and through this pipeline then it creates these annotations and takes the documents and it creates this knowledge graph in knowledge studio oh wow that's incredible yeah so that was the biggest project I worked on in IBM so this is an internal software that you're building 40 IBM right it's not a sort of consultancy work that you're building for other companies correct this project is an internal one we're fishing around for clients that may want to use it and if we can find one then we'll of course implement it into one of our broader Watson products and then make that available to anyone but for the time being it's still in the proof-of-concept stage is that a normal workflow at IBM consultancy where you build an internal product and you try to see if there are external clients would be interested in using it where is there other workflows that you see happening as a consultant yeah that's actually a very rare workflow only a few teams take things to market using that method most places it's the client comes up with a use case and IBM determines a solution to that and then builds a solution direct for that client and if that use case if then it can be can be broadened to other clients every you know every contract is different sometimes the client owns the IP sometimes IBM does sometimes there's some sharing but if IBM maintains control of the IP then we'll build it for one client but then we'll sell it around to other clients if it's applicable to other use cases are there certain time periods where you work get to work with consultants is there like a specific time or are these do these kind of timescales change depending on the different project that you're working on some projects I've been on our just one week and some can be several years the longest project I've done has been about six months a project monitor right now is is we have a two-year contract on this I don't know if I'll be on the project that full-time because there's a lot of different work streams and depending on your skill set different consultants hop in and hop off to fill in the gaps but I I know some consultants who have done the same project for 12 years oh yeah there's all sorts of different different types of projects different arrangements and it just depends on your skill set I see is there like a team that you get to work with wood that he steadily work with or is it just changing depending on who's available so it's a mixture of both my team the applied AI team we do work together a lot but sometimes we'll work completely independently on different projects project Mon right now I'm actually leading the team and so it was up to me to hire out I have three offshore data scientists in India and then two onshore that are based locally and so for the armored ones I wanted to hire people who I knew I already knew their reputation I knew their skillset so I took two people from my team I see in these specific positions what are some of the important tech note traits were some kind of personalities that you seek where you are important you think are important for executing the task at hand so data scientists there's a broad range of skills and the specific data science skills really depend on the project and so that can be that can be anything from NLP key to deep learning to just basic analyst but one thing that everyone at IBM has in common is that they they're consultants so they need to be able to work with the client they need to be in meetings with executives and be able to talk intelligently about solutions and they need to sell products and so that's not pushing solutions onto the client but that's understanding the client's needs and that's understanding how we can help the client better and so when I say sell that's not you know that we are coming to them and saying you need to buy this but we're deeply understanding what they need and how they can how we can improve them and then it's not really it's not really selling aggressively it's more passively selling that we show them the value and they they say they want that and that's that's a skill that is very valuable to a consultant so it's it so it seems like as a consultant it's important to have both the soft skills and the hard skills to execute these tasks it's not like you can just possess one and be fine with five be fine with it so it's interesting you brought it up so I was just last night I was having a conversation from our partners about this and he was saying that if you were in the top 1% of just in the technical skill set you'll be a rockstar at IBM if you're in the top 1% of the soft skills of the client focused six skills you'll also be a rockstar but if you're below the top 1% then you really need to have both of those skill sets and so and so you so most people do need to have a very strong blend of both the client based and the technical skill set that's that's great um now I want to switch gears a little bit and talk about your involvement a towards a to science comm because you've written many articles for TDs and especially one that comes to mind is what you already mentioned the Twitter bot so I'm very curious about this Twitter boss that can tell us more about this piece that you wrote yeah so I the idea for that came it was when Trump had just fired James Comey and he had fired him he said for he tweeted out that it was a pity he had to fire him or it was pity I'd fire Mike Flynn because Flynn had lied to the FBI and so everyone had sort of come out and said well but now that the obstruction of justice if you knew he liked the FBI in you and you asked Comey not to investigate that's what structuring of justice and some Trump's rebuttal to that was well I didn't write to tweet my lawyer he wrote the tween he sent it so what I decide I was gonna do is analyze his Twitter stream and trying to determine who was writing these tweets was it Trump or was at one of his aides and the way I could do that was that prior to his presidency Trump had always tweeted from an Android device while his staff has always had always treated tweeted from iPhones at a data of the tweets you can see the source and so I used that source as a label and and then I built a model to determine who was tweeting and then after that I built a Twitter bot which which it listened to Trump's Twitter stream and whenever he tweeted would capture that tweet sent it through the model and then I would send out another tweet saying Trump just tweeted this and I have 90% confidence that it was Trump that actually wrote it mm-hmm yeah and so after I did that then I wrote a long blog post on course data science about how I built the model about the features I had built about the results and then about how I built a Twitter bot to actually take it live oh that's very interesting um so it seems like a very classic supervisor learning problem and I wanted to talk a little bit more about your model design process so let's just solve the data what did you manually collect the data or did you do some programs to execute this how did you go about gathering the data so initially I built a scraper which just went through Trump's Twitter stream and it collected everything but then later I noticed that a lot of tweets were being deleted and so I was not getting those because I was only getting data that's currently available on Twitter and so I found a repository of it was a wife scraper that acted in real time and had been operating for a long time so he collected all these tweets that were had since been deleted so I instead used that data source so I could have these deleted tweets as well what is the machine learning model that you designed so I actually built a it was an ensemble I built nine models several different types of trees random forests gradient boosting at a boost as well as basically just accretion and then I built a just an ensemble that combined all these and I had the feature engineering was the big part I did and in the end I had about nine hundred features now a lot of those were just tf-idf were vectors but I did engineer several such as I looked at the count of all caps words I looked at punctuation so when you saw long strings of exclamation points I looked at the times of the day and the days of the week when they were they were tweeted out and I also I looked at some sentiment analysis and I had access to this lexicon built by a Canadian researcher and it looked at the I think the emotions of 13 different or 13 different emotions and it ranked words on 13 different emotions and so I sort is this to rank the emotional content of each tweet as well I see oh how did she decide to merge nine ensemble nine motifs together creatives in Samba was it an idea of your own or did you feel did you see some kind of research paper that was doing similar things uh yeah I was just talking with a few people and one person had gotten his ph.d and written his dissertation on the affection of ensemble methods and so that's where I really got the idea and it just came straight out of that conversation just trying it and I noticed that random forests and green and boosting had the highest accuracy on my models and I had a bounce data set so accuracy was the criteria that I was using to to tune these models I see but when I did the ensemble then I had one or two percent improvement over just straight random forests or gradient boosting so that's why I wanted the ensemble I see I was a result of this model that you built well so I had I had the Twitter bot live for several months and I was hosting on AWS and paying out-of-pocket and at one point the Atlantic wrote a big article about how Trump's aides were really mimicking his style and how it was really big difficult to determine who was actually tweeting and at that point do the cost of the AWS and the fact that now I really wasn't sure who was tweeting because the aides were so good at mimicking his style that I took it offline point I say let's see oh cool and that's a final question the TDS readers many of our TDS readers are aspiring data scientists well already data scientists already so being a lead at a scientist yourself are there any words of wisdom or advice that you can share to a community especially for those who are say transitioning from backgrounds like so yourself in business analytics or management so on what kind of words can you give to us so I would say the most important thing if you're looking for your first job is to get a github and populate it with some really interesting projects projects that may not be directly relevant to a company but they you are so excited about you can't can't wait to tell people about and that that enthusiasm really comes through in interviews and that's one of the key things I look for when I'm interviewing people is enthusiasm not that you're willing to put in late hours because that's what the job requires but that you want to put in late hours because this is such an interesting problem to you and so I think that's really good networking and then blogging about it is also really helpful because it shows that enthusiasm so that would be my advice is is just really sort of nurture this enthusiasm and make sure that that comes through and everything you do that's a great advice and I think that's all the time that we have for today so thank you again for for coming here Greg thank you very much she's great chatting thanks Greg
Info
Channel: Towards Data Science
Views: 4,321
Rating: 5 out of 5
Keywords: Data Science, data science, IBM, ibm, machine learning, data scientist, lead data scientist, natural language processing, nlp, word2vec, watson, AI, artificial intelligence, towardsdatascience, clustering, tds, mba, greg rafferty, haebichan, haebichan jung, recurly, feature engineering, data gathering, big data, supervised learning, unsupervised learning, random forest, decision trees, knn, ibm data science, data science summer, lambda school, data science conference, algorithm
Id: 6kFS-A1FNS4
Channel Id: undefined
Length: 19min 14sec (1154 seconds)
Published: Thu Jun 27 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.