Monday, November 2, 2015

Summary of Intro to SciKit Learn at Open Data Science Conference

Lukas Biewald, CEO of Crowdflower, gave a three hour workshop on arguably the most vital module to machine learning – SciKit Learn. Lukas gave the audience in Microsoft’s Hack Reactor Space open source code to follow along from:


He highlighted the value of participating in Kaggle competitions. He defined regression as one variable trying to predict another variable. He gave the students in the audience the main project for today’s workshop – judging emotion about brands and products. In this project, we saw Tweets about several Apple brands and products.

99% of the time was spent on “should we drop I” or “should we drop my” types of questions. We also asked “should we remove really rare words?” “should we remove really common words?” “should we remove stop words?” and “what is a word?”

In the midst of our learning, Mr. Biewald purported that if one did not want to earn a degree in machine learning, memorize the map at this website:


When testing how well an algorithm was working, Lukas recommended we utilize the “Test_algorithm_1.py” program. Moreover, that we should not be testing on the data we are training on. For example, we should generally train on 80% of the data and test on 20% of the data.

On the “feature_selection.py” program, he used the chi2 statistical model to rank which column was the best column in our datasets. However, he claimed that the chi2 feature was the least important to consider of all of the topics we discussed today.

Lukas closed with a call for us to make our own features like a count for the number of exclamation points, emojis, lengths of Tweets, or language of Tweets.


No comments:

Post a Comment