Lukas Biewald, CEO of
Crowdflower, gave a three hour workshop on arguably the most vital module to
machine learning – SciKit Learn. Lukas gave the audience in Microsoft’s Hack
Reactor Space open source code to follow along from:
He highlighted the value
of participating in Kaggle competitions. He defined regression as one variable
trying to predict another variable. He gave the students in the audience the
main project for today’s workshop – judging emotion about brands and products. In
this project, we saw Tweets about several Apple brands and products.
99% of the time was spent
on “should we drop I” or “should we drop my” types of questions. We also asked “should
we remove really rare words?” “should we remove really common words?” “should
we remove stop words?” and “what is a word?”
In the midst of our
learning, Mr. Biewald purported that if one did not want to earn a degree in
machine learning, memorize the map at this website:
When testing how well an
algorithm was working, Lukas recommended we utilize the “Test_algorithm_1.py”
program. Moreover, that we should not be testing on the data we are training
on. For example, we should generally train on 80% of the data and test on 20%
of the data.
On the “feature_selection.py”
program, he used the chi2
statistical model to rank which column was the best column in our datasets.
However, he claimed that the chi2 feature
was the least important to consider of all of the topics we discussed today.
Lukas closed with a call
for us to make our own features like a count for the number of exclamation points,
emojis, lengths of Tweets, or language of Tweets.
No comments:
Post a Comment