Monday, November 2, 2015

Summary of Intro to SciKit Learn at Open Data Science Conference

Lukas Biewald, CEO of Crowdflower, gave a three hour workshop on arguably the most vital module to machine learning – SciKit Learn. Lukas gave the audience in Microsoft’s Hack Reactor Space open source code to follow along from:


He highlighted the value of participating in Kaggle competitions. He defined regression as one variable trying to predict another variable. He gave the students in the audience the main project for today’s workshop – judging emotion about brands and products. In this project, we saw Tweets about several Apple brands and products.

99% of the time was spent on “should we drop I” or “should we drop my” types of questions. We also asked “should we remove really rare words?” “should we remove really common words?” “should we remove stop words?” and “what is a word?”

In the midst of our learning, Mr. Biewald purported that if one did not want to earn a degree in machine learning, memorize the map at this website:


When testing how well an algorithm was working, Lukas recommended we utilize the “Test_algorithm_1.py” program. Moreover, that we should not be testing on the data we are training on. For example, we should generally train on 80% of the data and test on 20% of the data.

On the “feature_selection.py” program, he used the chi2 statistical model to rank which column was the best column in our datasets. However, he claimed that the chi2 feature was the least important to consider of all of the topics we discussed today.

Lukas closed with a call for us to make our own features like a count for the number of exclamation points, emojis, lengths of Tweets, or language of Tweets.


Predictive Analytics Summary from Open Data Science Conference

Don Dini (AT&T, USA) opened his talk on Predictive Analytics with a comment on how machine learning was the second best solution to every problem. He made a self-contained pun on the word TERRIFYING being equivalent to:

[‘T’, ‘EH1’, ‘R’, ‘AH0’, ‘F’, ‘AY2’, ‘IH0’, ‘NG’]

He continued with a generalization that is the general answer was complicated, data science was generally the answer.

If you believe your servers are being attacked, perform hypothesis testing using null distribution KDE fromed from the last month of data. Then propose a model capable of addressing the inference problem. Finally, utilize k-nearest neighbors linear regression support vector machines.

As an example, Mr. Dini gave us a simple “Predict the next number” problem. To solve this, suppose
xi ~N(mu, sigma2)
a.     Then µ - hat = (x1 + x2 + … + xn) / N
b.     Then the coder needs to evaluate how well the model did. How sure is the model about what is recommended.
c.     Then determine computational confidence intervals

To perform computational confidence intervals, “bootstrapping” was a helpful method. Consider having 3 similar data sets:
a.     x1, x2, …, xn à r(x)
b.     x1, x2, …, xn à r(x)
c.     x1, x2, …, xn à r(x)
Ideally, if a coder had a perfect model, and the data the coder had was perfect, the outcome should be the same for all models. Moreover, the amount of data present was a direct (and non-removable) cause of uncertainty in the prediction. In any prediction made, it was vitally important to communicate uncertainty. For example, Don tried to show us how he answered the question “What are the things that influence communication on social media?”
For example, if thirty seconds had elapsed from a user’s Tweet, how much longer until that same user would Tweet again? Don’s point was that uncertainty, here, made a real prediction less meaningful.

Continuing on to the next point, Don highlighted the next relevant problem “How do we know if two variables had anything to with each other?”

Consider: Y = x2
x: x1, x2, ..., xn
y: y1, y2, …, yn

In the above instance, covariance decreased as odd-degree polynomials increased in degree. Conversely, the same result occurred for all even degree polynomials.

The next topic Don shared with the Microsoft Hack Reactor Office was Entropy (a principle that explained how decision trees worked).
X took on a series of values
                  x: x1, x2, …, xk
                  Pr(x1), Pr(x2), …, Pr(xk)
                  H(x): Pr(x1) * log(Pr(x1))
x also took on a minimum value of 0 when it was most skewed:
[0, 0, …, 0, 1, 0, …0, 0]
x took on a maximum when it was least skewed:
[1/k, 1/k, … 1/k]
Don explained that we could sort variable by how much they could cause x’s entropy to decrease. The goal was to find the variable that was going to make him maximally certain. This feat required defining relative entropy as:
H(X|Y) = Pr(y = y1) * H(X|Y = y1) + Pr(y = y2)

If Don had a collection of variables: A, B, C, …, he could sort them out by how much they caused entropy to drop:
a.     H(X) – H(X|A)
b.     H(X) – H(X|B)
c.     H(X) – H(X|C)

This brought Don to his next principle, “Mention Distance.” Rather, the amount of seconds that had elapsed since someone had Tweeted. With mention distance, Don was trying to answer the following question: “If I knew how long it had been since someone in this front network had mentioned them, did it influence if that person would respond?”


Don closed with a group activity to have us practice the boostrapping method.