Stranger Stats - Entering the Machine Learning World

1 Everything Starts with Why!

I really like the Simon Sinek’s great book “Start With WHY”. This was something I always thought of but couldn’t put into words. Yet, when I read it, everything just fitted. (If you are interested, you can check this TED talk of the writer.)

The same question goes for you since you already landed here and you are beginning a journey with or without your full passion or interest.

In this blog, I will try to give all the information using this framework: Why – How – What.

Perhaps ,not exactly the same what Sinek proposed but, this is how I’m going to establish my articles:

Why: The main reason which will make us take the action or give us enough fuel to pursue. When you read the why and if you don’t have any interest on that, don’t even go further.

How: The concept in a simple and plain way, without losing ourselves in the technical complexity. As Einstein said, ““If you can’t explain it simply, you don’t understand it well enough.” I will try to be as simple as possible on “how”.

What: The practical stuff. The tool itself to make job done! The hammer to hit, glue to fix, wrench to tighten.

So, let’s start, commençons, başlayalım !

2 1 - Why ? (the need)

Short story: To be able yo make sophisticated predictions or classifications on some data.

Longer story: Let’s say, you are a movie fan and you love suggesting movies to people. As you know from her previous choices, your wife doesn’t like horror movies at all! Also, she is not a fan of gun action or violent stuff. She expressed before that she loved animation called “Coco” and she likes watching “non intense” stuff to just to get relaxed after a stressful work day and have some joy and laughter. She asked for suggestion, what would you tell? A romantic comedy or a sci-fi action movie?

We have our previous data about people around us, and we have our “guess” based on our “historical data”. We have our “cluster”s, some categories or tags and our brain is constantly categorizing people with our observations. Then, if we have enough observations, it becomes not that hard for us guess what they would like to do the next.

Now let’s say, you are in the content suggestion team of Netflix. You have data from millions of users, every piece of data gives hints about their characters, and you need to make educated guesses about what they might want to watch next. Since you cannot assign people to analyze each and every person’s interests one by one, you need machines to learn what you need to do and let them do it on behalf of you.

In summary, you have some data and by using this data, you want to make a predictions¹ or classifications ². With the power of these, you either suggest new movies to people, suggest new products to buy, predict election results, forecast the weather or number of passengers to fly etc.

After this point, do I really need to explain why it’s important to guess what is going to happen next? Don’t think so… 😉

3 2- How ? (the way)

To make these predictions or classifications, we use mathematical and statistical approaches on our previous data. We simply create the “formula” or “recipe” of the desired output. If our desired output is how many people is going to visit a zoo in next saturday, hypothetically, we can come up with a formula like this:

\(Visitors = A + B * weather situation + C * Number of Visitors on Last Year's Same Day\)

Wouldn’t it be cool that just by knowing the weather, if it’s school vacation time or not and number of visitors last year you would be predicting next Saturday’s visitor numbers more or less?

Imagine that you would have formulas for everything you want to predict. Then you would start seeing the future! (Check out this movie if you didn’t see it before)

Machine learning is all about creating this formula. Whatever fancy name you hear on the way, it’s all about reaching this formula. Of course some questions appears when you think on it deeper:

1- What is the best way to predict A, B , C numbers in this formula?

2- Is there any other factor which we need to take into account to make prediction more accurate? For example, “Is it school vacation period or not”, “Is a new and very popular animation movie coming this weekend to theaters or not” etc.

3- How accurate we can make this prediction? Is it plus and minus 2000 people or plus and minus 300?

Eventually, it’s turning to a “meal recipe”. Some ingredients are more important than the others. Or maybe, cooking time is the most important sometimes.

In machine learning, by using several mathematical ways and previous data, we create some formulas (which we call “models”) , we spare some of the data to test how well the formula works and we use some other methods understand which factors are really needed and which are statistically just irrelevant in the end.

Sound simple, right? It is simple when you go step by step and when you understand the fundamental logic for each step. Also it got complicated at some point, because people didn’t stop to improve the ways of performing this process since years even before PCs invented... They added new terms, abbreviations, new mathematical models etc.

In summary: We create a formula (mathematical model) with a logical or statistical known method. While doing that, we use the previous data, domain expertise to be logical on the which factors to include to our formula and test it with some portion of our data set.

4 3- What (the tool)

Since these mathematical models were around for many years, many software³ and some programming languages⁴ included to the topic. It’s quite interesting and at some point, also a little confusing that there are many tools simply doing nearly the same thing especially when it comes to programming languages. Of course there are some explanations of each side when you go deeper conversations but, logic is pretty much the same. Also, until some point, it doesn’t even matter which tool you use.

Whatever tool you choose, they will simply expect you to know the theory behind what you did. When you add a “trend line” in excel and put an \(R^2\) on the graph, it will not explain you what is this value, how it calculated it etc. unless you specifically search for that.

Therefore, if you deal with these, first, you need to understand what is the statistics behind it. If you don’t, you can follow some templates, you can even reach some solutions but since you will not understand what you are doing, you will not be able to troubleshoot it when you have a trouble with the numbers.

Therefore, if you want to learn this, I highly recommend you to learn basic principles just with couple of videos. I personally like these channels: Stat Quest, Brandon Foltz.

In summary: “What” part is totally up to you. Going with a software like Minitab , SPSS or Alteryx kind of software can make your life easier and without coding but it can be harder when it comes to automate things. Going with programming with Python or R could make you open for many possibilities but can take longer time to learn and operate things. I believe, there is no single right way on this.

Whatever your way, if you have a solid “why” and “how”, it will not matter so much. You will be able to change your way anytime without big trouble.

If you already started the way, enjoy the ride! 😉

Footnotes

Prediction: Predicting weather, toilet paper demand for next month, how many passengers will fly… Did you notice something? They are all “Numbers”! (They are not “high, low, true, false, passed, failed etc.)↩︎
Classification: Which students will be successful in maths, what kind of people like android phones, which photos in my album contains me, which series next is the best to suggest this person who just completed all the episodes of “Stranger Things.”… Is there any number on the answers? Nope!↩︎
Including Minitab, SPSS, MS Excel, Stata, SAS etc. –> Full List ↩︎
The Most Popular Statistical Programming Languages Today: Python, R , Scala, Julia, JavaScript
Reference ↩︎