The Kernel Trick in Neural Networks
Data-science is the science of finding patterns in data. Given lots and lots of points on a 2D plane, they could all be lying on a line, or on an organized pattern (maybe they all form a circle), or there could be clumps of data here and there, or a combination of both (maybe they are all on two circles).
All the machine learning algorithms are specialized to do one or more of these tasks, i.e., finding the line(s) of best fit [such as Linear Regression] or finding the line(s) of best separation [such as Decision trees, Logistic regression], or both [such as SVMs and Neural Networks]
Even more basically, data-science is about finding lines. And linear regression is the easiest to explain. Find the coefficients of that line which has maximum overlap with the points, in case of regression and that line which divides the points most. But there is an obvious assumption in the method. That data is linearly separable. This is not true. Never true, in fact. There will always be that curve which can separate the clouds more evenly. And hence the need for non-linear models.
The kernel trick is one of the coolest things I've learned in ML. We have bulit methods that are fundamentally based on drawing lines (Linear Regression, SVM or NN can ultimately only draw lines). However the genius of kernel trick is to transform the data such that the clouds simply go apart and a line can then be drawn.
Think of it like this. There is a river separating the land into two. We can think of lands separated by a crooked line, or if the land itself were to be stretched like a blanket, with the right kind and amount of stretching the river looks straight and the two lands on both sides warped.
That's the simple trick. SVM and (for this particular topic) Neural Networks are basically about that. While SVM has an explicit bias about the type of kernel that must be used, Neural Networks take it a step further and finds its own kernel.