本文共 2454 字,大约阅读时间需要 8 分钟。
I'm learning the ropes of neural networks. Recently, I read stuff about the curse of dimensionality and how it might lead to overfitting (e.g. ).
If I understand correctly, the number of features (dimensions) d of a given dataset with n data points is very important when considering the size t of the training set.
QUESTIONS
(...not sure if all my questions are really connected to the curse of dimensionality)
One of the rules of thumb is to have at least 10x more data points as the number of dimensions. Using some intelligent prior information (e.g. good kernel in SVMs), you might even learn a good machine with less data points as dimensions.
The lecture about VC dimension from Yaser Abu-Mostafa motivates this 10x rule with some nice charts. If you are not familiar with VC dimension concept, it is about the capacity of learning. The higher the dimension, the more complex problem we are trying to solve. For example, classical Perceptron has d+1 VC dimensions. Some problems have infinite VC dimensions, such problems are impossible to learn.
A neural net is a linear model in derived variables. Take the regression case, because it is a little bit simpler:
where XX is your data (i.e.: your features), ΓΓ are matrices of weights, γγ are "biases", and ββ are your weights connecting the topmost hidden layer to the output. You see that it is nothing more than a linear model, but in nonlinear functions of XX .
Just like in a linear model, you can overfit when you have too many parameters. A typical strategy for avoiding overfit is regularization. Rather than solving
a
in ridge regression, for example. Selecting λλ by cross-validation, you're effectively letting the data tell you how much to use your many dimensions.
This generalized directly to neural nets, except that there is no closed form solution to the minimization problem, as there is in ridge regression. You'll overfit if you do
where θθ is a concatenated vector of all of your weights.
Note that the quadratic penalty here isn't the only form of regularization. You could also do L1 regularization, or dropout regularization.
But the idea is the same: build a model that will overfit the data, and then find a regularization parameter (by some variant of cross validation) that will constrain the variability such that you don't overfit.
转载地址:http://vzuii.baihongyu.com/