Just read a post from http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-two/
What is Feature Engineering?
We’ll start out this time with a topic that is so important that it deserves an instructive example. In fact, Domingos calls it “easily the most important factor” in determining the success of a machine learning project, and I agree with him.
Suppose you have a dataset in which you have pairs of cities, coupled with a prediction of whether most people would consider the two cities to be comfortably drivable within a single day. You’ve got a nice database with the longitudes and latitudes for all of your cities, and so these are your input fields. Your dataset might look something like this (note that the values don’t correspond to actual city locations – they are random):
|CITY 1 LAT.||CITY 1 LNG.||CITY 2 LAT.||CITY 2 LNG.||DRIVABLE?|
And you expect to construct a model that can predict for any two cities whether the distance is drivable or not.
Probably not going to happen.
The problem here is that no single input field, or even any single pair of fields, is closely correlated with the objective. It is a combination of all four fields (the distance from one pair of geo-coordinates to the other), and a combination by a fairly complex formula, that is correlated with the input. Machine learning algorithms are limited in the way they can combine input fields; if they weren’t, they could totally exhaust themselves trying everything.
But all is not lost! Even if the machine doesn’t have any knowledge about how longitudes and latitudes work, you do. So why don’t you do it? Apply the formula to each instance and get a dataset like this (again, random values):
Ah. Much more manageable. This is what we mean by feature engineering. It’s when you use your knowledge about the data to create fields that make machine learning algorithms work better.
Domingos mentions that feature engineering is where “most of the effort in a machine learning project goes”. I couldn’t agree more. In my career, I would say an average of 70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.
How does one engineer a good feature? One good rule of thumb is to try to design features where the likelihood of a certain class goes upmonotonically with the value of the field. So in our example above, “drivable = no” is more likely as distance increases, but that’s not true of longitude or latitude. You probably won’t be able to engineer a feature where this is strictly true, but it is a good feature even if it is somewhat close to that ideal.
Typically, there isn’t a single data transformation that makes learning immediately easy (as there was in the above example), but at least as typically there are one or more things you can do to the data to make machine learning easier. There’s no formula for this, and a lot of it happens by itch and by twitch. BigML attempts to do some of the easy ones for you (automated date parsing is an example) but far more interesting transformations can happen with detailed knowledge of your specific data. Great things happen in machine learning when human and machine work together, combining a person’s knowledge of how to create relevant features from the data with the machine’s talent for optimization.