Structuring Machine Learning Project

Class: ENG Created: Apr 19, 2020 10:29 PM Materials: https://www.coursera.org/learn/machine-learning-projects/home/welcome Reviewed: No Source: Coursera

Note from the Coursera course: Structuring Machine Learning Project

Orthogonalization

Definition: orthogonalization is the thought in tuning, in which situation each tuning knob can only does one thing.

In ML project, we expect our algorithm can work well in 4 main goals:

low cost in training set;
low cost in dev set;
low cost in test set;
performs well in real world

These 4 goals are hierarchical in a one by one order. In practice of turning ML algorithm, we do have some techniques works orthogonal in achieving these four goals independently. Like using bigger/deeper network for goal #1, using regularization for goal #2, etc.

In Andrew’s view, early stopping has its own draw back in tuning which is because it affect both the goal #1 and #2.

Single real number evaluation metric

Trade-off between precision and recall: Precision is what % of the examples which have been classified positive are truly positive. Recall is what % of the positive examples have been classified positive successfully.

To avoid struggling with the trade off, we should focus on one metric as possible. Like we should use F1 score instead of looking both P and R simultaneously.

Satisficing and optimizing metric

Which means can combine some metrics into one cost function. Like we can set the optimize problem into:

$$\max accuracy\ s.t.\ runningtime \le 100ms$$

So if we have N metrics, we can pick the most prior one to be the optimizing one and the rest N-1 ones to be satisficing.

Train/dev/test distributions

Should find a way to make sure the dev and test set are from the same distribution. Having a defined dev set and single real value metric is like setting a target of the whole team. The team will focus on this target so if we make the distribution between dev and test as closer as possible, we can save a lot of time in transferring from dev to test, and also make the effort of working on dev set more worthy.

So, the guideline is: choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

We used to split the data into 60:20:20 for train:dev:test. However, in the are of big data, we can have larger fraction of the data goes to the training data as long as we have enough number of samples in dev/test set.

For the test set, we need to set it to be big enough to give high confidence in the overall (unbias estimation) performance of the final system.

When to change dev/test sets and metrics

Can use different weight in the loss function to punish the unwanted misclassification. If you find the evaluation metric is not giving the correct rank order preference for what is actually better algorithm, then it’s the time to think about defining a new evaluation metric, or even with dev/test set. Beside set up a target correctly, there is a separately step which is to find a way to shoot/aim the target properly. Change the metric is just about the former goal, for the second one, we can apply the same technique (weighting for different sample) to the loss function.

So, even in the very beginning we cannot define the perfect evaluation metric, just set something reasonable up quickly and use that to drive the speed of the team’s iterating. And if later down the line we find out that it wasn’t a good one, we can just change the idea at that time.

Avoidable bias

We can use the human performance in our training as a metric to chose our next step: whether to focus on bias (avoidable bias, which is the gap between the performance on training set and the human-level performance) reduction or variance reduction tactics. This is because the human performance can usually be seen as a good approximation of the Bayes Error.

Error Analysis

When we get a result that we are not so satisfied with, we can probably look at the dev examples to get insight of what we should do to improve and evaluate the ideas. Like we can get ~100 mislabeled dev set examples, and look into what are the common characters among the majority of them and their proportion. This simple counting procedure error analysis can save a lot of time in terms of deciding what’s the most important, or what’s the most promising direction to focus on next. For several ideas to improve the performance, we can evaluate them in parallel.

Beside improving the algorithm, we also need to make sure our labeling is correct. DL algorithms are quite robust to random errors in the training set, but less robust to systematic errors.

Also we need to apply same process to dev and test sets to make sure they continue to came from the same distribution, like consider examining examples the algorithm got right as well as ones it got wrong.

After correctness, sometimes we might have the train and dev/test sets come from different distribution.

Get first system quickly and iterate

/null

Training and testing on different distributions

Should let the test/dev approximate the cases where we want the model to work on in real task (the production environment) as much as possible.

When we found there is a big gap between the training set and the dev/test set but not sure if there is a difference in the distribution of them, what should we do? We can cut out part of the training set as a “training-dev” set, which has the same distribution as training set, but not used for training. We can use this training-dev set as well as the dev set to measure the performance of out model to see whether there is a mismatched data distribution or a variance/bias problem we need to improve.

When it turns out that we do have a data mismatch problem, what should we do except regain a dataset?

Carry out manual error analysis to try to understand difference between training and dev/test sets.
Make training data more similar; or collect more data similar to dev/test sets. ⇒ Artificial data synthesis. Has problem of sampling from just a small subset of the possible input examples and easily overfitting.

Transfer Learning

When transfer learning make sense in A ⇒ B:

Task A and B have the same input
Have a lot more data for Task A then Task B
Low level features from A could be helpful for learning B

Multi-task Learning

Not assigning a single label to a example, but going through the different classes and asking for each of the classes does that label appear in the example. So to dealing with this kind of problem, instead of build several networks each for one label, we can build a single network that is looking at each example and basically solving all labeling problems since some of the earlier features in neural network can be shared among these different labeling problems.

Why multi-task learning make sense?

Training on a set of tasks that could benefit from having shared lower-level features;
Usually: Amount of data you have for each task is quite similar (so we can consider multi-task learning as a kind of transfer learning when the number of tasks is large enough. Then for one specific task we can say the rest of the tasks (larger dataset) are doing transfer learning to this one(small dataset))
Can train a big enough neural network to do well on all the task (only when the neural network is big enough)