How to use LBFGS instead of stochastic gradient descent for neural network training instead in PyTorch
If you ever trained a zero hidden layer model for testing you may have seen that it typically performs worse than a linear (logistic) regression model. By wait? Aren’t these the same thing? Why would the zero hidden layer network be worse? Exactly. There is, of course, a good explanation and it is model estimation. We typically train neural networks using variants of stochastic gradient descent. We typically train regression models using optimization methods than are not stochastic and make use of second derivates.
If you have ever trained a one-hidden-layer network in scikit-learn, you might have seen that one option for the optimizer there is the same as for logistic regression: the Limited memory Broyden Fletcher Goldfarb Shanno algorithm. Using the second order derivate to guide optimization should make convergence faster, although the time and memory requirement might make it infeasible for very deep networks and mini-batch training is not available in PyTorch out-of-the-box.
I work on tabular datasets where good convergence is a larger concern than deep architectures, so I’d be happy to get more stable convergence and regularize my network more to make sure I’m not overfitting. So let’s check out how to use LBFGS in PyTorch!
The PyTorch documentation says
Some optimization algorithms such as Conjugate Gradient and LBFGS need to reevaluate the function multiple times, so you have to pass in a closure that allows them to recompute your model. The closure should clear the gradients, compute the loss, and return it.
It also provides an example:
for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
```
Note how the function `closure()` contains the same steps we typically use before taking a step with SGD or Adam. In other words, if the optimizer needs the gradient once, like SGD or Adam, it's simple to calculate the gradient with `.backward()` and pass it to the optimizer. If the optimizer needs to calculate the gradient itself, like LBFGS, then we pass instead a function that wraps the steps we typically do once for others optimizers.
Let's see how that looks in a simple feed forward network:
```python
import torch
from torch import nn
from torch.autograd import Variable
from torch.optim import Adam, LBFGS
from torch.utils.data import Dataset, DataLoader
temp = [1,2,3,4,5]
list(zip(temp, temp[:-1]))
[(1, 1), (2, 2), (3, 3), (4, 4)]
I define a somewhat flexible feed-forward network below. The action happens in method train()
. We replace the gradient calculation with the closure
function that does the same thing, plus two checks suggested here in case closure
is called only to calculate the loss.
class NNet(nn.Module):
def __init__(self, input_dim, hidden_layer_sizes, loss, sigmoid=False):
super().__init__()
self.input_dim = input_dim
self.layer_sizes = hidden_layer_sizes
self.iter = 0
# The loss function could be MSE or BCELoss depending on the problem
self.lossFct = loss
# We leave the optimizer empty for now to assign flexibly
self.optim = None
hidden_layer_sizes = [input_dim] + hidden_layer_sizes
last_layer = nn.Linear(hidden_layer_sizes[-1], 1)
self.layers =\
[nn.Sequential(nn.Linear(input_, output_), nn.ReLU())
for input_, output_ in
zip(hidden_layer_sizes, hidden_layer_sizes[1:])] +\
[last_layer]
# The output activation depends on the problem
if sigmoid:
self.layers = self.layers + [nn.Sigmoid()]
self.layers = nn.Sequential(*self.layers)
def forward(self, x):
x = self.layers(x)
return x
def train(self, data_loader, epochs, validation_data=None):
for epoch in range(epochs):
running_loss = self._train_iteration(data_loader)
val_loss = None
if validation_data is not None:
y_hat = self(validation_data['X'])
val_loss = self.lossFct(input=y_hat, target=validation_data['y']).detach().numpy()
print('[%d] loss: %.3f | validation loss: %.3f' %
(epoch + 1, running_loss, val_loss))
else:
print('[%d] loss: %.3f' %
(epoch + 1, running_loss))
def _train_iteration(self,data_loader):
running_loss = 0.0
for i, (X,y) in enumerate(data_loader):
X = X.float()
y = y.unsqueeze(1).float()
X_ = Variable(X, requires_grad=True)
y_ = Variable(y)
### Comment out the typical gradient calculation
# pred = self(X)
# loss = self.lossFct(pred, y)
# self.optim.zero_grad()
# loss.backward()
### Add the closure function to calculate the gradient.
def closure():
if torch.is_grad_enabled():
self.optim.zero_grad()
output = self(X_)
loss = self.lossFct(output, y_)
if loss.requires_grad:
loss.backward()
return loss
self.optim.step(closure)
# calculate the loss again for monitoring
output = self(X_)
loss = closure()
running_loss += loss.item()
return running_loss
# I like to include a sklearn like predict method for convenience
def predict(self, X):
X = torch.Tensor(X)
return self(X).detach().numpy().squeeze()
class ExperimentData(Dataset):
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return self.X.shape[0]
def __getitem__(self, idx):
return self.X[idx,:], self.y[idx]
Let’s test the whole thing on some simulated data from sklearn
’s make_classification
.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
X,y = make_classification(n_samples=20000, n_features=100, n_informative=80, n_redundant=0, n_clusters_per_class=20, class_sep=1, random_state=123)
X, X_val, y, y_val = train_test_split(X, y, test_size=0.5, random_state=123)
# Don't forget to prepare the data for the DataLoader
data = ExperimentData(X,y)
INPUT_SIZE = X.shape[1]
EPOCHS=5 # Few epochs to avoid overfitting
pred_val = {}
We’ll check first if we can reach the AUC of the logistic regression model trained with sklearn and its (almost deprecated) liblinear optimizer.
HIDDEN_LAYER_SIZE = []
Let’s try Adam as an optimizer first. We would use that with a mini-batch and I use the default parameters.
data_loader = DataLoader(data, batch_size=128)
net = NNet(INPUT_SIZE, HIDDEN_LAYER_SIZE, loss = nn.BCELoss(), sigmoid=True)
net.optim = Adam(net.parameters())
net.train(data_loader, EPOCHS, validation_data = {"X":torch.Tensor(X_val), "y":torch.Tensor(y_val).unsqueeze(1)})
[1] loss: 84.258 | validation loss: 0.794
[2] loss: 56.030 | validation loss: 0.666
[3] loss: 52.088 | validation loss: 0.658
[4] loss: 51.850 | validation loss: 0.658
[5] loss: 51.821 | validation loss: 0.658
pred_val["adam"] = net.predict(X_val)
LFBGS is next. The implementation in PyTorch doesn’t work for mini-batches, so we’ll input the full dataset at the same time. Better hope your dataset is reasonably sized!
data_loader = DataLoader(data, batch_size=X.shape[0])
net = NNet(INPUT_SIZE, HIDDEN_LAYER_SIZE, loss = nn.BCELoss(), sigmoid=True)
net.optim = LBFGS(net.parameters(), history_size=10, max_iter=4)
A batch version of LBFGS is available at https://github.com/nlesc-dirac/pytorch/blob/master/torch/optim/lbfgs.py
net.train(data_loader, EPOCHS, validation_data = {"X":torch.Tensor(X_val), "y":torch.Tensor(y_val).unsqueeze(1) })
[1] loss: 1.236 | validation loss: 1.228
[2] loss: 0.657 | validation loss: 0.661
[3] loss: 0.652 | validation loss: 0.657
[4] loss: 0.652 | validation loss: 0.657
[5] loss: 0.652 | validation loss: 0.657
pred_val["LBFGS"] = net.predict(X_val)
logit = LogisticRegression(solver="liblinear")
logit.fit(X,y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
tol=0.0001, verbose=0, warm_start=False)
pred_val["logit"] = logit.predict_proba(X_val)[:,1]
{key:roc_auc_score(y_val, pred) for key, pred in pred_val.items()}
{'adam': 0.6484174933626715,
'LBFGS': 0.6502406286585709,
'logit': 0.6502353085555737}
Nice! We can see that ADAM gets us close, but LBFGS gets us very close or even better than liblinear.
But how about real neural networks? Or at least one with two hidden layers?
HIDDEN_LAYER_SIZE = [X.shape[1]//2,X.shape[1]//2]
data_loader = DataLoader(data, batch_size=128)
net = NNet(INPUT_SIZE, HIDDEN_LAYER_SIZE, loss = nn.BCELoss(), sigmoid=True)
net.optim = Adam(net.parameters())
net.train(data_loader, EPOCHS, validation_data = {"X":torch.Tensor(X_val), "y":torch.Tensor(y_val).unsqueeze(1) })
[1] loss: 52.640 | validation loss: 0.650
[2] loss: 48.396 | validation loss: 0.627
[3] loss: 44.462 | validation loss: 0.603
[4] loss: 39.803 | validation loss: 0.583
[5] loss: 35.332 | validation loss: 0.576
pred_val["adam_deep"] = net.predict(X_val)
data_loader = DataLoader(data, batch_size=X.shape[0])
net = NNet(INPUT_SIZE, HIDDEN_LAYER_SIZE, loss = nn.BCELoss(), sigmoid=True)
net.optim = LBFGS(net.parameters(), history_size=10, max_iter=4)
net.train(data_loader, EPOCHS, validation_data = {"X":torch.Tensor(X_val), "y":torch.Tensor(y_val).unsqueeze(1) })
[1] loss: 0.684 | validation loss: 0.688
[2] loss: 0.635 | validation loss: 0.654
[3] loss: 0.545 | validation loss: 0.631
[4] loss: 0.476 | validation loss: 0.620
[5] loss: 0.411 | validation loss: 0.608
pred_val["LBFGS_deep"] = net.predict(X_val)
{key:roc_auc_score(y_val, pred) for key, pred in pred_val.items()}
{'adam': 0.6484174933626715,
'LBFGS': 0.6502406286585709,
'logit': 0.6502353085555737,
'adam_deep': 0.7757943593787976,
'LBFGS_deep': 0.7770462436152764}
To be honest, the results are very close and any conclusion does not survive repeated testing. I’m afraid we’ll need a real benchmark at some point to figure this out. Until then, here’s a great reddit thread for you to explore this further: https://www.reddit.com/r/MachineLearning/comments/4bys6n/lbfgs_and_neural_nets/
Written on October 3rd , 2019 by Johannes Haupt