Implementing Adagrad in Python
I'm trying to implement Adagrad in Python. For learning purposes, I am using matrix factorisation as an example. I'd be using Autograd for computing the gradients.
My main question is if the implementation is fine.
Problem description
Given a matrix A (M x N) having some missing entries, decompose into W and H having sizes (M xk) and (k XN) respectively. Goal would to learn W and H using Adagrad. I'd be following this guide for the Autograd implementation.
NB: I very well know that ALS based implementation are well-suited. I'm using Adagrad only for learning purposes
Customary imports
import autograd.numpy as np
import pandas as pd
Creating the matrix to be decomposed
A = np.array([[3, 4, 5, 2],
[4, 4, 3, 3],
[5, 5, 4, 3]], dtype=np.float32).T
Masking one entry
A[0, 0] = np.NAN
Defining the cost function
def cost(W, H):
pred = np.dot(W, H)
mask = ~np.isnan(A)
return np.sqrt(((pred - A)[mask].flatten() ** 2).mean(axis=None))
Decomposition params
rank = 2
learning_rate=0.01
n_steps = 10000
Gradient of cost wrt params W and H
from autograd import grad, multigrad
grad_cost= multigrad(cost, argnums=[0,1])
Main Adagrad routine (this needs to be checked)
shape = A.shape
# Initialising W and H
H = np.abs(np.random.randn(rank, shape[1]))
W = np.abs(np.random.randn(shape[0], rank))
# gt_w and gt_h contain accumulation of sum of gradients
gt_w = np.zeros_like(W)
gt_h = np.zeros_like(H)
# stability factor
eps = 1e-8
print "Iteration, Cost"
for i in range(n_steps):
if i%1000==0:
print "*"*20
print i,",", cost(W, H)
# computing grad. wrt W and H
del_W, del_H = grad_cost(W, H)
# Adding square of gradient
gt_w+= np.square(del_W)
gt_h+= np.square(del_H)
# modified learning rate
mod_learning_rate_W = np.divide(learning_rate, np.sqrt(gt_w+eps))
mod_learning_rate_H = np.divide(learning_rate, np.sqrt(gt_h+eps))
W = W-del_W*mod_learning_rate_W
H = H-del_H*mod_learning_rate_H
While the problem converges and I get a reasonable solution, I was wondering if the implementation is correct. Specifically, if the understanding of sum of gradients and then computing the adaptive learning rate is correct or not?
At a cursory glance, your code closely matches that at https://github.com/benbo/adagrad/blob/master/adagrad.py
del_W, del_H = grad_cost(W, H)
matches
grad=f_grad(w,sd,*args)
gt_w+= np.square(del_W)
gt_h+= np.square(del_H)
matches
gti+=grad**2
mod_learning_rate_W = np.divide(learning_rate, np.sqrt(gt_w+eps))
mod_learning_rate_H = np.divide(learning_rate, np.sqrt(gt_h+eps))
matches
adjusted_grad = grad / (fudge_factor + np.sqrt(gti))
W = W-del_W*mod_learning_rate_W
H = H-del_H*mod_learning_rate_H
matches
w = w - stepsize*adjusted_grad
So, assuming that adagrad.py
is correct and the translation is correct, would make your code correct. (consensus does not prove your code to be right, but it could be a hint)
上一篇: 如何在nrrd加载器中使用相同的silde更新两个片
下一篇: 在Python中实现Adagrad