Improving Q-learning with policy gradients