Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the Conditional Probability of RL Loss #21

Open
crystina-z opened this issue Oct 30, 2018 · 1 comment
Open

Question on the Conditional Probability of RL Loss #21

crystina-z opened this issue Oct 30, 2018 · 1 comment

Comments

@crystina-z
Copy link

crystina-z commented Oct 30, 2018

RLSeq2Seq/src/model.py

Lines 372 to 381 in 515a4cb

indices = tf.stack( (batch_nums, targets), axis=1) # shape (batch_size, 2)
gold_probs = tf.gather_nd(dist, indices) # shape (batch_size). prob of correct words on this step
losses = -tf.log(gold_probs)
loss_per_step.append(losses)
# Equation 15 in https://arxiv.org/pdf/1705.04304.pdf
# Equal reward for all tokens
if FLAGS.use_discounted_rewards or FLAGS.use_intermediate_rewards:
rl_losses = -tf.log(gold_probs) * self._reward_diff[_k][dec_step, :] # positive values
else:
rl_losses = -tf.log(gold_probs) * self._reward_diff[_k] # positive values

I'm sorry if I understand the paper or the code in a wrong way, but according to my current understanding, the conditional probability in Equation 15 is the prob of Sampled sequence, other than the prob of Target sequence, in other words, there it's supposed to use the indices of the words in sampled sequence, other than the indices of the words in target sequence to 'gather' the probs?

@crystina-z
Copy link
Author

crystina-z commented Oct 30, 2018

A follow-up question, again I apologize for my possible misunderstanding.
So far as I see,

  1. the input for the Loss_ml and Loss_rl seems to be different in the original paper?
    Since the loss_ml part is basically the same with the traditional teacher forcing way, it uses ground truth as input for next timestep, which is also how the current implementation does.
    However, in the Reinforcement Learning part, considering the paper is trying to address the exposure bias which 'come from the fact that the network has knowledge of the ground truth sequence up to the next token' (at the bottom of page 4), and the baseline y^ is obtained essentially by greedy search(at the top of page 5), i feel in the RL part, ground truth should not be given in the training mode, i.e. the decode input should come from last prediction other than the batch.

  2. Based on the above (input of RL come from the last timestep), i'm thinking maybe the procedure of generating y^ and ys should also be separated (now they are both depend on the ground truth input):
    since at timestep t, y^(t) is obtained by maximizing p( y^(t)| y^(t-1), ...y^(1) ) , while ys(t) is sampled from p(ys(t)| ys(t-1), ...ys(1)). As u see, these 2 distributions are different, so i'm thinking maybe we are supposed to have 2 generative procedures here: one always takes y^(t-1) as next timestep input and generates y^(t), another always takes ys(t-1) as input and generates ys(t). in this way, we ultimately receive 2 sequences y^ and ys and their corresponding ROUGE value.

Please correct me if you think the other way. I'm still on my way about understanding RL in Summarization. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant