Question on the Conditional Probability of RL Loss #21

crystina-z · 2018-10-30T02:21:34Z

Lines 372 to 381 in 515a4cb

    
           indices = tf.stack( (batch_nums, targets), axis=1) # shape (batch_size, 2) 
        
           gold_probs = tf.gather_nd(dist, indices) # shape (batch_size). prob of correct words on this step 
        
           losses = -tf.log(gold_probs) 
        
           loss_per_step.append(losses) 
        
           # Equation 15 in https://arxiv.org/pdf/1705.04304.pdf 
        
           # Equal reward for all tokens 
        
           if FLAGS.use_discounted_rewards or FLAGS.use_intermediate_rewards: 
        
             rl_losses = -tf.log(gold_probs) * self._reward_diff[_k][dec_step, :]  # positive values 
        
           else: 
        
             rl_losses = -tf.log(gold_probs) * self._reward_diff[_k] # positive values

I'm sorry if I understand the paper or the code in a wrong way, but according to my current understanding, the conditional probability in Equation 15 is the prob of Sampled sequence, other than the prob of Target sequence, in other words, there it's supposed to use the indices of the words in sampled sequence, other than the indices of the words in target sequence to 'gather' the probs?

crystina-z · 2018-10-30T07:00:16Z

A follow-up question, again I apologize for my possible misunderstanding.
So far as I see,

the input for the Loss_ml and Loss_rl seems to be different in the original paper?
Since the loss_ml part is basically the same with the traditional teacher forcing way, it uses ground truth as input for next timestep, which is also how the current implementation does.
However, in the Reinforcement Learning part, considering the paper is trying to address the exposure bias which 'come from the fact that the network has knowledge of the ground truth sequence up to the next token' (at the bottom of page 4), and the baseline y^ is obtained essentially by greedy search(at the top of page 5), i feel in the RL part, ground truth should not be given in the training mode, i.e. the decode input should come from last prediction other than the batch.
Based on the above (input of RL come from the last timestep), i'm thinking maybe the procedure of generating y^ and ys should also be separated (now they are both depend on the ground truth input):
since at timestep t, y^(t) is obtained by maximizing p( y^(t)| y^(t-1), ...y^(1) ) , while ys(t) is sampled from p(ys(t)| ys(t-1), ...ys(1)). As u see, these 2 distributions are different, so i'm thinking maybe we are supposed to have 2 generative procedures here: one always takes y^(t-1) as next timestep input and generates y^(t), another always takes ys(t-1) as input and generates ys(t). in this way, we ultimately receive 2 sequences y^ and ys and their corresponding ROUGE value.

Please correct me if you think the other way. I'm still on my way about understanding RL in Summarization. Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the Conditional Probability of RL Loss #21

Question on the Conditional Probability of RL Loss #21

crystina-z commented Oct 30, 2018 •

edited

Loading

crystina-z commented Oct 30, 2018 •

edited

Loading

Question on the Conditional Probability of RL Loss #21

Question on the Conditional Probability of RL Loss #21

Comments

crystina-z commented Oct 30, 2018 • edited Loading

crystina-z commented Oct 30, 2018 • edited Loading

crystina-z commented Oct 30, 2018 •

edited

Loading

crystina-z commented Oct 30, 2018 •

edited

Loading