RNN(Recurrent Neural Networks)公式推导和实现

2023-11-22 18:53:09

RNN(Recurrent Neural Networks)公式推导和实现

http://x-algo.cn/index.php/2016/04/25/rnn-recurrent-neural-networks-derivation-and-implementation/ 2016-04-25 分类：Deep Learning / NLP / RNN 阅读(6997) 评论(7)

本文主要参考wildml的博客所写，所有的代码都是python实现。没有使用任何深度学习的工具，公式推导虽然枯燥，但是推导一遍之后对RNN的理解会更加的深入。看本文之前建议对传统的神经网络的基本知识已经了解，如果不了解的可以看此文：『神经网络（Neural Network）实现』。

所有可执行代码：Code to follow along is on Github.

文章目录 [展开]

语言模型

熟悉NLP的应该会比较熟悉，就是将自然语言的一句话『概率化』。具体的，如果一个句子有m个词，那么这个句子生成的概率就是：

其实就是假设下一次词生成的概率和只和句子前面的词有关，例如句子『He went to buy some chocolate』生成的概率可以表示为： P(他喜欢吃巧克力) = P(他喜欢吃) * P(巧克力|他喜欢吃) 。

数据预处理

训练模型总需要语料，这里语料是来自google big query的reddit的评论数据，语料预处理会去掉一些低频词从而控制词典大小，低频词使用一个统一标识替换（这里是UNKNOWN_TOKEN），预处理之后每一个词都会使用一个唯一的编号替换；为了学出来哪些词常常作为句子开始和句子结束，引入SENTENCE_START和SENTENCE_END两个特殊字符。具体就看代码吧：

点击展开代码

网络结构

和传统的nn不同，但是也很好理解，rnn的网络结构如下图：

A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

不同之处就在于rnn是一个『循环网络』，并且有『状态』的概念。

如上图，t表示的是状态，

如果隐层节点个数为100，字典大小C=8000，参数的维度信息为：

初始化

参数的初始化有很多种方法，都初始化为0将会导致『symmetric calculations 』（我也不懂），如何初始化其实是和具体的激活函数有关系，我们这里使用的是tanh，一种推荐的方式是初始化为

1 2 3 4 5 6 7 8 9 10 11

class RNNNumpy: def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4): # Assign instance variables self.word_dim = word_dim self.hidden_dim = hidden_dim self.bptt_truncate = bptt_truncate # Randomly initialize the network parameters self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim)) self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim)) self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

前向传播

类似传统的nn的方法，计算几个矩阵乘法即可：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

def forward_propagation(self, x): # The total number of time steps T = len(x) # During forward propagation we save all hidden states in s because need them later. # We add one additional element for the initial hidden, which we set to 0 s = np.zeros((T + 1, self.hidden_dim)) s[-1] = np.zeros(self.hidden_dim) # The outputs at each time step. Again, we save them for later. o = np.zeros((T, self.word_dim)) # For each time step... for t in np.arange(T): # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector. s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1])) o[t] = softmax(self.V.dot(s[t])) return [o, s]

预测函数可以写为：

1 2 3 4

def predict(self, x): # Perform forward propagation and return index of the highest score o, s = self.forward_propagation(x) return np.argmax(o, axis=1)

损失函数

类似nn方法，使用交叉熵作为损失函数，如果有N个样本，损失函数可以写为：

下面两个函数用来计算损失：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

def calculate_total_loss(self, x, y): L = 0 # For each sentence... for i in np.arange(len(y)): o, s = self.forward_propagation(x[i]) # We only care about our prediction of the "correct" words correct_word_predictions = o[np.arange(len(y[i])), y[i]] # Add to the loss based on how off we were L += -1 * np.sum(np.log(correct_word_predictions)) return L def calculate_loss(self, x, y): # Divide the total loss by the number of training examples N = np.sum((len(y_i) for y_i in y)) return self.calculate_total_loss(x,y)/N

BPTT学习参数

BPTT（ Backpropagation Through Time）是一种非常直观的方法，和传统的BP类似，只不过传播的路径是个『循环』，并且路径上的参数是共享的。

损失是交叉熵，损失可以表示为：

其中

rnn-bptt1

所以对所有误差求W的偏导数为：

进一步可以将

根据链式法则和RNN中W权值共享，可以得到：

下图将这个过程表示的比较形象

rnn-bptt-with-gradients

BPTT更新梯度的代码：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

def bptt(self, x, y): T = len(y) # Perform forward propagation o, s = self.forward_propagation(x) # We accumulate the gradients in these variables dLdU = np.zeros(self.U.shape) dLdV = np.zeros(self.V.shape) dLdW = np.zeros(self.W.shape) delta_o = o delta_o[np.arange(len(y)), y] -= 1. # For each output backwards... for t in np.arange(T)[::-1]: dLdV += np.outer(delta_o[t], s[t].T) # Initial delta calculation: dL/dz delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) # Backpropagation through time (for at most self.bptt_truncate steps) for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]: # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step) # Add to gradients at each previous step dLdW += np.outer(delta_t, s[bptt_step-1]) dLdU[:,x[bptt_step]] += delta_t # Update delta for next step dL/dz at t-1 delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2) return [dLdU, dLdV, dLdW]

梯度弥散现象

tanh和sigmoid函数和导数的取值返回如下图，可以看到导数取值是[0-1]，用几次链式法则就会将梯度指数级别缩小，所以传播不了几层就会出现梯度非常弱。克服这个问题的LSTM是一种最近比较流行的解决方案。

tanh

Gradient Checking

梯度检验是非常有用的，检查的原理是一个点的『梯度』等于这个点的『斜率』，估算一个点的斜率可以通过求极限的方式：

通过比较『斜率』和『梯度』的值，我们就可以判断梯度计算的是否有问题。需要注意的是这个检验成本还是很高的，因为我们的参数个数是百万量级的。

梯度检验的代码：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

def gradient_check(self, x, y, h=0.001, error_threshold=0.01): # Calculate the gradients using backpropagation. We want to checker if these are correct. bptt_gradients = self.bptt(x, y) # List of all parameters we want to check. model_parameters = ['U', 'V', 'W'] # Gradient check for each parameter for pidx, pname in enumerate(model_parameters): # Get the actual parameter value from the mode, e.g. model.W parameter = operator.attrgetter(pname)(self) print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)) # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ... it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: ix = it.multi_index # Save the original value so we can reset it later original_value = parameter[ix] # Estimate the gradient using (f(x+h) - f(x-h))/(2*h) parameter[ix] = original_value + h gradplus = self.calculate_total_loss([x],[y]) parameter[ix] = original_value - h gradminus = self.calculate_total_loss([x],[y]) estimated_gradient = (gradplus - gradminus)/(2*h) # Reset parameter to original value parameter[ix] = original_value # The gradient for this parameter calculated using backpropagation backprop_gradient = bptt_gradients[pidx][ix] # calculate The relative error: (|x - y|/(|x| + |y|)) relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient)) # If the error is to large fail the gradient check if relative_error > error_threshold: print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix) print "+h Loss: %f" % gradplus print "-h Loss: %f" % gradminus print "Estimated_gradient: %f" % estimated_gradient print "Backpropagation gradient: %f" % backprop_gradient print "Relative Error: %f" % relative_error return it.iternext() print "Gradient check for parameter %s passed." % (pname)

SGD实现

这个公式应该非常熟悉：

其中

1 2 3 4 5 6 7 8

# Performs one step of SGD. def numpy_sdg_step(self, x, y, learning_rate): # Calculate the gradients dLdU, dLdV, dLdW = self.bptt(x, y) # Change parameters according to gradients and learning rate self.U -= learning_rate * dLdU self.V -= learning_rate * dLdV self.W -= learning_rate * dLdW

生成文本

生成过程其实就是模型的应用过程，只需要反复执行预测函数即可：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

def generate_sentence(model): # We start the sentence with the start token new_sentence = [word_to_index[sentence_start_token]] # Repeat until we get an end token while not new_sentence[-1] == word_to_index[sentence_end_token]: next_word_probs = model.forward_propagation(new_sentence) sampled_word = word_to_index[unknown_token] # We don't want to sample unknown words while sampled_word == word_to_index[unknown_token]: samples = np.random.multinomial(1, next_word_probs[-1]) sampled_word = np.argmax(samples) new_sentence.append(sampled_word) sentence_str = [index_to_word[x] for x in new_sentence[1:-1]] return sentence_str num_sentences = 10 senten_min_length = 7 for i in range(num_sentences): sent = [] # We want long sentences, not sentences with one or two words while len(sent) < senten_min_length: sent = generate_sentence(model) print " ".join(sent)

参考文献

Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano

Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients

转载于:https://www.cnblogs.com/DjangoBlog/p/7447441.html

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 解读两大精简指令集：RISC-V和MIPS
下一篇 > Linksys EA9300对垒网件R8000P！谁能成功笑到最后？

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

RNN(Recurrent Neural Networks)公式推导和实现