基于深度学习算法的语音识别

训练完模型后可反复使用,训练时可以选购云环境,节省时间,本地除非你的服务器性能够强劲,本次教程基本上也是实验级别的,主要是看看代码,本次实验我的小米配置不行,就直接把代码写好直接放在华为云的环境上跑,3.5一个小时能够训练五六次吧,(仅供实验,代码完成之后一小时达到二三十次左右,太贵了没敢试,)。比4块钱一小时的网吧便宜,另外说明,这个实验的方法是参照华为云官网的。

前期准备

  • 一个华为云账号
  • 购买一个学生机ECS服务器(约10元)
  • 购买华为云OSB服务(约5元)
  • 开通ModeArts应用(约3.5元一小时)

上面的账号注册和买学生机就不说了,主要后面两个我会说一下,那个OBS服务可不是那个直播软件,它这个跟阿里云的OSS对象存储是一个原理,总计这个实验大概20块钱以内就可以训练出来了。

创建OBS桶

这里简单说下原理,因为我们的普通PC运算能力不够,我们需要远程连接到服务器,然后把Python代码放到ModeArts服务上,给我们跑代码,把训练的模型存放在OBS里面

点击控制台--更多--选择对象存储OBS服务

yEry.png

点击创建,然后参数写成下面的样子:

1

也就是说名字自定义,存储类别:标准,桶策略:私有,不加密,关闭归档数据直读,最后的标签可以写也可以不写。之后点创建,确保你的账户里有余额。这个收费也挺便宜的,自己可以去看看描述,做完实验把模型拉倒本地之后一定要把云端的删了,否则一直扣费。到此创建完成

准备秘钥

点击页面的“控制台”切换至控制台界面,在账号名称的下拉菜单中点击“我的凭证”,进入创建管理访问密钥(AK/SK)的界面。位置如下图所示:

ycVc.png

然后选择访问秘钥,新增秘钥,这里的描述信息随便写,一个验证码的事,写完了之后就可以保存文件,然后它会给你下载文件的,“credentials.csv”这个文件就是秘钥,里面前半段是AK后半段是SK,如下图,这个AK跟SK要记好,后面把ECS跟桶连接的时候要用,做认证的:

1

给云服务器安装OBS客户端

连接你的云服务器,跟着执行命令即可,具体啥意思自己看,这就相当于下载连接OBS的客户端,跟阿里云那个ossfs工具一样:

mkdir /home/user/Desktop/data; cd /home/user/Desktop/data; wget https://obs-community.obs.cn-north-1.myhuaweicloud.com/obsutil/current/obsutil_linux_amd64.tar.gz

输入解压命令,并查看目录列表

tar -zxf obsutil_linux_amd64.tar.gz; ls -l    

然后配置OBS工具,这里就把刚才的AK、SK秘钥替换进去,注意一定要替换,就是命令里面的AK/SK字样替换成刚才保存的。

cd ./obsutil_linux_amd64_*; ./obsutil config -i=AK -k=SK -e=obs.cn-north-4.myhuaweicloud.com

一定要替换哟,最后执行查看对象列表的命令,看看有没有刚才创建的那个桶名称

./obsutil ls

然后就能得到跟我下图一样的显示,能够读出你刚才创建的桶名称

1

上传语音资料

既然要做语音识别,那么你就要语音资料,而且要越多越好,越大越好,训练时间越久,效果越好,这里我使用的是华为给的语音资料,比较大,下载可能需要时间。

cd ../; wget https://sandbox-experiment-resource-north-4.obs.cn-north-4.myhuaweicloud.com/speech-recognition/data.zip; wget https://sandbox-experiment-resource-north-4.obs.cn-north-4.myhuaweicloud.com/speech-recognition/data_thchs30.tar

这个直接在云服务器里面执行即可。下载过程如下

1

下载完成之后把文件上传到OBS里面:

./obsutil_linux_amd64_5.*/obsutil cp ./data.zip obs://OBS; ./obsutil_linux_amd64_5.*/obsutil cp ./data_thchs30.tar obs://OBS

注意!!!这里后面的obs://是路径,后面那个大写的OBS要写成你的OBS桶名称。

复制到OBS之后这里就暂告一段落。下图就是复制中,注意我的这个名字

11

传输完成之后去网页看你的OSS里面就有文件在:

11

ModelArts应用

开通ModelArts应用,这里需要实名认证和访问授权,这里要AC跟AK就是在刚才下载的文件那里面,前面提到过

啥是ModelArts?

ModelArts是面向AI开发者的一站式开发平台,提供海量数据预处理及半自动化标注、大规模分布式训练、自动化模型生成及端-边-云模型按需部署能力,帮助用户快速创建和部署模型,管理全周期AI工作流。

导入包

学过Python的朋友都知道notebook是一个写Python的好工具,当时我学的时候为了不想依赖pycharm的自动补全功能就用的额notebook

33

这里选择左边的开发环境--notebook,创建的时候一定要把这个自动停止打开,然后自动停止的时间就先写个12小时,等模型训练完一定删除,否则3.5一个小时,扣余额很贵的,环境选择Python3,规格选择CPU最好的8和32Gib,就这配置等会跑五次都过一小时了,下面的硬盘加到30G,这个很便宜,跟着我的图片走就行了

22

然后提交了就等它给我们配置环境

222

设置好启动了之后点击右边的打开,然后new一个新的TensorFlow环境

2222

创建好之后在Python环境的输入框里面输下面的代码:

import moxing as mox
import numpy as np
import scipy.io.wavfile as wav
from scipy.fftpack import fft
import matplotlib.pyplot as plt
%matplotlib inline
import keras
from keras.layers import Input, Conv2D, BatchNormalization, MaxPooling2D
from keras.layers import Reshape, Dense, Lambda
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model
from keras.utils import multi_gpu_model
import os
import pickle

输入完之后点击RUN运行代码,然后左边有数字显示时表示代码运行完毕,上面的代码运行完之后出现下图的提示:

22

数据准备

然后输入在下面的白色框里再输入下面的命令,把刚才OBS桶里面的文件拷贝过来,等待出现数字再做下一步操作:

current_path = os.getcwd()
mox.file.copy('s3://OBS/data.zip', current_path+'/data.zip')
mox.file.copy('s3://OBS/data_thchs30.tar', current_path+'/data_thchs30.tar')

注意上面要把OBS两处大写的OBS护换成自己的OBS桶名称,出现数字再做下面的操作,我这里的数字跟你们的可能不通,因为我前面做错了几次,这个数字就是顺序的意思,无所谓的

22

继续在白色框框输入下面的代码,解压数据:

!unzip data.zip
!tar -xvf data_thchs30.tar

完了点击run,查看执行效果,这里就不上图了,就是在解压,如果有报错,就是检查上一步

数据处理

现在开始处理数据,继续在下方空白的输入框中输入以下代码,生成音频文件和标签文件列表:

注:考虑神经网络训练过程中接收的输入输出。首先需要batch_size内数据具有统一的shape。格式为:[batch_size, time_step, feature_dim],然而读取的每一个sample的时间轴长都不一样,所以需要对时间轴进行处理,选择batch内最长的那个时间为基准,进行padding。这样一个batch内的数据都相同,就可以进行并行训练了。

source_file = 'data/thchs_train.txt'
def source_get(source_file):
    train_file = source_file
    label_data = []
    wav_lst = []
    with open(train_file, "r", encoding="utf-8") as f:
        lines = f.readlines()
        for line in lines:
            #                 line = line.strip("n")
            datas = line.split("t")
            wav_lst.append(datas[0])
            label_data.append(datas[1])
    return label_data, wav_lst
label_data, wav_lst = source_get(source_file)
print(label_data[:10])
print(wav_lst[:10])

点击run,然后跟我下面图片一样:

23

继续在下方空白的输入框中输入以下代码,进行label数据处理(为label建立拼音到id的映射,即词典):

def mk_vocab(label_data):
    vocab = []
    for line in label_data:
        line = line.split(' ')
        for pny in line:
            if pny not in vocab:
                vocab.append(pny)
        vocab.append('_')
        return vocab

vocab = mk_vocab(label_data)

def word2id(line, vocab):
    return [vocab.index(pny) for pny in line.split(' ')]

label_id = word2id(label_data[0], vocab)
print(label_data[0])
print(label_id)

然后点击run,这里都是有输出的,我就不上图了

继续在下方空白的输入框中输入以下代码,进行音频数据处理:

def compute_fbank(file):
    x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
    w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1) )
    fs, wavsignal = wav.read(file)
    time_window = 25
    window_length = fs / 1000 * time_window
    wav_arr = np.array(wavsignal)
    wav_length = len(wavsignal)
    range0_end = int(len(wavsignal)/fs*1000 - time_window) // 10
    data_input = np.zeros((range0_end, 200), dtype = np.float)
    data_line = np.zeros((1, 400), dtype = np.float)
    for i in range(0, range0_end):
        p_start = i * 160
        p_end = p_start + 400
        data_line = wav_arr[p_start:p_end]    
        data_line = data_line * w
        data_line = np.abs(fft(data_line))
        data_input[i]=data_line[0:200] 

    data_input = np.log(data_input + 1)
    #data_input = data_input[::]
    return data_input
fbank = compute_fbank(wav_lst[0])
print(fbank.shape)

这里也是有输出内容的,内容是(777, 200),然后输入下面的代码生成数据生成器:

total_nums = 10000
batch_size = 4
batch_num = total_nums // batch_size
from random import shuffle
shuffle_list = list(range(10000))
shuffle(shuffle_list)
def get_batch(batch_size, shuffle_list, wav_lst, label_data, vocab):
    for i in range(10000//batch_size):
        wav_data_lst = []
        label_data_lst = []
        begin = i * batch_size
        end = begin + batch_size
        sub_list = shuffle_list[begin:end]
        for index in sub_list:
            fbank = compute_fbank(wav_lst[index])
            fbank = fbank[:fbank.shape[0] // 8 * 8, :]
            label = word2id(label_data[index], vocab)
            wav_data_lst.append(fbank)
            label_data_lst.append(label)
        yield wav_data_lst, label_data_lst
batch = get_batch(4, shuffle_list, wav_lst, label_data, vocab)
wav_data_lst, label_data_lst = next(batch)
for wav_data in wav_data_lst:
    print(wav_data.shape)
for label_data in label_data_lst:
    print(label_data)
lens = [len(wav) for wav in wav_data_lst]
print(max(lens))
print(lens)
def wav_padding(wav_data_lst):
    wav_lens = [len(data) for data in wav_data_lst]
    wav_max_len = max(wav_lens)
    wav_lens = np.array([leng//8 for leng in wav_lens])
    new_wav_data_lst = np.zeros((len(wav_data_lst), wav_max_len, 200, 1))
    for i in range(len(wav_data_lst)):
        new_wav_data_lst[i, :wav_data_lst[i].shape[0], :, 0] = wav_data_lst[i]
    return new_wav_data_lst, wav_lens

pad_wav_data_lst, wav_lens = wav_padding(wav_data_lst)
print(pad_wav_data_lst.shape)
print(wav_lens)
def label_padding(label_data_lst):
    label_lens = np.array([len(label) for label in label_data_lst])
    max_label_len = max(label_lens)
    new_label_data_lst = np.zeros((len(label_data_lst), max_label_len))
    for i in range(len(label_data_lst)):
        new_label_data_lst[i][:len(label_data_lst[i])] = label_data_lst[i]
    return new_label_data_lst, label_lens

pad_label_data_lst, label_lens = label_padding(label_data_lst)
print(pad_label_data_lst.shape)
print(label_lens)

然后输出结果如图:

31

然后再输入下面的代码点击run生成用于训练格式的数据生成器(此段代码无输出)

def data_generator(batch_size, shuffle_list, wav_lst, label_data, vocab):
    for i in range(len(wav_lst)//batch_size):
        wav_data_lst = []
        label_data_lst = []
        begin = i * batch_size
        end = begin + batch_size
        sub_list = shuffle_list[begin:end]
        for index in sub_list:
            fbank = compute_fbank(wav_lst[index])
            pad_fbank = np.zeros((fbank.shape[0]//8*8+8, fbank.shape[1]))
            pad_fbank[:fbank.shape[0], :] = fbank
            label = word2id(label_data[index], vocab)
            wav_data_lst.append(pad_fbank)
            label_data_lst.append(label)
        pad_wav_data, input_length = wav_padding(wav_data_lst)
        pad_label_data, label_length = label_padding(label_data_lst)
        inputs = {'the_inputs': pad_wav_data,
                  'the_labels': pad_label_data,
                  'input_length': input_length,
                  'label_length': label_length,
                 }
        outputs = {'ctc': np.zeros(pad_wav_data.shape[0],)} 
        yield inputs, outputs

模型搭建

继续输入以下代码:说明:训练输入为时频图,标签为对应的拼音标签,搭建语音识别模型,采用了 CNN+CTC 的结构。

def conv2d(size):
    return Conv2D(size, (3,3), use_bias=True, activation='relu',
        padding='same', kernel_initializer='he_normal')
def norm(x):
    return BatchNormalization(axis=-1)(x)
def maxpool(x):
    return MaxPooling2D(pool_size=(2,2), strides=None, padding="valid")(x)
def dense(units, activation="relu"):
    return Dense(units, activation=activation, use_bias=True, kernel_initializer='he_normal')
def cnn_cell(size, x, pool=True):
    x = norm(conv2d(size)(x))
    x = norm(conv2d(size)(x))
    if pool:
        x = maxpool(x)
    return x

def ctc_lambda(args):
    labels, y_pred, input_length, label_length = args
    y_pred = y_pred[:, :, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

class Amodel():
    """docstring for Amodel."""
    def __init__(self, vocab_size):
        super(Amodel, self).__init__()
        self.vocab_size = vocab_size
        self._model_init()
        self._ctc_init()
        self.opt_init()

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h4)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)
    def _ctc_init(self):
        self.labels = Input(name='the_labels', shape=[None], dtype='float32')
        self.input_length = Input(name='input_length', shape=[1], dtype='int64')
        self.label_length = Input(name='label_length', shape=[1], dtype='int64')
        self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
            ([self.labels, self.outputs, self.input_length, self.label_length])
        self.ctc_model = Model(inputs=[self.labels, self.inputs,
            self.input_length, self.label_length], outputs=self.loss_out)

    def opt_init(self):
        opt = Adam(lr = 0.0008, beta_1 = 0.9, beta_2 = 0.999, decay = 0.01, epsilon = 10e-8)
        #self.ctc_model=multi_gpu_model(self.ctc_model,gpus=2)
        self.ctc_model.compile(loss={'ctc': lambda y_true, output: output}, optimizer=opt)
am = Amodel(len(vocab))
am.ctc_model.summary()

然后点击run,执行结果如下图:

23

模型训练

继续输入下面的代码创建语音识别模型

total_nums = 100
batch_size = 20
batch_num = total_nums // batch_size
epochs = 8
source_file = 'data/thchs_train.txt'
label_data,wav_lst = source_get(source_file)
vocab = mk_vocab(label_data)
vocab_size = len(vocab)
print(vocab_size)
shuffle_list = list(range(100))

am = Amodel(vocab_size)

for k in range(epochs):
    print('this is the', k+1, 'th epochs trainning !!!')
    #shuffle(shuffle_list)
    batch = data_generator(batch_size, shuffle_list, wav_lst, label_data, vocab)
    am.ctc_model.fit_generator(batch, steps_per_epoch=batch_num, epochs=1)

这个执行过程是很慢的,在训练,大概需要10多分钟,如果你电脑配置好学会了可以用自己电脑跑,就不用担心这个3.5一小时的问题,而且你还可以吧epochs的值调大一点,整个50次慢慢跑,这样效果也更好,更精细,下图就是跑完的样子,也可以看那个*变成数字也就代表完成了。

231

保存模型

最后把训练好的模型保存到OBS桶里面,输入下面的代码(无输出),还是注意把大写的OBS换成自己的OBS名字

am.model.save("asr-model.h5")   
with open("vocab","wb") as fw:
    pickle.dump(vocab,fw)
mox.file.copy("asr-model.h5", 's3://OBS/asr-model.h5')
mox.file.copy("vocab", 's3://OBS/vocab')

测试模型

最后就是最好玩的模型测试阶段了,输入下面代码,用来导入包和加载模型及数据(代码无输出)

#导入包
import pickle
from keras.models import load_model
import os
import tensorflow as tf
from keras import backend as K
import numpy as np
import scipy.io.wavfile as wav
from scipy.fftpack import fft
#加载模型和数据
bm = load_model("asr-model.h5")
with open("vocab","rb") as fr:
    vocab_for_test = pickle.load(fr)

继续输入下面代码(无输出)

def wav_padding(wav_data_lst):
    wav_lens = [len(data) for data in wav_data_lst]
    wav_max_len = max(wav_lens)
    wav_lens = np.array([leng//8 for leng in wav_lens])
    new_wav_data_lst = np.zeros((len(wav_data_lst), wav_max_len, 200, 1))
    for i in range(len(wav_data_lst)):
        new_wav_data_lst[i, :wav_data_lst[i].shape[0], :, 0] = wav_data_lst[i]
    return new_wav_data_lst, wav_lens
#获取信号的时频图
def compute_fbank(file):
    x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
    w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1) ) # 汉明窗
    fs, wavsignal = wav.read(file)
    # wav波形 加时间窗以及时移10ms
    time_window = 25 # 单位ms
    window_length = fs / 1000 * time_window # 计算窗长度的公式,目前全部为400固定值
    wav_arr = np.array(wavsignal)
    wav_length = len(wavsignal)
    range0_end = int(len(wavsignal)/fs*1000 - time_window) // 10 # 计算循环终止的位置,也就是最终生成的窗数
    data_input = np.zeros((range0_end, 200), dtype = np.float) # 用于存放最终的频率特征数据
    data_line = np.zeros((1, 400), dtype = np.float)
    for i in range(0, range0_end):
        p_start = i * 160
        p_end = p_start + 400
        data_line = wav_arr[p_start:p_end]
        data_line = data_line * w # 加窗
        data_line = np.abs(fft(data_line))
        data_input[i]=data_line[0:200] # 设置为400除以2的值(即200)是取一半数据,因为是对称的
    data_input = np.log(data_input + 1)
    #data_input = data_input[::]
    return data_input
def test_data_generator(test_path):
    test_file_list = []
    for root, dirs, files in os.walk(test_path):
        for file in files:
            if file.endswith(".wav"):
                test_file = os.sep.join([root, file])
                test_file_list.append(test_file)
    print(len(test_file_list))
    for file in test_file_list:
        fbank = compute_fbank(file)
        pad_fbank = np.zeros((fbank.shape[0]//8*8+8, fbank.shape[1]))
        pad_fbank[:fbank.shape[0], :] = fbank
        test_data_list = []
        test_data_list.append(pad_fbank)
        pad_wav_data, input_length = wav_padding(test_data_list)
        yield pad_wav_data

test_path ="data_thchs30/test"
test_data = test_data_generator(test_path)

继续输入下面的代码测试:

def decode_ctc(num_result, num2word):
    result = num_result[:, :, :]
    in_len = np.zeros((1), dtype = np.int32)
    in_len[0] = result.shape[1];
    r = K.ctc_decode(result, in_len, greedy = True, beam_width=10, top_paths=1)
    r1 = K.get_value(r[0][0])
    r1 = r1[0]
    text = []
    for i in r1:
        text.append(num2word[i])
    return r1, text
for i in range(10):
    #获取测试数据
    x = next(test_data)
    #载入训练好的模型,并进行识别语音
    result = bm.predict(x, steps=1)
    #将数字结果转化为拼音结果
    _, text = decode_ctc(result, vocab_for_test)
    print('文本结果:', text)

然后点击run执行。

开始语言模型操作

导入包

输入下面代码导入包,无输出:

from tqdm import tqdm
import tensorflow as tf
import moxing as mox
import numpy as np

数据处理

输入下面代码,点击run:

with open("data/zh.tsv", 'r', encoding='utf-8') as fout:
    data = fout.readlines()[:10000]
inputs = []
labels = []
for i in tqdm(range(len(data))):
    key, pny, hanzi = data[i].split('\t')
    inputs.append(pny.split(' '))
    labels.append(hanzi.strip('\n').split(' '))
print(inputs[:5])
print()
print(labels[:5])

def get_vocab(data):
    vocab = ['<PAD>']
    for line in tqdm(data):
        for char in line:
            if char not in vocab:
                vocab.append(char)
    return vocab
pny2id = get_vocab(inputs)
han2id = get_vocab(labels)
print(pny2id[:10])
print(han2id[:10])

input_num = [[pny2id.index(pny) for pny in line] for line in tqdm(inputs)]
label_num = [[han2id.index(han) for han in line] for line in tqdm(labels)]

# 获取batch数据
def get_batch(input_data, label_data, batch_size):
    batch_num = len(input_data) // batch_size
    for k in range(batch_num):
        begin = k * batch_size
        end = begin + batch_size
        input_batch = input_data[begin:end]
        label_batch = label_data[begin:end]
        max_len = max([len(line) for line in input_batch])
        input_batch = np.array([line + [0] * (max_len - len(line)) for line in input_batch])
        label_batch = np.array([line + [0] * (max_len - len(line)) for line in label_batch])
        yield input_batch, label_batch

batch = get_batch(input_num, label_num, 4)
input_batch, label_batch = next(batch)
print(input_batch)
print(label_batch)

模型搭建

模型采用self-attention的左侧编码器。输入下面代码用以实现图片结构中的layer norm层,然后run(无输出)

#layer norm层
def normalize(inputs,
              epsilon = 1e-8,
              scope="ln",
              reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]

        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta
    return outputs

继续输入如下代码,点击“run”运行,以实现图片结构中的embedding层(无输出)

def embedding(inputs,
              vocab_size,
              num_units,
              zero_pad=True,
              scale=True,
              scope="embedding",
              reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[vocab_size, num_units],
                                       initializer=tf.contrib.layers.xavier_initializer())
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)

        if scale:
            outputs = outputs * (num_units ** 0.5) 

    return outputs

继续输入如下代码,点击“run”运行,以实现multihead层(此段代码无输出):

def multihead_attention(emb,
                        queries,
                        keys,
                        num_units=None,
                        num_heads=8,
                        dropout_rate=0,
                        is_training=True,
                        causality=False,
                        scope="multihead_attention",
                        reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]

        # Linear projections
        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)  # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)  # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu)  # (N, T_k, C)

        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0)  # (h*N, T_q, C/h) 
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h) 

        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))  # (h*N, T_q, T_k)

        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)

        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1)))  # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1])  # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1])  # (h*N, T_q, T_k)

        paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs)  # (h*N, T_q, T_k)

        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :])  # (T_q, T_k)
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense()  # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])  # (h*N, T_q, T_k)

            paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs)  # (h*N, T_q, T_k)

        # Activation
        outputs = tf.nn.softmax(outputs)  # (h*N, T_q, T_k)

        # Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1)))  # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1])  # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]])  # (h*N, T_q, T_k)
        outputs *= query_masks  # broadcasting. (N, T_q, C)

        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))

        # Weighted sum
        outputs = tf.matmul(outputs, V_)  # ( h*N, T_q, C/h)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)  # (N, T_q, C)

        # Residual connection
        outputs += queries

        # Normalize
        outputs = normalize(outputs)  # (N, T_q, C)

    return outputs

继续输入如下代码,点击“run”运行,以实现feedforward层(此段代码无输出):

def feedforward(inputs,
                num_units=[2048, 512],
                scope="multihead_attention",
                reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)

        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)

        # Residual connection
        outputs += inputs

        # Normalize
        outputs = normalize(outputs)

    return outputs


def label_smoothing(inputs, epsilon=0.1):
    '''Applies label smoothing. See https://arxiv.org/abs/1512.00567.

    Args:
      inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
      epsilon: Smoothing rate.    

    '''
    K = inputs.get_shape().as_list()[-1]  # number of channels
    return ((1 - epsilon) * inputs) + (epsilon / K)

继续输入如下代码,点击“run”运行,以搭建模型(此段代码无输出):

class Graph():
    def __init__(self, is_training=True):
        tf.reset_default_graph()
        self.is_training = arg.is_training
        self.hidden_units = arg.hidden_units
        self.input_vocab_size = arg.input_vocab_size
        self.label_vocab_size = arg.label_vocab_size
        self.num_heads = arg.num_heads
        self.num_blocks = arg.num_blocks
        self.max_length = arg.max_length
        self.lr = arg.lr
        self.dropout_rate = arg.dropout_rate

        # input
        self.x = tf.placeholder(tf.int32, shape=(None, None))
        self.y = tf.placeholder(tf.int32, shape=(None, None))
        # embedding
        self.emb = embedding(self.x, vocab_size=self.input_vocab_size, num_units=self.hidden_units, scale=True,
                             scope="enc_embed")
        self.enc = self.emb + embedding(
            tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),
            vocab_size=self.max_length, num_units=self.hidden_units, zero_pad=False, scale=False, scope="enc_pe")
        ## Dropout
        self.enc = tf.layers.dropout(self.enc,
                                     rate=self.dropout_rate,
                                     training=tf.convert_to_tensor(self.is_training))

        ## Blocks
        for i in range(self.num_blocks):
            with tf.variable_scope("num_blocks_{}".format(i)):
                ### Multihead Attention
                self.enc = multihead_attention(emb=self.emb,
                                               queries=self.enc,
                                               keys=self.enc,
                                               num_units=self.hidden_units,
                                               num_heads=self.num_heads,
                                               dropout_rate=self.dropout_rate,
                                               is_training=self.is_training,
                                               causality=False)

        ### Feed Forward
        self.outputs = feedforward(self.enc, num_units=[4 * self.hidden_units, self.hidden_units])

        # Final linear projection
        self.logits = tf.layers.dense(self.outputs, self.label_vocab_size)
        self.preds = tf.to_int32(tf.argmax(self.logits, axis=-1))
        self.istarget = tf.to_float(tf.not_equal(self.y, 0))
        self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y)) * self.istarget) / (
            tf.reduce_sum(self.istarget))
        tf.summary.scalar('acc', self.acc)

        if is_training:
            # Loss
            self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=self.label_vocab_size))
            self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)
            self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))

            # Training Scheme
            self.global_step = tf.Variable(0, name='global_step', trainable=False)
            self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
            self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)

            # Summary 
            tf.summary.scalar('mean_loss', self.mean_loss)
            self.merged = tf.summary.merge_all()

训练模型

继续输入如下代码,点击“run”运行,用于参数设定(此段代码无输出):

def create_hparams():
    params = tf.contrib.training.HParams(
        num_heads=8,
        num_blocks=6,
        # vocab
        input_vocab_size=50,
        label_vocab_size=50,
        # embedding size
        max_length=100,
        hidden_units=512,
        dropout_rate=0.2,
        lr=0.0003,
        is_training=True)
    return params


arg = create_hparams()
arg.input_vocab_size = len(pny2id)
arg.label_vocab_size = len(han2id)

继续输入以下代码,点击“run”运行,用于模型训练:

import os
epochs = 3
batch_size = 4

g = Graph(arg)

saver =tf.train.Saver()
with tf.Session() as sess:
    merged = tf.summary.merge_all()
    sess.run(tf.global_variables_initializer())
    if os.path.exists('logs/model.meta'):
        saver.restore(sess, 'logs/model')
    writer = tf.summary.FileWriter('tensorboard/lm', tf.get_default_graph())
    for k in range(epochs):
        total_loss = 0
        batch_num = len(input_num) // batch_size
        batch = get_batch(input_num, label_num, batch_size)
        for i in range(batch_num):
            input_batch, label_batch = next(batch)
            feed = {g.x: input_batch, g.y: label_batch}
            cost,_ = sess.run([g.mean_loss,g.train_op], feed_dict=feed)
            total_loss += cost
            if (k * batch_num + i) % 10 == 0:
                rs=sess.run(merged, feed_dict=feed)
                writer.add_summary(rs, k * batch_num + i)
        print('epochs', k+1, ': average loss = ', total_loss/batch_num)
    saver.save(sess, 'logs/model')
    writer.close()

大约10多分钟执行完成,如果想要更好的效果就调整epochs的参数。

模型测试

继续输入下面的代码,进行拼音的测试

arg.is_training = False

g = Graph(arg)

saver =tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, 'logs/model')
    while True:
        line = input('Input Test Content: ')
        if line == 'exit': break
        line = line.strip('\n').split(' ')
        x = np.array([pny2id.index(pny) for pny in line])
        x = x.reshape(1, -1)
        preds = sess.run(g.preds, {g.x: x})
        got = ''.join(han2id[idx] for idx in preds[0])
        print(got)

点击run然后出来个提示框,输入下面的内容用来测试拼音:

nian2 de bu4 bu4 ge4 de shang4 shi2 qu1 pei4 wai4 gu4 de nian2 ming2 de zi4 ren2 na4 ren2 bu4 zuo4 de jia1 zhong4 shi2 wei4 yu4 you3 ta1 yang2 mu4 yu4 ci3

然后单击回车模型会返回中文内容,继续在输入框里面输入exit,回车后停止运行

保存模型

最后把训练好的模型保存在OBS桶里面,方便下次调用

!zip -r languageModel.zip logs  #递归压缩模型所在文件夹
mox.file.copy("languageModel.zip", 's3://OBS/languageModel.zip')

老规矩还是把OBS换成刚才的OBS名称,最后去看OBS桶里面就有刚才训练的模型。至此实验结束

最后做语言模型操作的时候实现了很多层,比如图片结构的embedding层、multihead层、feedforward层等等,这些都是采用了self-attention的左侧编码器,如图:

222

文章代码量很多,算是中等难度了,有代码错误的地方留言或者私信我,不理解的也可以留言或私信。

最后修改:2020 年 10 月 10 日 05 : 02 PM
请俺喝杯咖啡呗