語言模型

語言模型是一個自然語言中的詞語機率分佈模型^[1]^[2]，例如提供一个长度为 $m$ 的字詞序列 $w_{1},w_{2},...,w_{m}$ ，計算這些字詞的概率： $P(w_{1},\ldots ,w_{m})$ 。通過语言模型，可以确定哪个词语出现的可能性更大，或者通過若干上文语境词来预测下一个最可能出现的词语。^[3]

語言模型經常使用在許多自然語言處理方面的應用，如語音識別^[4]，機器翻譯^[5]，詞性標註，句法分析^[6]，手写体识别^[7]和資訊檢索。由於字詞與句子都是任意組合的長度，因此在訓練過的語言模型中會出現未曾出現的字串(資料稀疏的問題)，也使得在語料庫中估算字串的機率變得很困難，這也是要使用近似的平滑n-元語法(N-gram)模型之原因。

在語音辨識和在資料壓縮的領域中，這種模式試圖捕捉語言的特性，並預測在語音串列中的下一個字。

在语音识别中，声音与单词序列相匹配。当来自语言模型的证据与发音模型和声学模型相结合时，歧义更容易解决。

當用於資訊檢索，語言模型是與文件有關的集合。以查詢字「Q」作為輸入，依據機率將文件作排序，而該機率 $P(Q|M_{d})$ 代表該文件的語言模型所產生的語句之機率。

模型类型

单元语法（unigram）

一个单元模型可以看作是几个单状态有限自动机的组合^[8]。它会分开上下文中不同术语的概率, 比如将 $P(t_{1}t_{2}t_{3})=P(t_{1})P(t_{2}\mid t_{1})P(t_{3}\mid t_{1}t_{2})$ 拆分为 $P_{\text{uni}}(t_{1}t_{2}t_{3})=P(t_{1})P(t_{2})P(t_{3})$ .

在这个模型中，每个单词的概率只取决于该单词在文档中的概率，所以我们只有一个状态有限自动机作为单位。自动机本身在模型的整个词汇表中有一个概率分布，总和为1。下面是一个文档的单元模型。

单词 term	在文档 doc 中的概率
a	0.1
world	0.2
likes	0.05
we	0.05
share	0.3
...	...

\sum _{\text{term in doc}}P({\text{term}})=1\,

为特定查询(query)生成的概率计算如下

P({\text{query}})=\prod _{\text{term in query}}P({\text{term}})

不同的文档有不同的语法模型，其中单词的命中率也不同。不同文档的概率分布用于为每个查询生成命中概率。可以根据概率对查询的文档进行排序。两个文档的单元模型示例:

单词	在Doc1的概率	在Doc2中的概率
a	0.1	0.3
world	0.2	0.1
likes	0.05	0.03
we	0.05	0.02
share	0.3	0.2
...	...	...

在信息检索环境中，通常会对单语法语言模型进行平滑处理，以避免出现P(term)= 0的情况。一种常见的方法是为整个集合生成最大似然模型，并用每个文档的最大似然模型对集合模型进行线性插值来平滑化模型。^[9]

n-元语法

在一个 n-元语法模型中，观测到序列 $w_{1},\ldots ,w_{m}$ 的概率 $P(w_{1},\ldots ,w_{m})$ 可以被近似为

P(w_{1},\ldots ,w_{m})=\prod _{i=1}^{m}P(w_{i}\mid w_{1},\ldots ,w_{i-1})\approx \prod _{i=1}^{m}P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})

此处我们引入马尔科夫假设，一个词的出现并不与这个句子前面的所有词关联，只与这个词前的 n 个词关联（n阶马尔科夫性质）。在已观测到 i-1 个词的情况中，观测到第i个词 w_i 的概率，可以被近似为，观测到第i个词前面n个词（第 i-(n-1) 个词到第 i-1 个词）的情况下，观测到第i个词的概率。第 i 个词前 n 个词可以被称为 n-元。

条件概率可以从n-元语法模型频率计数中计算:

P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})={\frac {\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1},w_{i})}{\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1})}}

术语二元语法(bigram) 和三元语法(trigram) 语言模型表示 n = 2 和 n = 3 的 n-元 ^[10]。

典型地，n-元语法模型概率不是直接从频率计数中导出的，因为以这种方式导出的模型在面对任何之前没有明确看到的n-元时会有严重的问题。相反，某种形式的平滑是必要的，将一些总概率质量分配给看不见的单词或n-元。使用了各种方法，从简单的“加一”平滑(将计数1分配给看不见的n-元，作为一个无信息的先验)到更复杂的模型，例如Good-Turing discounting（英语：Good-Turing discounting）或 back-off 模型（英语：back-off model）。

例子

在二元语法模型中 (n = 2) , I saw the red house 这个句子的概率可以被估计为

{\begin{aligned}&P({\text{I, saw, the, red, house}})\\\approx {}&P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})\end{aligned}}

而在三元语法模型中，这个句子的概率估计为

{\begin{aligned}&P({\text{I, saw, the, red, house}})\\\approx {}&P({\text{I}}\mid \langle s\rangle ,\langle s\rangle )P({\text{saw}}\mid \langle s\rangle ,I)P({\text{the}}\mid {\text{I, saw}})P({\text{red}}\mid {\text{saw, the}})P({\text{house}}\mid {\text{the, red}})P(\langle /s\rangle \mid {\text{red, house}})\end{aligned}}

注意前 n-1 个词的 n-元会用句首符号 <s> 填充。

指数型

最大熵（英语：Principle of maximum entropy）语言模型用特征函数编码了词和n-元的关系。

$P(w_{m}|w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))$

其中 $Z(w_{1},\ldots ,w_{m-1})$ 是分区函数（英语：partition function）, $a$ 是参数向量， $f(w_{1},\ldots ,w_{m})$ 是特征函数。

在最简单的情况下，特征函数只是某个n-gram存在的指示器。使用先验的 a 或者使用一些正则化的手段是很有用的。

对数双线性模型是指数型语言模型的另一个例子。

参见

参考资料

^ Jurafsky, Dan; Martin, James H. N-gram Language Models. Speech and Language Processing 3rd. 2021 [24 May 2022]. （原始内容存档于22 May 2022）.
^ Rosenfeld, Ronald. Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE. 2000, 88 (8).
^ 王亚珅，黄河燕著,短文本表示建模及应用,北京理工大学出版社,2021.05,第24頁
^ Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.
^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation （页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.
^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation （页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.
^ Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 2014.
^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237–240. Cambridge University Press, 2009
^ Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pg. 289–291. MIT Press.
^ Craig Trim, What is Language Modeling? （页面存档备份，存于互联网档案馆）, April 26th, 2013.

外部链接

LMSharp （页面存档备份，存于互联网档案馆） - 开源统计语言模型工具包，支持n-gram模型（Kneser-Ney平滑），以及反馈神经网络模型（recurrent neural network model）

[1] Jurafsky, Dan; Martin, James H. N-gram Language Models. Speech and Language Processing 3rd. 2021 [24 May 2022]. （原始内容存档于22 May 2022）.

[2] Rosenfeld, Ronald. Two decades of statistical language modeling: Where do we go from here?. Proceedings of the IEEE. 2000, 88 (8).

[3] 王亚珅，黄河燕著,短文本表示建模及应用,北京理工大学出版社,2021.05,第24頁

[4] Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.

[Semantic_parsing_as_machine_translation-5] Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation （页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.

[Semantic_parsing_as_machine_translation2-6] Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation （页面存档备份，存于互联网档案馆）." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.

[7] Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 2014.

[8] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237–240. Cambridge University Press, 2009

[9] Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pg. 289–291. MIT Press.

[10] Craig Trim, What is Language Modeling? （页面存档备份，存于互联网档案馆）, April 26th, 2013.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

查论编自然语言处理
一般术语	语料库口语语料库停用词词袋完全人工智慧（英语：AI-complete） n元语法（双字母组、三元语法（英语：Trigrams））
文本挖掘	文本分割词性标注（英语：Part-of-speech tagging）拆句处理（英语：Shallow parsing）复合词处理（英语：Compound term processing）搭配提取（英语：Collocation extraction）词干提取词形还原命名实体识别指代文本情感分析概念挖掘（英语：Concept mining）语法分析词义消歧术语提取（英语：Terminology extraction）真实大小写处理（英语：Truecasing）
自动摘要（英语：Automatic summarization）	多文档摘要（英语：Multi-document summarization）句子抽取（英语：Sentence extraction）文本简化（英语：Text simplification）
分佈語義（英语：Distributional semantics）模型	潜在语义学 Seq2Seq模型 Word2vec 語言模型大型语言模型基础模型 LLaMA ChatGPT GPT-4 文心一言词嵌入
机器翻译	電腦輔助翻譯基于实例（英语：Example-based machine translation）基于规则（英语：Rule-based machine translation）
自动识别与数据采集	语音识别语音合成光学字符识别自然语言生成提示工程
主题模型	弹珠分布（英语：Pachinko allocation）隐含狄利克雷分布潜在语义索引
计算机辅助审查（英语：Computer-assisted reviewing）	自动作文评分（英语：Automated essay scoring）语料库检索工具（英语：Concordancer）文法检查器（英语：Grammar checker）预测文本（英语：Predictive text）拼寫檢查语法猜测（英语：Syntax guessing）
自然语言用户界面（英语：Natural language user interface）	自动在线助手聊天機器人文字冒险游戏問答系統