数字人文下的先秦汉英典籍词性自动标注研究(附件)

2021-04-05 11:00编辑: www.jxszl.com景先生毕设

词性标注是信息处理语料库建设中重要的环节,同时也是自然语言处理领域的基础性工作。词性自动标注在现代汉语领域已有了较大成就，但对于古代典籍的研究却还较少，还需要解决很多问题。然而先秦经典在中国古代的地位是非常重要的。因此，本文主要探讨在数字人文下的先秦英汉典籍词性标注的研究。本文主要采用基于统计的方法，对先秦汉英典籍进行详细的考察，统计不同位置的词性用法、词长及读音等确定了组合特征模板，并结合条件随机场模型（CRF），得到先秦汉英典籍的词性自动标注模型。数字人文下的汉英词性标注模型调和平均值F分别达到69.66%和81.3%，具有较强的参考和应用价值。采用F值最高的模型最终对战国策平行语料进行词性自动标注。
目录
摘要２
关键词２
Abstract ２
引言
引言
一、相关研究综述３
（一）词性标注的基本理论３
1．自然语言处理的涵义３
2．词性标注的涵义３
（二）词性标注基本的研究方法４
1.基于规则的方法４
2.基于统计的方法４
3.基于规则和统计相结合的方法４
（三）词性标注研究现状及应用５
（四）词性标注存在的问题５
二、先秦典籍语料及词性标记简介５
（一）先秦典籍语料来源及先秦古汉语词性标签５
（二）先秦英文典籍语料来源及英文词性标签６
三、先秦典籍词性自动标注模型的训练及实验６
（一）条件随机场模型６
（二）特征模板的确定７
（三）先秦典籍词性自动标注实验８
1.先秦古汉语词性自动标注实验８
2.先秦典籍英文词性自动标注实验９
（四）模型构建的流程和评价指标９
（五）词性标注实验结果及标注语料分析１１
（六）词性自动标注模型应用１２
四、结语１３
致谢１３
参考文献：１３
图1 自然语言处理分解示意图３
图2 英文词性标签６
图3 *景先生毕设|www.jxszl.com +Q: ^351916072^
一阶隐马尔可夫模型有向图( A) 和线性链条件随机场无向图( B) 示意７
图4 针对先秦典籍标注好的英文训练语料样例９
图5 词性自动标注模型构建流程１０
图6 战国策平行语料词性自动标注示例１２
表1 先秦古汉语词性标签５
表2 特征模板说明７
表3 先秦典籍词汇长度分布表８
表4 条件随机场下语料标注样例９
表5 先秦典籍古汉语词性自动标注语料样例１０
表6 先秦典籍英文词性自动标注语料样例１１
表7 先秦典籍词性自动标注模型的测试性能１１
表8 先秦典籍英文词性自动标注模型的测试性能１２
数字人文下的先秦汉英典籍词性自动标注研究
Research on Partofspeech Automatic Tagging of PreQin, ChineseEnglish Classical Books under the Digital Humanities
Student majoring in Information Management and Information System ZHUANG Shimeng
Tutor WANG Dongbo
Abstract： Partofspeech tagging is an important part of information processing corpus construction, and it is also a basic work in the field of natural language processing. The automatic tagging of parts of speech has made great achievements in the field of modern Chinese, but there are few studies on ancient books and records, and many problems need to be solved. The status of the preQin classics in the field of ancient Chinese is extremely important. Therefore, this dissertation focuses on the study of the partofspeech tagging of English and Chinese classics in the preQin period under the digital humanities.This article mainly adopts the statisticsbased method to conduct a detailed investigation of the preQin ChineseEnglish classics, and determines the combination of feature usage, word length, and pronunciation for different locations. The combined feature template is combined with the conditional random field model (CRF) to obtain preQin ChineseEnglish classics. POS tagging model. The value F of the partofspeech tagging model under the digital humanity reaches 69.66%, achieves a strong reference and application value.
Key words: POS Tagging; preQin classics; Automatic Word Segmentation; Conditional Random Field Model
引言
随着信息技术和网络技术的不断进步和发展，计算机已经成为人类生活不可或缺的工具。计算机智能处理人类语言的技术，即自然语言处理（NLP）技术应运而生。词性标注(PartOfSpeech Tagging)作为其中较为基础的一种预处理程序，对后续的工作和课题研究都起着至关重要的作用。

原文链接：http://www.jxszl.com/jsj/xxaq/57787.html

"景先生毕设|www.jxszl.com

数字人文下的先秦汉英典籍词性自动标注研究(附件)

查看完整版论文请

扫码加QQ

扫码加微信

在线客服

[QQ:351916072]