语料库

时间:24-11-18 网友

Background Information

语料库的概念

语料库是指按照一定的语言学规则,利用随机抽样的方法收集的有代表性的语言材料的总汇,它是语言材料的样本。

语料库通常指为语言研究机构收集的,具有一定容量的大型电子文本语料库。它是由口语语料和书面语的样本汇集而成, 用来代表特定的语言或语言变体,或经过加工后带有语言学信息标注的文本的集合。

语料库的分类

按照语料库所涉及的语言种类,语料库课分为单语语料库,双语平行语料库(parallel corpus)和多语语料库(multilingual corpus);

按照语言涉及的题材,语料库可分为普通语料库(general corpus)和专门用途语料库(specialized corpus);

按语料的来源,又可分为口语语料库和书面语语料库;

按语料库是否被标注,语料库可分为生语料库或原始语料库(raw corpus)和熟语料库或标注语料库(annotated corpus)

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Terminology:

双语或多语语料库 Bilingual or multilingual corpus

机器翻译技术 machine translation technology

双语词典编纂技术 bilingual lexicography technique

跟踪研究工作 follow-up study

设计、采集、编码和管理 design, collection, coding and management

Translation Version:

关于双语或多语语料库的研究目前大致可分为三类:

The research on bilingual or multilingual corpus can be divided into three categories currently:

一是研究双语语料的对齐技术(Alignment),国内外学者就此提出多种策略和方法,现在已经出现了许多对齐双语或多语语料的程序或工具;

First is the study of bilingual corpus alignment technology .The scholars at home and abroad propose various strategies and methods about it. There have been a lot of procedures or tools of bilingual or multilingual corpus

alignment at present.

二是研究双语语料的各种应用,如在基于统计的机器翻译技术、基于实例的机器翻译技术,双语词典编纂技术中,双语语料库都发挥着十分重要的作用;

Second is the all kinds of applications on the research of bilingual corpus . For example, bilingual corpus play an important role in the statistics-based machine translation technology, example-based machine translation technology and bilingual lexicography technique.

三是双语语料库的设计、采集、编码和管理问题。目前比较著名的语料库编码方案有TEI 文本编码标准以及CES标准,两者均基于SGML标记语言。

Third is about the design, collection, coding and management issues of the bilingual corpus. The relatively well-known corpus of encoding scheme are the TEI text encoding standard and CES standard, both of which based on the SGML markup language.

就前两类研究来说,中国国内目前做了较多的跟踪研究工作,而对于第三类研究,即双语语料库尤其是涉及汉语的双语语料库的建设、编码和管理研究,探索工作似乎做的相对较少。

In terms of first two research, many follow-up studies have been made in China at present. For the third type of the study , particularly in relation to the bilingual corpus of Chinese bilingual corpus construction, coding and management research. It seems that few work has been done relatively .

《语料库》相关文档:

基于语料库独立学院汉英翻译教学可行性研究11-16

语料库在大学英语教学中的应用01-06

专门用途语料库的建设、应用、问题与发展趋势02-27

基于语料库的医学英语课程教材编写09-28

语料库11-18

Top