Florian Keßler, Diane Donner, Shuyi Li (Erlangen): multiplyitbyxreadingconcordancesinalanguagewithoutwordorsentenceboundaries

Datum: 21. März 2025Zeit: 11:30 – 14:00Ort: Kollegienhaus, Universitätsstraße 15, 91054 Erlangen

Join us for the RC21 Project Symposium, where invited speakers and project team members, Poster Presenters will present their work on methodology and applications of concordance analysis!

 

Florian Keßler, Diane Donner, Shuyi Li (Erlangen): »multiplyitbyxreadingconcordancesinalanguagewithoutwordorsentenceboundaries«


Abstract:

Reading concordance lines is one of the most popular methods in corpus-based research. However, the majority of research is conducted using corpora in Western languages sharing orthographic features such as spaces between words and punctuation separating clauses. In contrast to this, Literary Chinese, the written language of imperial China, was written in scriptura continua, without spaces or punctuation marks. How does this affect the quality of concordance reading, and can Large-Language-Models (LLMs) make the process more efficient by decomposing the text into words and sentences? In order to answer these questions, in this ongoing project, we read concordances extracted from five ancient Chinese mathematical texts preprocessed with different segmentation strategies, with an emphasis on discovering valency patterns of common operands.

For example, in the texts, an operand used in an operation is often introduced with the particle “yi 以”, giving rise to such constructs as “multiply it by the 23 people” (以二十三人乘之). But this is not the only choice for stating that operation, as we also find constructions such as “multiply x and y with each other” (x y xiang cheng 相乘). How were such patterns distributed, and were there any changes over time? We use these questions as the background to our exploration of different pre-processing strategies for concordance reading. In particular, we read concordances of common operators such as “cheng 乘” (to multiply) in four different versions of the same corpus consisting of ancient Chinese mathematical works: 1) with neither punctuation nor word segmentation, i.e. true to the original form of the corpus 2) with crowd-sourced punctuation (obtained from Wikisource) and no word segmentation 3) with automatic punctuation and no word segmentation 4) with crowd-sourced punctuation and automatic word segmentation. In order to derive the four versions, we have developed a pipeline around a LLM fine-tuned for Literary Chinese2, allowing us to systematically evaluate whether additional segmentation is beneficial for concordancing in Literary Chinese, and if so, whether automatic segmentation is on par with human efforts.

Literature

Bin Li, Yiguo Yuan, Jingya Lu, Minxuan Feng, Chao Xu, Weiguang Qu, and Dongbo Wang. 2022. The First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff: Overview of the EvaHan 2022 Evaluation Campaign. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 135–140, Marseille, France. European Language Resources Association.

Bin Li, Bolin Chang, Zhixing Xu, Minxuan Feng, Chao Xu, Weiguang Qu, Si Shen, and Dongbo Wang. 2024. Overview of EvaHan2024: The First International Evaluation on Ancient Chinese Sentence Segmentation and Punctuation. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 229–236, Torino, Italia. ELRA and ICCL.

Zum Kalender hinzufügen

Details

Datum:
21. März 2025
Zeit:
11:30 – 14:00
Ort:

Kollegienhaus, Universitätsstraße 15, 91054 Erlangen

Veranstaltungskategorien:
RC21