Retrieval-augmented generation over low-structure media: news from tv and digitized newspapers

Yükleniyor...
Küçük Resim

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Institute of Electrical and Electronics Engineers Inc.

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

This paper presents a practical and end-to-end implementation of a multimodal Retrieval-Augmented Generation (RAG) pipeline that integrates two traditionally underutilized but information-rich modalities in the news domain: transcribed television broadcasts and digitized digitized newspapers. While RAG has shown significant success in domains with clean digital text, its application to traditional media remains limited due to the inherent challenges in processing unstructured, noisy, and layout-complex content. To address these challenges, we propose a layout-aware document ingestion pipeline for digitized newspapers, powered by a semantic segmentation model trained. For television broadcasts, we integrate an optimized automatic speech recognition (ASR) and chunking framework. Both modalities are indexed under a unified hybrid retrieval architecture, combining dense and sparse representations to support accurate and semantically rich document retrieval. The system is deployed entirely on-premise and evaluated on a proprietary Turkish news corpus using standard retrieval metrics across both modalities.

Açıklama

10th International Conference on Computer Science and Engineering, UBMK, Istanbul, 17-21 September 2025

Anahtar Kelimeler

Automatic Speech Recognition, Large Language Model, Multimodal Retrieval, Retrieval Augmented Generation

Kaynak

10th International Conference on Computer Science and Engineering, UBMK 2025

WoS Q Değeri

Scopus Q Değeri

N/A

Cilt

Sayı

2025

Künye