Retrieval-augmented generation over low-structure media: news from tv and digitized newspapers

Uçar, Bilal EmirBuluz Kömeçoǧlu, BaşakGüven, RamazanCoşkun, Ali KemalKömeçoǧlu, Yavuz2026-03-052026-03-0520252521-164110.1109/UBMK67458.2025.112067842-s2.0-105030855810https://doi.org/10.1109/UBMK67458.2025.11206784https://hdl.handle.net/11501/265810th International Conference on Computer Science and Engineering, UBMK, Istanbul, 17-21 September 2025This paper presents a practical and end-to-end implementation of a multimodal Retrieval-Augmented Generation (RAG) pipeline that integrates two traditionally underutilized but information-rich modalities in the news domain: transcribed television broadcasts and digitized digitized newspapers. While RAG has shown significant success in domains with clean digital text, its application to traditional media remains limited due to the inherent challenges in processing unstructured, noisy, and layout-complex content. To address these challenges, we propose a layout-aware document ingestion pipeline for digitized newspapers, powered by a semantic segmentation model trained. For television broadcasts, we integrate an optimized automatic speech recognition (ASR) and chunking framework. Both modalities are indexed under a unified hybrid retrieval architecture, combining dense and sparse representations to support accurate and semantically rich document retrieval. The system is deployed entirely on-premise and evaluated on a proprietary Turkish news corpus using standard retrieval metrics across both modalities.eninfo:eu-repo/semantics/closedAccessAutomatic Speech RecognitionLarge Language ModelMultimodal RetrievalRetrieval Augmented GenerationRetrieval-augmented generation over low-structure media: news from tv and digitized newspapersConference Object4622025N/A457