Retrieval-augmented generation over low-structure media: news from tv and digitized newspapers
| dc.contributor.author | Uçar, Bilal Emir | |
| dc.contributor.author | Buluz Kömeçoǧlu, Başak | |
| dc.contributor.author | Güven, Ramazan | |
| dc.contributor.author | Coşkun, Ali Kemal | |
| dc.contributor.author | Kömeçoǧlu, Yavuz | |
| dc.date.accessioned | 2026-03-05T07:20:34Z | |
| dc.date.available | 2026-03-05T07:20:34Z | |
| dc.date.issued | 2025 | |
| dc.department | Fakülteler, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü | |
| dc.description | 10th International Conference on Computer Science and Engineering, UBMK, Istanbul, 17-21 September 2025 | |
| dc.description.abstract | This paper presents a practical and end-to-end implementation of a multimodal Retrieval-Augmented Generation (RAG) pipeline that integrates two traditionally underutilized but information-rich modalities in the news domain: transcribed television broadcasts and digitized digitized newspapers. While RAG has shown significant success in domains with clean digital text, its application to traditional media remains limited due to the inherent challenges in processing unstructured, noisy, and layout-complex content. To address these challenges, we propose a layout-aware document ingestion pipeline for digitized newspapers, powered by a semantic segmentation model trained. For television broadcasts, we integrate an optimized automatic speech recognition (ASR) and chunking framework. Both modalities are indexed under a unified hybrid retrieval architecture, combining dense and sparse representations to support accurate and semantically rich document retrieval. The system is deployed entirely on-premise and evaluated on a proprietary Turkish news corpus using standard retrieval metrics across both modalities. | |
| dc.identifier.doi | 10.1109/UBMK67458.2025.11206784 | |
| dc.identifier.endpage | 462 | |
| dc.identifier.issn | 2521-1641 | |
| dc.identifier.issue | 2025 | |
| dc.identifier.scopus | 2-s2.0-105030855810 | |
| dc.identifier.scopusquality | N/A | |
| dc.identifier.startpage | 457 | |
| dc.identifier.uri | https://doi.org/10.1109/UBMK67458.2025.11206784 | |
| dc.identifier.uri | https://hdl.handle.net/11501/2658 | |
| dc.indekslendigikaynak | Scopus | |
| dc.institutionauthor | Buluz Kömeçoǧlu, Başak | |
| dc.language.iso | en | |
| dc.publisher | Institute of Electrical and Electronics Engineers Inc. | |
| dc.relation.ispartof | 10th International Conference on Computer Science and Engineering, UBMK 2025 | |
| dc.relation.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | |
| dc.rights | info:eu-repo/semantics/closedAccess | |
| dc.subject | Automatic Speech Recognition | |
| dc.subject | Large Language Model | |
| dc.subject | Multimodal Retrieval | |
| dc.subject | Retrieval Augmented Generation | |
| dc.title | Retrieval-augmented generation over low-structure media: news from tv and digitized newspapers | |
| dc.type | Conference Object |











