Retrieval-augmented generation over low-structure media: news from tv and digitized newspapers

dc.contributor.authorUçar, Bilal Emir
dc.contributor.authorBuluz Kömeçoǧlu, Başak
dc.contributor.authorGüven, Ramazan
dc.contributor.authorCoşkun, Ali Kemal
dc.contributor.authorKömeçoǧlu, Yavuz
dc.date.accessioned2026-03-05T07:20:34Z
dc.date.available2026-03-05T07:20:34Z
dc.date.issued2025
dc.departmentFakülteler, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description10th International Conference on Computer Science and Engineering, UBMK, Istanbul, 17-21 September 2025
dc.description.abstractThis paper presents a practical and end-to-end implementation of a multimodal Retrieval-Augmented Generation (RAG) pipeline that integrates two traditionally underutilized but information-rich modalities in the news domain: transcribed television broadcasts and digitized digitized newspapers. While RAG has shown significant success in domains with clean digital text, its application to traditional media remains limited due to the inherent challenges in processing unstructured, noisy, and layout-complex content. To address these challenges, we propose a layout-aware document ingestion pipeline for digitized newspapers, powered by a semantic segmentation model trained. For television broadcasts, we integrate an optimized automatic speech recognition (ASR) and chunking framework. Both modalities are indexed under a unified hybrid retrieval architecture, combining dense and sparse representations to support accurate and semantically rich document retrieval. The system is deployed entirely on-premise and evaluated on a proprietary Turkish news corpus using standard retrieval metrics across both modalities.
dc.identifier.doi10.1109/UBMK67458.2025.11206784
dc.identifier.endpage462
dc.identifier.issn2521-1641
dc.identifier.issue2025
dc.identifier.scopus2-s2.0-105030855810
dc.identifier.scopusqualityN/A
dc.identifier.startpage457
dc.identifier.urihttps://doi.org/10.1109/UBMK67458.2025.11206784
dc.identifier.urihttps://hdl.handle.net/11501/2658
dc.indekslendigikaynakScopus
dc.institutionauthorBuluz Kömeçoǧlu, Başak
dc.language.isoen
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.relation.ispartof10th International Conference on Computer Science and Engineering, UBMK 2025
dc.relation.publicationcategoryKonferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.subjectAutomatic Speech Recognition
dc.subjectLarge Language Model
dc.subjectMultimodal Retrieval
dc.subjectRetrieval Augmented Generation
dc.titleRetrieval-augmented generation over low-structure media: news from tv and digitized newspapers
dc.typeConference Object

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
Kapalı Erişim
İsim:
Tam Metin / Full Text
Boyut:
1.43 MB
Biçim:
Adobe Portable Document Format
Lisans paketi
Listeleniyor 1 - 1 / 1
Kapalı Erişim
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed to upon submission
Açıklama: