Arseam

International Journal of Advances in Engineering & Scientific Research

RETRIEVAL AUGMENTED GENERATOR DRIVEN PDF DATA EXTRACTION USING LLM

Mrs.L.Nivetha, Kowshik G, Thamizhselvan V, Vignesh V

Mrs.L.Nivetha, Kowshik G, Thamizhselvan V, Vignesh V, Department of Computer Science and Engineering, Kongunadu College of Engineering and Technology, Tamilnadu, India

DOI : https://doi.org/10.5281/zenodo.14922126 Page No : 34-44

Published Online : 2025-02-25

Download Full Article : PDF Check for Updates

The exponential growth of unstructured data in PDF format presents challenges for efficient extraction and comprehension. This paper introduces a Retrieval-Augmented Generator (RAG) driven PDF Data Extraction framework leveraging Large Language Models (LLMs) to enhance accuracy and contextual understanding. Our approach integrates retrieval-based augmentation with generative AI, ensuring precise and contextually aware extraction from complex documents. The framework consists of three key components: document preprocessing, retrieval-augmented generation, and post-processing refinement. First, PDF documents are parsed using Optical Character Recognition (OCR) for scanned content and structured parsing for digital text. A vector database (e.g., FAISS, Pinecone) indexes document segments using embedding models (e.g., Open AI’s Ada, SBERT). During query execution, relevant document chunks are retrieved using semantic search and passed as augmented context to an LLM (e.g., GPT-4, LLaMA, Mistral) for accurate extraction and summarization. The output undergoes post-processing using Named Entity Recognition (NER) and rule-based validation to refine extracted information. This hybrid approach enhances information retrieval efficiency, mitigates hallucinations, and enables domain-specific adaptation. Our method outperforms conventional PDF parsers in handling complex layouts, multilingual content, and domain-specific terminologies, making it ideal for legal, financial, and scientific document processing. Experimental results demonstrate significant improvements in precision, recall, and response coherence over baseline methods.

Keywords: RAG, LLM, PDF Data Extraction, Semantic Search, Vector Embeddings and NLP

References

Suresh, K., Reddy, P. P., & Preethi, P. (2019). A novel key exchange algorithm for security in internet of things. Indones. J. Electr. Eng. Comput. Sci, 16(3), 1515-1520.
Barma MD, Muthupandiyan I, Samuel SR, Amaechi BT. (2021) Inhibition of Streptococcus mutans, antioxidant property and cytotoxicity of novel nano-zinc oxide varnish. Arch Oral Biol. 2021 Jun;126:105132. doi: 10.1016/j.archoralbio.2021.105132. Epub 2021 Apr 23.
Yogapriya, J., Saravanabhavan, C., Asokan, R., Vennila, I., Preethi, P., & Nithya, B. (2018). A study of image retrieval system based on feature extraction, selection, classification and similarity measurements. Journal of Medical Imaging and Health Informatics, 8(3), 479-484.
Siddique R, Nivedhitha MS, Jacob B. (2019) Quantitative analysis for detection of toxic elements in various irrigants, their combination (precipitate), and para-chloroaniline: An inductively coupled plasma mass spectrometry study. J Conserv Dent. 2019 Jul-Aug;22(4):344-350. doi: 10.4103/JCD.JCD_95_19.
Avacharmal, R., Pamulaparthyvenkata, S., Ranjan, P., Mulukuntla, S., Balakrishnan, A., Preethi, P., & Gomathi, R. D. (2024, June). Mitigating Annotation Burden in Active Learning with Transfer Learning and Iterative Acquisition Functions. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) (pp. 1-7). IEEE.
Khan, A. A., Hasan, M. T., Kemell, K. K., Rasku, J., & Abrahamsson, P. (2024). Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report. arXiv preprint arXiv:2410.15944.
Hikov, A., & Murphy, L. (2024). Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering. Journal of AI, Robotics & Workplace Automation, 3(2), 142-150.
Koziolek, H., Grüner, S., Hark, R., Ashiwal, V., Linsbauer, S., & Eskandani, N. (2024, April). LLM-based and retrieval-augmented control code generation. In Proceedings of the 1st International Workshop on Large Language Models for Code (pp. 22-29).
DeBellis, M., Dutta, N., Gino, J., & Balaji, A. (2025). Integrating Ontologies and Large Language Models to Implement Retrieval Augmented Generation. Applied Ontology, 15705838241296446.
Silva, L., & Barbosa, L. (2024). Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems, 294, 111740.

Nivetha L, Kowshik G, Thamizhselvan V, Vignesh V, (2025), “Retrieval Augmented Generator Driven Pdf Data Extraction Using LLM”, International Journal of Advances in Engineering & Scientific Research, Volume 12, Issue 1, 2025, pp 34-44. DOI : https://doi.org/10.5281/zenodo.14922126

Article View: 48
PDF Download: 4

References

Suresh, K., Reddy, P. P., & Preethi, P. (2019). A novel key exchange algorithm for security in internet of things. Indones. J. Electr. Eng. Comput. Sci, 16(3), 1515-1520.
Barma MD, Muthupandiyan I, Samuel SR, Amaechi BT. (2021) Inhibition of Streptococcus mutans, antioxidant property and cytotoxicity of novel nano-zinc oxide varnish. Arch Oral Biol. 2021 Jun;126:105132. doi: 10.1016/j.archoralbio.2021.105132. Epub 2021 Apr 23.
Yogapriya, J., Saravanabhavan, C., Asokan, R., Vennila, I., Preethi, P., & Nithya, B. (2018). A study of image retrieval system based on feature extraction, selection, classification and similarity measurements. Journal of Medical Imaging and Health Informatics, 8(3), 479-484.
Siddique R, Nivedhitha MS, Jacob B. (2019) Quantitative analysis for detection of toxic elements in various irrigants, their combination (precipitate), and para-chloroaniline: An inductively coupled plasma mass spectrometry study. J Conserv Dent. 2019 Jul-Aug;22(4):344-350. doi: 10.4103/JCD.JCD_95_19.
Avacharmal, R., Pamulaparthyvenkata, S., Ranjan, P., Mulukuntla, S., Balakrishnan, A., Preethi, P., & Gomathi, R. D. (2024, June). Mitigating Annotation Burden in Active Learning with Transfer Learning and Iterative Acquisition Functions. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) (pp. 1-7). IEEE.
Khan, A. A., Hasan, M. T., Kemell, K. K., Rasku, J., & Abrahamsson, P. (2024). Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report. arXiv preprint arXiv:2410.15944.
Hikov, A., & Murphy, L. (2024). Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering. Journal of AI, Robotics & Workplace Automation, 3(2), 142-150.
Koziolek, H., Grüner, S., Hark, R., Ashiwal, V., Linsbauer, S., & Eskandani, N. (2024, April). LLM-based and retrieval-augmented control code generation. In Proceedings of the 1st International Workshop on Large Language Models for Code (pp. 22-29).
DeBellis, M., Dutta, N., Gino, J., & Balaji, A. (2025). Integrating Ontologies and Large Language Models to Implement Retrieval Augmented Generation. Applied Ontology, 15705838241296446.
Silva, L., & Barbosa, L. (2024). Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems, 294, 111740.

Submission Instruction

Call for Paper

Submit Paper Online

Join Editorial Board

Author Guidelines

Impact Factor: 6.013

Indexed in

RETRIEVAL AUGMENTED GENERATOR DRIVEN PDF DATA EXTRACTION USING LLM

Aims and Objectives

Ethics Policy

Peer Review Policy

Call for Paper

Conference

Instruction To Review

Guide To Authors