Skip to main content

Command Palette

Search for a command to run...

Infosys Springboard: PaperIQ

Virtual AIML Internship

Updated
4 min read
Infosys Springboard: PaperIQ

Overview

As part of the Infosys Springboard DSAI Virtual Internship, I worked on building PaperIQ, an end-to-end intelligent system designed to analyze research papers and convert them into structured, actionable insights. The internship was structured around multiple milestones, each focusing on a critical layer of the data and AI pipeline—from ingestion to intelligence and interaction.

The experience was highly practical, involving real-world challenges such as handling unstructured documents, applying NLP techniques, and designing meaningful evaluation metrics.


Problem Statement

Research papers are dense, unstructured, and time-consuming to analyze manually. The goal of PaperIQ was to build a system that could:

  • Extract structured content from research documents
  • Summarize key insights
  • Evaluate quality and integrity
  • Enable interactive querying

Milestone 1: Data Ingestion & Structural Parsing

The first phase focused on building a robust ingestion pipeline capable of handling PDF and DOCX files. Using pdfplumber, documents were parsed page by page, and raw text was extracted.

To improve data quality, preprocessing steps were applied using regex to remove noise, fix broken words, and normalize formatting.

import pdfplumber
import re

text = ""
with pdfplumber.open("paper.pdf") as pdf:
    for page in pdf.pages:
        text += page.extract_text()

clean_text = re.sub(r'\s+', ' ', text)
print(clean_text[:500])

The system was then designed to identify structural sections such as Abstract, Methodology, and Conclusion by detecting heading patterns (uppercase lines or numbered sections). This step was critical for enabling downstream analysis.


Milestone 2: NLP-Based Summarization

To reduce information overload, an extractive summarization engine was developed. The approach was frequency-based, where important words were identified and used to score sentences.

from collections import Counter
from heapq import nlargest

words = clean_text.split()
freq = Counter(words)

sentences = clean_text.split('.')
sentence_scores = {s: sum(freq.get(w,0) for w in s.split()) for s in sentences}

summary = nlargest(5, sentence_scores, key=sentence_scores.get)
print(summary)

This allowed the system to generate concise summaries representing the most important parts of the document, which is useful in research review workflows.


Milestone 3: Quantitative Quality Metrics

A major part of the project involved building a scoring engine to evaluate research quality using measurable metrics.

Readability (Flesch Reading Ease)

def flesch_score(words, sentences, syllables):
    return 206.835 - 1.015*(words/sentences) - 84.6*(syllables/words)

Semantic Strength (Concept Density using POS tagging)

from textblob import TextBlob

blob = TextBlob(clean_text)
nouns = [word for word, tag in blob.tags if tag.startswith('NN')]
print(len(set(nouns)))

Additional metrics included:

  • Transition word density (coherence)
  • Causal keyword detection (reasoning quality)
  • Sentence complexity and variation

This resulted in an 11-metric evaluation system providing a structured “health score” for research papers.


Milestone 4: AI Interaction & Comparative Analysis

The final stage focused on making the system interactive. A chatbot-like interface was implemented using TF-IDF and cosine similarity to retrieve the most relevant answers from the document.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = sentences
vectorizer = TfidfVectorizer().fit_transform(docs)

query = ["What is the main contribution?"]
query_vec = vectorizer.transform(query)

similarity = cosine_similarity(query_vec, vectorizer)
print(docs[similarity.argmax()])

This enabled users to query documents directly without reading them entirely.

Additionally, a similarity matrix was implemented to compare multiple papers, helping identify overlap, technical depth, and uniqueness.


Final Outcome

The project evolved into a complete intelligent research assistant with the following capabilities:

  • Multi-format ingestion (PDF, DOCX)
  • Automated summarization
  • 11-metric quality evaluation
  • Research integrity checks (citation density, semantic depth)
  • Interactive chatbot interface
  • Multi-document comparison

The system also supported exporting structured reports in formats like PDF, LaTeX, and DOCX.


Key Learnings

This internship provided strong exposure to:

  • Handling unstructured data
  • Applying NLP in real-world scenarios
  • Designing measurable evaluation systems
  • Building end-to-end AI pipelines
  • Translating theoretical ML concepts into practical solutions

Conclusion

The Infosys Springboard DSAI internship was a hands-on experience in building a real-world AI system from scratch. It reinforced the importance of combining data processing, NLP, and system design to create meaningful solutions.

The final project, PaperIQ, reflects a complete pipeline—from raw document ingestion to intelligent insights—closely aligning with industry-level data and AI workflows.

The complete implementation and documentation can be accessed on my github


Experiences: Real Lessons from Internships, Projects, and Tech Work

Part 1 of 2

sharing what i learn A collection of real-world learnings from my internships, projects, and tech work. Covering data engineering, Python, SQL, cloud, and practical industry experience.

Up next

Celebal Internship – Weekly Learning Journal

Data Engineering Track