Infosys Springboard: PaperIQ
Virtual AIML Internship

Overview
As part of the Infosys Springboard DSAI Virtual Internship, I worked on building PaperIQ, an end-to-end intelligent system designed to analyze research papers and convert them into structured, actionable insights. The internship was structured around multiple milestones, each focusing on a critical layer of the data and AI pipeline—from ingestion to intelligence and interaction.
The experience was highly practical, involving real-world challenges such as handling unstructured documents, applying NLP techniques, and designing meaningful evaluation metrics.
Problem Statement
Research papers are dense, unstructured, and time-consuming to analyze manually. The goal of PaperIQ was to build a system that could:
- Extract structured content from research documents
- Summarize key insights
- Evaluate quality and integrity
- Enable interactive querying
Milestone 1: Data Ingestion & Structural Parsing
The first phase focused on building a robust ingestion pipeline capable of handling PDF and DOCX files. Using pdfplumber, documents were parsed page by page, and raw text was extracted.
To improve data quality, preprocessing steps were applied using regex to remove noise, fix broken words, and normalize formatting.
import pdfplumber
import re
text = ""
with pdfplumber.open("paper.pdf") as pdf:
for page in pdf.pages:
text += page.extract_text()
clean_text = re.sub(r'\s+', ' ', text)
print(clean_text[:500])
The system was then designed to identify structural sections such as Abstract, Methodology, and Conclusion by detecting heading patterns (uppercase lines or numbered sections). This step was critical for enabling downstream analysis.
Milestone 2: NLP-Based Summarization
To reduce information overload, an extractive summarization engine was developed. The approach was frequency-based, where important words were identified and used to score sentences.
from collections import Counter
from heapq import nlargest
words = clean_text.split()
freq = Counter(words)
sentences = clean_text.split('.')
sentence_scores = {s: sum(freq.get(w,0) for w in s.split()) for s in sentences}
summary = nlargest(5, sentence_scores, key=sentence_scores.get)
print(summary)
This allowed the system to generate concise summaries representing the most important parts of the document, which is useful in research review workflows.
Milestone 3: Quantitative Quality Metrics
A major part of the project involved building a scoring engine to evaluate research quality using measurable metrics.
Readability (Flesch Reading Ease)
def flesch_score(words, sentences, syllables):
return 206.835 - 1.015*(words/sentences) - 84.6*(syllables/words)
Semantic Strength (Concept Density using POS tagging)
from textblob import TextBlob
blob = TextBlob(clean_text)
nouns = [word for word, tag in blob.tags if tag.startswith('NN')]
print(len(set(nouns)))
Additional metrics included:
- Transition word density (coherence)
- Causal keyword detection (reasoning quality)
- Sentence complexity and variation
This resulted in an 11-metric evaluation system providing a structured “health score” for research papers.
Milestone 4: AI Interaction & Comparative Analysis
The final stage focused on making the system interactive. A chatbot-like interface was implemented using TF-IDF and cosine similarity to retrieve the most relevant answers from the document.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
docs = sentences
vectorizer = TfidfVectorizer().fit_transform(docs)
query = ["What is the main contribution?"]
query_vec = vectorizer.transform(query)
similarity = cosine_similarity(query_vec, vectorizer)
print(docs[similarity.argmax()])
This enabled users to query documents directly without reading them entirely.
Additionally, a similarity matrix was implemented to compare multiple papers, helping identify overlap, technical depth, and uniqueness.
Final Outcome
The project evolved into a complete intelligent research assistant with the following capabilities:
- Multi-format ingestion (PDF, DOCX)
- Automated summarization
- 11-metric quality evaluation
- Research integrity checks (citation density, semantic depth)
- Interactive chatbot interface
- Multi-document comparison
The system also supported exporting structured reports in formats like PDF, LaTeX, and DOCX.
Key Learnings
This internship provided strong exposure to:
- Handling unstructured data
- Applying NLP in real-world scenarios
- Designing measurable evaluation systems
- Building end-to-end AI pipelines
- Translating theoretical ML concepts into practical solutions
Conclusion
The Infosys Springboard DSAI internship was a hands-on experience in building a real-world AI system from scratch. It reinforced the importance of combining data processing, NLP, and system design to create meaningful solutions.
The final project, PaperIQ, reflects a complete pipeline—from raw document ingestion to intelligent insights—closely aligning with industry-level data and AI workflows.
The complete implementation and documentation can be accessed on my github


