Infosys Springboard Internship Building PaperIQ (AI/NLP)

Overview

As part of the Infosys Springboard DSAI Virtual Internship, I worked on building PaperIQ, an end-to-end intelligent system designed to analyze research papers and convert them into structured, actionable insights. The internship was structured around multiple milestones, each focusing on a critical layer of the data and AI pipeline—from ingestion to intelligence and interaction.

The experience was highly practical, involving real-world challenges such as handling unstructured documents, applying NLP techniques, and designing meaningful evaluation metrics.

Problem Statement

Research papers are dense, unstructured, and time-consuming to analyze manually. The goal of PaperIQ was to build a system that could:

Extract structured content from research documents
Summarize key insights
Evaluate quality and integrity
Enable interactive querying

Milestone 1: Data Ingestion & Structural Parsing

The first phase focused on building a robust ingestion pipeline capable of handling PDF and DOCX files. Using pdfplumber, documents were parsed page by page, and raw text was extracted.

To improve data quality, preprocessing steps were applied using regex to remove noise, fix broken words, and normalize formatting.

import pdfplumber
import re

text = ""
with pdfplumber.open("paper.pdf") as pdf:
    for page in pdf.pages:
        text += page.extract_text()

clean_text = re.sub(r'\s+', ' ', text)
print(clean_text[:500])

The system was then designed to identify structural sections such as Abstract, Methodology, and Conclusion by detecting heading patterns (uppercase lines or numbered sections). This step was critical for enabling downstream analysis.

Milestone 2: NLP-Based Summarization

To reduce information overload, an extractive summarization engine was developed. The approach was frequency-based, where important words were identified and used to score sentences.

from collections import Counter
from heapq import nlargest

words = clean_text.split()
freq = Counter(words)

sentences = clean_text.split('.')
sentence_scores = {s: sum(freq.get(w,0) for w in s.split()) for s in sentences}

summary = nlargest(5, sentence_scores, key=sentence_scores.get)
print(summary)

This allowed the system to generate concise summaries representing the most important parts of the document, which is useful in research review workflows.

Milestone 3: Quantitative Quality Metrics

A major part of the project involved building a scoring engine to evaluate research quality using measurable metrics.

Readability (Flesch Reading Ease)

def flesch_score(words, sentences, syllables):
    return 206.835 - 1.015*(words/sentences) - 84.6*(syllables/words)

Semantic Strength (Concept Density using POS tagging)

from textblob import TextBlob

blob = TextBlob(clean_text)
nouns = [word for word, tag in blob.tags if tag.startswith('NN')]
print(len(set(nouns)))

Additional metrics included:

Transition word density (coherence)
Causal keyword detection (reasoning quality)
Sentence complexity and variation

This resulted in an 11-metric evaluation system providing a structured “health score” for research papers.

Milestone 4: AI Interaction & Comparative Analysis

The final stage focused on making the system interactive. A chatbot-like interface was implemented using TF-IDF and cosine similarity to retrieve the most relevant answers from the document.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = sentences
vectorizer = TfidfVectorizer().fit_transform(docs)

query = ["What is the main contribution?"]
query_vec = vectorizer.transform(query)

similarity = cosine_similarity(query_vec, vectorizer)
print(docs[similarity.argmax()])

This enabled users to query documents directly without reading them entirely.

Additionally, a similarity matrix was implemented to compare multiple papers, helping identify overlap, technical depth, and uniqueness.

Final Outcome

The project evolved into a complete intelligent research assistant with the following capabilities:

Multi-format ingestion (PDF, DOCX)
Automated summarization
11-metric quality evaluation
Research integrity checks (citation density, semantic depth)
Interactive chatbot interface
Multi-document comparison

The system also supported exporting structured reports in formats like PDF, LaTeX, and DOCX.

Key Learnings

This internship provided strong exposure to:

Handling unstructured data
Applying NLP in real-world scenarios
Designing measurable evaluation systems
Building end-to-end AI pipelines
Translating theoretical ML concepts into practical solutions

Conclusion

The Infosys Springboard DSAI internship was a hands-on experience in building a real-world AI system from scratch. It reinforced the importance of combining data processing, NLP, and system design to create meaningful solutions.

The final project, PaperIQ, reflects a complete pipeline—from raw document ingestion to intelligent insights—closely aligning with industry-level data and AI workflows.

The complete implementation and documentation can be accessed on my github

Infosys Springboard: PaperIQ

Overview

Problem Statement

Milestone 1: Data Ingestion & Structural Parsing

Milestone 2: NLP-Based Summarization

Milestone 3: Quantitative Quality Metrics

Readability (Flesch Reading Ease)

Semantic Strength (Concept Density using POS tagging)

Milestone 4: AI Interaction & Comparative Analysis

Final Outcome

Key Learnings

Conclusion

Comments

Experiences: Real Lessons from Internships, Projects, and Tech Work

Celebal Internship – Weekly Learning Journal

More from this blog

Celebal Internship – Weekly Learning Journal

Kaggle's Intro to Machine Learning Course

Developing an Easy Photo Editing Tool in Python

Learn Python: The Ultimate Starting Point for Programmers

Command Palette

Overview

Problem Statement

Milestone 1: Data Ingestion & Structural Parsing

Milestone 2: NLP-Based Summarization

Milestone 3: Quantitative Quality Metrics

Readability (Flesch Reading Ease)

Semantic Strength (Concept Density using POS tagging)

Milestone 4: AI Interaction & Comparative Analysis

Final Outcome

Key Learnings

Conclusion

Comments

Experiences: Real Lessons from Internships, Projects, and Tech Work

Celebal Internship – Weekly Learning Journal

More from this blog