Celebal Internship – Weekly Learning Journal

Week 1: Basics of Data

The first week focused on understanding data fundamentals and how modern systems handle data. Data was introduced as raw facts, while information is processed data with meaning. Different types of data were explored: structured (tables), semi-structured (JSON/XML), and unstructured (images, logs).

We studied storage systems, comparing file storage with DBMS, where DBMS provides efficient querying and relationships. Relational databases (SQL-based) and non-relational databases (NoSQL) were introduced.

Concepts like OLTP (transactional systems) and OLAP (analytical systems) clarified how real-time operations differ from reporting systems. Big Data concepts (6 V’s) and modern architectures like Data Warehouse, Data Lake, and Lakehouse provided a high-level understanding of industry data flow.

Week 2: Python Fundamentals

This week introduced Python as the primary language for data engineering. Core concepts included variables, data types, and collections like lists, sets, tuples, and dictionaries.

A practical example used in data processing:

numbers = \[1, 2, 3, 4\] 
squares = \[x\*\*2 for x in numbers\] 
print(squares)

This type of transformation is commonly used in preprocessing pipelines.

String handling was important for cleaning data:

name = "shivankur" 
print(name.upper()) 
print(name\[::-1\])

We also learned virtual environments to isolate dependencies:

python -m venv env source env/bin/activate

Week 3: Python Advanced Concepts

This week focused on writing scalable and structured code using functions and OOP.

class Student: 
    def **init**(self, name): 
       self.name = name

    def display(self):
        print(self.name)

s = Student("Shiv") 
s.display()

Exception handling was critical for robust pipelines:

try: 
     num = int("abc") 
except ValueError: 
     print("Conversion failed")

File handling (used in ingestion pipelines):

with open("data.txt", "w") as f: 
    f.write("sample data")

Concepts like multithreading and testing (Pytest) were introduced for performance and reliability.

Week 4: Pandas & PySpark

This week focused on data analysis and preprocessing.

import pandas as pd

df = pd.DataFrame({"name": \["A", "B", "B"\], "marks": \[90, None, 85\]}) df.dropna(inplace=True) 
df.drop\_duplicates(inplace=True)

These operations reflect real-world cleaning tasks.

Aggregation example:

df.groupby("name")\["marks"\].mean()

PySpark was introduced for handling large-scale distributed data.

Basic SQL querying was also covered:

SELECT name, marks FROM students WHERE marks > 80;

Week 5: SQL for Data Manipulation

This week focused on querying structured data.

Filtering and joins:

SELECT e.name, d.dept FROM employees e JOIN departments d ON e.dept\_id = d.id;

Aggregation:

SELECT dept, COUNT(\*) FROM employees GROUP BY dept;

Set operations (combining datasets):

SELECT name FROM A UNION SELECT name FROM B;

These are heavily used in reporting and transformation layers.

Week 6: Advanced SQL

Advanced querying and optimization were covered.

CTE for simplifying queries:

WITH temp AS ( SELECT \* FROM employees WHERE salary > 50000 ) SELECT \* FROM temp;

Stored procedures for reusable logic:

CREATE PROCEDURE GetEmployees AS SELECT \* FROM employees;

Indexing and query optimization were introduced to improve performance in large datasets.

Week 7: Cloud Basics (Azure)

This week introduced cloud-based data systems. Azure components like Storage Accounts and Data Lake Gen2 were explored.

A real-world ingestion task involved extracting metadata from filenames:

filename = "CUST\_MSTR\_20191112.csv" 
date = filename.split("\_")\[-1\].split(".")\[0\] 
formatted = f"{date\[:4\]}-{date\[4:6\]}-{date\[6:\]}" 
print(formatted)

This logic is used in pipelines to derive partition columns or metadata fields.

Week 8: Azure Data Factory (ADF)

The final week focused on building ETL pipelines using ADF.

Key concepts included pipelines, datasets, linked services, and triggers. Data flows were designed to extract files from storage, transform them, and load into databases.

ADF pipelines automate workflows like:

Reading files from Data Lake
Transforming data
Loading into SQL tables

Parameterization was used to make pipelines dynamic and reusable across datasets.

Final Summary

The internship covered the full data engineering lifecycle:

Data fundamentals
Python programming
SQL (basic to advanced)
Data processing (Pandas, PySpark)
Cloud (Azure)
ETL pipelines (ADF)

By the end, the focus shifted from learning concepts to implementing real-world data pipelines.

Links:

X post Github

Celebal Internship – Weekly Learning Journal

Week 1: Basics of Data

Week 2: Python Fundamentals

Week 3: Python Advanced Concepts

Week 4: Pandas & PySpark

Week 5: SQL for Data Manipulation

Week 6: Advanced SQL

Week 7: Cloud Basics (Azure)

Week 8: Azure Data Factory (ADF)

Final Summary

Comments

Experiences: Real Lessons from Internships, Projects, and Tech Work

Infosys Springboard: PaperIQ

More from this blog

Infosys Springboard: PaperIQ

Kaggle's Intro to Machine Learning Course

Developing an Easy Photo Editing Tool in Python

Learn Python: The Ultimate Starting Point for Programmers

Command Palette

Week 1: Basics of Data

Week 2: Python Fundamentals

Week 3: Python Advanced Concepts

Week 4: Pandas & PySpark

Week 5: SQL for Data Manipulation

Week 6: Advanced SQL

Week 7: Cloud Basics (Azure)

Week 8: Azure Data Factory (ADF)

Final Summary

Comments

Experiences: Real Lessons from Internships, Projects, and Tech Work

Infosys Springboard: PaperIQ

More from this blog