Celebal Internship – Weekly Learning Journal
Data Engineering Track

Week 1: Basics of Data
The first week focused on understanding data fundamentals and how modern systems handle data. Data was introduced as raw facts, while information is processed data with meaning. Different types of data were explored: structured (tables), semi-structured (JSON/XML), and unstructured (images, logs).
We studied storage systems, comparing file storage with DBMS, where DBMS provides efficient querying and relationships. Relational databases (SQL-based) and non-relational databases (NoSQL) were introduced.
Concepts like OLTP (transactional systems) and OLAP (analytical systems) clarified how real-time operations differ from reporting systems. Big Data concepts (6 V’s) and modern architectures like Data Warehouse, Data Lake, and Lakehouse provided a high-level understanding of industry data flow.
Week 2: Python Fundamentals
This week introduced Python as the primary language for data engineering. Core concepts included variables, data types, and collections like lists, sets, tuples, and dictionaries.
A practical example used in data processing:
numbers = \[1, 2, 3, 4\]
squares = \[x\*\*2 for x in numbers\]
print(squares)
This type of transformation is commonly used in preprocessing pipelines.
String handling was important for cleaning data:
name = "shivankur"
print(name.upper())
print(name\[::-1\])
We also learned virtual environments to isolate dependencies:
python -m venv env source env/bin/activate
Week 3: Python Advanced Concepts
This week focused on writing scalable and structured code using functions and OOP.
class Student:
def **init**(self, name):
self.name = name
def display(self):
print(self.name)
s = Student("Shiv")
s.display()
Exception handling was critical for robust pipelines:
try:
num = int("abc")
except ValueError:
print("Conversion failed")
File handling (used in ingestion pipelines):
with open("data.txt", "w") as f:
f.write("sample data")
Concepts like multithreading and testing (Pytest) were introduced for performance and reliability.
Week 4: Pandas & PySpark
This week focused on data analysis and preprocessing.
import pandas as pd
df = pd.DataFrame({"name": \["A", "B", "B"\], "marks": \[90, None, 85\]}) df.dropna(inplace=True)
df.drop\_duplicates(inplace=True)
These operations reflect real-world cleaning tasks.
Aggregation example:
df.groupby("name")\["marks"\].mean()
PySpark was introduced for handling large-scale distributed data.
Basic SQL querying was also covered:
SELECT name, marks FROM students WHERE marks > 80;
Week 5: SQL for Data Manipulation
This week focused on querying structured data.
Filtering and joins:
SELECT e.name, d.dept FROM employees e JOIN departments d ON e.dept\_id = d.id;
Aggregation:
SELECT dept, COUNT(\*) FROM employees GROUP BY dept;
Set operations (combining datasets):
SELECT name FROM A UNION SELECT name FROM B;
These are heavily used in reporting and transformation layers.
Week 6: Advanced SQL
Advanced querying and optimization were covered.
CTE for simplifying queries:
WITH temp AS ( SELECT \* FROM employees WHERE salary > 50000 ) SELECT \* FROM temp;
Stored procedures for reusable logic:
CREATE PROCEDURE GetEmployees AS SELECT \* FROM employees;
Indexing and query optimization were introduced to improve performance in large datasets.
Week 7: Cloud Basics (Azure)
This week introduced cloud-based data systems. Azure components like Storage Accounts and Data Lake Gen2 were explored.
A real-world ingestion task involved extracting metadata from filenames:
filename = "CUST\_MSTR\_20191112.csv"
date = filename.split("\_")\[-1\].split(".")\[0\]
formatted = f"{date\[:4\]}-{date\[4:6\]}-{date\[6:\]}"
print(formatted)
This logic is used in pipelines to derive partition columns or metadata fields.
Week 8: Azure Data Factory (ADF)
The final week focused on building ETL pipelines using ADF.
Key concepts included pipelines, datasets, linked services, and triggers. Data flows were designed to extract files from storage, transform them, and load into databases.
ADF pipelines automate workflows like:
- Reading files from Data Lake
- Transforming data
- Loading into SQL tables
Parameterization was used to make pipelines dynamic and reusable across datasets.
Final Summary
The internship covered the full data engineering lifecycle:
- Data fundamentals
- Python programming
- SQL (basic to advanced)
- Data processing (Pandas, PySpark)
- Cloud (Azure)
- ETL pipelines (ADF)
By the end, the focus shifted from learning concepts to implementing real-world data pipelines.
Links:


