NLP by Vinod

A structured public journey from NLP fundamentals to real-world AI systems.

Vinod Codes is where I document my learning in AI, Machine Learning, Deep Learning, Natural Language Processing, Generative AI, and practical projects.

The main series here is NLP by Vinod — a learner-builder journey where I explain concepts with intuition, Python examples, mistakes, GitHub work, and honest implementation notes.

Start here: follow the Foundations Track first, then move into deep learning, transformers, projects, and real-world NLP systems.
NLP Foundations Python for NLP Machine Learning Deep Learning Real Projects

Data Acquisition for NLP - Collecting Text Before Preprocessing

Data Acquisition for NLP - Collecting Text Before Preprocessing
NLP by Vinod - Foundations
Data Acquisition

Data Acquisition for NLP - Collecting Text Before Preprocessing.

Data acquisition is the first real step before text preprocessing. In this post, I am documenting how I collected data using web scraping, JSON files, SQL, APIs, CSV workflows, and basic EDA.

NLP Data Acquisition Web Scraping APIs

Data acquisition for NLP means collecting the raw text or structured information that will later become useful input for an NLP pipeline. Before text preprocessing, tokenization, vectorization, model training, or transformers, there is one very practical question: where will the data come from?

In my NLP learning roadmap, I placed data acquisition before text preprocessing because a model cannot learn from data that has not been collected properly. In the previous topic, I learned Python strings and regex for NLP. That helped me understand raw text at the character and pattern level. Now this topic moves one step earlier in the pipeline: getting the data itself.

While working on the notebooks, I used multiple sources: web pages, JSON files, SQL databases, APIs, CSV files, and a student performance dataset for EDA. My rough understanding is simple: data acquisition is the starting point, and there are many ways to do it depending on where the data lives.

"Before preprocessing, there must be data."
Data acquisition is not the glamorous part of NLP, but it decides whether the rest of the pipeline starts with useful raw material or messy confusion.
Data acquisition pipeline for NLP showing web scraping, JSON, APIs, SQL, CSV files and a Pandas DataFrame on a laptop
Data acquisition connects real-world sources to machine learning workflows. For NLP, this often means collecting text from pages, APIs, files, databases, and documents.

01 What Data Acquisition Means in NLP

In simple words, data acquisition means collecting data from a source and converting it into a usable format. In NLP, that data is usually text or text-related metadata. It may come from product reviews, news articles, movie descriptions, social media posts, PDFs, support tickets, company pages, chat logs, or public datasets.

The technical part is not just downloading something. The real work is converting messy sources into a structure that Python can process. Most of the time, that means creating a Pandas DataFrame with clear columns.

Raw Sources

  • HTML web pages
  • JSON files
  • SQL tables
  • REST APIs
  • CSV datasets
  • Documents and reports

Usable Output

  • Clean DataFrame columns
  • Text fields for NLP
  • Labels for supervised learning
  • Metadata for analysis
  • CSV files for storage
  • Notebook-ready datasets

This is why I started thinking of data acquisition as a bridge. On one side, there is raw information scattered across different places. On the other side, there is a structured dataset that can move into EDA, cleaning, preprocessing, and modeling.

Source Website, API, JSON, SQL, CSV
Fetch requests, pandas, connector
Parse HTML, JSON, tables, fields
Structure DataFrame with columns
Use EDA and preprocessing
My main learning: Data acquisition is not one method. It is a group of methods. The method depends on whether the data is stored on a website, inside a file, inside a database, or behind an API.

02 Web Scraping with Requests and BeautifulSoup

The first notebook focused on web scraping. I used requests to fetch the page and BeautifulSoup to parse the HTML. At first, the website returned an access denied page. That was a useful mistake because it showed me that websites may block plain Python requests.

After adding a browser-like User-Agent header, the response became usable. Then I inspected the HTML and found that company information was stored inside repeated card-like blocks. This is the real scraping process: observe the page structure, find repeated patterns, extract the fields, and store them.

Web scraping workflow using Python requests and BeautifulSoup with HTML inspection for NLP data collection
Web scraping starts by inspecting the page structure, finding repeated HTML patterns, and extracting useful text fields into a dataset.
Python
import requests
from bs4 import BeautifulSoup

url = "https://www.example.com/list-of-companies"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

cards = soup.find_all("div", class_="companyCardWrapper__metaInformation")
len(cards)

In the notebook, each page had 20 company cards. I extracted company name, description, rating, rating count, reviews, salaries, interviews, jobs, benefits, and photos. After that, I looped through 50 pages and created a dataset with 1000 rows and 10 columns.

20
Cards per page
50
Pages scraped
1000
Rows collected
10
Columns created

The Pattern I Used for Scraping

The most important idea was not the exact class name. Class names can change on websites. The important idea was the scraping pattern:

01
Fetch the page
Use requests.get() to download the HTML. If the site blocks the request, try a browser-like header and still respect the website rules.
02
Parse the HTML
Use BeautifulSoup to convert raw HTML into a searchable object where tags, classes, and text can be found.
03
Find repeated cards
Identify the repeated container that holds one complete record. In my case, each company card had similar structure.
04
Extract fields
Pull out the fields from each card using tag names and class names, then store them inside lists or a dictionary.
05
Create a DataFrame
Convert the dictionary into a Pandas DataFrame so the data becomes useful for EDA, cleaning, and future modeling.
Important ethical note: Web scraping should be done carefully. Check website rules, avoid aggressive requests, do not scrape private data, and do not overload servers. For learning, small controlled scraping is fine, but production scraping needs more responsibility.

03 Reading JSON Data into Pandas

The next notebook started with JSON data. JSON is very common in NLP because APIs usually return JSON responses, and many datasets store nested text fields in JSON format. I used pd.read_json() to read a file and convert it into a DataFrame.

The example had columns like id and ingredients. The interesting part was that one column contained lists. This is something I need to remember for NLP because text data can be nested: one row may contain a list of ingredients, tokens, tags, comments, or paragraphs.

Python
import pandas as pd

df = pd.read_json("test.json")

df.head()
df.columns

# Accessing one item from a list inside a cell
df.ingredients[0][0]
What clicked here: JSON to DataFrame is a common first step. Once the JSON is inside Pandas, I can apply EDA, cleaning, preprocessing, and feature engineering.

04 Collecting Data from SQL Tables

Another source I explored was SQL. In real projects, data does not always come as a ready CSV file. Many applications store data in relational databases. For NLP, this could mean customer reviews, support tickets, product descriptions, messages, logs, or user feedback stored in tables.

In the notebook, I connected to a local MySQL database and used Pandas to run SQL queries. This helped me see how SQL tables can directly become DataFrames.

Python
import mysql.connector
import pandas as pd

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="YOUR_PASSWORD",
    database="world"
)

pd.read_sql_query("SHOW TABLES", conn)

df_city = pd.read_sql_query("SELECT * FROM city", conn)
df_country = pd.read_sql_query("SELECT * FROM country", conn)

The important part is that SQL helps when the data is already structured. I do not need to scrape or parse HTML. I just need a good query and a clean connection. From there, Pandas becomes the bridge between database storage and machine learning workflow.

05 Using APIs for Data Acquisition

APIs were the most practical part for me because many real-world datasets are accessed through APIs. An API is like a controlled data pipeline between two software systems. Instead of manually downloading pages, I send a request and receive structured data, usually as JSON.

In the notebook, I tried an API that returned related news data. Then I worked with a movie API to create a movie dataset with names, descriptions, and genres. This is highly connected to NLP because movie descriptions are text fields and genres can work as labels.

API key safety: Never publish real API keys in blog posts, GitHub notebooks, screenshots, or public repositories. Use environment variables or placeholders like YOUR_API_KEY_HERE.
Python
import requests
import pandas as pd

API_KEY = "YOUR_API_KEY_HERE"

url = (
    "https://api.themoviedb.org/3/movie/top_rated"
    f"?api_key={API_KEY}&language=en-US&page=1"
)

response = requests.get(url)
data = response.json()

data.keys()

The response contained keys like page, results, total_pages, and total_results. The actual movie records were inside results. From that, I selected useful columns like movie title, overview, and genre IDs.

Creating a Movie Dataset from API Data

The movie dataset was a good NLP example because the overview field is text. Later, this kind of data can be used for text classification, genre prediction, recommendation systems, semantic search, or topic analysis.

Python
# Convert movie results into a DataFrame
df = pd.DataFrame(data["results"])[["title", "overview", "genre_ids"]]

df.rename(
    columns={
        "title": "movie_name",
        "overview": "description"
    },
    inplace=True
)

df.head()

One problem was that the API returned genre IDs instead of genre names. So I used a second API endpoint to create a mapping from genre ID to genre name. Then I converted lists like [18, 80] into readable genre names like Drama, Crime.

Python
# Example genre mapping idea
genre_dict = {
    18: "Drama",
    80: "Crime",
    35: "Comedy"
}

df["genre"] = df["genre_ids"].apply(
    lambda ids: ", ".join([genre_dict[i] for i in ids if i in genre_dict])
)

df.drop("genre_ids", axis=1, inplace=True)

Finally, I saved the dataset as a CSV file. This step matters because once the raw API data is converted into a stable CSV, it becomes easier to reuse for EDA, preprocessing, and modeling.

Python
df.to_csv("movies.csv", index=False)

06 Comparing Data Acquisition Methods

After trying different methods, I started seeing the difference between them more clearly. Every method has a use case. There is no single best method for all NLP projects.

Method Best For What I Learned
Web scraping Collecting visible data from web pages Useful when no API is available, but it needs careful HTML inspection and ethical handling.
JSON files Nested data and API-like stored data Pandas can read JSON directly, but nested lists and dictionaries need attention.
SQL Structured data stored in databases SQL queries can bring selected tables directly into Pandas for analysis.
APIs Controlled access to live or structured data APIs are clean and scalable, but keys, limits, and response structure must be handled safely.
CSV Saving and reusing datasets CSV becomes a simple checkpoint after collecting and structuring data.
The practical conclusion: In real NLP work, data acquisition is usually mixed. I may scrape some data, read existing CSV files, pull extra metadata from APIs, and then combine everything into one clean dataset.

07 Why EDA Comes Immediately After Data Acquisition

The third notebook connected data acquisition to EDA. I used a student performance dataset and checked its shape, columns, missing values, duplicate rows, categorical columns, numerical columns, outliers, correlations, and visual distributions.

EDA dashboard for NLP data acquisition showing API responses, SQL results, text length distribution and dataset summaries
After collecting data, EDA helps check dataset quality, source distribution, missing values, text length patterns, labels and other signals before preprocessing.

This helped me understand that collecting data is not enough. I need to inspect the dataset before preprocessing or modeling. Otherwise, I may blindly train on missing values, duplicate rows, strange outliers, incorrect data types, or poorly encoded categories.

649
Rows in dataset
33
Original columns
0
Missing values
0
Duplicate rows

In the notebook, I separated categorical and numerical columns. I found 17 categorical columns and 16 numerical columns. I also observed outliers, but many of them were legitimate because some features were discrete or naturally limited.

Important observation: Outliers are not always bad data. Sometimes they are real values. Removing them without understanding the feature can damage the dataset.

Basic EDA Checklist After Data Acquisition

Dataset Quality

  • Check shape and columns
  • Check missing values
  • Check duplicate rows
  • Check data types
  • Check unique values

Dataset Understanding

  • Separate categorical and numerical columns
  • Visualize distributions
  • Study correlations
  • Check outliers carefully
  • Understand target variable

After that, I moved toward feature preparation. I used one-hot encoding for categorical features. The encoded categorical data had 43 columns. After combining encoded categorical features with numerical features, the final feature matrix had 58 columns.

Python
from sklearn.preprocessing import OneHotEncoder

cat_cols = []
num_cols = []

for col in df.columns:
    if df[col].dtype == "object":
        cat_cols.append(col)
    else:
        num_cols.append(col)

ohe = OneHotEncoder()
X_encoded = ohe.fit_transform(X[cat_cols])

Even though this student performance dataset was not a pure NLP dataset, the workflow is still useful. NLP datasets also need EDA. For example, after collecting text data, I can check missing text, empty strings, duplicate reviews, label imbalance, text length distribution, number of categories, and noisy samples.

Connection to NLP: In text data, EDA might include average sentence length, word count distribution, missing labels, repeated documents, language detection, class imbalance, and noisy HTML or URL patterns.

08 How This Connects to Text Preprocessing

Data acquisition gives me raw data. Text preprocessing turns that raw data into cleaner model-ready text. These two topics are tightly connected. If I collect movie descriptions from an API, the next step may be lowercasing, removing extra spaces, handling punctuation, removing URLs, tokenization, and creating clean labels.

This is where the previous strings and regex topic becomes useful again. Regex can clean URLs, emails, phone numbers, repeated whitespace, and strange patterns. String methods can normalize casing, strip spaces, and replace fixed patterns. Data acquisition gives the dataset; preprocessing improves it.

Data Acquisition Gives

  • Raw text
  • Metadata
  • Labels
  • Source information
  • Initial CSV or DataFrame

Text Preprocessing Does

  • Cleans noisy text
  • Normalizes casing
  • Handles punctuation and spacing
  • Tokenizes sentences or words
  • Prepares text for representation

This topic also made me realize something simple: a weak dataset leads to a weak model. If the collected data is incomplete, biased, duplicated, noisy, or badly labeled, the later NLP model will struggle no matter how advanced the algorithm is.

09 The Notebooks Behind This Post

This post is based on three notebooks from my NLP learning work. The blog explains what I understood, while the notebooks show the actual implementation and experiments.

GH

github.com/vinod-kaumar/NLP-by-vinod

Notebook references: 01_web_scrapping.ipynb, 02_Data_Acquisition.ipynb, and 03_EDA_Student_performance.ipynb. These cover web scraping, JSON, SQL, APIs, CSV creation, and EDA.

Open Repository

Before pushing notebooks publicly, I should remove real API keys, passwords, local paths, and any private information. Public learning is powerful, but public notebooks should be safe and clean.

10 Related Reading in the NLP Journey

This post fits inside the Foundations Track of NLP by Vinod. It connects the early Python text handling topic to the next practical stage: text preprocessing.

The main roadmap that organizes this journey from NLP fundamentals to real-world AI systems.
The previous foundation topic where I learned string operations, Unicode, regex patterns, and text cleaning basics.
The next topic where raw collected text becomes cleaner and more useful for tokenization, representation, and modeling.

11 What Comes Next in the NLP Journey

After data acquisition, the next topic is Text Preprocessing. This is where the collected raw text will be cleaned and normalized before it moves into tokenization and text representation.

01
Text Cleaning

Removing unnecessary spaces, fixing casing, handling punctuation, and cleaning noisy patterns from raw text.

02
Normalization

Making text consistent using lowercase conversion, contraction handling, spelling decisions, and standard formats.

03
Tokenization

Breaking text into words, sentences, or subword units so models can process language more effectively.

NLP Data Acquisition Web Scraping APIs Pandas BeautifulSoup EDA NLP Foundations

Before Cleaning Text, I Need to Collect It Properly.

Data acquisition helped me understand where NLP datasets come from: websites, APIs, JSON files, SQL tables, and existing CSV datasets. Next, I will move into text preprocessing.

Following along? The notebooks and experiments are connected through GitHub so the learning stays visible and reproducible.

Comments

Post a Comment

Most viewed

Python Strings & Regex for NLP — The Real Foundation

NLP Learning Roadmap — From Fundamentals to Real-World AI Systems