Data Acquisition for NLP - Collecting Text Before Preprocessing

NLP by Vinod - Foundations

Data Acquisition

Data Acquisition for NLP - Collecting Text Before Preprocessing.

Data acquisition is the first real step before text preprocessing. In this post, I am documenting how I collected data using web scraping, JSON files, SQL, APIs, CSV workflows, and basic EDA.

NLP Data Acquisition Web Scraping APIs

Data acquisition for NLP means collecting the raw text or structured information that will later become useful input for an NLP pipeline. Before text preprocessing, tokenization, vectorization, model training, or transformers, there is one very practical question: where will the data come from?

In my NLP learning roadmap, I placed data acquisition before text preprocessing because a model cannot learn from data that has not been collected properly. In the previous topic, I learned Python strings and regex for NLP. That helped me understand raw text at the character and pattern level. Now this topic moves one step earlier in the pipeline: getting the data itself.

While working on the notebooks, I used multiple sources: web pages, JSON files, SQL databases, APIs, CSV files, and a student performance dataset for EDA. My rough understanding is simple: data acquisition is the starting point, and there are many ways to do it depending on where the data lives.

"Before preprocessing, there must be data."
Data acquisition is not the glamorous part of NLP, but it decides whether the rest of the pipeline starts with useful raw material or messy confusion.

Data acquisition pipeline for NLP showing web scraping, JSON, APIs, SQL, CSV files and a Pandas DataFrame on a laptop — Data acquisition connects real-world sources to machine learning workflows. For NLP, this often means collecting text from pages, APIs, files, databases, and documents.

01 What Data Acquisition Means in NLP

In simple words, data acquisition means collecting data from a source and converting it into a usable format. In NLP, that data is usually text or text-related metadata. It may come from product reviews, news articles, movie descriptions, social media posts, PDFs, support tickets, company pages, chat logs, or public datasets.

The technical part is not just downloading something. The real work is converting messy sources into a structure that Python can process. Most of the time, that means creating a Pandas DataFrame with clear columns.

Raw Sources

HTML web pages
JSON files
SQL tables
REST APIs
CSV datasets
Documents and reports

Usable Output

Clean DataFrame columns
Text fields for NLP
Labels for supervised learning
Metadata for analysis
CSV files for storage
Notebook-ready datasets

This is why I started thinking of data acquisition as a bridge. On one side, there is raw information scattered across different places. On the other side, there is a structured dataset that can move into EDA, cleaning, preprocessing, and modeling.

Source Website, API, JSON, SQL, CSV

Fetch requests, pandas, connector

Parse HTML, JSON, tables, fields

Structure DataFrame with columns

Use EDA and preprocessing

My main learning: Data acquisition is not one method. It is a group of methods. The method depends on whether the data is stored on a website, inside a file, inside a database, or behind an API.

02 Web Scraping with Requests and BeautifulSoup

The first notebook focused on web scraping. I used requests to fetch the page and BeautifulSoup to parse the HTML. At first, the website returned an access denied page. That was a useful mistake because it showed me that websites may block plain Python requests.

After adding a browser-like User-Agent header, the response became usable. Then I inspected the HTML and found that company information was stored inside repeated card-like blocks. This is the real scraping process: observe the page structure, find repeated patterns, extract the fields, and store them.

Web scraping workflow using Python requests and BeautifulSoup with HTML inspection for NLP data collection — Web scraping starts by inspecting the page structure, finding repeated HTML patterns, and extracting useful text fields into a dataset.

Python

import requests
from bs4 import BeautifulSoup

url = "https://www.example.com/list-of-companies"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

cards = soup.find_all("div", class_="companyCardWrapper__metaInformation")
len(cards)

In the notebook, each page had 20 company cards. I extracted company name, description, rating, rating count, reviews, salaries, interviews, jobs, benefits, and photos. After that, I looped through 50 pages and created a dataset with 1000 rows and 10 columns.

Cards per page

Pages scraped

1000

Rows collected

Columns created

The Pattern I Used for Scraping

The most important idea was not the exact class name. Class names can change on websites. The important idea was the scraping pattern:

Fetch the page

Use requests.get() to download the HTML. If the site blocks the request, try a browser-like header and still respect the website rules.

Parse the HTML

Use BeautifulSoup to convert raw HTML into a searchable object where tags, classes, and text can be found.

Find repeated cards

Identify the repeated container that holds one complete record. In my case, each company card had similar structure.

Extract fields

Pull out the fields from each card using tag names and class names, then store them inside lists or a dictionary.

Create a DataFrame

Convert the dictionary into a Pandas DataFrame so the data becomes useful for EDA, cleaning, and future modeling.

Important ethical note: Web scraping should be done carefully. Check website rules, avoid aggressive requests, do not scrape private data, and do not overload servers. For learning, small controlled scraping is fine, but production scraping needs more responsibility.

03 Reading JSON Data into Pandas

The next notebook started with JSON data. JSON is very common in NLP because APIs usually return JSON responses, and many datasets store nested text fields in JSON format. I used pd.read_json() to read a file and convert it into a DataFrame.

The example had columns like id and ingredients. The interesting part was that one column contained lists. This is something I need to remember for NLP because text data can be nested: one row may contain a list of ingredients, tokens, tags, comments, or paragraphs.

Python

import pandas as pd

df = pd.read_json("test.json")

df.head()
df.columns

# Accessing one item from a list inside a cell
df.ingredients[0][0]

What clicked here: JSON to DataFrame is a common first step. Once the JSON is inside Pandas, I can apply EDA, cleaning, preprocessing, and feature engineering.

04 Collecting Data from SQL Tables

Another source I explored was SQL. In real projects, data does not always come as a ready CSV file. Many applications store data in relational databases. For NLP, this could mean customer reviews, support tickets, product descriptions, messages, logs, or user feedback stored in tables.

In the notebook, I connected to a local MySQL database and used Pandas to run SQL queries. This helped me see how SQL tables can directly become DataFrames.

Python

import mysql.connector
import pandas as pd

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="YOUR_PASSWORD",
    database="world"
)

pd.read_sql_query("SHOW TABLES", conn)

df_city = pd.read_sql_query("SELECT * FROM city", conn)
df_country = pd.read_sql_query("SELECT * FROM country", conn)

The important part is that SQL helps when the data is already structured. I do not need to scrape or parse HTML. I just need a good query and a clean connection. From there, Pandas becomes the bridge between database storage and machine learning workflow.

05 Using APIs for Data Acquisition

APIs were the most practical part for me because many real-world datasets are accessed through APIs. An API is like a controlled data pipeline between two software systems. Instead of manually downloading pages, I send a request and receive structured data, usually as JSON.

In the notebook, I tried an API that returned related news data. Then I worked with a movie API to create a movie dataset with names, descriptions, and genres. This is highly connected to NLP because movie descriptions are text fields and genres can work as labels.

API key safety: Never publish real API keys in blog posts, GitHub notebooks, screenshots, or public repositories. Use environment variables or placeholders like YOUR_API_KEY_HERE.

Python

import requests
import pandas as pd

API_KEY = "YOUR_API_KEY_HERE"

url = (
    "https://api.themoviedb.org/3/movie/top_rated"
    f"?api_key={API_KEY}&language=en-US&page=1"
)

response = requests.get(url)
data = response.json()

data.keys()

The response contained keys like page, results, total_pages, and total_results. The actual movie records were inside results. From that, I selected useful columns like movie title, overview, and genre IDs.

Creating a Movie Dataset from API Data

The movie dataset was a good NLP example because the overview field is text. Later, this kind of data can be used for text classification, genre prediction, recommendation systems, semantic search, or topic analysis.

Python

# Convert movie results into a DataFrame
df = pd.DataFrame(data["results"])[["title", "overview", "genre_ids"]]

df.rename(
    columns={
        "title": "movie_name",
        "overview": "description"
    },
    inplace=True
)

df.head()

One problem was that the API returned genre IDs instead of genre names. So I used a second API endpoint to create a mapping from genre ID to genre name. Then I converted lists like [18, 80] into readable genre names like Drama, Crime.

Python

# Example genre mapping idea
genre_dict = {
    18: "Drama",
    80: "Crime",
    35: "Comedy"
}

df["genre"] = df["genre_ids"].apply(
    lambda ids: ", ".join([genre_dict[i] for i in ids if i in genre_dict])
)

df.drop("genre_ids", axis=1, inplace=True)

Finally, I saved the dataset as a CSV file. This step matters because once the raw API data is converted into a stable CSV, it becomes easier to reuse for EDA, preprocessing, and modeling.

Python

df.to_csv("movies.csv", index=False)

06 Comparing Data Acquisition Methods

After trying different methods, I started seeing the difference between them more clearly. Every method has a use case. There is no single best method for all NLP projects.

Method	Best For	What I Learned
Web scraping	Collecting visible data from web pages	Useful when no API is available, but it needs careful HTML inspection and ethical handling.
JSON files	Nested data and API-like stored data	Pandas can read JSON directly, but nested lists and dictionaries need attention.
SQL	Structured data stored in databases	SQL queries can bring selected tables directly into Pandas for analysis.
APIs	Controlled access to live or structured data	APIs are clean and scalable, but keys, limits, and response structure must be handled safely.
CSV	Saving and reusing datasets	CSV becomes a simple checkpoint after collecting and structuring data.

The practical conclusion: In real NLP work, data acquisition is usually mixed. I may scrape some data, read existing CSV files, pull extra metadata from APIs, and then combine everything into one clean dataset.

07 Why EDA Comes Immediately After Data Acquisition

The third notebook connected data acquisition to EDA. I used a student performance dataset and checked its shape, columns, missing values, duplicate rows, categorical columns, numerical columns, outliers, correlations, and visual distributions.

EDA dashboard for NLP data acquisition showing API responses, SQL results, text length distribution and dataset summaries — After collecting data, EDA helps check dataset quality, source distribution, missing values, text length patterns, labels and other signals before preprocessing.

This helped me understand that collecting data is not enough. I need to inspect the dataset before preprocessing or modeling. Otherwise, I may blindly train on missing values, duplicate rows, strange outliers, incorrect data types, or poorly encoded categories.

649

Rows in dataset

Original columns

Missing values

Duplicate rows

In the notebook, I separated categorical and numerical columns. I found 17 categorical columns and 16 numerical columns. I also observed outliers, but many of them were legitimate because some features were discrete or naturally limited.

Important observation: Outliers are not always bad data. Sometimes they are real values. Removing them without understanding the feature can damage the dataset.

Basic EDA Checklist After Data Acquisition

Dataset Quality

Check shape and columns
Check missing values
Check duplicate rows
Check data types
Check unique values

Dataset Understanding

Separate categorical and numerical columns
Visualize distributions
Study correlations
Check outliers carefully
Understand target variable

After that, I moved toward feature preparation. I used one-hot encoding for categorical features. The encoded categorical data had 43 columns. After combining encoded categorical features with numerical features, the final feature matrix had 58 columns.

Python

from sklearn.preprocessing import OneHotEncoder

cat_cols = []
num_cols = []

for col in df.columns:
    if df[col].dtype == "object":
        cat_cols.append(col)
    else:
        num_cols.append(col)

ohe = OneHotEncoder()
X_encoded = ohe.fit_transform(X[cat_cols])

Even though this student performance dataset was not a pure NLP dataset, the workflow is still useful. NLP datasets also need EDA. For example, after collecting text data, I can check missing text, empty strings, duplicate reviews, label imbalance, text length distribution, number of categories, and noisy samples.

Connection to NLP: In text data, EDA might include average sentence length, word count distribution, missing labels, repeated documents, language detection, class imbalance, and noisy HTML or URL patterns.

08 How This Connects to Text Preprocessing

Data acquisition gives me raw data. Text preprocessing turns that raw data into cleaner model-ready text. These two topics are tightly connected. If I collect movie descriptions from an API, the next step may be lowercasing, removing extra spaces, handling punctuation, removing URLs, tokenization, and creating clean labels.

This is where the previous strings and regex topic becomes useful again. Regex can clean URLs, emails, phone numbers, repeated whitespace, and strange patterns. String methods can normalize casing, strip spaces, and replace fixed patterns. Data acquisition gives the dataset; preprocessing improves it.

Data Acquisition Gives

Raw text
Metadata
Labels
Source information
Initial CSV or DataFrame

Text Preprocessing Does

Cleans noisy text
Normalizes casing
Handles punctuation and spacing
Tokenizes sentences or words
Prepares text for representation

This topic also made me realize something simple: a weak dataset leads to a weak model. If the collected data is incomplete, biased, duplicated, noisy, or badly labeled, the later NLP model will struggle no matter how advanced the algorithm is.

09 The Notebooks Behind This Post

This post is based on three notebooks from my NLP learning work. The blog explains what I understood, while the notebooks show the actual implementation and experiments.

github.com/vinod-kaumar/NLP-by-vinod

Notebook references: 01_web_scrapping.ipynb, 02_Data_Acquisition.ipynb, and 03_EDA_Student_performance.ipynb. These cover web scraping, JSON, SQL, APIs, CSV creation, and EDA.

Open Repository

Before pushing notebooks publicly, I should remove real API keys, passwords, local paths, and any private information. Public learning is powerful, but public notebooks should be safe and clean.

10 Related Reading in the NLP Journey

This post fits inside the Foundations Track of NLP by Vinod. It connects the early Python text handling topic to the next practical stage: text preprocessing.

NLP Learning Roadmap

The main roadmap that organizes this journey from NLP fundamentals to real-world AI systems.

Python Strings and Regex for NLP

The previous foundation topic where I learned string operations, Unicode, regex patterns, and text cleaning basics.

Text Preprocessing in NLP

The next topic where raw collected text becomes cleaner and more useful for tokenization, representation, and modeling.

11 What Comes Next in the NLP Journey

After data acquisition, the next topic is Text Preprocessing. This is where the collected raw text will be cleaned and normalized before it moves into tokenization and text representation.

Text Cleaning

Removing unnecessary spaces, fixing casing, handling punctuation, and cleaning noisy patterns from raw text.

Normalization

Making text consistent using lowercase conversion, contraction handling, spelling decisions, and standard formats.

Tokenization

Breaking text into words, sentences, or subword units so models can process language more effectively.

NLP Data Acquisition Web Scraping APIs Pandas BeautifulSoup EDA NLP Foundations

Before Cleaning Text, I Need to Collect It Properly.

Data acquisition helped me understand where NLP datasets come from: websites, APIs, JSON files, SQL tables, and existing CSV datasets. Next, I will move into text preprocessing.

Read the NLP Roadmap Continue to Text Preprocessing

Following along? The notebooks and experiments are connected through GitHub so the learning stays visible and reproducible.

Search This Blog

Vinod Codes | AI Engineering & Data Science

A structured public journey from NLP fundamentals to real-world AI systems.