Data Acquisition for NLP - Collecting Text Before Preprocessing
Data Acquisition for NLP - Collecting Text Before Preprocessing.
Data acquisition is the first real step before text preprocessing. In this post, I am documenting how I collected data using web scraping, JSON files, SQL, APIs, CSV workflows, and basic EDA.
Data acquisition for NLP means collecting the raw text or structured information that will later become useful input for an NLP pipeline. Before text preprocessing, tokenization, vectorization, model training, or transformers, there is one very practical question: where will the data come from?
In my NLP learning roadmap, I placed data acquisition before text preprocessing because a model cannot learn from data that has not been collected properly. In the previous topic, I learned Python strings and regex for NLP. That helped me understand raw text at the character and pattern level. Now this topic moves one step earlier in the pipeline: getting the data itself.
While working on the notebooks, I used multiple sources: web pages, JSON files, SQL databases, APIs, CSV files, and a student performance dataset for EDA. My rough understanding is simple: data acquisition is the starting point, and there are many ways to do it depending on where the data lives.
Data acquisition is not the glamorous part of NLP, but it decides whether the rest of the pipeline starts with useful raw material or messy confusion.
01 What Data Acquisition Means in NLP
In simple words, data acquisition means collecting data from a source and converting it into a usable format. In NLP, that data is usually text or text-related metadata. It may come from product reviews, news articles, movie descriptions, social media posts, PDFs, support tickets, company pages, chat logs, or public datasets.
The technical part is not just downloading something. The real work is converting messy sources into a structure that Python can process. Most of the time, that means creating a Pandas DataFrame with clear columns.
Raw Sources
- HTML web pages
- JSON files
- SQL tables
- REST APIs
- CSV datasets
- Documents and reports
Usable Output
- Clean DataFrame columns
- Text fields for NLP
- Labels for supervised learning
- Metadata for analysis
- CSV files for storage
- Notebook-ready datasets
This is why I started thinking of data acquisition as a bridge. On one side, there is raw information scattered across different places. On the other side, there is a structured dataset that can move into EDA, cleaning, preprocessing, and modeling.
02 Web Scraping with Requests and BeautifulSoup
The first notebook focused on web scraping. I used requests to fetch the page and BeautifulSoup to parse the HTML. At first, the website returned an access denied page. That was a useful mistake because it showed me that websites may block plain Python requests.
After adding a browser-like User-Agent header, the response became usable. Then I inspected the HTML and found that company information was stored inside repeated card-like blocks. This is the real scraping process: observe the page structure, find repeated patterns, extract the fields, and store them.
import requests
from bs4 import BeautifulSoup
url = "https://www.example.com/list-of-companies"
headers = {
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
cards = soup.find_all("div", class_="companyCardWrapper__metaInformation")
len(cards)
In the notebook, each page had 20 company cards. I extracted company name, description, rating, rating count, reviews, salaries, interviews, jobs, benefits, and photos. After that, I looped through 50 pages and created a dataset with 1000 rows and 10 columns.
The Pattern I Used for Scraping
The most important idea was not the exact class name. Class names can change on websites. The important idea was the scraping pattern:
requests.get() to download the HTML. If the site blocks the request, try a browser-like header and still respect the website rules.
BeautifulSoup to convert raw HTML into a searchable object where tags, classes, and text can be found.
03 Reading JSON Data into Pandas
The next notebook started with JSON data. JSON is very common in NLP because APIs usually return JSON responses, and many datasets store nested text fields in JSON format. I used pd.read_json() to read a file and convert it into a DataFrame.
The example had columns like id and ingredients. The interesting part was that one column contained lists. This is something I need to remember for NLP because text data can be nested: one row may contain a list of ingredients, tokens, tags, comments, or paragraphs.
import pandas as pd
df = pd.read_json("test.json")
df.head()
df.columns
# Accessing one item from a list inside a cell
df.ingredients[0][0]
04 Collecting Data from SQL Tables
Another source I explored was SQL. In real projects, data does not always come as a ready CSV file. Many applications store data in relational databases. For NLP, this could mean customer reviews, support tickets, product descriptions, messages, logs, or user feedback stored in tables.
In the notebook, I connected to a local MySQL database and used Pandas to run SQL queries. This helped me see how SQL tables can directly become DataFrames.
import mysql.connector
import pandas as pd
conn = mysql.connector.connect(
host="localhost",
user="root",
password="YOUR_PASSWORD",
database="world"
)
pd.read_sql_query("SHOW TABLES", conn)
df_city = pd.read_sql_query("SELECT * FROM city", conn)
df_country = pd.read_sql_query("SELECT * FROM country", conn)
The important part is that SQL helps when the data is already structured. I do not need to scrape or parse HTML. I just need a good query and a clean connection. From there, Pandas becomes the bridge between database storage and machine learning workflow.
05 Using APIs for Data Acquisition
APIs were the most practical part for me because many real-world datasets are accessed through APIs. An API is like a controlled data pipeline between two software systems. Instead of manually downloading pages, I send a request and receive structured data, usually as JSON.
In the notebook, I tried an API that returned related news data. Then I worked with a movie API to create a movie dataset with names, descriptions, and genres. This is highly connected to NLP because movie descriptions are text fields and genres can work as labels.
YOUR_API_KEY_HERE.
import requests
import pandas as pd
API_KEY = "YOUR_API_KEY_HERE"
url = (
"https://api.themoviedb.org/3/movie/top_rated"
f"?api_key={API_KEY}&language=en-US&page=1"
)
response = requests.get(url)
data = response.json()
data.keys()
The response contained keys like page, results, total_pages, and total_results. The actual movie records were inside results. From that, I selected useful columns like movie title, overview, and genre IDs.
Creating a Movie Dataset from API Data
The movie dataset was a good NLP example because the overview field is text. Later, this kind of data can be used for text classification, genre prediction, recommendation systems, semantic search, or topic analysis.
# Convert movie results into a DataFrame
df = pd.DataFrame(data["results"])[["title", "overview", "genre_ids"]]
df.rename(
columns={
"title": "movie_name",
"overview": "description"
},
inplace=True
)
df.head()
One problem was that the API returned genre IDs instead of genre names. So I used a second API endpoint to create a mapping from genre ID to genre name. Then I converted lists like [18, 80] into readable genre names like Drama, Crime.
# Example genre mapping idea
genre_dict = {
18: "Drama",
80: "Crime",
35: "Comedy"
}
df["genre"] = df["genre_ids"].apply(
lambda ids: ", ".join([genre_dict[i] for i in ids if i in genre_dict])
)
df.drop("genre_ids", axis=1, inplace=True)
Finally, I saved the dataset as a CSV file. This step matters because once the raw API data is converted into a stable CSV, it becomes easier to reuse for EDA, preprocessing, and modeling.
df.to_csv("movies.csv", index=False)
06 Comparing Data Acquisition Methods
After trying different methods, I started seeing the difference between them more clearly. Every method has a use case. There is no single best method for all NLP projects.
| Method | Best For | What I Learned |
|---|---|---|
| Web scraping | Collecting visible data from web pages | Useful when no API is available, but it needs careful HTML inspection and ethical handling. |
| JSON files | Nested data and API-like stored data | Pandas can read JSON directly, but nested lists and dictionaries need attention. |
| SQL | Structured data stored in databases | SQL queries can bring selected tables directly into Pandas for analysis. |
| APIs | Controlled access to live or structured data | APIs are clean and scalable, but keys, limits, and response structure must be handled safely. |
| CSV | Saving and reusing datasets | CSV becomes a simple checkpoint after collecting and structuring data. |
07 Why EDA Comes Immediately After Data Acquisition
The third notebook connected data acquisition to EDA. I used a student performance dataset and checked its shape, columns, missing values, duplicate rows, categorical columns, numerical columns, outliers, correlations, and visual distributions.
This helped me understand that collecting data is not enough. I need to inspect the dataset before preprocessing or modeling. Otherwise, I may blindly train on missing values, duplicate rows, strange outliers, incorrect data types, or poorly encoded categories.
In the notebook, I separated categorical and numerical columns. I found 17 categorical columns and 16 numerical columns. I also observed outliers, but many of them were legitimate because some features were discrete or naturally limited.
Basic EDA Checklist After Data Acquisition
Dataset Quality
- Check shape and columns
- Check missing values
- Check duplicate rows
- Check data types
- Check unique values
Dataset Understanding
- Separate categorical and numerical columns
- Visualize distributions
- Study correlations
- Check outliers carefully
- Understand target variable
After that, I moved toward feature preparation. I used one-hot encoding for categorical features. The encoded categorical data had 43 columns. After combining encoded categorical features with numerical features, the final feature matrix had 58 columns.
from sklearn.preprocessing import OneHotEncoder
cat_cols = []
num_cols = []
for col in df.columns:
if df[col].dtype == "object":
cat_cols.append(col)
else:
num_cols.append(col)
ohe = OneHotEncoder()
X_encoded = ohe.fit_transform(X[cat_cols])
Even though this student performance dataset was not a pure NLP dataset, the workflow is still useful. NLP datasets also need EDA. For example, after collecting text data, I can check missing text, empty strings, duplicate reviews, label imbalance, text length distribution, number of categories, and noisy samples.
08 How This Connects to Text Preprocessing
Data acquisition gives me raw data. Text preprocessing turns that raw data into cleaner model-ready text. These two topics are tightly connected. If I collect movie descriptions from an API, the next step may be lowercasing, removing extra spaces, handling punctuation, removing URLs, tokenization, and creating clean labels.
This is where the previous strings and regex topic becomes useful again. Regex can clean URLs, emails, phone numbers, repeated whitespace, and strange patterns. String methods can normalize casing, strip spaces, and replace fixed patterns. Data acquisition gives the dataset; preprocessing improves it.
Data Acquisition Gives
- Raw text
- Metadata
- Labels
- Source information
- Initial CSV or DataFrame
Text Preprocessing Does
- Cleans noisy text
- Normalizes casing
- Handles punctuation and spacing
- Tokenizes sentences or words
- Prepares text for representation
This topic also made me realize something simple: a weak dataset leads to a weak model. If the collected data is incomplete, biased, duplicated, noisy, or badly labeled, the later NLP model will struggle no matter how advanced the algorithm is.
09 The Notebooks Behind This Post
This post is based on three notebooks from my NLP learning work. The blog explains what I understood, while the notebooks show the actual implementation and experiments.
github.com/vinod-kaumar/NLP-by-vinod
Notebook references: 01_web_scrapping.ipynb, 02_Data_Acquisition.ipynb, and 03_EDA_Student_performance.ipynb. These cover web scraping, JSON, SQL, APIs, CSV creation, and EDA.
Before pushing notebooks publicly, I should remove real API keys, passwords, local paths, and any private information. Public learning is powerful, but public notebooks should be safe and clean.
10 Related Reading in the NLP Journey
This post fits inside the Foundations Track of NLP by Vinod. It connects the early Python text handling topic to the next practical stage: text preprocessing.
11 What Comes Next in the NLP Journey
After data acquisition, the next topic is Text Preprocessing. This is where the collected raw text will be cleaned and normalized before it moves into tokenization and text representation.
Before Cleaning Text, I Need to Collect It Properly.
Data acquisition helped me understand where NLP datasets come from: websites, APIs, JSON files, SQL tables, and existing CSV datasets. Next, I will move into text preprocessing.
Following along? The notebooks and experiments are connected through GitHub so the learning stays visible and reproducible.
Do visit Github for complete source code.
ReplyDelete