Modern natural language processing (NLP) increasingly treats diverse tasks as sequence-to-sequence generation problems. Rather than restricting models to classification or extraction, we can frame nearly any NLP task—summarization, correction, translation—as generating a target text from an input text. This paradigm shift enables more flexible and powerful solutions using generative architectures.
Foundations of Text Generation
Early text generation relied on statistical n-gram models that predicted the next word based on prior context. While simple, these models lacked deep contextual understanding. Modern approaches leverage neural architectures like encoder-decoder frameworks, transformer-based models (BERT, GPT), and adversarial training to produce coherent, context-aware outputs.
import random
# Simplified vocabulary and transition probabilities
word_transitions = {
"the": {"cat": 0.6, "dog": 0.3, "bird": 0.1},
"cat": {"sat": 0.7, "ran": 0.2, "jumped": 0.1},
"sat": {"on": 0.8, "under": 0.2},
"on": {"mat": 0.5, "chair": 0.3, "floor": 0.2}
}
def generate_sequence(start_word, length=4):
current = start_word
result = [current]
for _ in range(length - 1):
if current not in word_transitions:
break
next_word = random.choices(
list(word_transitions[current].keys()),
weights=list(word_transitions[current].values())
)[0]
result.append(next_word)
current = next_word
return " ".join(result)
print(generate_sequence("the")) # Example: "the cat sat on"
During training, models minimize cross-entropy loss between generated and reference texts. Evaluation commonly uses metrics like BLEU (for translation) or ROUGE (for summarization) to measure overlap with human-written references.
Text Summarization
Summarization condenses source content into shorter, informative versions. Approaches include:
- Extractive: Selects key sentences from source
- Abstractive: Generates novel phrasing using NLG
Here's an abstractive example using a multilingual T5 variant:
from transformers import pipeline
summarizer = pipeline(
"summarization",
model="csebuetnlp/mT5_multilingual_XLSum",
tokenizer="csebuetnlp/mT5_multilingual_XLSum"
)
source_text = """Automatic trust negotiation addresses cross-domain trust establishment through iterative disclosure of access policies and digital certificates. Due to its unique approach and complex environments, it faces various security threats requiring specialized analysis and defense mechanisms."""
summary = summarizer(source_text, max_length=50, min_length=20)[0]['summary_text']
print(summary) # Output varies but captures core concepts
Fine-tuning for Domain Adaptation
For specialized domains, fine-tuning improves performance. Using OpenAI's API:
import pandas as pd
import openai
# Prepare training data
training_data = [
{"prompt": "Source text about ML algorithms", "completion": "ML algorithm overview"},
{"prompt": "Paper on neural networks", "completion": "Neural network study"}
]
df = pd.DataFrame(training_data)
df.to_json("training.jsonl", orient="records", lines=True, force_ascii=False)
# Initiate fine-tuning (command-line)
# !openai api fine_tunes.create -t training.jsonl -m ada
Fine-tuned models show marked improvement over base models, though quality depends on traniing data volume and relevance.
Text Correction
Correction systems fix spelling, grammar, and typographical errors. Techniques evolve from rule-based dictionaries to neural approaches:
from transformers import pipeline
corrector = pipeline(
"text2text-generation",
model="shibing624/macbert4csc-base-chinese"
)
noisy_text = "Ths is an exmple with erors."
correction = corrector(f"Correct: {noisy_text}", max_length=50)[0]['generated_text']
print(correction) # "This is an example with errors."
Large language models like GPT-3.5 often outperform specialized correctors due to broader linguistic knowledge:
def correct_with_llm(text):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": f"Fix errors: {text}"}]
)
return response.choices[0].message.content
print(correct_with_llm("I has three apple.")) # "I have three apples."
Machine Translation
Neural machine translation (NMT) dominates modern systems, replacing statistical methods. Transformer architectures enable high-quality, context-aware translations:
translator = pipeline(
"translation_zh_to_en",
model="Helsinki-NLP/opus-mt-zh-en"
)
chinese_text = "欢迎参加机器学习课程"
english_translation = translator(chinese_text)[0]['translation_text']
print(english_translation) # "Welcome to the machine learning course"
Handling Long Documents
For lengthy texts exceeding model limits, chunk processing maintains context:
def translate_long_text(text, max_chunk_tokens=500):
paragraphs = text.split('\n')
chunks, current_chunk = [], ""
for para in paragraphs:
if len(current_chunk) + len(para) > max_chunk_tokens:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n" + para if current_chunk else para
if current_chunk:
chunks.append(current_chunk)
translations = []
for chunk in chunks:
response = openai.Completion.create(
engine="text-davinci-003",
prompt=f"Translate to English:\n{chunk}",
max_tokens=2048
)
translations.append(response.choices[0].text.strip())
return "\n".join(translations)
This approach balances context preservation with computational constraints, enabling book-length translations while managing costs through strategic chunking.