NLP Overview
- What is NLP?
- Views from
Behrooz Mansouri;
Diyi Yang.
- Process, analyze, understand, and generate human (natural) language (text or speech).
- Techniques from linguistics (rule-based/symbolic), statistics, machine learning, deep
learning.
- Enormous amounts of unstructured text and speech data need automated processing.
- Bridges human communication and machines: voice assistants, chatbots, translation,
document analysis, etc.
- Automation of tasks previously human-intensive: e.g., customer support, document
classification, summarization, sentiment mining, information extraction.
- Historical evolution and Approaches
- Early rule-/symbolic-based NLP
- Hand-written grammar rules, dictionaries, and heuristic symbol manipulation.
- Struggled with ambiguity, idioms, exceptions, syntactic variation, homonyms,
metaphors, etc.
- Shift to statistical/machine-learning approaches
- Late 1980s/1990s, more computational power, large corpora of text, statistical
methods became feasible and popular.
- Statistical and ML-based is more robust to noisy or unseen input.
- Modern deep-learning/pre-trained models era
- Large pre-trained models (e.g., transformer-based) learn from massive amounts
of text.
- Strong performance on translation, summarization, question answering, generation,
semantic understanding, etc.
- Views from Diyi Yang;
FIU
- Core Tasks NLP
- Text and speech processing
- Speech recognition (speech-to-text) - turning spoken utterances into textual form.
- Text classification - e.g. spam detection, sentiment analysis, topic classification,
etc.
- Lexical, syntactic, semantic analysis
- Tokenization - splitting text into words, tokens, punctuation, etc.
- Part-of-speech tagging (POS tagging) - labeling each token with its grammatical
role, e.g., noun, verb, adjective, etc.
- Named-Entity Recognition (NER) - identifying entities, e.g., people, places,
organizations, dates, etc.
- Syntactic parsing, parsing, chunking - understanding sentence structure,
dependencies, phrases.
- Semantic analysis - inferring meaning, context, word sense, relationships,
semantics beyond syntax, e.g. ,co-reference, metaphor, ambiguity.
- Discourse-level tasks and context-sensitive analysis
- Coreference resolution - when different words/phrases refer to the same entity.
- Discourse analysis - relations between sentences or paragraphs, e.g., elaboration,
contrast, cause-effect, narrative coherence.
- Generation, transformation, and higher-level tasks
- Machine translation - translating text from one language to another.
- Summarization - generating concise summaries of longer texts.
- Question answering (QA), conversational agents - responding to user queries,
maintaining context, engaging in conversation.
- Natural Language Generation (NLG) - generating fluent, human-like text from
knowledge; responses, reports, summaries, replies, etc.
- Benefits and Use Cases
- Automation of repetitive and large-scale text tasks - e.g., customer support chatbots,
document classification and routing, data entry, information extraction.
- Analysis of unstructured text data - sentiment analysis of social media, reviews; trend
detection; summarizing large document corpora; extracting insights, e.g., from news,
legal, medical data.
- Bridging human-machine communication - voice assistants, speech-to-text systems,
conversational AI, translation tools.
- Business and enterprise workflows - document processing, compliance checking, customer
analytics, smart search and retrieval in large text archives, summarization, report
generation.
- NLP is Hard
- Views from
Diyi Yang;
Behrooz Mansouri.
- Human language is highly ambiguous, variable, and context-dependent - synonyms, idioms,
metaphors, homonyms, tone/sarcasm, grammar exceptions.
- Structure and meaning vs. statistical patterns
- Rule-based systems struggle with variability
- Statistical/ML systems require large, clean data
- Deep models may struggle with true understanding - semantics, real-world grounding,
world knowledge.
- Scalability and resource requirements - large corpora, computational resources, and
tuning.
- Ambiguity across levels - lexical ambiguity (word sense), syntactic ambiguity,
discourse-level ambiguities (coreference, anaphora), pragmatic/contextual inference.
- Evaluation difficulties - measuring understanding, coherence, meaning preservation,
semantic correctness is non-trivial compared to measuring syntactic correctness.
- The Modern NLP Stack
- Statistical + machine learning methods, supplemented by deep learning and pre-trained
language models.
- Pre-trained large language models (LLMs) serve as flexible back-bones.
- Integration of symbolic processing - tokenization, POS tagging, parsing, NER, semantic
analysis, generation.
- End-to-end pipelines that support real-world applications.
- Growing adoption in industry and enterprise.
- What NLP Enables
- Computers understand human language, at least to a useful extent.
- Automation of tasks involving text.
- Interaction via natural language rather than programming languages - lowering the barrier
for users and enabling inclusive interfaces.
- Analysis of massive unstructured data sets, unlocking insights.
- Foundations for advanced AI systems: chatbots, search, recommendation, etc.
Exam Style Questions
- Explain in 3-4 sentences why NLP is considered essential in modern AI and data processing.
Include an example of an application that fundamentally requires NLP.
- List and define four types of linguistic ambiguity (lexical, syntactic, semantic, pragmatic).
For each, provide an example that illustrates the ambiguity clearly.
- Describe the 5 main stages of a modern NLP pipeline.
What is the purpose of tokenization and embedding in this context?
- Compare the three eras of NLP: Rule-based, Statistical, Neural/Deep Learning.
What limitations motivated transitions between eras?
- Describe two strengths and two weaknesses of Large Language Models (LLMs) in NLP tasks.
- Explain the difference between syntactic parsing and semantic parsing.
- Provide an example sentence where syntactic analysis is correct but semantic analysis remains
unclear.
- Consider the sentence: "I saw the man with the telescope."
Provide two different parse trees or interpretations.
Identify which type(s) of ambiguity are present.
- Tokenize the following text using simple whitespace tokenization, then a more realistic
tokenizer:
"U.S.-based models aren't always domain-specific."
Explain differences in token output and why they matter.
- Given this POS-tagged sentence:
"Time/NN flies/VBZ like/IN an/DT arrow/NN ./.",
interpret the meaning suggested by these tags.
Provide an alternate tagging that corresponds to a different interpretation.
- Word embeddings place words in a vector space.
Given example cosine similarities:
sim("cat", "dog") = 0.87
sim("cat", "car") = 0.21
Explain what these numbers mean and why embeddings exhibit this behavior.
- Write Python code that
tokenizes a list of documents.
builds a bag-of-words matrix
trains a simple classifier to detect whether text is positive or negative.
(No deep learning required.)
- LLMs can hallucinate or misinterpret text.
Given the task "summarize this medical report accurately," explain
Why an LLM might provide incorrect details, and
how adding symbolic or rule-based validation can reduce errors.