NLP Overview

What is NLP?
- Process, analyze, understand, and generate human (natural) language (text or speech).
- Techniques from linguistics (rule-based/symbolic), statistics, machine learning, deep learning.
- Enormous amounts of unstructured text and speech data need automated processing.
- Bridges human communication and machines: voice assistants, chatbots, translation, document analysis, etc.
- Automation of tasks previously human-intensive: e.g., customer support, document classification, summarization, sentiment mining, information extraction.
Historical evolution and Approaches
- Early rule-/symbolic-based NLP
  - Hand-written grammar rules, dictionaries, and heuristic symbol manipulation.
  - Struggled with ambiguity, idioms, exceptions, syntactic variation, homonyms, metaphors, etc.
- Shift to statistical/machine-learning approaches
  - Late 1980s/1990s, more computational power, large corpora of text, statistical methods became feasible and popular.
  - Statistical and ML-based is more robust to noisy or unseen input.
- Modern deep-learning/pre-trained models era
  - Large pre-trained models (e.g., transformer-based) learn from massive amounts of text.
  - Strong performance on translation, summarization, question answering, generation, semantic understanding, etc.
- Views from Diyi Yang.
Core Tasks NLP
- Speech processing - turning spoken utterances into textual form.
- Lexical, syntactic, semantic analysis
  - Tokenization - splitting text into words, tokens, punctuation, etc.
  - Part-of-speech tagging (POS tagging) - labeling each token with its grammatical role, e.g., noun, verb, adjective, etc.
  - Named-Entity Recognition (NER) - identifying entities, e.g., people, places, organizations, dates, etc.
  - Syntactic parsing, parsing, chunking - understanding sentence structure, dependencies, phrases.
  - Semantic analysis - inferring meaning, context, word sense, relationships, semantics beyond syntax, e.g., co-reference, metaphor, ambiguity.
- Discourse-level tasks and context-sensitive analysis
  - Coreference resolution - when different words/phrases refer to the same entity.
  - Discourse analysis - relations between sentences or paragraphs, e.g., elaboration, contrast, cause-effect, narrative coherence.
- Generation, transformation, and higher-level tasks
  - Text classification - e.g., spam detection, sentiment analysis, topic classification, etc.
  - Machine translation - translating text from one language to another.
  - Summarization - generating concise summaries of longer texts.
  - Natural Language Generation (NLG) - generating fluent, human-like text from knowledge; responses, reports, summaries, replies, etc.
  - Question answering (QA), conversational agents - responding to user queries, maintaining context, engaging in conversation.
- Views from FIU
NLP is Hard
- Human language is highly ambiguous, variable, and context-dependent - synonyms, idioms, metaphors, homonyms, tone/sarcasm, grammar exceptions.
- Structure and meaning vs. statistical patterns
  - Rule-based systems struggle with variability
  - Statistical/ML systems require large, clean data
  - Deep models may struggle with true understanding - semantics, real-world grounding, world knowledge.
- Scalability and resource requirements - large corpora, computational resources, and tuning.
- Ambiguity across levels - lexical ambiguity (word sense), syntactic ambiguity, discourse-level ambiguities (coreference, anaphora), pragmatic/contextual inference.
- Evaluation difficulties - measuring understanding, coherence, meaning preservation, semantic correctness is non-trivial compared to measuring syntactic correctness.
- Views from Diyi Yang
The Modern NLP Stack
- Integration of symbolic processing - tokenization, POS tagging, parsing, NER, semantic analysis, generation.
- Statistical + machine learning methods, supplemented by deep learning and pre-trained language models.
- Pre-trained large language models (LLMs) serve as flexible back-bones.
- End-to-end pipelines that support real-world applications.
What NLP Enables
- Computers understand human language, at least to a useful extent.
- Automation of tasks involving text.
- Interaction via natural language rather than programming languages - lowering the barrier for users and enabling inclusive interfaces.
- Analysis of massive unstructured data sets, unlocking insights.
- Foundations for advanced AI systems: chatbots, search, recommendation, etc.
Benefits and Use Cases
- Automation of repetitive and large-scale text tasks - e.g., customer support chatbots, document classification and routing, data entry, information extraction.
- Analysis of unstructured text data - sentiment analysis of social media, reviews; trend detection; summarizing large document corpora; extracting insights, e.g., from news, legal, medical data.
- Bridging human-machine communication - voice assistants, speech-to-text systems, conversational AI, translation tools.
- Business and enterprise workflows - document processing, compliance checking, customer analytics, smart search and retrieval in large text archives, summarization, report generation.

Exam Style Questions

Explain in 3-4 sentences why NLP is considered essential in modern AI and data processing. Include an example of an application that fundamentally requires NLP.
List and define four types of linguistic ambiguity (lexical, syntactic, semantic, pragmatic). For each, provide an example that illustrates the ambiguity clearly.
Describe the 5 main stages of a modern NLP pipeline. What is the purpose of tokenization and embedding in this context?
Compare the three eras of NLP: Rule-based, Statistical, Neural/Deep Learning. What limitations motivated transitions between eras?
Describe two strengths and two weaknesses of Large Language Models (LLMs) in NLP tasks.
Explain the difference between syntactic parsing and semantic parsing.
Provide an example sentence where syntactic analysis is correct but semantic analysis remains unclear.
Consider the sentence: "I saw the man with the telescope." Provide two different parse trees or interpretations. Identify which type(s) of ambiguity are present.
Tokenize the following text using simple whitespace tokenization, then a more realistic tokenizer: "U.S.-based models aren't always domain-specific." Explain differences in token output and why they matter.
Given this POS-tagged sentence: "Time/NN flies/VBZ like/IN an/DT arrow/NN ./.", interpret the meaning suggested by these tags. Provide an alternate tagging that corresponds to a different interpretation.
Word embeddings place words in a vector space. Given example cosine similarities:
sim("cat", "dog") = 0.87
sim("cat", "car") = 0.21
Explain what these numbers mean and why embeddings exhibit this behavior.
Write Python code that tokenizes a list of documents. builds a bag-of-words matrix trains a simple classifier to detect whether text is positive or negative. (No deep learning required.)
LLMs can hallucinate or misinterpret text. Given the task "summarize this medical report accurately," explain Why an LLM might provide incorrect details, and how adding symbolic or rule-based validation can reduce errors.