top of page

Named Entity Recognition (NER)

Writer: JA SolerJA Soler

Updated: Jan 12

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and respond to human language. It bridges the gap between human communication and computer understanding by analyzing text or speech data. Applications include sentiment analysis, chatbots, machine translation, and text summarization.


There are different techniques available to extract information from a text:

  • Named Entity Recognition (NER): to identify key entities such as names of people, organizations, dates, locations, etc. Libraries like SpaCy already have pre-trained models for this task.

  • Regular Expressions (RegEx): to identify specific patterns such as numbers, dates, email addresses, among others.

  • Keyword Extraction: Implement a keyword extraction algorithm such as TF-IDF or RAKE.

  • Transformer Models: If you need to identify more complex information or perform context analysis, pre-trained models like BERT, GPT, or any transformer-based model can deliver advanced results.


Named Entity Recognition (NER) is a fundamental technique in the field of NLP that aims to automatically identify and classify entities mentioned in a text. These entities can include names of people, organizations, locations, dates, numbers, monetary values, and more. NER is widely used in tasks such as information extraction, sentiment analysis, question-answering systems, and search engines.


The NER process involves two main tasks:


  1. Entity Detection in the text (identifying text fragments that represent entities). In this phase, the system identifies which words or sequences of words in the text represent entities. There two steps:


    1. Tokenization: The first step is to divide the text into smaller units called "tokens," which are usually individual words.

    2. Boundary Marking: The system identifies the beginning and end of each named entity. This involves tagging each word in the text as part of an entity or not using labeling schemes such as BIO (Beginning, Inside, Outside):

      • B (Beginning): Indicates that the token is the start of an entity.

      • I (Inside): Indicates that the token is inside an entity spanning multiple tokens.

      • O (Outside): Indicates that the token is not part of an entity.


  2. Classification of Detected Entities: once entities are detected, the next step is to classify them into different predefined categories. Common categories include:

    • Person: Names of individuals (e.g., "Albert Einstein").

    • Organization: Names of companies, institutions, or government agencies (e.g., "Google," "United Nations").

    • Location: Geographical locations, countries, or cities (e.g., "Spain," "Madrid").

    • Date: Specific dates (e.g., "October 5, 2024").

    • Monetary Value: Amounts of money (e.g., "150 euros").

    • Time: Periods of time or specific hours (e.g., "2 weeks," "3:30 PM").


Examples of NER applications:

  • Information Extraction: In legal or scientific documents, NER can be used to extract names of laws, drugs, people, companies, or important concepts.

  • Sentiment Analysis: NER can be combined with sentiment analysis to determine which people or brands are mentioned in a positive or negative context.

  • Search Engines: Advanced search systems use NER to recognize specific entities in queries and improve the relevance of the results.


We are going to develop a solution in Python that utilizes the NER technique. We will structure the code into the following steps:


  1. Load the Model: Spacy is a small English model pre-trained on common tasks like tokenization, part-of-speech tagging, and Named Entity Recognition (NER).

  2. Input Text: define the text you want to analyze for named entities.

  3. Process the Text: through SpaCy's pipeline to extract entities, tokens, and other linguistic features.

  4. Print the Named Entities: for each entity, print its text and its type.


 

 

The expected NER output is:

  • Elon Musk - PERSON

  • Tesla - ORG

  • California - GPE

  • March 10, 2023 - DATE

  • Europe - LOC

  • $1,000 million - MONEY


Explanation of Common Entity Labels:

  • PERSON: Names of individuals (e.g., "Elon Musk").

  • ORG: Organizations or companies (e.g., "Tesla").

  • LOC (Location): Refers to non-political or natural locations (e.g "Europe")

  • GPE: Geopolitical entities like countries, cities, states (e.g., "California").

  • DATE: Dates in various formats (e.g., "March 10, 2023").

  • MONEY: Monetary amounts (e.g., "$1,000 million").

Recent Posts

See All

Regular Expressions (RegEx)

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and...

2 Comments


A very interesting post! There are many language processing techniques besides Large Language Models.


As a regular user of these techniques, I would like to point out when, in my opinion, I would use regular expressions (REx) and when I would use named entity recognition (NER).


For capturing text with a known format, such as a date, a quantity or codes such as a person's ID, I would always go for regular expressions, as they are more manageable and robust. I would only use NERs for specific problems, e.g. capturing proper nouns.


An example where I would use NER would be, for example, to capture ‘New York Times’. With regular expressions, you might be able to catch those words that…


Like
JA Soler
JA Soler
Jan 07
Replying to

Javier, thank you for sharing your experience in using RegEx versus NER. This is exactly the idea behind this learning community: for everyone to contribute their knowledge/experience while also learning from what others share

Like
bottom of page