top of page

Regular Expressions (RegEx)

Updated: Mar 16

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and respond to human language. It bridges the gap between human communication and computer understanding by analyzing text or speech data. Applications include sentiment analysis, chatbots, machine translation, and text summarization.


There are different techniques available to extract information from a text:

  • Named Entity Recognition (NER): to identify key entities such as names of people, organizations, dates, locations, etc. Libraries like SpaCy already have pre-trained models for this task.

  • Regular Expressions (RegEx): to identify specific patterns such as numbers, dates, email addresses, among others.

  • Keyword Extraction: Implement a keyword extraction algorithm such as TF-IDF or RAKE.

  • Transformer Models: If you need to identify more complex information or perform context analysis, pre-trained models like BERT, GPT, or any transformer-based model can deliver advanced results.


Regular Expressions (RegEx) are sequences of characters that define a search pattern. They are a powerful tool used for matching, extracting, and manipulating specific patterns of text in a string. RegEx is widely used in text processing, data cleaning, validation, and more. It provides a concise and flexible means of identifying strings of interest within text.


RegEx operates by defining a pattern composed of literal characters, metacharacters, and operators. When applied to a text, the pattern is matched against the string to find sequences that conform to the defined criteria.


Key Elements in RegEx Patterns:

  • Literal Characters: Exact characters to match, e.g., cat matches "cat".

  • Metacharacters: Special characters with unique meanings (e.g., ., *, +, ?).

  • Character Classes: Sets of characters to match, enclosed in brackets, e.g., [a-z].

  • Quantifiers: Specify the number of times a character or pattern can occur, e.g., *, +, {n}.

  • Anchors: Define positions in the text, such as start (^) or end ($) of a string.


Examples of RegEx Applications

  • Data Validation:

    • Email addresses: r'[a-zA-Z0-9.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

    • Phone numbers: r'\+?\d{1,4}?[\s.-]?\(?\d{1,4}?\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}'

    • Credit cards numbers (16 digits): r'\b(?:\d{4}[- ]?){3}\d{4}\b'

    • IP addresses (IPv4 format): r'\b(?:\d{1,3}\.){3}\d{1,3}\b'

    • URLs: r'https?://(?:www\.)?\S+(?:\.\S+)+'

  • Text Extraction:

    • Dates:

      • dd/mm/yyyy o dd-mm-yyyy: r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b'

      • AAAA-MM-DD: r'\b\d{4}-\d{2}-\d{2}\b'

    • Numbers: r'\d+(?:\.\d+)?'


RegEx is an indispensable tool for professionals working with text, offering unparalleled precision and flexibility in string manipulation and pattern matching. Whether in programming, data analysis, or web development, its applications are vast and invaluable.


We are going to develop a solution in Python that utilizes the RegEx technique. We will structure the code into the following steps:


  1. Define the Function: The function identify_patterns takes a string text as input. This prepares the function for flexibility in handling various inputs.

  2. Date Pattern Matching: The regular expression r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b' identifies dates in formats like dd/mm/yyyy or dd-mm-yyyy.

    • \b: Ensures that the match is a complete word (boundaries).

    • \d{1,2}: Matches one or two digits for day and month.

    • [/-]: Matches either / or - as a separator.

    • \d{2,4}: Matches two or four digits for the year.

  3. Remove Dates from Text: After finding dates, the code removes them from the input text. This step ensures that the number detection does not accidentally include date components (e.g., "05" from "05/10/2024").

  4. Number Pattern Matching: The regular expression r'\d+(?:\.\d+)?' identifies numbers, including decimals.

    • \d+: Matches one or more digits.

    • (?:\.\d+)?: Optionally matches a decimal part (e.g., ".50").

    • ?: indicates a non-capturing group, which makes the regex more efficient.

  5. Display Results: The function prints two lists:

    • Numbers: Found in the text excluding dates.

    • Dates: Found in the text.

  6. Input Text: The input_text string contains English-translated text.

  7. Call the Function: The input_text is passed to the function identify_patterns


 





















 

The expected RegEx output is:

  • Numbers found (excluding dates): ['1500.50', '2017.43', '3200']

  • Dates found: ['05/10/2024', '12-09-2023', '23/11/2021']

Recent Posts

See All
Named Entity Recognition (NER)

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and...

 
 
 

Comentários


bottom of page