NLP in Action: Named Entity Recognition
How to tag known entities (such as people and places) in text and why it matters.
This is the second issue of our series, NLP in Action. This series highlights the good and the bad of common methods of natural language processing. In doing so, we hope to spark conversation and curiosity in the world of NLP.
This post is sponsored by Multimodal, a NYC-based development shop that focuses on building custom natural language processing solutions for product teams using large language models (LLMs).
With Multimodal, you will reduce your time-to-market for introducing NLP in your product. Projects take as little as 3 months from start to finish and cost less than 50% of newly formed NLP teams, without any of the hassle. Contact them to learn more.
Ankur’s Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Named entity recognition (NER) is a natural language processing tool used for information extraction. Information extraction is defined as retrieving structured information from an unstructured text source.
Also known as entity extraction, identification, or chunking, NER identifies and classifies entities found in unstructured text. An entity is an independent, self-contained existence. A named entity is a real-world entity, such as Google, London, or Jeff Bezos.
After identifying a named entity, NER then classifies the entity into categories. Some general classifications are organization (Google), location (London), or person (Jeff Bezos). NER tools also allow you to create your own classification category specific to the text you are analyzing. For example, a pharmaceutical company can create a “drug” category to assist them in analyzing research.
Sorting unstructured data into structured data makes it easier to pull meaning and insight from raw data. NER is useful in highlighting key elements and recognizing major themes in textual analysis.
Netflix, Hulu, and Disney+ all utilize NER in their content recommendations. If you enjoy action movies, they will curate a list of other labeled “action” movies that you may enjoy as well. Customer support also uses NER when sorting through complaints or reviews. Would you rather manually read a thousand complaints or let a program analyze and provide statistics on specific products in under five minutes?
It’s a pretty clear choice. The power of NER and other NLP tools, such as entity recognition, is making textual analysis of unstructured data easier and quicker for companies and developers.
Named Entity Recognition Methods
Five common forms of NER are StanfordNLP, NLTK, SpaCy, Polyglot, and Gate API models. All of these models offer pre-trained tools to run on unstructured data, or users can leverage them in building their own NER model to recognize certain named entities.
Let’s walk through each to understand how NER works and see the growth of this natural language processing tool since its introduction. Each of the five models will be tested and evaluated on a recent CNN article about the Olympics.
The StanfordNLP NER Model is the standard for NER models, first released in 2006. This model relies heavily on probability and statistics, as it does not utilize modern deep learning tools. Despite being computationally heavy, the StanfordNLP is still a solid tool for learning the foundations of NER. It offers over 10 models for detecting named entities.
NLTK, the most widely-used python library for natural language processing, also offers an NER solution. NLTK has a built-in named entity recognition method that both identifies and classifies entities. Its general model follows three steps:
Tokenizing: This step breaks the text down into words.
Tagging part of speech: This step tags the words into parts of speech (POS), taking special note of nouns primarily used to recognize named entities.
Parts of speech the NLTK model recognizes include noun phrases, prepositions, conjunctions, verb phrases, determiners, and adjectives.
Parsing chunks: This step identifies and classifies named entities based on their parts of speech.
Similar to NLTK’s NER model, the SpaCy model also tokenizes, tags POS, and parses chunks to detect named entities for information extraction. It uses neural networks combined with feature engineering to extract entities from unstructured data. SpaCy also offers pre-trained models for users to employ on their data.
SpaCy also includes a feature called displacy that clearly labels named entities with their classifications. Below is a snapshot of the first few sentences of the CNN article from the SpaCy model utilizing displacy.
Because of this visualization tool and SpaCy’s effectiveness in correctly identifying named entities, it is considered the top-performing model.
Although not as popular as the previous models, Polyglot is a robust NER tool that uses a multitude of unsupervised datasets to identify named entities through the use of a neural network. It recognizes three predetermined entity categories.
Unlike StanfordNLP, NLTK, and SpaCy models, Polyglot was not trained on datasets that had already been annotated. Instead, Polyglot was trained on datasets from Wikipedia, using hyperlinks to its advantage to identify named entities.
Compared to other models, Polyglot works best on untrained and informal datasets, simply because of its semi-supervised build.
Gate API Model
The last model tested was Gate Cloud’s ANNIE NER API. Gate Cloud provides text analytics-as-a-service with GATE’s open-source platform.
With ten classifications available and 1,200 free requests a day, Gate Cloud’s API is an extremely useful tool to quickly perform NER. Users can import text files and test the pipeline directly on their website, or scale-out their pipelines on thousands of documents.
StanfordNLP, NLTK, SpaCy, Polyglot, and Gate API models were put to the test on the CNN article. Using three classifications: person, location, and organization, these models identified and classified named entities.
Here is a screenshot of the first 15 identified entities in the article and each model’s classification of them.
As you can see, the evaluation for named entities is not consistent across models or even correct sometimes. NER has a long way to go, but the technology is in place for developers to improve these models.
Attached is the link to the Google Colab notebook used to test these models with detailed documentation, functions, and outputs of each model.
Subscribe to get full access to the newsletter and website. Never miss an update on major trends in AI and startups.
You can also follow me on Twitter.