Isaac Sacolick
Contributing writer

3 ways to experiment with text analytics

analysis
Feb 04, 20216 mins
AnalyticsData Mining

Sift through your unstructured text with cloud-native products, machine learning tools, or specialized text analytics programs.

file folders / library / archive / repository
Credit: nikada / Getty Images

Text analytics, sometimes called text data mining, is the process of uncovering insightful and actionable information, trends, or patterns from text. The extracted and structured data is much more convenient than the original text, making it easier to determine the information’s data quality and usefulness. Developers and data scientists can then use the mined data in downstream data visualizations, analytics, machine learning, and applications.

Text analytics aims to identify facts, relationships, sentiments, or other contextual information. The types of information extracted often start with tagging entities such as people’s names, places, and products. It can advance to assigning topics, determining categories, and discovering sentiments. When measures such as currencies, dates, or quantities are extracted, establishing their relationship to other entities (and any qualifiers) is a key text analytics capability.

Extracting data from documents versus form fields

The hardest challenges in text analytics are processing enterprise repositories and large documents such as aggregated news from websites, corporate SEC filings, electronic health records, and other unstructured or semistructured documents. Parsing documents has some unique challenges as the document’s size and structure often dictate domain-specific preprocessing rules and NLP (natural language processing) algorithms. For example, categorizing a 1,000-word blog post is a lot easier than ranking all of the topics found in a book collection. Also, larger documents often require validating the extracted information based on context; for instance, the medical conditions of a patient should be categorized independently from the conditions listed in their family history.

But what if you want to perform a potentially simpler task of extracting information from a form field or other short text snippet? Consider these possible scenarios:

  • Quantify feedback from an employee survey’s open-ended responses
  • Process social media posts for their sentiments about brands or products
  • Categorize different types of chatbot interactions
  • Assign topics to user stories on an agile backlog
  • Route service desk requests based on the problem details
  • Parse information submitted to marketing on your website

These problems require more simplified algorithms than parsing documents because the text fields are identifiable, short, and often carry a specific type of information.

Let’s say you need to leverage unstructured field data in an application or are asked to include insightful information extracted from text in a data visualization. Text analytics is an important first step, and agile data science teams often use spikes to conduct discovery work. The team needs tools, skills, and methodologies to perform text analytics. Here are three different approaches.

1. Use a public cloud’s NLP and cognitive services

The major public clouds offer natural language processing and other cognitive services, so teams already working in these environments and skilled at using these algorithms should research these options.

  • Azure Cognitive Services offers several related services. Form Recognizer can extract key/value pairs from text fields and documents, and Text Analytics can identify entities, sentiment, and key phrases. The more advanced Language Understanding capability can be used for developing NLP models in chatbot, mobile, and IoT applications.
  • Google Cloud Platform has two separate natural language offerings. Developers can use the natural language API to analyze basic entities, extract sentiment, and categorize content into 700 predefined categories. The more advanced AutoML Natural Language creates custom categorization and sentiment models.
  • AWS Comprehend has similar text analytics and NLP features with APIs for detecting entities, events, key phrases, topics, sentiments, and personally identifiable information. Developers and data scientists can also use Amazon SageMaker to test, train, and deploy NLP models such as BlazingText, BERT (Bidirectional Encoder Representations from Transformers), or SpaCy.
  • IBM Watson Natural Language Understanding can extract entities, sentiment, categories, and concepts but also has more sophisticated features for identifying relations, emotions, and semantic roles.

2. Use text analytics tools in data integration and machine learning platforms

If your organization invested in data integration, machine learning, or analytics platforms, then it’s likely one has some text analytics and NLP capabilities. Using these platforms may be an easier and faster way to perform lightweight text analytics, rather than coding to APIs or in data science notebooks. Here are some examples:

Other data science platforms such as RapidMiner, Knime, and Dataiku offer text mining functions natively, through plug-ins and integrations with public cloud services.

3. Use specialized text analytics tools

If coding on public cloud platforms is too complex, and if your organization does not already have an analytics, data science, or machine learning platform with text mining capabilities, then you’re probably seeking a third option. Specialized text analytics tools may be the answer. Take a look at KeatText, Lexalytics, MeaningCloud, MonkeyLearn, NetOwl, Provalis Research, Rosette Text Analytics, and other platforms that offer text analytics capabilities.

Text analytics is also common in customer experience, marketing automation, market research, social listening, chatbot, and other platforms that capture qualitative information around customers and sales prospects.

It’s no surprise that many tools have text analytics capabilities. Some offer simple on-ramps with prebuilt models based on standardized entities, categories, and topics, whereas others enable robust model building. The platforms also differ by target use cases, with some focusing on specific industries, document types, integration requirements, or technology use cases.

If you’re just getting started with text analytics, there are a few best practices. Begin any data and analytics discovery exercise by defining questions and target outcomes that potentially deliver business value. From there, consider the overall complexity of the document, content, and text fields that require processing, and examine the details around the target entities, topics, and semantics. Understanding the problem complexity can help separate whether an agile spike against a lightweight approach is viable or if a more extensive agile proof of concept co-constructed with text mining experts is needed.

Most importantly, recognize that text analytics and natural language processing is a form of machine learning. Arriving at robust solutions requires experimenting with different algorithms, improving models, adding new data sources, and validating the results’ quality. For organizations trying to improve customer experiences, text analytics is an important capability to develop.

Isaac Sacolick
Contributing writer

Isaac Sacolick, President of StarCIO, a digital transformation learning company, guides leaders on adopting the practices needed to lead transformational change in their organizations. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO, a digital transformation influencer, and has over 900 articles published at InfoWorld, CIO.com, his blog Social, Agile, and Transformation, and other sites.

The opinions expressed in this blog are those of Isaac Sacolick and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author