Text data insight is derived via text analysis and mining techniques mainly practiced in natural language processing (NLP).
Cleaned and processed text data is rich and contains lots of insights. But for data scientists, text data is a bit more challenging to use to represent insights in charts and graphs because it's not numerical. Text visualization requires different skills, mainly, efficiently using screen real estate to visualize relationships between phenomena and highlight the main message. This may involve leaving some data out to allow the main insight or objective to be achieved.
Several libraries have been developed to overcome this challenge. This guide will explore some text visualization libraries primarily written in Python.
The guide assumes you have an intermediate level skill in Python and general data visualization.
ScatterText is a powerful Python-based tool for extracting terms in a body of text and visualizing them in an interactive HTML display. The official Github repo can be found here.
To get started, install the library using pip
.
1pip install scattertext
To develop some code, check out a sample tutorial from the official repo here.
The output of a ScatterText scatterplot looks similar to this.
Source: Official Repository
A word cloud is a text visualization technique that focuses on the frequency of words and correlates the size and opacity of a word to its frequency within a body of text. The output is usually an image that depicts different words in different sizes and opacities relative to the word frequency.
An application of this form of visualization is document summarization, where you can process a body of text within a document and, based on the most prominent words, get a general summary of what the document is all about. This can also be applied in job applications where if the job description is analyzed, the largest words to appear are most likely the most important skills for the job.
The example below explores how to develop a word cloud in Python.
To get started, install wordcloud
to generate the cloud of text and matplotlib
to plot and visualize it.
To install, run the command
1pip install wordcloud matplotlib
The body of text used is a job description from this link.
1import matplotlib.pyplot as plt
2from wordcloud import WordCloud, STOPWORDS
3# stopwords is a collection of words that dont convey meaning. mostly pronouns such as he she etc.
4
5#generate word cloud
6text = "copy_text_from_job_description_in_link_provided_above" # the input of the wordcloud generator
7#generate the wordcloud object, set the height and width, set the random_state parameter to ensure
8reproducibility of results and set the stopwords parameter so that the irrelevant words such as pronouns are discarded.
9wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='blue', collocations=False, stopwords = STOPWORDS).generate(text)
10# text is the input to the generate() method
11#draw the figure
12#Set figure size
13plt.figure(figsize=(40, 30))
14# Display image
15plt.imshow(wordcloud)
16# No axis
17plt.axis("off")
18plt.show()
To run the above script, copy the codeblock in a Python file and name it wc_generator.py
. To run the app, run
1python wc_generator.py
The figure below should appear
These alternative visualization skills are vital in roles that involve business intelligence, data visualization, and data science. Visualization allows you to communicate data insights graphically to an audience that may not necessarily be tech savvy.