The amount of text data has grown exponentially in recent years, resulting in an ever-increasing need to analyze the massive amounts of such data. Word clouds provide an excellent option to analyze text data through visualization in the form of tags, or words, where the importance of a word is explained by its frequency. In this guide, you will learn how to visualize text data using the word cloud feature in Azure Machine Learning Studio.
In this guide, you will work with Twitter data of the Bollywood movie Rangoon. The movie was released on February 24, 2017, and the tweets were extracted on February 25. These tweets have been stored in a file named movietweets. The data contains tweets in rows, and the column you will consider is the text
variable, which contains the tweet. Start by loading the data into the workspace.
Once you have logged into your Azure Machine Learning Studio account, click the EXPERIMENTS option, listed on the left sidebar, followed by the NEW button.
Next, click on the blank experiment and a new workspace will open. Give the name WordCloud to the workspace.
Next, load the data into the workspace. Click NEW, and select the DATASET option shown below.
The selection above will open a window, shown below, which can be used to upload the dataset from the local system.
Once the data is loaded, you can see it in the Saved Datasets option. The file name is movietweets.csv. The next step is to drag it from the Saved Datasets list into the workspace. To explore this data, right-click and select the Visualize option, as shown below.
You can see there are 14933 rows and 20 columns.
It is important to pre-process text before you visualize it with a word cloud. Common pre-processing steps include:
The Preprocess Text module is used to perform these and other text cleaning steps. Search and drag the module into the workspace. Connect it to the data as shown below.
You must specify the text variable to be pre-processed. Click on the Launch column selector option, and select the text
variable.
Run the experiment and click Visualize to see the result. The Preprocessed text
variable contains the processed text.
You have performed the pre-processing step, and the corpus is ready to be used for building a word cloud. You will use the R programming language to generate the word cloud. The Execute R Script module is used to execute R codes in the machine learning experiment.
To begin, search and add the Execute R Script module to the experiment. Next, connect the data to the first input port (left-most) of the module.
Click on the module and under the Properties pane. You will see the option of writing your R script. Enter the code as shown below.
You can also copy the code below.
1#lines 1 to 4
2
3library(tm)
4library(wordcloud)
5library(RColorBrewer)
6dataset <- maml.mapInputPort(1)
7
8# lines 5 to 12 – text preprocessing
9
10commatokenizer = function(x) unlist(strsplit(as.character(x),","))
11corpus <- Corpus(DataframeSource(data.frame(dataset[,1])))
12corpus = tm_map(corpus, removePunctuation)
13corpus = tm_map(corpus, content_transformer(tolower))
14corpus = tm_map(corpus, removeNumbers)
15corpus = tm_map(corpus, stripWhitespace)
16corpus = tm_map(corpus, removeWords, stopwords('english'))
17corpus = tm_map(corpus, stemDocument)
18
19# lines 13 and 14 - Create term-document matrix, frequency
20
21tdm = TermDocumentMatrix(corpus,control=list(tokenize=commatokenizer))
22freq <- rowSums(as.matrix(tdm))
23
24# line 15
25
26wordcloud(names(freq),freq,min.freq = 10, max.words=150,
27random.order=FALSE, random.color=FALSE, rot.per=.25,colors=brewer.pal(8, "Dark2"))
In the code above, the first three lines of code load the required libraries. The fourth line creates a dataframe, dataset1
, which is mapped to the first input port with the function,mam1.mapInputPort()
.
Line of codes from five to twelve perform further refining on the earlier preprocessed text data with the tm_map
function. The next two lines create the document term matrix and store the frequency of words in the freq
object. Finally, the wordcloud()
function is used to build the word cloud. The major arguments of this function are given below.
min.freq
: An argument which ensures that words with a frequency below min.freq
will not be plotted in the word cloud.max.words
: The maximum number of words to be plotted.random.order
: An argument that specifies plotting of words in random order. If false, the words are plotted in decreasing frequency.rot.per
: The proportion of words with 90-degree rotation (vertical text).colors
: An argument that specifies the color of words from least to most frequent.The above arguments have been provided in the wordcloud()
function. Once you have set up the experiment, the next step is to run it.
On successful completion, you can see the green tick in the module.
Right-click and select Visualize to look at the output.
The following output is displayed. The word cloud generated shows that the words are plotted in decreasing frequency, which means that the most frequent words are in the center of the word cloud, and the words with lower frequency are farther away from the center.
You can see that the word "rangoon" is at the center of the word cloud, which makes sense as it was the name of the movie. Another interesting word is "miss," because the name of the central character in the movie was Miss Julia. This way, you can analyze the important words in a text corpus using a word cloud.
Word clouds are very useful in sentiment analysis as they highlight the key words in text. This has application in Twitter, Facebook, and other social media analytics tasks. Word clouds are also applied to build marketing campaigns or plan promotional advertisements where significant words are used.
To learn more about data science and machine learning using Azure Machine Learning Studio, please refer to the following guides: