Visualization Module in Arabica Speeds Up Text Data Exploration

Introduction

Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica simplifies exploratory analysis (EDA) of these datasets by providing these methods:

arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset cappuccino: for visual exploration of the data. This article provides an introduction to Cappuccino, Arabica’s visualization module for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.

Cappuccino, visualization for exploratory text data analysis The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).

The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (pixels depend on the data range displayed in the graphs).

Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The method’s parameters are:

def cappuccino(text: str,                # Text
               time: str,                # Time
               date_format: str,         # Date format: 'eur' - European, 'us' - American
               plot: str,                # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int,               # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: str,           # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
               max_words int,            # Maximum of most frequent n-grams displayed for each period
               stopwords: [],            # Languages for stop words
               stopwords_ext: [],        # Languages for extended stop words list
               skip: [],                 # Remove additional stop words               
               numbers: bool = False,    # Remove numbers
               lower_case: bool = False  # Lowercase text

Descriptive n-gram visualization Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.

Word cloud

Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.

First, install Arabica in a Python 3.10 environment with pip install arabica, then import Cappuccino:

from arabica import cappuccino

After reading the data with pandasThe data looks like this:

Figure 2. Million News Headlines data We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings (“g”, “br”), and plot a word cloud with the 100 most frequent words:

cappuccino(text = data[‘headline’],

           time = data['date'],
           plot = 'wordcloud',
           ngram = 1,               # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
           time_freq = 'ungroup',   # no period aggregation
           max_words = 100,         # displays 100 most frequent words
           stopwords = ['english'], # remove English stopwords
           stopwords_ext = None, 
           skip = ['g','br'],       # remove additional strings
           numbers = True,          # remove numbers
           lower_case = True        # lowercase text
)

It returns the word cloud:

Press enter or click to view image in full size

Figure 3. Word cloud, image by author. After changing ngram = 2 , we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3 displays the most frequent trigrams:

Press enter or click to view image in full size

Figure 4. Word cloud — trigram, image by author. 4. Time-series n-gram visualization Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.

Image by author, source: Draw.io

Heatmap A heatmap with the ten most frequent words in each month is displayed with the following code :

cappuccino(text = data[‘headline’],

           time = data['date'],
           plot = 'heatmap',
           ngram = 1,               # n-gram size, 1 = unigram, 2 = bigram
           time_freq = 'M',         # monthly aggregation
           max_words = 10,          # displays 10 most frequent words for each period
           stopwords = ['english'], # remove English stopwords
           stopwords_ext = None,    
           skip = ['g', 'br'],      # remove additional strings
           numbers = True,          # remove numbers
           lower_case = True        # lowercase text
)

The unigram heatmap is the output:

Press enter or click to view image in full size

Figure 5. Heatmap — unigram, image by author. The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:

most frequent n-grams: “us”, “police”, “new”, “man”.

outliers (terms appearing only in one period): “war”, “wa”, “rain”, “killed”, “iraqi”, “concerns”, “budget”, “bali”.

We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2 and max_words = 5 displaying a heatmap with the five most frequent bigrams in each period.

Press enter or click to view image in full size

Figure 6. Heatmap — bigram, image by author. Line plot A line plot with n-grams is displayed by changing plot = ‘line’. By setting ngram parameter to 1 and max_words = 5 we create a line plot for the five most frequent words in each period:

Press enter or click to view image in full size

Figure 7. Line plot — unigram, image by author. Similarly by changing ngram = 2 and max_words = 3 the bigram line plot looks like this:

Press enter or click to view image in full size

Figure 8. Line plot — bigram, image by author. Final remarks Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:

The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time. Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period. Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.

Arabica now has a sentiment and structural breaks analytical module. Read more and also check practical applications in these tutorials:

Sentiment Analysis and Structural Breaks in Time-Series Text Data

Customer Satisfaction Measurement with N-gram and Sentiment Analysis Research Article Meta-data Description Made Quick and Easy The complete code and data for this tutorial are here.

Petr Korab is a Python Engineer and Founder of Text Mining Stories with over eight years of experience in Business Intelligence and NLP.

Visualization Module in Arabica Speeds Up Text Data Exploration

Introduction

Word cloud

Image by author, source: Draw.io

Sentiment Analysis and Structural Breaks in Time-Series Text Data

Related Posts

Bigram Word Cloud Animates Your Data Stories

Contour Plots and Word Embedding Visualisation in Python

Data Storytelling with Animated Word Clouds

Advanced Visualisations for Text Data Analysis