Topical Segmentation of Text Documents

Text is by far the most widely spread form of unstructured data and it is often the one that holds the most information. The old saying is that a picture is worth a thousand words, but nobody learns algebra from pictures, and nobody studies for engineering exams using pictures.

01.03.2023

written by Alexandru Parvu (Machine Learning Engineer), in the February 2023 issue of Today Software Magazine.

Read the article in Romanian here

Before the topic of this article can be addressed, we must first start with the problem it is trying to solve. It is a cliché at this point, but a true one, that over 90% of the data ever generated has been generated in the last 2 years, according to IBM.

However, what is rarely discussed is the fact that more than 80% of that data is unstructured. Most of the data generated today is in the form of videos, text, images, 3D art and so on.

This creates great difficulty in making sense of this data, as unlike tabular data, there is no easy solution for categorizing, clustering, or filtering it. In many cases that is because the data itself has no labels that could be used for filtering. Due to this lack of order, most of it is not openly accessible to the end user, and as a consequence most of this information can’t be used in the decision-making process.

The Problem with Text Data

Text is by far the most widely spread form of unstructured data, and it is often the one that holds the most information. The old saying is that a picture is worth a thousand words, but nobody learns algebra from pictures, and nobody studies for engineering exams using pictures.

But we are not talking strictly about technical text documents, most of the text data found today is in the form of articles, reviews, comments, historical documents, and many other examples of information posted without labels or a way of making sense of it.

The issue could be solved by implementing a labelling convention from the very beginning of any process using, storing or sorting text data, but often by the time a convention is implemented, there are already many terabytes of text data that need to be reprocessed.

And even in the case of a pre-established convention, there would be no guarantee that the text data generated would fit within those constraints.

In that regard, this article aims to prove that the problem encountered when dealing with text data is not as hopeless.

In many cases Machine Learning can resolve the issue, or at least make sense of what would have been a very time-consuming problem to solve.

Data Sources

For this article, three sources of text data have been used, each presenting the problem of having no structure, thus making any filtering, sorting or discerning of information, difficult if not impossible. It would be hard to understand the length of these text files just by a word count, so to make it easily understandable, a simple point of reference has been used, namely the volume of text when compared to the length of the Bible.

The data sources in question are:

NPR News Articles, 9,266,936 words, 11.47 times larger than the Bible;
Spotify Song Dataset, 12,653,383 words, 15.67 times larger than the Bible;
Women’s Clothing Review, 1,221,308 words, 1.51 times larger than the Bible.

As can be noticed, the datasets are from 3 very different sources, each presenting the same problem, there are no labels, thus, there is no method of creating a filter. This, combined with the larger text volumes leads to great difficulty in discerning any useful information from the text within good time.

The most obvious solution would be to create a simple count vectorizer or a more advanced Term Frequency-Inverse Document Frequency embedding of the documents and use a classical clustering algorithm on the result.

However, that would create clusters, but there would be no way of understanding what those clusters refer to. Thus, the initial problem of having to go through the whole document collection and to make sense of them, would have just been split into having to go through several clusters and do the same. The time and resources problem would not be solved but divided into smaller problems the total sum of which would be the same.

However, there are clustering techniques that can solve this issue, as are the ones used for this article: Latent Dirichlet Allocation and Non-Negative Matrix Factorization.

Both algorithms rely upon the same assumptions:

documents would cluster around a series of words that define that cluster,
each Topic has a series of words with a higher probability of showing up.

NPR

The NPR dataset consists of 11,991 news articles, but there is no label for any of the articles. The most frequent words found in the articles are known as those frequently encountered stop words that can’t be used to determine a category as can be seen in Figure 1.

Even when eliminating these stop words, it is difficult to determine a subject as can be seen in Figure 2.

It can be concluded that a simple word filter could not be used to categorize the articles. As a result, we are left with the tools mentioned above, namely Latent Dirichlet Allocation and Non-Negative Matrix Factorization.

We start out by assuming a number of seven clusters. There is no correct number, but the higher the number, the more specific the clusters become, the lower the number, the more general they become.

Starting with this assumption, we end up with 7 clusters, each having a different frequency for different words. For example, for the first topic (Topic_0), we can observe in Figure 3 that the words with the highest frequency tend to be financial in nature. Notice that the clustering algorithm does not give a topical name.

It is up to us to determine the appropriate name from the word frequencies obtained for each cluster.

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

dtm = cv.fit_transform(data['Article'])

LDA = LatentDirichletAllocation(n_components=7

, random_state=42

, n_jobs=-1

, verbose=1

, max_iter=100).fit(dtm)

The following results were obtained for each Topic:

Topic_0: [tax insurance states companies 000 law company year state money government new million federal care percent health people said says]. We could classify this topic as related to Finance.
Topic_1: [attack state military war news department country according reported president russia security npr reports told government people says police said]. We could classify this topic as related to International News (from the point of view of the US).
Topic_2: [local little land small way year make world home time day city new years just water people food like says]. We could classify this topic as National or Local News.
Topic_3: [brain time years research new don percent just care drug children like disease medical patients study women health people says]. This topic clearly deals with Medical Research.
Topic_4: [presidential just voters political vote donald party new people republican election white house obama state campaign clinton president said trump]. We can notice that this topic deals with the Presidential Election.
Topic_5: [black says world ve said going story years don life music way really new think know time people just like]. This can be labeled as Cultural News.
Topic_6: [children work science kids make really way schools don university education time new think just like people students school says]. This topic can be labeled as Education.

The Topics given are not labelled intuitively, but the topics can be deduced from the frequency of the words used in each category. Taking into consideration the labels determined, we can take a few samples from the corpus to verify our assumptions, shown below.

Spotify

It can be noticed that this clustering method seems to work for news articles; however, different however, the algorithm can deal with different circumstances , for example, with the lyrics of songs. However, expectations should be limited regarding what the algorithm can do. It should not be expected for it to detect the genre of music.

As the name suggests, it can segment text on its topic but often, the topic of a song is not the same as the genre. The Spotify dataset has a total of 57,650 song lyrics. They shall be investigated in order to see if they can be segmented via a topic.

We can notice in Figure 4.1 and Figure 4.2 that the most common words in the songs are ordinary English stop words, as expected. However, if these are ignored, it can also be noticed that the most common words have to do with emotions. Thus, it could be assumed that the Topical Segmentation will lead to different topics relating to emotions.

The same algorithm for the NPR data was applied to the Spotify text data, with the same assumed 7 topics. In Figure 5, the most frequent words encountered in the 3rd topic can be observed. Most of them have a religious connotation, thus, it can be assumed that the topic of the songs in the 3rd category is religious by nature. Again, the topic name is only implied by the frequency of the words as the algorithm does not provide a topic name but a number.

Thus, the topic names were decided by looking into the frequency of the words for each topic.

They are as follows:

Topic_0: [wanna like hey know girl come let love want got don gonna yeah baby oh]. We can observe that the topic of these songs seems to be the Subject of a Courtship.
Topic_1: [wind blue time dream rain day come sky eyes light sun away ll night like]. This topic seems to be focused on Nature.
Topic_2: [feel away life want like way heart say time ve ll just don't know love]. This topic seems to be focused on Love, which the algorithm seems to consider a distinct topic from that of courtship.
Topic_3: [free born heaven man sings die jesus soul oh let come life world lord god]. As we’ve noticed from this graph, these songs seem to have the topic of the Divine.
Topic_4: [way day ll long town good little ve got just old said home man la], this topic appears to be Locational by nature.
Topic_5: [la santa gimme music di ba happy ha doo roll dance rock da christmas na]. This topic seems to be focused on Holidays.
Topic_6: [nigga chorus fuck man shit money em just cause yaain know don got like]. The last topic seems to be indeed overlapping with Rap Music.

As previously, a random test is needed to confirm the assumptions made on the topics.

The assumptions appear to be correct. It can now be Now, we can verify which topics are the most popular in the Spotify dataset, as seen in Figure 6:

Women’s Clothing

No, this is not the beginning of a joke, from a technical point of view it is more the beginning of a tragedy. The reason is that if one sells women’s clothing the assumption might be made that it is straightforward to understand what your customers are interested in when it comes to the product. And, of course, their satisfaction can be measured via the rating, as seen in Figure 7.

Overall, the customers are satisfied with the products. But we can’t really understand what the customers are interested in unless we go into the reviews, which are, of course, unstructured text data.

In such an environment, we can use the exact same tools that we have used so far. In Figure 8, we can notice the most often used words, but these do not give us any topics of interest.

As with the other cases we need to apply the Latent Dirichlet Allocation, as for the previous cases, with the assumption of 7 topics of interest. The result being the following topics with the adjacent twenty most frequent words:

Topic_0: [don, ordered, went, got, jeans, did, pants, try, love, bought, just, online, price, fit, retailer, tried, saw, size, sale, store]. It appears that here the focus is on the Sale of the item.
Topic_1: [petite, bit, right, nice, little, great, short, just, love, hips, look, fabric, flattering, long, like, length, skirt, size, fit, waist]. In this topic the focus seems to be on the Fit of the item.
Topic_2: [long, colors, cute, nice, black, look, bought, fall, like, looks, jacket, comfortable, jeans, perfect, soft, wear, color, sweater, love, great]. The focus of this topic seems to be the Comfort.
Topic_3: [retailer, run, like, love, bit, fits, lbs, big, little, wear, usually, runs, petite, medium, fit, xs, ordered, large, small, size]. The focus of this topic appears to be Small Sizes, the algorithm seems to think women differentiate between Fit and Small Sizes.
Topic_4: [looks, colors, material, blouse, pretty, bit, sheer, soft, really, underneath, little, bra, nice, love, wear, color, white, shirt, like, fabric]. This topic appears to be focused on Material and Color.
Topic_5: [cut, cute, loved, beautiful, thought, wanted, work, model, material, looks, fit, way, didn, looked, really, fabric, look, just, dress, like]. Of course, Appearance appears to be a topic.
Topic_6: [work, gorgeous, fabric, true, quality, recommend, summer, dresses, fit, compliments, size, fits, comfortable, beautiful, flattering, great, perfect, wear, love, dress]. Dresses are the only topic that overlaps with a category of clothing, similar to rap in the music segment.

As before the assumptions need to be tested with a few random examples:

The assumptions are close to reality; however, this can be taken even further. We have the ratings of the products; therefore, we can dig deeper and see the topics for the liked and disliked products, thus getting valuable insight about the customers’ behaviour.

The convention is assumed that the liked products are those with a rating of 4 stars or above and the rest are considered disliked products. We’re separating the data into two datasets, one with a positive ratings and the other one with negative ratings. Then, the clustering algorithm is applied to each dataset individually. We assume just 4 topics for each nata set, and apply the algorithm to them.

We can see in Figure 9 which topics are related to positive ratings.

In much the same way, we can notice the topics negatively rated in Figure 10.

It can be observed that positively reviewed products tend to get reviews regarding:

Well-fitting small sizes
Appearance
Casual Style
Price to Quality Ratio.

Whilst negatively reviewed products tend to be focused on:

Tops that do not fit or don’t look well
Dresses made of cheap fabric
Tops with a wrong Color or low-quality Material
Items of clothing that are labeled as small but are larger than expected.

Thus, through a simple clustering algorithm, we have managed to gain valuable insight into the customers’ interests and complaints, even though this data is stored as text reviews.

Conclusion

When we take into consideration that most of the data generated today is unstructured, and moreover, a large part of this data is in the form of text, companies and individuals face significant challenges in making sense of it. However, it appears that, at least in the case of text, there are tools available that can help in the analysis and understanding of unstructured text data, and despite the difficulties, valuable insight can be obtained from such data sources.

Join our community of problem-solvers