[Story Annotation for KOL’s Social Media Postings with Transformers, LSTM, GRU]

Automatically annotate social media posts with Chinese language context

Case Description

Background:

The objective of this project was to develop an efficient machine learning model to automatically annotate social media posts with Chinese language context. It aimed to archive several aspects in order to provide useful perspectives to make informed decisions.

Specifically, it aimed to develop natural language processing-based models that can accurately identify product, sentiment, and product comparison review from Chinese social media posts. Also, a large dataset composed of Chinese social media posts was used to train and test the model and the models were evaluated using standard validation metrics. Moreover, the project has discussed the potential applications, limitations of the approaches and recommendations for future research in this area.

Problem Definition

The project was divided into three sub-tasks for better interpretation and classification, ranging from sentiment analysis, product detection, to product comparison review detection. All three sub-tasks were natural language understanding (NLU) and classification tasks in general, in which product comparison review detection is an imbalanced classification task.

Sentiment Analysis

This task aimed to identify positive, negative and neutral sentiment from social media posts. It would help to understand customer sentiment and be able to digest vast amounts of information from social media postings, converting to manageable sentiment classification.

Product Detection

Social media generates large amounts of data every day and topics vary, which sometimes may not relate to certain products or services. It can be overwhelming for humans to analyse manually, so in this task, machine learning models could be trained to process the data at scale and filter out product-related postings rapidly so as to provide insights that would be difficult to obtain. It was labelled as 1 if the text corpus contains product, and labelled as 0 vice versa.

Product Comparison Review Detection

In this task, samples were classified based on whether it is a product comparison review with more than one product in the post. It was labelled as 1 if the text corpus contains product comparison, and labelled as 0 vice versa.

Also, based on the complexity of the problem, the positive data was small in proportion. The machine learning models were required to overcome the issue of class imbalance, for example, avoid occurrences like to predict all the data as the same class.

Choices of Models

RNN Model
LSTM model
GRU model
Pre-trained Model
- BERT Model
- RoBERTa Model
- ERNIE Model
- CPT Model

Methodology

Sentiment Analysis and Product Detection

Procedures for sentiment analysis and product detection are similar as they are similar in classification characteristic and complexity. After data preprocessing, for the RNN models, tokenization of text was completed while loading of the models was completed for Pre-trained Models. The models were trained with manually labelled training samples. Then predictions on testing data were accomplished and finally were passed to validation.

The steps for LSTM and GRU model were shown below:

Read Training data, Validation data, Testing data
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with training data
Prediction on testing data
Validation

The steps for pre-trained models were shown below:

Read Training data, Validation data, testing data
Padding of sentences
Model loading from Hugging Face
Model Training with training data
Prediction on testing data
Validation

Product Comparison Review Detection

Due to the inadequacy of positive data, two different types of attempts were conducted for increasing accuracy.

For the supervised approach, after data preprocessing, for the RNN models, tokenization of text was completed. Sample size of the major group was downsized to ensure that the proportion of positive data group were adjusted to 25%, 50% and 75% respectively. Then, the model was trained with adjusted samples and predictions on testing data was done. Finally, the predictions were passed to validation.

For the semi-supervised approach, after preprocessing, for the RNN models, tokenization of text was also completed. Models were first trained with labelled samples in which the samples were marked 0 and 1. Then, predictions on unlabelled training samples were completed in order to obtain the pseudo labels from unlabeled training data, but the predictions were first returned as floating numbers ranging from 0 to 1 which represented certainty of the pseudo labels. A threshold, which is larger than 0.9 and smaller than 0.1, was set for selecting data samples with higher degree of certainty. The floating numbered samples were reorganised to integers labelled, meaning that samples larger than 0.9 were relabelled to 1 while samples smaller than 0.1 were relabelled to 0. Model training was again executed with labelled samples and pseudo labelled samples. Finally, predictions on testing data were accomplished based on the retrained model and then passed to validation.

The steps for supervised approach were shown below:

Read training data, validation data, testing data
Adjustment Training data proportion
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with training data
Prediction on testing data
Validation

The steps for semi-supervised approach were shown below:

Read Training data, Validation data, Testing data
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with labelled training data
Get pseudo labels from unlabelled training data
Retraining with labelled data and pseudo-labelled training data
Prediction on testing data
Validation

Results

Sentiment Analysis

Model Performance

Table: Validation results for sentiment analysis

In this task, the GRU model generally performed better than the LSTM model although GRU model only had 2 gates and was simpler in structure. From table , GRU models obtained 0.66 and 0.70 in F1 score and accuracy respectively. The LSTM model scored the lowest, with a F1 score of 0.65 and an accuracy of 0.68.

For pre-trained models, ERNIE model performed the best among the models, while the BERT model had the second highest F1 score as shown in table 2, which were 0.90 and 0.89 respectively. RoBERTa and CPT models scored the same in F1 scores and accuracy, which are 0.87 and 0.88 respectively.

Product Detection

Model Performance

Table: Validation results for product detection

For RNN models, GRU performed better than LSTM model in which GRU model had 0.7 in F1 score while LSTM model only had 0.67 in F1-score. Moreover, GRU model obtained 0.71 in accuracy while, as compared with LSTM model, it only obtained 0.68 under the same condition as shown in table.

For pre-trained models, the BERT, RoBERTa and ERNIE models performed better than the CPT model. From the table, it is shown that BERT and ERNIE had 0.80 in F1 score and RoBERTa had 0.81 in F1 score while BERT and RoBERTa had 0.81 and ERNIE had 0.80 in accuracy. In contrast, the CPT model only obtained 0.69 in both F1 score and accuracy, which was even lower than that of GRU model.

Product Comparison Review Detection

Model Performance

Table: Validation results for LSTM model in product comparison review detection

Table: Validation results for GRU model in product comparison review detection

In this task, 10,000 samples were labelled for the training task, but only 879 samples are marked as 1, meaning that only 8% of the samples were positive and hence class imbalance existed among two classes. Therefore, supervised learning approach, downsize of the major group and semi-supervised learning approach were both used to examine for increasing accuracy purposes. Positive sample sizes were adjusted to the proportion of 25%, 50% and 75% respectively by reducing the major group for supervised learning approach but the unsupervised learning approach remained unchanged in sample proportion.

From the tables, Both LSTM and GRU models with supervised learning approach and 75% sample weight scored the lowest F1 score and accuracy, 0.02 and 0.09 respectively. All the predictions for testing data were actually predicted as in the same group, positive. Also, F1 score decreased as the positive sample weights increased, meaning that the model performances could not be benefited by manually adjusting the proportion of samples.

However, both semi-supervised learning approaches improved the model performances as compared with the relevant supervised learning models.

From the tables, GRU with semi-supervised learning approach had the highest F1 score, 0.91, while GRU model with supervised learning and normal sample weight had the second highest F1 score, 0.90. These two models scored the same accuracy, 0.92.

Among all the LSTM models, LSTM with semi-supervised learning approach had the highest F1 score and accuracy which were 0.89 and 0.92 respectively. Also, the supervised learning approach with 25% of positive samples had the second highest F1 score among the LSTM group of models, 0.88.

Discussion

Apart from basic model parameters like learning rate, batch size or optimizer, sentence length is one of the important parameters that are essential. In the test, several sentence lengths were actually examined during experiments, and finally maximum sentence length of 250 were fine-tuned for all models as other lengths like 50 or 100 can not generate proper results, such as, all the predictions belonged to the same class.

Word segmentation, padding of sentences, and vectorization of text were applied to textual corpus during data preprocessing in order to improve the LSTM, and GRU model performance. Unfortunately, their performances were not as good as the pre-trained models.

In the product comparison review detection task, based on difficult conditions, as the complexity of the problem increased, the required data dimensions and relevant sample size were enlarged respectively. However, downsizing of the major class can not improve the performance but semi-supervised learning which increases the data size can enhance the model performance. The possible reasons may be the total training parameters decreased during downsizing of the major class, so the degree of freedom for the model was also reduced and the model became underfit for the classification problem. But it may still require more experiments for verification.

The potential applications for models in this project can be utilised for providing valuable insights into customer sentiment, emerging trends and consumer behaviour, ultimately leading to improved products, services and customer experiences.

However, there were also limitations that only classification tasks were included in this project. Also, this project did not limit specific product types, such as cosmetics, groceries, or food etc, but all the saleable products. Other natural language processing tasks like question-answering tasks were not included in this project. Hence, these types of problems can have further research in this area.

Summary

Overall, analysing social media posts using machine learning models can provide stakeholders with useful perspectives into different aspects like customer sentiment, emerging trends, and consumer behaviour, finally resulting in enhanced products, services, and customer experiences.

In this project, several machine learning models were examined to annotate Chinese social media postings automatically. By comparing with manually labelled true answers, results on model performance were obtained.

One key finding is that, in general, GRU models performed better than LSTM models in all three tests although it is simpler in structure. Also, The CPT models were expected to have better performance as it compromised the advantages of NLU models and NLG models, but the actual performance did not perform as well as other pre-trained models. In the product detection test, it obtained a score even lower than the GRU models. Moreover, for pre-trained models, BERT, RoBERTa, and ERNIE models were proficient in processing NLU tasks, which also aligned with the problem nature in this project, while they also outperformed other models in sentiment analysis and product detection test.

Hey there! If you're curious, why not swing by my GitHub link below? Check it out for more details!

Link: Story-Annotation

Page updated

Report abuse