Hello Heng, can you explain to us how this project was born ?
We started working on the semantic analysis of press articles in June 2019. The objective of this project is to have increased responsiveness to the predictability of enterprise failure.
This project was born out of a reflection on the analysis of press articles. The idea was to define a feeling for each item so that we could have a complementary view of the company failure score.
The approach we wanted to put in place was to propose a system capable of analysing press articles about a company, and classifying them as negative or positive. This is a way for us to exploit new information to detect failure risk signals earlier.
Specifically, how is it going ?
At Ellisphere, we have a database of press articles that already concern more than 25,000 French companies. As the days go by, we get new ones. This allows us to have a sufficient body of texts to initiate our working methodology.
Take an example of a company that will be named X. Company X has had a failure score of 10 since last year. However, last week we learned from the press that this company is in cessation of payment. As the bankruptcy is not recorded, the probability of failure score is not yet updated.
This analysis allows us to ensure a complementarity of the score by guaranteeing our customers a precise vision of the situation of the companies.
When we talk about “feeling” about the articles we have to be careful. For example, if we take the comments left on the Amazon platform, we can judge the feeling of these comments on a particular product (I like or do not like this product). However, our approach is different. In our case, the feeling will be linked to the risk of failure by being able to detect, through the analysis of press articles, the economic and financial difficulties to come for a company.
What methodology have you applied to arrive at a solution ?
For this project we tried different approaches. Each of them allowed us to adjust our results to finally choose the most relevant methodology.
Labelling of press articles
At the beginning of the project, we established a corpus of 2,000 articles that we labeled by our experts. This work allowed us to build the learning model around which we could work.
This labeling work consisted of reading these articles and then defining whether they could cause future failure (negative feeling) or not (positive / neutral feeling).
Preprocessing of data
In order to make our data set analyzable, we used a standard NLP preprocessing. This essential step aims to format the data so that the analysis model can work.
Our approach: a semi-supervised model
We then opted for a semi-supervised model to estimate a press score. A second phase of data clean-up was required to address this approach.
To achieve our goals, we built two language models based on deep learning technology. A first commonly called Word2Vec (used by Google) that allows us to analyze all articles by randomly using a target word to predict its context.
For example, let’s take the phrase “I eat a potato.. In this context, the language will analyse the word “potatoe” and predict the context of the object according to its occurrence.
Why did you choose a language model? This allows us to translate our language into a mathematical representation with a semantic sense. This language model was therefore used as a dictionary associating words with vectors.
The next step was to create a network of hidden two-layer neurons to predict whether or not the difficulty article could cause a failure.
Methodology of the project
After our experiments, we have achieved a satisfactory result. For articles with a positive feeling, we obtained a 95% accuracy. For items indicating a negative feeling, we obtained 87% accuracy.
This analysis is therefore able to predict a positive or negative feeling sufficiently precisely on articles outside its learning.
Good results = 1 / random results = 0.5
What difficulties have you encountered in creating the press score ?
The main difficulty we have faced is the following. Failure is a rare event. According to the probability of having an item that could mean a failure is low. It was therefore necessary to train the algorithm to be able to predict this minority. The labelling step was therefore essential in setting the score.
We also faced some challenges. For example, it may have happened that in our body of articles, an article could concern two companies simultaneously, as part of the acquisition of a company by another. In this case, the press score affects both companies.
We encountered several error situations in the backtesting stage of the model. Many articles mention financial / economic hardship on companies that have not failed after one year (sometimes the failure occurs 2 or 3 years after the article or never).
Another observation we have been able to make concerns the relationship between press reports and failures. Some articles mention a company a few days or months before the failure, but they do not mention any particular difficulties.
What are the next steps ?
This semantic analysis of press articles is not yet implemented in the methodology for calculating the failure score. However, we can already use it as an alarm system for monitoring companies.
As the analysis tool, the article’s date of publishing will be considered for improvement. We are also working on the establishment of a mechanism of self-attention to improve our language model. To address the issues of explanability, we will generate comments explaining the Ai’s approach.
This model allows us at this stage to respond to issues corresponding to our business. However, we are actively working to improve this model to get better performances.