Razor Insights
Text Classification with Generative AI
What is Few-Shot Classification?
Explore the capabilities of Generative AI beyond chat interactions through few-shot classification, a method that allows for the categorisation of unstructured data with minimal labelled examples. Understand how this approach utilises transfer learning to apply pre-existing knowledge to new tasks, enhancing the extraction of insights from complex data sources
<>
In its most well-known form, Generative AI has become the gold standard for chat-like, question-and-answer style interactions. However, it can be used for much more than just data retrieval.
By providing specific instructions, known as ‘prompts’, we can use this conversational style of communicating our requirements to achieve a wide variety of tasks, including the inference of the category a text sample belongs to.
This has the potential to unlock insight into previously untapped sources of unstructured data such as emails, customer feedback and staff notes.
<>
Few-Shot Classification
Traditional machine learning (ML) classification methods tend to rely on large amounts of labelled data in order to train a model to recognise the desired categories. A good example of this is spam filters, where millions of examples are available since it’s a common problem for everyone.
This labelled data has to come from somewhere and is almost always manually classified - a time-consuming and costly process that most businesses do not have the resources to do, and unlike our spam filter example, we can’t rely on publicly available data.
Few-shot classification is a method that tries to circumvent this requirement. Models like GPT have been trained on a wide range of data, resulting in a general-purpose solution that can deal with a range of natural language processing (NLP) tasks. They also have a better understanding of language and its nuances.
<>
By providing:
A very small sample of labelled data
A clear list of potential categories
Specific instructions of our required outcome
We can leverage this general knowledge through a process called ‘transfer learning’, where a model can perform a task it was not explicitly trained for by relying on knowledge of another task.
While this is often outperformed by traditional ‘fine tuned’ models, the results can still be surprisingly accurate, making it a viable solution where manual classification is not a possibility.
<>
Result Validation
There are many factors at work when using a method like this to classify data; the examples provided, the wording of the prompt, and the categories chosen can all directly impact performance. For example, choosing categories that have significant overlap can result in the model conflating the two. Validating our results helps highlight areas of improvement and refine the model's performance, necessitating an iterative approach to development.
Manual Classification
To validate the results, we still have to perform some manual classification. However, rather than classifying thousands of records before we can begin training a model, we can instead review a small sample of the results (50-100 can be sufficient, although more is better). This process also tends to be much faster as we can leverage the model's classification rather than approaching the task with no context.
As part of this process, it’s also worth recording any values that could not be confidently classified manually. With this information, we can establish a baseline for predictive performance by calculating the percentage of classifiable records. The model cannot be expected to outperform this baseline as human intervention could not achieve a better result.
Assessing the Predictive Performance
By comparing the manual classification against the model’s results, we can assess its predictive performance. As well as calculating the overall accuracy (no. classified successfully/total number in the sample), we can also quantify the success of each category by calculating its F-Score: a commonly used measure of predictive performance for classification which combines:
Precision: The number of correct results, divided by the number of results predicted to be in that category
Recall: The number of correct results divided by the number of results that should have been identified as belonging to that category
Considering these together is important because it provides a single unified figure for performance, where one could show an inflated perception of success. As an example, the precision of a result might be 95%, whereas its recall might only be 30%. From this we can conclude that when the model did assign a comment to the category it was highly likely to be correct, but it struggled to spot some of the signs that a comment should be in the category in the first place.
Tools like confusion matrices (a visual representation of performance) can also help to identify issues with specific categories, such as its overlap with another similar category.
<>
Example in Action
This method of text classification is best used in situations where:
The data is unstructured, such as in word documents, emails, user comments, or product reviews.
The categories are known, but which records belong to each category is not.
There are too many records to classify manually, making automation essential.
Effectively classifying information like this can:
Unlock previously unknown insights into the operation of a business
Validate intuitions that couldn’t be proven tangibly
Identify potential areas for process improvement.
Enhance data collection by highlighting overlooked opportunities.
For example, imagine you work for an e-commerce business. The website includes a review feature, where customers can provide a rating of a product and leave a comment about their individual experience.
While the ratings give an indication of satisfaction, it’s not easily possible to discover the underlying reasons behind them without reading through the comments individually and noting down the key features.
You might have some idea of what these key features are from reviewing a small sample, but if you had thousands of comments it wouldn’t be possible to verify how important each feature is. Even if you were able to review all of the historical data, as more comments are added over time you would have to continue manually classifying them to keep your analysis up to date.
By using few-shot classification to classify these comments, meaningful analysis can be done on those key features to find out previously hidden insights. From this you could learn more about your customers and the things they like to help inform future product lines, or identify why certain products are poorly received. You might decide to add more questions to the review process to gather explicit information about a certain feature.
Once this classification process has been created and refined, it could continue to run as new comments are added, so improvements that you make can be proved directly through data.
<>
Conclusion
Generative AI is not a silver bullet for text classification tasks; it still requires many of the same skills data scientists have honed by working with traditional ML models, and benefits from an iterative approach which can take time to assess and perfect. It will also never outperform human intervention, or classify ambiguous data successfully.
Where it really shines is how significantly it reduces the barrier to entry, as reliance on huge quantities of manually classified test data is no longer a blocker to getting meaningful results from unstructured data. The insights that can be gained are well worth the cost, especially when the outlay is so much lower than before.