Statistical Models for Natural Language and business applications

Written by Daniele Gambetta | Jul 6, 2022 10:00:00 PM

Natural language and human speech are some of the most complex and varied kinds of data to handle. While we have a common numerical interpretation of images through matrixes or flows with time series, it has always been hard to find a mathematical representation of text able to manage lexical, phonological and semantical aspects. Furthermore, words are susceptible to declinations, inflexions, conjugations, declensions and other variations depending on the specific language. Think about the development of a search engine or a chatbot: how to explain and teach to an algorithm…

In the last century, with the advent of computers and the dawn of artificial intelligence, many new fields arise also in social science. For example computational semantics and linguistics, that concern the modelling of natural language and the study of how to automate the process of constructing and reasoning with meaning representations of language expressions. Of course, in the last decades and with the support of a large amount of data, we have seen a huge explosion of new methods and algorithms useful to modelize language, in the machine learning field knowns as NLP.

In 1935 the American linguist George Kingsley Zipf stated his famous Zipf’s law about the distribution of terms in a language, stating that given a corpus of text, the frequency of any word is inversely proportional to its rank following a power law, frequently in human behaviour statistics. Thus, the most common word (rank 1) in English, which is the, occurs about one-tenth of the time in a typical text; the next most common word (rank 2), which is of, occurs half as often as the, so one-twentieth of the time, the 3rd most frequent appears one-third of the, and so on…

Zipf Law
Source

In the most recent years of the 21st century, thanks to large text data, many new techniques have been developed to process languages, such as the famous doc2Vec, created and published by a team of scientists led by Tomas Mikolov and other Google researchers. The algorithm uses a neural network model to learn word associations and represent each distinct word in a vector (a list of numbers). The vector is chosen in a way that a particular distance between two words is a metric of their meaning similarity, and is possible to make mathematical operations on them.

Visual illustration of word embeddings
Source

Embeddings and representations like this allow recognizing the structure of the text detecting and classifying entities into pre-defined categories such as person names, organizations, locations, medical codes, etc.

Named Entity Recognition (NER) has many business use cases, such as classifying and prioritizing emails of customers, or generating summaries of web contents from a large number of recommendations.

Example of NER
Source

Auto-tagging and topic recognition in Hyntelo R&D

Many companies and societies have to handle every day a large amount of data, often also in textual format as folders of emails, archives and scanned documents, and usually, we have to face the need to manage and organize this huge quantity of information.

In Hyntelo we focus on detecting and recognizing specific topics of a particular document or extracting entries and names cited in a text, to permit data analysis, smart suggestions generation and searching performances based on text databases.

NER in pharmaceutical applications
Source

Furthermore, by focusing our algorithms on specific needs, we have developed text representation models based on the language (mainly Italian and English) and specialized in particular semantic sectors, such as the legal and medical sectors. Through a fine-tuning process of NLP models we have led the algorithm to recognize specific entities and terms in the sector (names of medicines, references to laws, company names …), developing automatic tagging tools and textual search on large archives of files.

View full post