Thanks to the release of OpenAI's ChatGPT, we've all become aware of the power of natural language processing (NLP). In a few short clicks, you can now make a computer write a poem, generate ideas for a new product, and write blog posts (like this one*).
BETA recently completed a project which used NLP to analyse job ads. In this post we'll explain common terms and make sense of some of the hype, and in our next post we'll explain how we used this method in our project.
What is natural language?
All written text or recorded speech produced by humans is 'natural language', but traditional statistics can't help us understand this information. What is the 'average' language in this blog? What is its key take-away? We can't calculate the 'mean' or the correlation between sentences to answer these questions. This is what the computer science/linguistics field of NLP has been trying to solve.
With NLP, computers are used to analyse and synthesise text and speech generated by people ('natural language'). In the past, researchers used complicated rules to get computers to understand the world – and as you can imagine these rules became increasingly incomprehensible. To understand why, think about how many rules for spelling and grammar the English language has, and all the exceptions to those rules. For example the rule, 'i' before 'e' except after 'c', except for all the common exceptions, 'species', 'science', 'sufficient'. And for most text you also have to account for all the ways we misspell words! Instead, researchers began relying on a machine learning – where a machine builds its own algorithms – to bypass these programmed rules. A neural network is the algorithm that underlies most machine learning, and to date, is the best way researchers have found to make a computer learn. How a neural network learns is too complicated for this blog, but it involves doing LOTS of simple maths, and solving an optimisation problem.
But how exactly do you train a computer to understand language?
The answer is an algorithm BETA used in our recent project, `word2vec`.
`word2vec` is an older style natural language processing algorithm developed by Google in 2011. This algorithm takes words and converts them to a vector of numbers which a computer can understand and manipulate (and where we get the algorithms name, word to vectors). The algorithms that underpin ChatGPT (and other Large Language Models) are more complicated, but essentially rely on the same principles.
How to teach a computer to convert words into vectors, is to use machine learning to understand what the probability of a word appearing with other words in a sentence is. To make this a little more concrete, let's solve the same problem as the computer. What are words that fill in the blank:
'I took the (blank) to work'
No prizes for guessing that 'bus', 'train', 'tram' or even 'ferry', are words that all fit in this sentence. Your brain just did what we get computers to do. It looked at the surrounding words, and then based on your experience, filled the blank with the word that had the highest probability of being correct.
However, your brain has its own biases based on your experience, or 'training' dataset. For those from Sydney, 'train' is what may first comes to your mind, but for those from Canberra, 'bus' may have made more sense to you. For people from Melbourne, maybe 'tram' made more sense.
Training a computer is similar, except we give it LOTS of sentences to help it work out what word fits into the sentence. This demonstrates how biases in the training data can affect the model. Where you're from has an effect on your 'default' mode of public transport, and so a bias exists in your model of the world. Computers face the same challenges when you train them. As our own language reflects our stereotypes and biases, when you train a computer on natural language, it too reflects the same stereotypes and biases!
Training word2vec and detecting bias
This is the underlying assumption that lets us use NLP to measure the existence of stereotypes in job ads, as part of our recent project. We trained our own word2vec model using all of English Language Wikipedia, which involved inputting over 5 million Wikipedia articles into our model!
We then used another concept, semantic similarity, to help us understand gendered stereotypes in language. This is the idea that if words have the same or similar meaning then they are semantically similar. For example, 'bus', 'train' and 'tram' are all semantically similar to each other, but not to 'cat'. We can then detect biased language, because stereotypical phrases like 'independent thinker' are more semantically similar to men than women.
We will go into how to detect biased language in our next blog post, but with these tools we could start to analyse job ads.
Keep an eye out for the next blog to see if it worked!
Footnotes:
* Confirming this blog was written by a human ↩ Jump back to footnote in the text