Using machine learning to analyse job ads – Part 2

26 March 2024

Welcome back to Part 2 of our blog where we go through the technical nitty gritty of how we used machine learning in our recent project. We recommend reading the Part 1 blog if you haven’t already, we are going to use some terms and techniques mentioned in that blog.

Part 1 outlined how a computer can be trained to understand language using Natural Language Processing (NLP) and how we used semantic similarity to help us understand gendered stereotypes in language. In Part 2, we’ll explain how we used our ‘word2vec’ model to analyse job ads for gendered language to see if employers may be signalling the job is a better fit to a particular cohort. If you would like a less technical explanation of our analysis, read the ‘What we did section’ of the final report.

What is gendered language?

Gendered language is language that is associated with a particular gender stereotype. Examples of gendered language reflecting male stereotypes include statements such as “an assertive person, who is independent and makes fast decisions”. While including this language in a job ad does not explicitly state a gender preference, it may still implicitly appeal more to men. This ‘implicit’ appeal to a particular group is driven by how language is stereotypically associated with specific groups or cohorts.

Analysing 12 million job ads

Partnering with Jobs and Skills Australia, we accessed the Lightcast database which contains most job ads posted online in Australia since 2012.

To analyse the job ads, we first cleaned the raw job ad text and ‘embedded’ each word. Embedding involved using our word2vec model to convert each word into a series of numbers to allow the computer to work with the text. This resulted in a lot of computations (12,000,000 x 400 x 200 = 360 billion - where we had 12 million job ads, the average length of a job ad was 400 words, and each word was converted to a vector of 200 numbers). We needed to use a very big computer to do this - fortunately we have modern computing services.

Once the words in every job ad were embedded, we calculated the semantic similarity of each phrase with a list of stereotypical phrases that we collated. To get our phrases we built on previous research into masculine or feminine workplace stereotypes.

Now you may have noticed, we jumped from embedding individual words, to phrases. The way we do this is we can add word embeddings together to get the meaning of a phrase or set of words. For example, if you add the word embeddings of ‘king’ and ‘woman’ together, you get a word embedding that is semantically similar to ‘queen’. As a consequence, we can take the word embeddings for ‘taking’ and ‘charge’ and add them together into the phrase ‘taking charge’. ‘Taking charge’ is a masculine stereotype identified in the literature, and the word embedding for ‘taking charge’ is more semantically similar to ‘man’ than it is to ‘women’ (according to the 5 million Wikipedia articles we analysed in Part 1 of the blog).

Quantifying semantic similarity to gendered stereotypes

For each job ad, we looked at how semantically similar each phrase in the job ad was to each gendered stereotype. We first converted each phrase in the job ad to a single word vector. To do this, we removed all stop words, then calculated the sum of each 3-gram (trigram) in every sentence. We then calculated the cosine similarity, a measure of semantic similarity, between each stereotype and each 3-gram in the job ad. This gave us a range of different scores, one for each phrase in the job ad.

To combine all the phrases and summarise the overall ‘stereotypicalness’ of the language in a given job ad, we took the 95th percentile of the scores. While this score in theory could range from -1 to 1, in practice it tended to range from 0.17 to 0.53 in our analyses. We then multiplied it by 100 for ease of interpretation. For example, for the masculine stereotype ‘being in control’, cyber security job ads had the highest average score at 44, while Arts and Media Professionals job adshad the lowest average score with 37.

Our analysis found that cyber security job ads use the most stereotypically masculine language of any occupation across a number of stereotypes,¹ including ‘lead/leadership’, ‘analysis’, ‘being in control’ and ‘problem solving’. While women clearly also possess these qualities, this language is stereotypically or subconsciously associated with men, and may therefore discourage women from applying to job ads that rely on these stereotypes. See Gaucher et al. (2011) for the original research.

You can read the full results in our report. You can also access all the code that underlies our analysis by request, please email BETA@pmc.gov.au.

Where to next? Machine learning and natural language processing in government

Our analysis of job ads is just one example of how government can use new tools to help inform public policy. There are many more opportunities to apply NLP models to understand biases in communications - for example, if the communications we produce are biased towards a particular audience, or understand if other industries are biased in their advertising and communications to see if this is driving workforce shortages.

We used ANZSCO sub-major (level 2) codes to analyse different occupations, with an additional code for cyber security. We relied on Lightcast’s coding of job ads into their respective sub-major occupation code. See the final report for the list of occupations. Return to footnote 1 ↩