This store requires javascript to be enabled for some features to work correctly.

Home

Regions

Hong Kong SAR

Courses

Blog

Using AI to Create Custom Listening Materials

All teachers are materials writers. We all make worksheets, think up activities, select texts and create writing assignments. There is however, one type of material which is extremely challenging for us to make ourselves: audio.

Online Workshop - Bring AI Into Your Classroom: A Beginner's Guide - Tue, 29 October

Matching the quality of audio from coursebooks is extremely difficult. Audio recordings from coursebooks usually includes a range of accents, speakers of different ages, and genders. Recordings are made using high quality microphones in expensive studios. After recording, editors take out hiss, remove gaps, and normalise the volume. If you want to create your own audio, you might record yourself and a friend using your phone, or online meeting software. But this isn’t always helpful. Your students already get plenty of practice listening to you in class. They need help with other accents. Even if you can find other speakers to record, the audio quality is unlikely to match those in commercially produced text books. All of this makes creating your own audio a bit of a no-go zone for teachers.

But this has changed. Teachers can now create audio with an even greater variety of accents than most coursebook writers have access to. They can generate audio free of hiss and enjoy perfect sound quality. Teachers don’t need access to fancy studios or expensive microphones. They just need a computer and access to Text-to-speech AI.

In this blog post we’re going to use AI to generate audio for use in language classes. I’ll introduce you to some of the best free text-to-speech AI tools and how to use these to create high quality audio at the right level for your students. We’ll also look at how to use free audio editing software to create dialogues. Finally, we’ll examine some of the limitations of these tools.

Why Use AI to Create Audio?

Text-to-speech AI makes it possible for teachers to create their own audio instead of relying on coursebook listening texts. But why would you want to do that? Here are some reasons why language teachers might use text-to-speech AI to generate audio for their classes:

Customizable. You can create audio to match the vocabulary, grammar, or topics your students are studying.
Accents. Many text-to-speech AI tools offer different accents, exposing students to a wider variety of pronunciation.
Speed. Teachers can generate audio files quickly; faster than recording yourself or trawling YouTube for suitable conversations.
Adjustable. Text-to-speech software allows you to control the speed of speech. Audio can be slowed down to make it easier for beginners, or speeded up to challenge advanced learners.
Editable. Making changes AI generated audio is also straightforward. It’s easy to insert pauses, add or delete phrases, change the speakers, or slow down the audio.

Creating Audio Scripts

If you’re creating audio, you’re probably going to create your own audio scripts first. So, before we get into using text-to-speech AI, let’s briefly look at some tips for generating audio scripts.

Make it realistic. One of the best places to start is with real speech that has been transcribed as part of a spoken language corpus, like these, from movies and TV shows. Search for the target language that you want to put in your dialogue and be inspired by the contexts that you find.
Don’t make it perfect. We don’t speak perfectly, so it’s okay if you’re audio isn’t 100% grammatically correct. Run on sentences are common in spoken English, as are contractions, repetitive constructions, speakers finishing each other’s sentences, and even mistakes.
Keep it short. What might look like a relatively short script can turn into a mammoth listening activity. Plus, some of the text-to-speech AIs that we’ll look at later have limitations on how much audio you can create.
Write scripts at the right level. Include vocabulary and grammar that you want your students to decode for meaning. Check how complex your script is before you give it to them (for example, using Hemingway Editor or Oxford’s Text Checker).
Make it interesting. Base your scripts on topics that are relevant and appropriate for your students' level and interests.

Creating a Script

For the most part, writing audio scripts is similar to writing listening texts. However, spoken language can be quite different to written language. Without specific instructions, AI text generators will create text that looks like writing, rather than speaking. Make sure that you include details in your prompt about the type of text you are trying to create.

Prompt: Please write a short [conversation / monologue] between [description of the speakers]. They are talking about [topic]. Word limit: [number] words. Please write this using spoken English conventions and include appropriate aspects of language, such as contractions.

Chat GPT Generated dialogue for esl classes — A dialogue based on a prompt from ChatGPT

Example transcripts of spoken English on similar topics or in similar situations can also help. By first asking AI to analyze the transcript, the AI is more likely to focus on the key elements and reproduce these in a new script.

Prompt: Read the transcript below. Analyze this transcript for features of spoken English. Then create a new transcript which includes the same features of spoken English, on the topic of [topic].

Chat GPT generated esl student dialogue transcript — A transcript based on an example from Chat GPT

For more about creating texts using AI, check out our blog post about creating materials, here.

Text to Speech AI Tools

There are a few text-to-speech AI tools, although far fewer than those for speech to text.

Natural Readers has a wide range of high-quality voices. Users can search by country to choose an appropriate accent. Previews of each speaker can be listened to using the default text. Natural Readers limits free users to five minutes of audio generation per day. If you use this tool, double check your script before it turning into audio. Otherwise, you might use up your five-minute allowance and need to wait 24 hours to create your next file.

TTS Maker also has a range of speakers and accents, although not quite as many as Natural Readers. However, TTS Maker is free, and doesn’t limit users (even free ones) on how much audio they can create each day. There is a limit of 1000 characters per file, which is roughly 150-200 words. That shakes out to about one minute of audio. If this isn’t long enough, create multiple audio files and splice these together using free software like Audacity (more on this later).

Eleven Labs perhaps has the most realistic sounding voices, and lets new users create up to 10,000 characters of audio for free (around 10mins worth). After that you’ll need to pay or create a new account.

Cereproc has a range of accents, mostly from different areas of the UK. Although Cereproc is a paid service, their website has a free demo which teachers can use to create audio.

Narakeet allows users to create around 100 words (or about 45 seconds) of audio for free. You can do this up to 20 times before being charged.

Murf AI lets users create audio for free. However, you are limited to just 250 characters each time – only around 15 seconds! Unless your students have extremely short attention spans, this probably isn’t the best choice.

Creating Audio with AI

Let’s look at how to use one of these text-to-speech AI tools to create some audio. I’m going to use TTS Maker because it’s got fairly good voices, gives users a lot of control with settings, and (most importantly) it’s free.

Let’s start with the text we want to turn into audio. (If you want to know how to use AI to generate texts, read our blog post on this.) Start by highlighting the text. If you’re using TTS maker, you’re limited to 1000 characters each time, so you’ll want to check you’ve not exceeded the limit before you copy and paste. You can do this by highlighting the text in Microsoft Word, clicking “Review” at the top of the page, and then clicking “Word Count”. This will show you the number of characters, with and without spaces (TTS Maker counts the spaces). Copy and paste this text (with less than 1000 characters) into TTS Maker.

Choose a Voice

All the text-to-speech generators mentioned earlier come with a range of voices. This lets you choose the accents for your audio. Most of the platforms above allow you to hear a short sample of each speaker before choosing them to create audio. You might want to choose accents your learners are familiar with, or accents your students are likely to encounter outside the classroom. You can also choose the gender of the speakers. This can be helpful in making dialogues easier or more challenging for students. (It’s easier to understand a conversation between speakers of different genders than the same gender). I note the voices that I like so I can find these quickly again in the future.

Some of the many accents available on Natural Readers

Other Settings

After choosing a speaker, set the speaking speed. These are usually written as multiples, so x1.05 is 5% faster than ‘normal’, although normal varies from speaker to speaker. You might want to slow down the speed of the audio for your students. Many of the AI voices in the platforms described above should sound relatively natural, even at slow speeds. Some teachers may want to generate faster audio based on the needs of the class. For example, you might need to match the speeds that learners will hear in a test or in a real-world context. Play around with the settings until you find a voice speed that your students will be comfortable listening to.

TTS maker includes other options such as inserting pauses. To add a pause, select the place in the transcript, then click on the pause button, and choose the length of the pause. I find this is best done after creating a ‘draft’ audio, then hearing where the pauses should be longer (or shorter). If you’re really picky about pauses, you can add these yourself using audio editing software later.

creating pauses in TTS Maker — Pauses in TTS Maker

When you think everything is where it should be, click “Convert to Speech” and create the audio. This usually takes a few seconds. It’s unlikely your audio will sound perfect on the first attempt. Take notes on the script as you listen, adding pauses, deleting contractions and changing phrases. Go through the process again, until your audio sounds as you want it to.

Creating Dialogues

Why Dialogues?

One of the biggest drawbacks of the AI tools we’ve looked at so far is that they only create audio, one voice at a time. For better or worse, dialogues are one of the most common forms of audio that we ask students to listen to in class. In this section, we’re going to look at how to turn the audio created using the text-to-speech AI tools into dialogues. To do this, we’re going to need to use some software.

Audio Editing Software

I’m going to show you how to edit audio using Audacity. Audacity is free open-source audio editing software. Although this has been designed for recording and editing music, Audacity has everything you need to edit voices. To download Audacity, go here. Audacity isn’t the only option for editing audio. Some other editing programs are entirely online, like AudioMass, and Twisted Wave.

Two Voices

Let’s put the editing software to one side and go back to generating the audio. Imagine that we have a script for a dialogue. Before we go from text to audio, we need to change the format of the script, organizing it by speaker. We need all the utterances from Speaker A together and all the utterances from Speaker B together.

Copy and paste Speaker A’s utterances, and turn these into audio, using the process described above. Then, repeat the process for Speaker B, choosing a different voice. Choosing a different gender, age and accent should make the voices easier for learners to distinguish. We should now have two audio files: one for Speaker A and one for Speaker B.

Next open Audacity. Create a new file by going to File -> New. Then add the audio files. You could either drag the files, or by going to File -> Import -> Audio. Add for both audio files (one for Speaker A and one for Speaker B). The audio from each speaker will now be on a separate track. Don’t worry if it looks like the audio is duplicated. This is the left and right channel in stereo (which you don’t need to worry about).

Audacity Interface, with Speaker A’s audio at the top and Speaker B’s at the bottom

Now we have all the audio together in one place, but in the wrong order. To get this in the right order, listen to the audio from Speaker A. Click at the end of their first utterance. Add a short bit of silence here (where Speaker B will talk) by going to Generate -> Silence. Choose roughly how long you want this silence to last. Then drag (or copy and paste) the audio for Speaker B into this silent section. Line up the audios (below) so the speakers aren’t talking over each other. If the silence is too long, highlight the extra silence and hit “delete”. If it’s too short, add more silence. Repeat until you get to the end of your dialogue (which should look like the screenshot below).

To export, go to File -> Export audio. Choose a file type (like mp3) and a place to save. You now have a dialogue.

This might sound like a lot of work. Once you get the hang of it, editing a short conversation (like the one above) from beginning to end should take less than five minutes.

Sound Effects

If you want to make your audio even more realistic, you can add sound effects over the top. For a dialogue taking place in a train station or airport, you might want to add some background noise. There are many free repositories for sound effects such as Pixabay and ZapSplat. After downloading, add the files to your audio in the same way described above (but without the gaps). Adjust the volumes using the ‘-‘ and ‘+’ under the ‘Effects’ tab at the left of each track.

Limitations

As with all AI technology, there are limits to what text-to-speech AI can do. Creating dialogues is a little complicated, although there are workarounds. The main limitations are with the voices themselves. AI voices aren’t real. That means that they might be missing some important aspects of ‘real world’ speech. This isn’t just a problem with AI generated audio. Much of the audio in coursebooks (and even tests) is quite different from real-world speech. Make sure that you supplement any AI generated audio with audio from authentic listening texts and your coursebook.

Although AI speech generators offer a range of accents from different countries, not every country or accent is available. The accents available skew towards more prestigious varieties of English. You might not be able to find the exact accent you want. But unless you work at the UN, you’ll probably have access to more accents than you did previously.

Conclusions

AI tools are changing how teachers create listening materials. These tools offer many benefits. They're fast, customizable, and free. Teachers can now make high-quality audio with accents from around the globe. However, there are some challenges. Creating dialogues takes extra steps and AI voices aren't perfect. They might miss some features of real speech. But overall, these tools give teachers new options. With just a computer and free tools, teachers now have the power to create their own listening texts for their classrooms. As AI improves, it will become an even more useful tool for language teachers.

Key Takeaways

Text-to-speech AI allows teachers to generate customizable, clear, and accent-diverse audio. The range of accents available is often greater than the variety found in coursebooks.
AI-generated audio is fast, adjustable, and editable. This gives teachers the flexibility to tailor listening materials to their students' needs.
Teachers can create dialogues using audio editing tools. With a little work, these can combine audio files from different speakers into one dialogue.
AI tools provide the opportunity to expose students to a range of accents. This helps learners develop listening skills for a variety of real-world English dialects.
While AI-generated voices are clear, they may lack some features of authentic speech. It is important to supplement AI generated audio with voices from the real-world.