Bocek - talking discord bot

The journey

Being a 90s kid my voice communicator journey started with Skype around 2009 with Counter Strike 1.6. Then throughout the years, I've jumped from Skype to TeamSpeak, mainly because it was less laggy but the biggest nuisance was a constant search for the free servers with empty rooms. Finding a one wasn't a guarantee that some random will join and start screeming. Fast forward a few years and boom, Discord was created and it changed the game, but not for everyone of my friends as it seemed laggy for them and was causing issues with games. So we go back to to TeamSpeak, this time having a spare PC, it was pretty easy to start my own server where only the friends could join. When developing the server I came across the TeamSpeak3 python API and started playing with it. I've implemented some pretty simple commands like rolling a dice and it was great, but I quickly realised that the potential is pretty limited. Soon after that a friend of mine updated his setup and running Discord was no longer an issue for him. Because of that, the final transition has happened, we're sitting on a Discord, no problems and that means that I can put some effort into the server and that's when the idea struck me. Discord must have an API and hell yeah there is one. Checking out the documentation for Discord.py, I've realized the potential. You can create a bot that will react to messages, join the channel, play an audio... and that's how Bocek slowly came to life.

The concept

Similarly to the TeamSpeak bot I started with some basics, dice roll, replying to messages, etc. With Discord.py quickstart it is possible to make a bot in 15 lines of code. But it wasn't enough, I've seen that there are bots that can play Youtube videos meaning that you can feed an audio stream and it will be played on a channel. It would be nice to implement such thing by myself but I didn't want to reinvent the wheel. Having some experience with Google Cloud (you can also call it wandering through their UI for hours before I find what I want), I reminded myself that they have a Text-To-Speech functionality. That's perfect, there is ton of stupid sayings that we say all the time or some classic one liners from movies, games, our childhoods. This is it, Bocek is going to have some sort of dictionary of possible lines, send a call to google TTS and then play it on the channel.

The implementation

Text-To-Speech Cog

To start things off, discord.py suggests to divide the functionalities of a bot into, so called, cogs.

from discord.ext.commands import Cog
from google.cloud import texttospeech

class Tts(Cog, name="tts"):
    def __init__(self, bot):
        self.bot = bot  # a cog needs to have a refrence to the bot object
        self.client = texttospeech.TextToSpeechAsyncClient()

and voila. Next up on the table is how do I create an audio stream or a file that can later on be played on a channel. When making a call you can specify all the details of a resulting audio, such as voice, pitch, speaking rate, volume and of course the language. It is worth mentioning that it's possible to provide a ssml input to the TTS API but I though it will be too much trouble for generating all of the various lines.

    async def tts_google(
        self,
        text,
        pitch=0.0,
        voice="pl-PL-Wavenet-B",
        volume=0.0,
        speaking_rate=0.9,
        lang="pl-PL",
    ):
        if not text:
            return

        tts = texttospeech.SynthesisInput(text=text)
        # voice creation
        voice_params = texttospeech.VoiceSelectionParams(language_code=lang, name=voice)
        # additionial params
        audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3,
            pitch=pitch,
            volume_gain_db=volume,
            speaking_rate=speaking_rate,
        )
        ...

As you can see, the method exposes all the parameters so that they can be used from outside and then does the configuration that will be sent alongside the text. After that it's finally ready to call an endpoint responsible for synthesizing the speech.

        ...
        # generate response
        retry = AsyncRetry(initial=0.1, multiplier=2, deadline=10)
        response = None
        try:
            response = await self.client.synthesize_speech(
                input=tts,
                voice=voice_params,
                audio_config=audio_config,
                retry=retry,
            )
        except Exception as exc:
            log.exception(exc)
        if response is None:
            return None
        # save the response
        tts_path = MP3_DIR / f"{uuid4().hex[:10]}.mp3"
        with tts_path.open("wb") as out:
            out.write(response.audio_content)
        return tts_path

Just to make sure that the request will go through I'm using AsyncRetry with a 10s timeout. If everything went well, we should get a response containing audio content that can be safely saved to a uniquely named .mp3 file. Why am I not using the raw audio content? I'll just say that saving and playing .mp3 files was more reliable for me and was easier to debug.

In addition to that I did some utility functions, like deleting specific mp3 as well as all of them in one go or selecting a random voice and then proceeded to create a discord command that would enable users to create their own TTS. With all of that it was now pretty easy to implement a discord command that would allow users to create TTS and you can get pretty crazy with the argument autocompletion. I went all the way to requesting a supported list of voices on a command call and that list is being cached on for the later command calls (check it out on github).

Connecting and playing

Having .mp3 files is great but he needs a way of playing them when being connected to the voice channel. Because it is a local bot, only meant for one channel, there is only one connection to a voice channel. The initial version was making Bocek to join a channel, talk and then disconnect from it, but it quickly became pretty annoying with all the sound notifications from discord thus now he sits on the channel as soon as someone joins and leaves the channel when he is the last man standing. The main logic of the play functionality goes like this:

    async def play_on_channel(self, message=None):
        if not self.ready:
            return
        if len(self.voice_channel.members) <= 1 and self.vc:
            await self.disconnect_from_voice()
            return

        if not self.vc:
            self.vc = await self.voice_channel.connect()

        if self.vc and self.vc.is_playing():
            log.warning("Already playing")
            return
        duration = MP3(message).info.length
        self.vc.play(
            discord.FFmpegOpusAudio(
                executable=FFMPEG, source=message, options="-loglevel panic"
            )
        )

        timeout = time() + duration + 1  # timeout is audio duration + 1s
        # Sleep while audio is playing.
        while self.vc and self.vc.is_playing() and time() < timeout:
            await asyncio.sleep(0.1)
        else:
            await asyncio.sleep(0.5)  # sometimes mp3 is still playing
            # await self.vc.disconnect()
        await self.tts.delete_tts(message)

The code is pretty self explanatory, I'm using ffmpeg to play audio. The FFMPEG points to a executable file location (OS dependent, on Windows it needs .exe, on Linux points to /usr/bin/ffmpeg).

Bocek is now able to speak but he's not speaking on it's own, it would be great if he said some bullshit when no one asks him. It was now necessary to make a glossary from which he could choose a line. In this case it is not a rocket science, just a big JSON with a list of strings that will be feeded into the TTS cog. One interesting feature is that anything between the {} inside the string will be evaluated as a python code, which is great all sorts of things, like inserting user's nickname or finding a rhyme with user's nickname (there is another cog for finding rhymes):

"{user} hello"
"{user} {self.get_rhyme(user)}"

I know that it is not safe to evaluate the code from the string but I own the project and there was no way that someone could push some changes without me knowing.

RandomEvent cog

The only thing left is to make him say these lines randomly. To keep a modular aproach I created a new cog, named RandomEvent and put all the logic in there. It facilitates discord.py tasks that basically allow to run some code in a loop in background. When cog is being initialized, it loads the glossary and starts the loop:

class RandomEvent(Cog, name="random_event"):
    def __init__(self, bot: commands.Bot):
        self.bot: commands.Bot = bot
        self.glossary = Glossary(self, "random_join.json")
        self.tzinfo = datetime.now().astimezone().tzinfo
        self.join_at = None
        self.update_join_time()
        self.random_check.start()

update_join_time randomly (here between 8 and 10 minutes) selects a time when Bocek will say something by changing the interval of the loop and logs it.

    def update_join_time(self):
        new_interval = randint(8 * 60, 10 * 60)
        join_at = datetime.now(self.tzinfo) + timedelta(seconds=new_interval)
        self.join_at = join_at.strftime("%H:%M:%S")
        self.random_check.change_interval(seconds=new_interval)
        log.info(f"Next random join at {self.join_at}")

Then there is random_say which evaluates the placeholders in the glossary. The special ones such as user gets replaced with a random user or all_users with all users on the channel.

    def random_say(self):
        if members := [
            x.global_name for x in self.bot.voice_channel.members if not x.bot
        ]:
            msg, placeholders = self.glossary.get_random()
            if "user" in placeholders:
                user = choice(members)
            if "all_users" in placeholders:
                all_users = ", ".join(members) if len(members) > 1 else members[0]
            scope = locals()
            msg = replace_all(msg, {f"{{{p}}}": eval(p, scope) for p in placeholders})
            return msg
        return None

And finally the loop itself that updates the next random event time, creates a TTS and then passes it to the play_on_channel function.

    @tasks.loop(seconds=8 * 60)
    async def random_check(self):
        self.update_join_time()
        if len(self.bot.voice_channel.members) <= 1:
            return
        if msg := self.random_say():
            tts = await self.bot.tts.create_tts(msg, random=True)
            await self.bot.play_on_channel(tts)

Greeting users

With this two cogs the bot comes to life, but there is one tiny little detail that I added to make him even more connective - greeting the people when they join the server. The on_voice_state_update hook function is called everytime something happens on a channel and it's parameter is a user that caused all that commotion.

    async def on_voice_state_update(self, member: discord.Member, before, after):
        if member == self.user or not self.ready:
            return
        if not hasattr(after, "channel") and not hasattr(after.channel.name):
            return
        if before.channel != after.channel and after.channel == self.voice_channel:
            await asyncio.sleep(0.75)
            to_say, placeholders = self.glossary.get_random("greetings")
            user = member.name
            scope = locals()
            to_say = replace_all(
                to_say, {f"{{{p}}}": eval(p, scope) for p in placeholders}
            )
            tts = await self.tts.create_tts(to_say, random=True)
            await self.play_on_channel(tts)

By filtering all other events with a few if statements, it chooses one of the greetings, replaces the {user} from the glossary with user's nick and finally plays it one the channel.

Final thoughts

Bocek has been "alive" since 2023 and with the time you start to remember all his quotes but he can still surprise, for example with the timing. Imagine it's getting late and everyone is wondering whether to play that last game or go to sleep and Bocek randomly jumps in and says "time to bed, boys". Although debugging some features is a pain, I think it brought me the most joy out of all the projects I did as I get to experience it a few times a week on a channel. The best part is when you're tired of him you can just mute him and he won't make a fuss about it.

If you want to see the full code check out here and there is also another part to the Bocek series, live game commentary.