Localized Phoneme / Mood / Pose Expression System

Hello all,

Since a significant part of Chop Chop involves interacting with characters in dialogue / cutscene sequences, I have been thinking that it might be good to add in a little more nuance to characters. I know this project is meant to be a vertical slice and that this feature hasn't been specifically requested, but I actually developed this system for my own game and thought it would be nice to share the implementation if there is sufficient interest!

The system that I'm working on (implementation described in detail below) leverages Unity's Timeline API to create custom sequencing components, which can be used to define a character's mood as a function of time. The character's mood is then used in other sub-systems to control its' eye textures, mouth textures, and animations. Additional features like eye blinking and lip-syncing are also supported and, ultimately, determined from the character's mood. Finally, localization is supported in this system. So, for example, the lip-syncing mouth textures used on a character will change with the active language.

Combined, I just refer to the whole thing as a Character Expression System. But before I carry on, please watch the demonstration video link below to get an idea of what this system does. Please note that the system runs smoother than is shown in the video. My computer just has trouble with video recording software.


The potentially "controversial" addition my system adds in is the use of phonemes. If you don't know what that is, it basically is a system used for character lip-syncing to spoken dialogue. That's why I mentioned this system could be a little controversial...given that the characters will be speaking Sim-lish / Animal Crossing style. Still, I personally like the effect. Even though the audio won't be spoken language, the dialogue text will be, and I think it looks quite good when the character's lips sync with the text the player is reading on-screen. This also avoids the problem of having the same mouth animation play out every time a character speaks, as in the current implementation. But of course, a system like this has some tradeoffs too...for one, many more textures will be needed for each character. I'm happy to work on that, if needed, but I'm not an artist by trade, so please feel free to usurp me if your talents are better than my own (you probably won't have to try very hard).

Here's an overview of how the system works:
(1) A custom Timeline track / clip / behaviour is defined for character moods (see image below)

6760660--780487--IMG01.PNG

Note a few things in this image:

  • As you can see in the clip settings to the right, a custom "mood set" can be assigned to each of these clips...more on that later.
  • The CharacterMoodTrack has a binding to the main system I use : ExpressionManager, which is itself derived from MonoBehaviour.
  • There are additional clip settings to control which animation to play. As will be shown below, mood sets allow the assignment of an array of animations...so you might have five animations, for example, that go along with a character's "happy" mood. Now, you might want the animation that plays out to either be selected randomly (default) or you might want to force a specific animation within the array to play (that's what the animation index is for).
  • The CharacterMoodClips are independent of DialogueClips, so a character's mood can change mid-sentence or even when there is no dialogue at all

Before we continue, I had to crop a bunch of images together into one (see below) because I was running into my upload limit. Please reference this image as the systems below are discussed.

6760660--780568--Systems.png

(2) A mood system is implemented. This system again leverages Timeline and defines custom tracks / clips / behaviours. The mood system allows you to set the character's mood (happy / sad / angry / explaining, etc.) and the mood is then used to set textures for the eyebrows, eyes, mouth, and character pose. And yes, I know "explaining" is not a mood, but in this demo, I was trying to stretch the utility of this system. Maybe "mood system" isn't the best label since it really can be used to model lots of character states.

Here are the elements that define a mood collection...

  • The actor that it effects, represented by the ActorSO object
  • The mood itself, an enumeration that is used internally as a key in a dictionary to quickly reference other mood collection properties
  • Three eye textures for blinking - eyes fully open, eyes mid-blink, and eyes fully closed.
  • The mouth textures are obtained from another system called PhonemeSetSO (more on that later...). This system is localized because phonemes (the basic mouth shapes used for lip syncing) are language-specific. Here again, the character's mouth shape will be a function of its mood...so an "AH" sound, for example, will have different textures whether the character is happy or sad, etc.
  • As for the character pose, I currently have implemented small (and probably not very good) animations (I am not an animator) and use Animator.CrossFade to fade-in to the pose.
  • Finally, you'll see the animator clip title section. Here, the titles of anim clips (already in the character's Animator Controller) that you'd like to play out for this mood are set. Internally, the titles are converted to hashes for efficiency.

(3) The PhonemeSetSO class, which defines the base sounds that a character can make, links the appropriate mouth texture to use for a given base sound. As can be seen in the thumbnail below, lots of sounds ("K", "AA", "AE", ...) use the "Happy_AH" mouth texture. If you'd like to know more about these base sounds, I use the CMU pronouncing dictionary. Remember also that in the mood collections (described above), the PhonemeSetSO member is localized. That means that you'll need one of these classes for each language and each mood. Ex: If we have happy and sad moods and support English and French, then we'd need four PhonemeSetSOs defined. Furthermore, as far as I know, the CMU pronouncing dictionary is only intended for the English language. Other phoneme sets can be found for other languages.

The PhonemeSetSO class has a public function called GetMouthShape(string phonemeKey), which takes in the phoneme key (base sound, like "AH") and returns a Texture2D, which is ultimately passed along by the ExpressionManager to the appropriate ActorSO component. That ActorSO component then sets its internal mouth material mainTexture property to the Texture2D that was returned from GetMouthShape(...). Then, the mouth mesh on the character that is assigned the mouth material will automatically update its mainTexture component.

(4) Modification to DialogueLineSO

In addition to the original localized dialogue "Sentence", there is another LocalizedString for the phoneme sequence. Here you can see an example of how a regular line of text is (manually) entered as a phoneme sequence : "And win the game" = "EH N D . W IH N . TH AH . G EY M ."

The system interprets whitespace as a separator for parsing out the individual phonemes (using string.split(" ")). For this reason, it is important not to enter double spaces, because no phoneme is defined for whitespace. You'll also note that periods "." are used in the phoneme sentence. This is used to indicate the end of a word and communicates to the Expression System that the character should close its mouth before forming the next word. Note that this format is automatically generated by the CMU pronouncing dictionary. There are additional options to indicate primary and secondary emphasis of phonemes, which could potentially be added in to this system at a later date.

Just to be clear, no outside resource is actually necessary for this system to work. Once you learn the basic sound phonemes, you'll be able to type out these phoneme sentences on your own. In my implementation, all of these sentences are entered manually.

(5) Modification to DialogueBehaviourSO

This is where all the cool stuff related to parsing the phoneme sentence is done. First, the LocalizationSettings are used to retrieve the "LocalizedStringAsync", and when that asynchronous operation is completed, the localize phoneme set is stored. In other words, I just get the underlying PhonemeSet needed for the current language.

Then, this localized phoneme set is parsed, using whitespace as the phoneme separator. See the image below for more detail...

6760660--780595--DialogueBehaviourMod.png

(5) Modification to ActorSO, and eye / mouth texture solution

None of the base functionality has changed in ActorSO, but there are several default settings that need to be set here. For example, a default eye texture, mouth texture, and animation clip need to be set in case the system doesn't know what textures / animations to display at any point. Also, two material references are needed (one for the eyes, one for the mouth).

Within the ActorSO script, there are several public functions used by the Expression System to set the "mainTexture" property of the eye and mouth materials. Note that this represents a change in how the characters are currently setup (this character change will not be part of the PR, unless requested). Prior to my changes, there were multiple copies of facial meshes parented to the head bone in the character rig. These meshes were enabled / disabled using an animation to give the desired expression sequence. My system needs only one mesh for each facial component (L Eye, R Eye, Mouth -- and can be extended to include eyebrows easily), and changes the texture on each of those facial components. Of course, the system can be easily modified to how things were previously, but I think my solution is more scaleable if lots of textures are needed.

Below, I have attached some example eye and mouth textures using this mood / expression system. Please note that these textures were made very quickly with a mouse...I will be getting a graphics tablet soon and can make improved textures later. These textures were modeled after the artwork for the main character, Hamlet. Some use the original artwork directly.

6760660--780571--Mood Chart.png

And that's basically it! Of course, I haven't really talked about how the ExpressionManager itself works, but it essentially is just a front end that interfaces with functions on the ActorSO component. For example, at the appropriate time, it will call the ActorSO.SetMouthTexture(...) or ActorSO.SetEyeTexture(...) or ActorSO.TransitionToAnimatorClip(...). See the diagram below for more information on the core systems.

6760660--780640--Diagram.png

All-in-all, I think this system is pretty flexible and slots in quite nicely to the existing framework. And I think it allows for nuanced interactions with characters. But of course, the decision has to be left up to the community and Unity itself. I should mention though that this system doesn't interfere with any underlying systems and can be bypassed if one wants to use the currently implemented system.

I will be submitting a pull request with this system implemented, hopefully by tomorrow. No worries if it doesn't fit the style of the game. Note that this currently only works on one character. I am in the process of extending this system to all characters (it involves a lot of dictionaries!) and that will be in a future pull request.

3 Likes

Awesome work!

1 Like

Hey, why does it (face expressions) need to be localised. I didn’t understand it quite.

Localization is only applied to the phoneme sentence inside the clip and to the set of mouth textures used for lip-syncing.

The phoneme sentence is essentially the control lever behind the lip-syncing system. As I explained above, if a sentence contains an "AH" mouth, for example, the Expression system will interpret that to mean that the "AH" mouth texture should be applied to the relevant actor.

This needs to be localized because the basic set of sounds / mouth shapes differ by language. For example, the English alphabet has the letter "n" but Spanish has the letter "ñ" (enye). Those are two different sounds and form two different mouth shapes. It wouldn't make much sense to have "ñ" (enye) be part of the English phoneme set, since that's not a basic sound used in English. So while "n" is a phoneme in English, "ñ" (enye) is a phoneme in Spanish...at least, I think so...I'm not a Linguist.

And, of course, the same is true with the actual mouth texture that is meant to show the character uttering that basic sound. If you watch a French-speaker's mouth while speaking, it looks nothing like an English-speaker's mouth. So in order to localize this system so that everyone experiences the game in a way that they consider authentic, I wanted to at least give the option to localize the phoneme sentences and mouth textures. It probably wouldn't be considered very authentic, for example, if we had a character speaking Spanish that didn't properly roll its' R's. But that's not something that we need for an English character.

And if that doesn't complicate things enough, phonology is not a perfect science. You'll often find several different phoneme sets for the same language. My system is aimed at achieving what I consider to be (at least close to) authentic English lip-syncing, which I can identify as a native English speaker. But my goal was to give localization teams the tools to create their own authentic lip-syncing solutions for their native language.

Part of the confusion may also be that I didn't actually demonstrate the localization system in the video, but I didn't want to create a ton of textures for other languages.

Hopefully that made things a little bit clearer. If not, here are some links that break down some of the basic phonemes / sounds in English and Spanish:

English Phonemes: https://en.wikipedia.org/wiki/English_phonology
Spanish Phonemes: http://assets.cambridge.org/052154/5382/excerpt/0521545382_excerpt.htm

Expression System now Supports Multiple Actors
I've finally implemented support for multiple actors (new PR commit to be submitted soon).

In the demonstration video below, I've copied the original Hamlet character and made an "evil" version. I did it this way because Hamlet is one of the few characters in the game that is fully rigged, and the textures that I've made thus far are configured for the UV coordinates in Hamlet's eye and mouth meshes.

Evil Hamlet is the opposite of Good Hamlet - when Good Hamlet is happy, Evil Hamlet is sad, and vice versa. Evil Hamlet obviously won't be a character in the game...this was just made to demonstrate the newest version of the Character Expression System.

You'll have to excuse the lack of polish in this demonstration. I wanted to quickly get something up and going. I typed in the phoneme sentences for Evil Hamlet very quickly and didn't spend much time adjusting, so his lip-syncing might seem a bit off. But the point of this video is to demonstrate that the system now works for multiple characters.


Because this system relies on materials as a shared data asset, in a similar way to how scriptable objects are used already in the project, we have lots of options for configuring the expressions on characters. For example, you could have a cutscene or dialogue sequence where Hamlet is talking to five Townsfolk that operate as a hive mind, meaning their eye textures, mouth textures, animations, and dialogue lines would all be in sync. That could very easily be accomplished by setting all of the Townsfolk up to use the same eye / mouth materials and changing a few other things. That's what's great about using materials as an intermediate data communication layer - the end result on the character doesn't have hard ties to the Expression System implementation. And it's easy to change around a few things and get completely different results. If the PR is merged, I will make a detailed video on how to set up one of these sequences from scratch so that others can start experimenting with setting up their own cutscene and dialogue sequences.

3 Likes

For anyone that might be keeping up with this system, I am working to extend this system even further so that it supports BlendShapes. Looking at the Townsfolk and the main chef characters, it looks like their facial expressions will be controlled via BlendShapes rather than textures.

1 Like

Alright, so I now have blendshapes working…see the demo below. As before, the demo was thrown together and I probably don’t have the most optimal settings for the blendshape-based character. I also made the blendshapes in Blender very quickly so…you know…no judging. The system probably needs a little more tweaking. Still, nothing is broken, so I’ll submit another commit to the PR later today.

To make this system work, I needed to start by changing the PhonemeSetSO class to handle both textures and another type that I’ve called BlendTargets (which includes both the blendshape name and target blend weight). As can be seen in the code snippet below, this was done by changing the Texture2D type in the previous version of the code to the generalized object type, then casting as necessary to get the appropriate type.

using System.Collections.Generic;
using UnityEngine;

// To specify whether the character uses 2D (textures) or 3D (blendshapes) for mouth expressions
public enum PhonemeType { TwoD, ThreeD }

// This class is essentially the phoneme "alphabet" in use. In other words, how the sounds the character
// makes are translated into the final mouth texture. It applies to characters whose facial expressions
// are controlled via the mainTexture of a material (2D) or blend shapes (3D).
[CreateAssetMenu(menuName = "Phonemes/New Phoneme Set")]
public class PhonemeSetSO : ScriptableObject
{
    public PhonemeType Type;
    public List<Phoneme> Phonemes = new List<Phoneme>(); // The "alphabet"

    // Dictionary used for efficient runtime lookup
    private Dictionary<string, object> _phonemeDictionary = new Dictionary<string, object>();
    private bool _init = false;

    // Initialize the phoneme dictionary, which associates phoneme codes with mouth textures or blend targets
    public void Initialize()
    {
        _phonemeDictionary.Clear();

        foreach (Phoneme p in Phonemes)
        {
            foreach (string s in p.Codes)
            {
                if (Type == PhonemeType.TwoD)
                {
                    _phonemeDictionary.Add(s, p.MouthShape);
                }
                else if (Type == PhonemeType.ThreeD)
                {
                    _phonemeDictionary.Add(s, p.BlendTargets);
                }
            }
        }

        _init = true;
    }

    // This function uses the phoneme dictionary to look up and return the corresponding mouth
    // texture or blend target for a given phoneme code.
    public object GetMouthShape(string phonemeKey)
    {
        if (!_init)
            Initialize();

        object mouthShape;
        if (_phonemeDictionary.TryGetValue(phonemeKey, out mouthShape))
        {
            return mouthShape;
        }

        return null;
    }
}

Additionally, the MoodCollectionSO script was modified to set BlendTargets for the eyes and mouth instead of a static Texture2D. ActorSO also needed to be modified so that default BlendTargets could be specified for the eyes and mouth. See the image below for changes to these components. In the future, I think I’ll make a custom inspector for these components so that only the appropriate properties show when the actor type is set to 2D or 3D. There’s no reason to request eye textures if the actor that this is being applied to uses blendshapes, for example.

As can be seen in the example image for the PhonemeSetSO component below, any time I set a new blend target, I need to also be cognizant of previously set blendshapes. So, for example, if I want to make the “AH” mouth active, I need to return all other phoneme blendshapes to zero. Also, because the mouth mood blend shapes (like happy, sad, etc.) are independent levers from the phoneme blendshapes, I moved the mouth mood blendshapes to MoodCollectionSO…in this way, I only need one phoneme set and can set the mouth mood shapes directly from MoodCollectionSO. This isn’t possible with 2D characters like Hamlet and Evil Hamlet because the mouth textures combine both the mood and phoneme - “Happy_AH” vs “Sad_AH”, for example. Hopefully that makes sense…

Finally, the ExpressionManager needed to be reworked a bit since blendshapes operate a little differently than static textures. Because blendshapes can smoothly vary over a range of values from 0 to 100, I needed to make most of the functions controlling them time-dependent. So, a lot of these functions were implemented in Update(). But I don’t want to post the ExpressionManager code here because it has 907 lines of code. However, you can find the full code in the pull request: Added a Character Expression System by amadeus737 · Pull Request #323 · UnityTechnologies/open-project-1 · GitHub

Before this system gets too large, I think I’ll hold off on making further major functional changes until I hear back about the PR. Still, I might do some cosmetic cleanup in the meantime.

@calculus7 : We can't see your video - it's set to private. I guess you wanted it "unlisted".

1 Like

Thanks, yeah. It's still uploading to YouTube...around 50 minutes left. I started typing up this post while uploading and clicked "Post Reply" a bit too early =D

EDIT: Fixed! SD version is available now. HD version still processing.

Hey @calculus7 ! Wow. This system is super cool, and I like the integration with Timeline!
However, it might be a bit overkill for our game, but we’ll see! Maybe it can be integrated later - as you said - with minimal interaction with other systems.

However, one thing is that, in reality, the Pig Chef shouldn’t have that wide range of expressions, at least for the eyes. He should open the eyes only when hit or when he’s surprised.

For now, I’ll put the PR on hold (you’ll see it in a label), but I’ll keep an eye on it and we can return to this once we have the whole Timeline/dialogue stuff in place (hopefully soon!).

Thanks anyway, it’s a really cool addition and I can see you’ve put a lot of work in it!

@cirocontinisio , yeah, no worries. I always knew this system would be more flexible than what is needed in this project, just didn't know if its base functionality could be useful in some way. As you said, maybe some form of it will prove useful in the future.

I've done a lot of Timeline scripting in my own projects, so this was a lot of fun to delve into. Guess I got a little carried away...tends to happen with me. Anyway, I was developing this system for my own game so it's not wasted effort.

You mentioned that the Timeline / dialogue stuff still needs to be put in place. I was under the impression that was already done? If there's still implementation work to be done on the Timeline / dialogue systems, would you mind pointing me to that? That's something I'd like to contribute to if there's still a chance. Or I'd be happy to help out with really anything require Timeline scripting.

1 Like

It’s one of those systems we did a while ago, but never used. I suspect it needs a refresh/refactor before we can actually create some cutscene with it. I’ll take a look at it and drop a card in the roadmap if so, and I’ll let you know here!