Introduction

In this tutorial, you will learn the basics of how to make Furhat listen to, and interpret, speech and how to create an interactive dialogue.

Note: this tutorial assumes that you:

In this tutorial we will use a new agent state, agent:listen, which allows Furhat to accept speech as input. Calling this method automatically uses Furhat's current speech recognition service (normally abbrevated ASR for Automatic Speech Recognition), and will return semantics depending on the current context and the Skill's grammar. These terms and their use will be explained in this tutorial.

Generating a skill

In this tutorial, we will use the template speech_interaction that adds recognizer and some basic interactions.

Note: Currently, when creating a skill from the web interface, you will get a nullpointer exception if you restart the mode. To get around this error you have to restart the furhat server. On a Furhat robot, this is done using the top right menu in web interface. If you are using the SDK and development server, close down the server and start it up again.

Setting up a recognizer

Once we have a skill generated, we need to set up a recognizer (an ASR service as explained above). For instructions, check out the recognizer page.

Giving Furhat Ears

Writing agent:listen in the flow will make Furhat listen to speech input through the configured microphone. Example:

  <agent:say>Say something!</agent:say>
  <agent:listen/>

  <agent:say>Say something again! This time, you have to be fast</agent:say>
  <agent:listen timeout="4000"/>

In this example, the first listen is normal, with an 8 second timeout. The second example cuts the timeout down to 4 seconds.

For future quick reference on what the agent:listen method does, check out the system agent flow reference .

Capturing speech.

A agent:listen will always raise one event, either a event signaling that audio has been received and parsed though a speech recognizer, or an event signaling that the timeout has been reached with no audio input. Example:

<state id="Init">
  <onevent name="sense.user.speak**">
      <agent:say>You said something!</agent:say>
  </onevent>

  <onevent name="sense.user.silence">
      <agent:say>I didn't hear anything!</agent:say>
  </onevent>
</state>

In the above example, the first event handler triggers on the sense.user.speak** event. Note: the two asterisks signal that we want to capture all sense.user.speak events. This is necessary if we have a situated dialogue (i.e we have a vision sensor to find multiple users in Furhat's interaction space) and we only have one microphone.

Event Description
sense.user.speak Speech from user that the Furhat attends, or any speech if no user is attended.
sense.user.speak.side If one of the users was attended to by the system, and some other user speaks this will be triggered.
sense.user.speak.multi Speech from several users simultaneous.

The second event handler triggers on the sense.user.silence event and is sent if the agent:listen has timed out.

An agent:listen triggers once, for continued listening several agent:listen need to called. 2 ways of doing this below.

Repeated Listening

This first flow shows of an infinite listening loop, with Furhat logging when he either hears or doesn't hear the user. An agent:listen is required on both events, otherwise Furhat would stop listening to the user.

<state id="Init">
  <onentry>
    <agent:listen/>
  </onentry>

  <onevent name="sense.user.silence">
    <log>Heard nothing </log>
    <agent:listen/>
  </onevent>

  <onevent name="sense.user.speak**">
    <log> heard something </log>
    <agent:listen/>
  </onevent>
</state>

We can achieve similar results with the below flow:

<state id="Init">
  <onentry>
    <agent:listen/>
    <reentry/>
  </onentry>

  <onevent name="sense.user.silence">
    <log> we heard nothing, again </log>
  </onevent>
</state>

Note here that we are only acting on the sense.user.silence event, but there is still an infinite listen going on. Due to the reentry, after listening Furhat reenters the state, triggering another agent:listen.

The benefit of this second approach is that you do not need to catch every sense.user.speak** event, yet continue to listen. The use of this may become more apparent in the following examples with semantics.

Understanding what has been said through semantics

Once a speech event has been caught, the next challenge is to make Furhat understand what was said. To achieve this, we need to parse the speech event we received. We use a grammar file for defining how to parse and make sense of a text.

In the example below, Furhat asks for the user's favorite color, and responds with a note about the color.

<onentry>
  <agent:say> What is your favorite color? </agent:say>
  <agent:listen/>
</onentry>

<onevent name="sense.user.speak**">
  <if cond="event?:sem:red">
      <agent:say> Red, like the sunset </agent:say>
  <elseif cond="event?:sem:blue"/>
    <agent:say> Blue, like the ocean </agent:say>
  <else/>
    <agent:say> I don't know that color yet </agent:say>
  </if>
</onevent>

Decrypting this, we see that we catch the sense.use.speak event and then set up an <if> conditional where we check the event for a parameter sem which in turn should have a parameter red. If so, we say a comment about red. Similarly, if the speech event has a parameter sem which has a blue parameter, we comment about blue instead. We then have a catch all clause for any other user input.

Note: In pure java, which the XML flow translates to, event?:sem:red is translated to event.has("sem") && event.get("sem").has("red"). It is in other words a convenient shorthand. We could also use a parameter called color and checked <if cond="event?:sem:color and event:sem:color == 'blue'"> but this is often more lengthy than using the name of the parameter as in above example.

Below is a grammar rule that can be used together with above flow. Note that grammar rules are written in xml, using slightly different tags than the flow.

<rule id="root" scope="public">
  <one-of>
    <item>
      red<tag>out.red=1</tag>
    </item>
    <item>
      blue<tag>out.blue=1</tag>
    </item>
  </one-of>
</rule>

First we identify our rule. Grammar rules are named similar to flows. Root is the starting rule and it has a public scope meaning that it is accessible from the flow (unlike non-public rules that can be used only through inclusion in other rules).

Second, we see a <one-of> tag, meaning that the parser will try to match one of (in contrast to all of) the following items.

We see two <item> tags, these are the phrases that we try to match to the speech. Each of them contain an optional semantic tag, where you can define an output variable that will be added to the event:sem record. So, if the user says "blue", the parameter blue is added to the semantic record of the speech event (it's value 1 would accessed by event:sem:blue in the flow but to check if the variable exists is enough in this case, which is done through event?:sem:blue).

Another example

In this flow, Furhat will ask the user if they would like to play a game. If they answer yes or no, he will make another small comment and then move on to a new state.

If the users say something that is not covered in our grammar semantics, or they say nothing at all, the flow will reenter the state. The count variable is tied to the onentry block, and is here used to ensure that furhat only asks the question on the first pass through the state.

<state id="Question">
  <onentry>
    <if cond="2>count">
      <agent:say> Would you like to play a game with me? </agent:say>
    </if>
    <agent:listen/>
    <reentry/>
  </onentry>

  <onevent name="sense.user.speak**" cond="event?:sem:yes">
    <agent:say> great! </agent:say>
    <goto state="Setup_Game"/>
  </onevent>

  <onevent name="sense.user.speak**" cond="event?:sem:no">
    <agent:say> I can't believe you would say that.</agent:say>
    <goto state="Look_For_New_User"/>
  </onevent>

</state>

Getting the exact speech of a user

If you wish to get the exact response of a user, for example to repeat it back to them, you can use event:text to get the text of the response. This is especially useful for verifications combined with a confidence score that the recognizer adds to the sense.user.speak** event.

In the following example, Furhat will repeat what a user said if it's confidence in what the user said is below 50%.

Note: the > symbol works in most cases in the flow xml while < is forbidden. For this reason due to the XML syntax. &lt; can be used to replace the < symbol, but it is recommended for readability to flip simple comparisons to use the > symbol instead, as in the below example.

<onevent name="sense.user.speak**">
  <if cond="0.5 > event:conf ">
    <agent:say> Did you say, <expr> event:text </expr> ? </agent:say>
    <agent:listen/>
</onevent>

Grammar, with more complex rules

Spoken language is incredibly complex and come far more varied than expected, which might require more advanced grammar such as nested rules, optional inputs and phrases. As grammar grows more complex, it's recommended to split rules into several section, similar to how Flows have different states.

In this example we will cover the use of references to rules, the repeat tag, and using the one-of tag inside an item:

    <rule id="root" scope="public">
        <one-of>
      <item>
        <ruleref uri="#yes" />
        <tag>out.yes=rules.yes</tag>
      </item>    
            <item>
                <ruleref uri="#no" />
                <tag>out.no="anything can go in here"</tag>
            </item>
        </one-of>
    </rule>

    <rule id="yes">
        <one-of>
            <item> yes</item>
            <item> okay</item>
            <item> yeah</item>
            <item> sure</item>
            <item> why not</item>
            <item> me too</item>
            <item> affirmative</item>
            <item>
                <item repeat="0-1"> okay</item>
                <item repeat="0-1"> Let's</item>
                play
        <item repeat="0-1"> a</item>
        <item repeat="0-1"> game</item>
            </item>
        </one-of>
    <tag>out=1</tag>
    </rule>

  <rule id="no">
        <one-of>
            <item> no </item>
            <item> nah </item>
            <item> not
                <one-of>
                    <item> interested</item>
                    <item> right now</item>
                    <item> today</item>
                    <item> in your life</item>
                    <item> </item>
                </one-of>
            </item>
        </one-of>
    </rule>
</grammar>

The root rule in the grammar contains all the references to different semantic cases (the yes-rule and the no-rule), and each rule describes the inputs that will trigger the semantic rule. Tags can be set on individual rule level and then passed on to the root as in the yes-example, or set on the root level as in the no-example.

To get if a user means "yes", for example, several different variations are provided through the one-of tag. All the specified phrases are mapped to the out tag yes.

In the yes rule, we have the appearance of the repeat parameter in an item. The parameter can be given either a range or specific integer. By using a repetition of 0-1 we state that an item may or may not appear. In this example, all these combinations are matched: "Okay let's play a game", "let's play", " Okay play game" , "play", "play game". This tool adds robustness to the semantic, as spoken language is more garbled than written speech.

On the no rule, the last item allows utterances as "I am not interested" and "not right now". with the key-word "not". This is handled through a nested item where "not" as a mandatory one of the following phrases are required.

If a grammar rule is passed, a variable can be set to a value in the sense.user.speak** event. I.e. if the "no" rule is passed then a variable called "no" is set to the String "anything can go here".

Therefore the sense.user.speak** event will contain in it's semantic record no:'anything can go here'.