Overview
Conversation AI is a constant source of entertainment for me, even before the days of chatGPT and other similar systems.
In the past, I have used an ML system named RASA that works on top of the Spacy NLP module to interpret user input and generate action frameworks.
Unfortunately, RASA no longer offers individual or normal pricing plans, so this project is directly on top of the Spacy module.
This is going to let me explore Python, NLP and trying to create a functional action processor from here.
The plan is to create a 4 stage system.
1. Speech recognition to gather input (likely using OpenAI Whisper model)
2. NLP on that input to gather actionable items, and provide action processor with context
3. Action processor that implements actual results
4. AI response back to human based on results
At its current state, elements 2 & 3 are going well!
Using Spacy, I have managed to sort inputs into Verbs and VerbObjects (direct objects or preposition objects). The heirarchial nature of Spacy's decoding means that every verb will have associated objects to provide context. By grabbing the 'dobj' and 'pobj', these can provide sufficient information for the action processor to enact.
The test phrase I have been using is 'Create a test file and add some json content to it'. Here we can see how this is being decoded:
In the past, I have used an ML system named RASA that works on top of the Spacy NLP module to interpret user input and generate action frameworks.
Unfortunately, RASA no longer offers individual or normal pricing plans, so this project is directly on top of the Spacy module.
This is going to let me explore Python, NLP and trying to create a functional action processor from here.
The plan is to create a 4 stage system.
1. Speech recognition to gather input (likely using OpenAI Whisper model)
2. NLP on that input to gather actionable items, and provide action processor with context
3. Action processor that implements actual results
4. AI response back to human based on results
At its current state, elements 2 & 3 are going well!
Using Spacy, I have managed to sort inputs into Verbs and VerbObjects (direct objects or preposition objects). The heirarchial nature of Spacy's decoding means that every verb will have associated objects to provide context. By grabbing the 'dobj' and 'pobj', these can provide sufficient information for the action processor to enact.
The test phrase I have been using is 'Create a test file and add some json content to it'. Here we can see how this is being decoded:
We can see that it identifies the two verbs 'create' and 'add'. A useful feature of Spacy is that it provides a 'lemma_' attribute for its identified words. This provides me the root word with no applied prefixes/suffixes/tenses. This makes it much easier to run comparisons on.
Currently, I am using the first verb as the 'primary' or 'root' verb. I may change this in the future, but for now it seems to work.
We can see that the 'create' verb has a single direct object, 'file'. By changing the direct object, we can change the effect of the create verb. In this case, this primes the system to create a new file. We can see 'file' is compounded by the phrase 'test'. Currently this has no effect, but I am hoping in the future I can use this for further contextualisation. A 'test' file in one program would look different than another.
Further, we see our second verb 'add'. I am using all verbs past the primary as 'sub actions', with the assumption that they will act upon the result of the primary action. So the primary action 'create file' has setup a file to be created at a certain path. The 'add' action will add content to this file before its created (or possibly after if I need to change the structure). We can see a direct object 'content' with the compounder 'json'. I am using 'content' as a marker that this should be a text file, and the compounder 'json' to mark that the filler text should be in the json format.
Thus, we see the following:
Currently, I am using the first verb as the 'primary' or 'root' verb. I may change this in the future, but for now it seems to work.
We can see that the 'create' verb has a single direct object, 'file'. By changing the direct object, we can change the effect of the create verb. In this case, this primes the system to create a new file. We can see 'file' is compounded by the phrase 'test'. Currently this has no effect, but I am hoping in the future I can use this for further contextualisation. A 'test' file in one program would look different than another.
Further, we see our second verb 'add'. I am using all verbs past the primary as 'sub actions', with the assumption that they will act upon the result of the primary action. So the primary action 'create file' has setup a file to be created at a certain path. The 'add' action will add content to this file before its created (or possibly after if I need to change the structure). We can see a direct object 'content' with the compounder 'json'. I am using 'content' as a marker that this should be a text file, and the compounder 'json' to mark that the filler text should be in the json format.
Thus, we see the following:
It works!
Of course, this is just a 'happy path' right now. A single prompt with specific wording is working. But im taking this as a promising first step as I begin to expand my system to take different wording configurations.
One of the next steps is to introduce 'modes'. Actions are not always going to be a single line interaction, and the user should not always be able to access all the actions. So I will be taking a leaf from the Microsoft Bot Framework and introducing a processor context stack. The processor for that particular stack will be active until dismissed, and control will fall back to the underlying processor. Actions can add or remove from the stack.
An example of this architecture is:
Base Stack (responsible for initial input) actions = ["hello", "start a command", "talk with me"]
Command stack (responsible for system operations) actions = ["create", "change", "delete", "go back"]
Discussion stack (responsible for basic chat) action = ["go back"]
As we can see here, by separating our architecture into 3 stacks, we can control the contextual input of the user.
This is a very traditional approach, and works best when the user is aware of what commands can be triggered (such as creating files etc).
For my implementation of the discussion stack, you may notice a single command to return to the base stack. This is because anything other than a return command will be sent through to a chat GPT API implementation. This allows for easy use of chat gpt, a fairly useful tool nowdays, without the confusion of differentiating between gpt and commands.
Of course, this is just a 'happy path' right now. A single prompt with specific wording is working. But im taking this as a promising first step as I begin to expand my system to take different wording configurations.
One of the next steps is to introduce 'modes'. Actions are not always going to be a single line interaction, and the user should not always be able to access all the actions. So I will be taking a leaf from the Microsoft Bot Framework and introducing a processor context stack. The processor for that particular stack will be active until dismissed, and control will fall back to the underlying processor. Actions can add or remove from the stack.
An example of this architecture is:
Base Stack (responsible for initial input) actions = ["hello", "start a command", "talk with me"]
Command stack (responsible for system operations) actions = ["create", "change", "delete", "go back"]
Discussion stack (responsible for basic chat) action = ["go back"]
As we can see here, by separating our architecture into 3 stacks, we can control the contextual input of the user.
This is a very traditional approach, and works best when the user is aware of what commands can be triggered (such as creating files etc).
For my implementation of the discussion stack, you may notice a single command to return to the base stack. This is because anything other than a return command will be sent through to a chat GPT API implementation. This allows for easy use of chat gpt, a fairly useful tool nowdays, without the confusion of differentiating between gpt and commands.