Making AI Smarter
I want to those responses to go through the Elevenlabs API so that it can be a full 2-way voice conversation when needed
I use various AI tools, both locally on my laptop and running on my server. 80% of the time when I use it I am logged into my server via SSH, using the command line and a CLI version. Most of the time I am using OpenAI Codex CLI. The only issue is that I want multi-modal functionality.
I want the ability to say to the agent 'Look at that module on the webpage I am on,, please move this or that' or perhaps I ca say 'The option you are referring to is not available to me, look at my screen and notice it is different than what you described'.... things like that. So how did I tackle that?
Well for starters , I am using 2 agents. The first is on the server, the second is on my Windows laptop, both running Codex CLI. On the Windows laptop I am running FFMPEG to capture the audio and the video from my laptop. The stream is sent to an RTC endpoint on the VPS via WebRTC. Then on the VPS the Codex client will watch the feed and interpret what it sees. The most important part is Redis, the messaging layer. Redis allow simultaneous communication between the two agents super fast.
So this messaging layer means that as soon as the VPA interprets my video or audio feed it can send a message or response in real-time back to the other client. One final step which I have not yet implemented is that I want to those responses to go through the Elevenlabs API so that it can be a full 2-way voice conversation when needed because I will often be buried in the terminal and unable to read the textual replies. But that will have to wait until after my assembly , perhaps next week to finish it off.