Vtubing with a webcam and real-time AI

I’m pretty much dropping everything at the moment other than working on real-time AI video.

The day I got it actually working at a decent frame rate I knew I had to focus on this. It has unlocked me to operate online fully as MrAssisted.

You can find links to the open source code that allows you to run this on your own computer and warp your webcam in the browser at https://gendj.com

A lot of the code was lifted from the pioneering project https://github.com/kylemcdonald/i2i-realtime so for more robust functionality use that repo. To run in in a browser though, use the GenDJ repo.

There are many ways forward for AI but this is mine. Looking back, all the uses of AI that I connect with are a finite collaboration. I use it for writing, but I read every word it writes and manually alter it. I use it for code, but I read through every line of the code and use it as a jumping off point for code I manually write.

We have so little precedent for anything in AI, we have to look at our existing behavior in the short time it has existed.

I believe video and audio will also move to real-time collaborations.

Audio can work like a synth + a turntable with a consciousness strapped to it. Current song-generators are an awkward middle step. We’re building an instrument. To what extent it will be played by people vs algorithms remains to be seen. We can all go to Coachella and put spotify on shuffle over the PA but that’s not what we do. We go to see people perform.

My focus for GenDJ will be user interface and increasing the quality/fidelity/accessibility of real-time AI video. I want to take the input devices humans have already used to control machines in real-time: midi controllers, game controllers, freaking steering wheels, etc. We need to explore prior art here as much as we can.

I also want to eliminate the barrier to entry of needing an expensive PC with a powerful graphics card. I want to run it on a remote server and round trip the warped image back to the client in real-time.

I also think this may have massive implications beyond just the vtuber use-case.

Anything experienced as frames can be rendered this way. Compression can work this way. Websites can work this way. Books can work this way if the efficiency gets to the point where it’s a negligible cost difference to render a frame of a page vs the text of a page. Why have text models when the omni models can render videos of text?

We always expect the experiences of old paradigms to map to tech post-disruption but they only do as a stepping stone. Moving from print to digital didn’t get us digital newspapers, it got us blogs. Moving from desktop to mobile didn’t give us blogs on the phone, it gave us mass UGC, vertical video, and algo feeds.

What is native to this new disruption? Currently I see some kind of a portal where you’re looking directly into an omni model. Something between a game and a video, constantly spitting out frames that respond to your reactions or input. Or it could spit out a 3d environment rendered some other way. It could be scary to drill down so directly with no restraints into what we want to see. This is going to be wild.

If the experience is multiplayer, how will you know where the model ends and other players begin? Very matrix.

Regardless, step 1 is figuring out how to be a decent vtuber with this new tech stack, so follow @MrAssisted somewhere if you want to see what I cook up, or join the Discord if you want to meet others interested in this