How To 🤗

Wed Oct 18, 2023Listen to this post

So catwalk is now kind of a thing. I've been running minor experiments on it on the box under my desk, and so far I'm seeing it perform much faster than making equivalent calls into replicate. And that's with no additional work on making it efficient, or doing things in a sensically batched way to maximize state locality. All I've done so far is made sure all the models are loaded and ready to go. Based on the observed results, I'm guessing replicate does set-up, inference and tear-down for each individual request, or possibly has high thresholds to keep a model warm between calls, and just avoiding that makes a massive difference in generation time. The only current downside is that I don't have enough memory to run kandinsky, but my first few use cases don't need any image/video generation, so I think that's probably ok. The small experiments I intend to run in the short term can still go through the API call.

Right now, this is a ridiculously simple, minimal HTTP server that exposes interfaces into models to do basic tasks like TTS, voice transcription and image captioning/summarization. I've also got a basic text model wired up, but I'll very probably be disabling it to save on memory for the first bit.

Here are some quick observations about the models I've been running:

`whisper`

The whisper model is another one that, like tortoise seems simpler to run through its' own repo rather than 🤗. So it's a direct requirement rather than being pulled in through the pipeline interface. It's also by far the fastest one of the models. I'm very seriously considering putting together some prototypes of voice command evaluation that I can run without hitting the Google/Amazon/Facebook/what-have-you servers with audio coming out of my mouth. My direct use case here is to use transcription on the output of my TTS calls as an error trapping pass, but there are definitely a few more things I could think to do with it. I've got it set up alongside a couple of other trivials in basics.py.

`blip`

Not much to say on this one. I call it trivially to get basic descriptions of images in things I'm going to be audio-fying. When I say "trivially"; I mean it. The call is:

_CAPTIONER = pipeline("image-to-text", model="Salesforce/blip2-flan-t5-xl")

def caption_image(url):
    return _CAPTIONER(url)

I haven't seen any advantages to running this one locally as opposed to on replicate. Possibly, they keep it warm enough, or possibly loading it is fast enough that it doesn't matter either way.

`tortoise`

Once set up, tortoise is amazing in terms of quality and ease-of-use. You can see the setup in the separate tts.py. Weirdly, the output quality on this one is improved, as well as the performance, when compared to the replicate interface. I have no theory for why this might be. I know the performance gains mostly come from my system keeping both the model and target voice vectors in memory, and this also definitely saves network traffic since I'm not sending wavs over the wire each time, but I don't at all understand why text read by my voice would sound more robotic when generated by the replicate model rather than locally.

As far as I know, I am running the exact same model as they are. I guess it might be that I'm arranging the voice samples differently? Like, I send one 5-minute sample up to the server, but use several ~20 second clips locally because it's possible to do that, but it amounts to a similar "amount" of voice by wall-clock time, so I'm not sure why it would make a difference.

Oh, one more thing here; it looks like the tortoise process on some level understands the idea of people changing their voices? Like, if you feed it audio of someone intentionally distorting their own voice (like they would be if they were voice acting for some character), and then ask it to generate audio, it seems to generate that audio using the persons' actual, non-distorted voice. I'm going to run a few more experiments to explore this, but I certainly wasn't expecting the effect. My theory is that there's something like SIFT for audio that the model is extracting from its' voice samples and that thing isn't affected by the changes a speaker makes to their own voice intentionally. I'm going to put a bit of work into understanding this more deeply, because it'd be a pretty interesting application to have a character read some text, as opposed to a person's natural voice.

Next Steps

I'm going to put a couple experiments on the backburner regarding tortoise voices, but my main short-term goal is getting a version of ai-blog-reader up and running that

runs on local models
can be easily touched off with an external URL
can automagically produce a reading, possibly while giving me an interface to correct/re-record some parts
crawls through my archives and some other blogs I'd like to listen to and turn them into podcasts

As always, I'll let you know how it goes.