catwalk is now kind of a thing. I've been running minor experiments on it on the box under my desk, and so far I'm seeing it perform much faster than making equivalent calls into
replicate. And that's with no additional work on making it efficient, or doing things in a sensically batched way to maximize state locality. All I've done so far is made sure all the models are loaded and ready to go. Based on the observed results, I'm guessing
replicate does set-up, inference and tear-down for each individual request, or possibly has high thresholds to keep a model warm between calls, and just avoiding that makes a massive difference in generation time. The only current downside is that I don't have enough memory to run kandinsky, but my first few use cases don't need any image/video generation, so I think that's probably ok. The small experiments I intend to run in the short term can still go through the API call.
Right now, this is a ridiculously simple, minimal HTTP server that exposes interfaces into models to do basic tasks like TTS, voice transcription and image captioning/summarization. I've also got a basic text model wired up, but I'll very probably be disabling it to save on memory for the first bit.
Here are some quick observations about the models I've been running:
whisper model is another one that, like
tortoise seems simpler to run through its' own repo rather than 🤗. So it's a direct requirement rather than being pulled in through the
pipeline interface. It's also by far the fastest one of the models. I'm very seriously considering putting together some prototypes of voice command evaluation that I can run without hitting the Google/Amazon/Facebook/what-have-you servers with audio coming out of my mouth. My direct use case here is to use transcription on the output of my TTS calls as an error trapping pass, but there are definitely a few more things I could think to do with it. I've got it set up alongside a couple of other trivials in
Not much to say on this one. I call it trivially to get basic descriptions of images in things I'm going to be audio-fying. When I say "trivially"; I mean it. The call is:
_CAPTIONER = pipeline("image-to-text", model="Salesforce/blip2-flan-t5-xl") def caption_image(url): return _CAPTIONER(url)
I haven't seen any advantages to running this one locally as opposed to on
replicate. Possibly, they keep it warm enough, or possibly loading it is fast enough that it doesn't matter either way.
Once set up,
tortoise is amazing in terms of quality and ease-of-use. You can see the setup in the separate
tts.py. Weirdly, the output quality on this one is improved, as well as the performance, when compared to the
replicate interface. I have no theory for why this might be. I know the performance gains mostly come from my system keeping both the model and target voice vectors in memory, and this also definitely saves network traffic since I'm not sending
wavs over the wire each time, but I don't at all understand why text read by my voice would sound more robotic when generated by the
replicate model rather than locally.
As far as I know, I am running the exact same model as they are. I guess it might be that I'm arranging the voice samples differently? Like, I send one 5-minute sample up to the server, but use several ~20 second clips locally because it's possible to do that, but it amounts to a similar "amount" of voice by wall-clock time, so I'm not sure why it would make a difference.
Oh, one more thing here; it looks like the
tortoise process on some level understands the idea of people changing their voices? Like, if you feed it audio of someone intentionally distorting their own voice (like they would be if they were voice acting for some character), and then ask it to generate audio, it seems to generate that audio using the persons' actual, non-distorted voice. I'm going to run a few more experiments to explore this, but I certainly wasn't expecting the effect. My theory is that there's something like SIFT for audio that the model is extracting from its' voice samples and that thing isn't affected by the changes a speaker makes to their own voice intentionally. I'm going to put a bit of work into understanding this more deeply, because it'd be a pretty interesting application to have a character read some text, as opposed to a person's natural voice.
I'm going to put a couple experiments on the backburner regarding
tortoise voices, but my main short-term goal is getting a version of
ai-blog-reader up and running that
- runs on local models
- can be easily touched off with an external URL
- can automagically produce a reading, possibly while giving me an interface to correct/re-record some parts
- crawls through my archives and some other blogs I'd like to listen to and turn them into podcasts
As always, I'll let you know how it goes.