If You Have an LLM-Hammer

As a developer that writes code that calls LLMs, I start thinking I need to use LLMs for things that really don't need them.
If I use a fast, cloud-based LLM to do everything, it tends to hide the places where I could get better performance or accuracy by using a different technique. In contrast, local LLMs running on consumer-grade devices are slower. A silver lining: I'm more likely to notice that I'm burning CPU on the LLM when I could handle a task more efficiently.
I wanted to talk about a few alternatives to LLMs that work well inside general web app code, as well as the Decent App architecture for local-LLM web apps.
Vector Embeddings for Comparing
You could use an LLM to check for certain meanings in what a user says. So, for example, you could insert user messages from a chat session into a prompt template like:
Output TRUE if the user's message "{{message}}" indicates they want to end the chat session, or FALSE otherwise.
After a few refinements (e.g. few-shot examples), you could have Llama, Gemma, Phi or other LLMs detecting when a user's last message in chat indicated they wanted to end the session.
But you could also create a vector embedding of the sentence "I want to end the session" and then whenever the user enters some text, compare their text to the vector embedding for a similarity score. Computationally, this technique is multiple orders of magnitude cheaper than a call to an LLM inference endpoint. The comparison will run in milliseconds, rather than seconds with an LLM.
And you can do clever things to make the comparisons more accurate, like creating a centroid (average) of multiple embeddings ("I want to end the session", "end the session", "I'm leaving", "exit") or logic to check multiple criteria more completely.
There are many models for vector embedding that can be loaded via TransformersJS, and these will run alongside the WebLLM-based models in your Decent App. I've been using MiniLM-L6-v2 with good results.
The Decent Apps tools don't lock you in to just the models available with WebLLM. You can mix and match. WebLLM is great for running medium-sized LLMs in the 8-billion parameter range in the browser with GPU. And TransformersJS is great for its larger selection of models that comfortably run on CPU.
(I'm aware TransformersJS also has GPU capabilities. But without going into details, at time of writing, its my opinion that WebLLM is better-suited to run larger LLMs from video/unified memory.)
NLI for Classification
You could use an LLM to classify a word, phrase, or sentence amongst a number of possibilities. Imagine a customer support form where you wanted to take some freeform text from a customer and route it to the best possible recipient. A prompt template that might work for this:
Considering a customer's message "{{message}}" output REFUND if they are asking for a refund, RETURN if they are asking for a replacement, or RANT if they are expressing anger.
Of course it would need tweaking, but you can see how it would work, right? The LLM's REFUND/RETURN/RANT response wouldn't just output to a chat window. Some controlling code would parse the LLM response and handle routing of the message via email or other channels.
This response from the LLM inference endpoint will come back in seconds. But let's say you want milliseconds. Use a natural language inference (NLI) model instead of an LLM. They’re trained to judge whether one text supports another, which makes them useful for classification.
With an NLI model, you specify a premise (the user's message) and multiple hypotheses ("customer wants a refund", "customer wants a replacement", "customer expresses anger") and the model will cheaply score the likelihood of each hypothesis matching the premise.
Again, you can use TransformersJS models like "roberta-large-mnli" or "tinybert-mnli" that will run within a web browser. These can cohabitate with a local LLM inside of a Decent App, so you can use a fast small model for one task, and use the "big guns" LLM for another.
Good Old-Fashioned Parsing
You may remember the examples used to prove that LLMs suck - counting the number of "i"s in "Mississippi", and similar variants. LLMs are relatively slow and unreliable at parsing. (I mean the models themselves - tool-calling in LLM-based systems can often address this.) If you've made a prompt that is asking an LLM to parse, it's fine for your ChatGPT session. But you don't want to invoke an inference endpoint with application code to perform parsing. Much better to write a bit of boring "deterministic" code to count the "i"s.
Also, if you're looking for certain patterns in text, you might be surprised at what can be done with a clever set of keywords and the right logic. The first example of checking when a user wants to end a chat session could be handled by matching against a handful of keywords with some guarding against false positives. And parsing is millisecond-scale stuff unless you're doing something like processing 100-page documents.
A great tool to help with parsing is part-of-speech tagging. You can use a library like WinkNLP to analyze some freeform text and have all the verbs, nouns, adjectives, and other parts of speech identified for you. In the example of checking for an exit phrase from the user, your code can look just at verbs. And those verbs can be lemmatized, which basically converts the variants of the same word to a normalized root word, e.g. "leaving" -> "leave".
Not Everything is an LLM-Nail
A side effect of exploring these pre-"LLM Age" techniques is that you enlarge your understanding of the AI field in general. There are fundamental concepts like vector embeddings that are building blocks for LLMs, and learning these makes LLMs less of a black box.
But the main benefit of using non-LLM techniques like vector embeddings, natural language inference, and parsing is that your app will run faster and sometimes more accurately. There are other tools in the toolbox besides your LLM-Hammer.
September 2025 News
We've started a weekly Discord video call called "Decent Saturdays" where a small group of people talk mainly about making local-LLM-based web apps. Depending on who shows up and what's on people's minds, there might be demos, troubleshooting sessions, general hanging out.
It's every Saturday at 10am Pacific Time and lasts about an hour. To see a magical conversion to your timezone, try clicking the event link.