Offline Speech Recognition with Vosk
Build an offline speech-to-command system in Python by combining Vosk for speech recognition with NLTK for keyword extraction.
No more Sphinx
The CMU Sphinx project, developed by Carnegie Mellon University, has not been actively maintained for approximately 5 years. However, this doesn’t necessitate moving to production-oriented alternatives. The CMU Sphinx team has introduced a successor project called Vosk.
While other production solutions exist - such as OpenVINO and Mozilla DeepSpeech
- this post prioritizes ease of setup and implementation.
Okay, I don’t know what you are talking about. Please explain more.
According to the official CMU Sphinx wiki, the project provides “an open-source toolkit for speech recognition” developed at Carnegie Mellon University.
I get it, but why do you call this dead?
The official CMU Sphinx blog shows minimal recent activity, with the most recent posts dating back several years. A YCombinator discussion also reflects community concerns about the project’s maintenance status and future development.
Okay I get it. So what now?
The CMU Sphinx website itself acknowledges the limitations of the original project. While I had extensive Sphinx experience, I found Vosk offered a “gentle learning curve” with surprisingly accessible implementation.
This post demonstrates how to create a Python script combining Vosk for speech recognition with NLTK for keyword extraction, enabling voice-controlled applications.
Setting up
Stage 0: Resolving system-level dependencies
Required tools:
- Linux system (Ubuntu recommended; Windows/Mac compatible for programming)
- PulseAudio audio drivers
- Python 3.8 with pip
- Internet connection
- IDE (VSCode recommended)
- Microphone
For Debian/Ubuntu systems, run:
sudo apt-get install gcc automake autoconf libtool bison swig python3-devsudo apt-get install libpulse-dev jackd libasound2-devNote: Execute these commands separately. The
libasound2-devandjackdpackages requireswigto build driver code. If issues occur withswig, search error messages with “CMU Sphinx” as a keyword.
Stage 1: Setting up Vosk-API
Clone the Vosk-API repository:
git clone https://github.com/alphacep/vosk-api.gitAlternatively, download from GitHub.
Create a project folder structure:
speech2command |_______ vosk-api |_______ ...Stage 2: Setting up a language model
Vosk supports language-specific models for over 18 languages, including Greek, Turkish, Chinese, and Indian English. Models enable Vosk to recognize speech across different languages.
Download the
small American English model
(approximately 40 MB). Extract and rename the folder to model:
speech2command |_______ vosk-api |_______ model |_______ ...Stage 3: Setting up Python Packages
Required packages:
- platform (built-in)
- Speech Recognition
- NLTK
- JSON (built-in)
- sys (built-in)
- Vosk
Install external packages:
pip install nltk speech_recognition voskStage 4: Setting up NLTK Packages
Install NLTK components: stopwords, averaged_perceptron_tagger, punkt, and
wordnet.
nltk.download("stopwords")nltk.download("averaged_perceptron_tagger")nltk.download("punkt")nltk.download("wordnet")Or in one command:
nltk.download("stopwords", "averaged_perceptron_tagger", "punkt", "wordnet")Stage 5: Programming with Vosk and NLTK
Create a Python file s2c.py in the project folder:
speech2command |_______ vosk-api |_______ model |_______ s2c.py |_______ ...Code Implementation
import platformimport speech_recognition as srimport nltkfrom nltk.stem import WordNetLemmatizerfrom nltk.corpus import stopwords, wordnetimport sysimport voskimport jsonfrom vosk import SetLogLevel
SetLogLevel(-1) # Hide Vosk logs
p = platform.system()
def listen(): rec = sr.Recognizer() with sr.Microphone() as src: rec.adjust_for_ambient_noise(src) audio = rec.listen(src) try: cmd = rec.recognize_vosk(audio) # Connecting to Vosk API except Exception: print("Sorry, couldn't hear. Mind trying typing it?") cmd = input() return cmd
def pos_tagger(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return None
def lemmatizer(src): w = WordNetLemmatizer() pos_tagged = nltk.pos_tag(nltk.word_tokenize(src)) wn_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged)) ls = [] # lemmatized sentence for word, tag in wn_tagged: if tag is None: ls.append(word) else: ls.append(w.lemmatize(word, tag)) return ls
def make_tokens(lms): stop_words = set(stopwords.words('english')) src3 = [] for i in lms: if i in stop_words: pass else: src3.append(str(i)+" ") print("Keywords are:", end=' ') for i in src3: print(i, end=' ')
try: while True: print("\nSay some words: ") c = listen() print("Listened value=",c) d = json.loads(c) print("Command =", d["text"]) if str(d["text"]).rstrip(" ") in ['stop', 'exit', 'bye', 'quit', 'terminate', 'kill', 'end']: print("\n\nExit command triggered from command! Exiting...") sys.exit() lemmatized = lemmatizer(d["text"]) make_tokens(lemmatized)except KeyboardInterrupt: print("\n\nExit command triggered from Keyboard! Exiting...")Run the script to activate a continuous listener with terminal output. The system captures speech, processes it through Vosk, and extracts keywords using NLTK.
For the complete source code, visit the Vosk Demo repository.
Explanation
The workflow operates as follows:
- Audio input captured from microphone
- Vosk processes audio and returns JSON with recognized text
- NLTK tokenizes the recognized text
- POS tagging identifies word types
- Lemmatization reduces words to base forms
- Stop word removal filters common words
- Keywords extracted and displayed
Conclusion
This implementation provides a fully functional offline speech recognition system with keyword extraction, enabling voice control features for custom applications.