oops.se
First you need to decide on strategy, Local or cloud Voice recognition.
Cloud this is similar as Alexa, Siri and Google home. A wake work is detected locally and then is the "Voice to text" done in the cloud. The cloud service is NOT free.
Local this is when everything "Wake word" and "Voice to text" is done locally, example "Home Assistance Assist".
Is a Raspberry Pi 3 enough? Well less CPU = Longer time and more delays.
And I advocate Local as that is far more resilient that building long chains of dependencies. And privately I would love to see a MP3 example, extracting metadata and play from that collection.