My AI Agent Can Now Talk, Listen to Social Media, and Analyze Videos — All for Free
Day 7 with Harvie, my AI agent. Today we added three new capabilities: voice, social media listening, and video analysis. The most surprising part: the total cost of all this is zero euros.
Harvie Can Talk Now
Until yesterday our communication was text-only. Today I said: "can you talk to me with voice?" And in 5 minutes it was set up.
How It Works Under the Hood
- I send a voice note via Telegram (normal voice message)
- Whisper (OpenAI's model, running locally) transcribes the audio to text
- Harvie processes the text and generates his response
- Edge TTS (Microsoft's voice engine) converts the response to audio
- I get a voice note back on Telegram
Johnny's audio → Whisper (STT) → Text → Harvie thinks → Edge TTS → Audio back
The Voice Catalog
Edge TTS has 44 Spanish voices with accents from across the Spanish-speaking world:
- 🇪🇸 Spain (Álvaro, Elvira, Ximena)
- 🇨🇴 Colombia (Gonzalo, Salomé)
- 🇦🇷 Argentina (Tomás, Elena)
- 🇻🇪 Venezuela (Sebastián, Paola)
- 🇲🇽 Mexico (Jorge, Dalia)
- 🇨🇺 Cuba (Manuel, Belkys)
- And more: Chile, Peru, Ecuador, Puerto Rico...
The Cost
| Component | Price |
|---|---|
| Whisper (STT) | Free — runs locally, tiny model |
| Edge TTS | Free — Microsoft API, no key needed |
| Telegram | Free |
| Total | €0/month |
No paid API keys. No usage limits. No subscriptions. Whisper runs on the same server where Harvie lives, and Edge TTS is a public Microsoft service.
Harvie Listens to Social Media
Every morning at 9:00, Harvie automatically checks:
Twitter/X
Monitors accounts I care about, looking for tweets about AI agents, open source tools and developer tools. Sends me a digest with the best tweets of the day and suggestions for new accounts.
YouTube
Checks specific channels for new videos. If there's a new one, downloads the subtitles and sends me a summary.
When I send a reel, it extracts the audio, transcribes it and analyzes the content. If the reel mentions other resources, it finds them and summarizes them.
How It Works Under the Hood
Cron job (9:00 every day)
→ bird CLI (Twitter) / yt-dlp (YouTube/Instagram)
→ Download content
→ Whisper (if audio/video)
→ Analysis and summary
→ Delivered to me via Telegram
All tools are open source:
- bird: Twitter CLI, no official API needed
- yt-dlp: downloads from any platform (YouTube, Instagram, TikTok, Bilibili...)
- Whisper: audio to text transcription
- OpenClaw cron jobs: automatic task scheduling
Real Example: From a Reel to Actionable Knowledge
Today someone sent me an Instagram reel. 30 seconds of someone recommending 5 YouTube videos. In 10 minutes Harvie had:
- Downloaded the audio from the reel
- Transcribed the content with Whisper
- Identified the 5 videos mentioned
- Found each one on YouTube
- Downloaded subtitles and generated summaries
One example — Jaime Guerra: "Want to Get Rich? Don't Start a Business"
His thesis: don't start a business if you don't have skills. First learn something valuable (growth marketing, sales, leadership). Skills last, businesses come and go. He went from €5 in his account to €100,000/month by learning growth marketing and offering his services to e-commerce brands.
A 30-second reel contained 5 hours of YouTube content. Without this pipeline, it would have taken an entire afternoon. With it, 10 minutes.
What This Means for Development
These capabilities open doors that didn't exist before:
- Market research: Harvie can monitor what's being said on social media about my sector and summarize it every morning
- Content curation: instead of scrolling for hours, I receive the relevant stuff filtered
- Content creation: video summaries become material for blog posts
- Prospecting: if a lead posts on social media, Harvie detects it and adapts the pitch
- Accelerated learning: I can "watch" 10 videos in the time it takes to watch one
And it all runs on open source tools, no paid APIs, on a €10/month VPS.
Full Stack and Costs
| Service | Function | Cost |
|---|---|---|
| VPS (Hostinger) | Server where everything lives | ~€10/month |
| OpenClaw | Agent orchestrator | Free (open source) |
| Whisper | Audio transcription | Free (local) |
| Edge TTS | Voice synthesis | Free |
| yt-dlp | Content downloading | Free (open source) |
| bird CLI | Twitter reading | Free (open source) |
| Telegram Bot | Communication channel | Free |
| Claude (Anthropic) | Agent's brain | Variable (API) |
The only real cost beyond the server is the language model API. Everything else is open source and free.
Your Turn
If you want to build something similar:
- Start with voice:
pip install edge-tts— 5 minutes and your agent talks - Add listening:
yt-dlp+ a cron job = automatic monitoring - Connect to your workflow: make the summaries feed your actual work, not just more notifications
The most useful AI isn't the one that knows the most. It's the one that integrates into your life without making you change how you work.
— I, Johnny — configured agent: Harvie. The question is no longer whether AI will change your job, but whether you'll decide how.