My AI Agent Can Now Talk, Listen to Social Media, and Analyze Videos — All for Free

Day 7 with Harvie, my AI agent. Today we added three new capabilities: voice, social media listening, and video analysis. The most surprising part: the total cost of all this is zero euros.

Harvie Can Talk Now

Until yesterday our communication was text-only. Today I said: "can you talk to me with voice?" And in 5 minutes it was set up.

How It Works Under the Hood

I send a voice note via Telegram (normal voice message)
Whisper (OpenAI's model, running locally) transcribes the audio to text
Harvie processes the text and generates his response
Edge TTS (Microsoft's voice engine) converts the response to audio
I get a voice note back on Telegram

Johnny's audio → Whisper (STT) → Text → Harvie thinks → Edge TTS → Audio back

The Voice Catalog

Edge TTS has 44 Spanish voices with accents from across the Spanish-speaking world:

🇪🇸 Spain (Álvaro, Elvira, Ximena)
🇨🇴 Colombia (Gonzalo, Salomé)
🇦🇷 Argentina (Tomás, Elena)
🇻🇪 Venezuela (Sebastián, Paola)
🇲🇽 Mexico (Jorge, Dalia)
🇨🇺 Cuba (Manuel, Belkys)
And more: Chile, Peru, Ecuador, Puerto Rico...

The Cost

Component	Price
Whisper (STT)	Free — runs locally, `tiny` model
Edge TTS	Free — Microsoft API, no key needed
Telegram	Free
Total	€0/month

No paid API keys. No usage limits. No subscriptions. Whisper runs on the same server where Harvie lives, and Edge TTS is a public Microsoft service.

Every morning at 9:00, Harvie automatically checks:

Twitter/X

Monitors accounts I care about, looking for tweets about AI agents, open source tools and developer tools. Sends me a digest with the best tweets of the day and suggestions for new accounts.

YouTube

Checks specific channels for new videos. If there's a new one, downloads the subtitles and sends me a summary.

Instagram

When I send a reel, it extracts the audio, transcribes it and analyzes the content. If the reel mentions other resources, it finds them and summarizes them.

How It Works Under the Hood

Cron job (9:00 every day)
  → bird CLI (Twitter) / yt-dlp (YouTube/Instagram)
  → Download content
  → Whisper (if audio/video)
  → Analysis and summary
  → Delivered to me via Telegram

All tools are open source:

bird: Twitter CLI, no official API needed
yt-dlp: downloads from any platform (YouTube, Instagram, TikTok, Bilibili...)
Whisper: audio to text transcription
OpenClaw cron jobs: automatic task scheduling

Real Example: From a Reel to Actionable Knowledge

Today someone sent me an Instagram reel. 30 seconds of someone recommending 5 YouTube videos. In 10 minutes Harvie had:

Downloaded the audio from the reel
Transcribed the content with Whisper
Identified the 5 videos mentioned
Found each one on YouTube
Downloaded subtitles and generated summaries

One example — Jaime Guerra: "Want to Get Rich? Don't Start a Business"

His thesis: don't start a business if you don't have skills. First learn something valuable (growth marketing, sales, leadership). Skills last, businesses come and go. He went from €5 in his account to €100,000/month by learning growth marketing and offering his services to e-commerce brands.

A 30-second reel contained 5 hours of YouTube content. Without this pipeline, it would have taken an entire afternoon. With it, 10 minutes.

What This Means for Development

These capabilities open doors that didn't exist before:

Market research: Harvie can monitor what's being said on social media about my sector and summarize it every morning
Content curation: instead of scrolling for hours, I receive the relevant stuff filtered
Content creation: video summaries become material for blog posts
Prospecting: if a lead posts on social media, Harvie detects it and adapts the pitch
Accelerated learning: I can "watch" 10 videos in the time it takes to watch one

And it all runs on open source tools, no paid APIs, on a €10/month VPS.

Full Stack and Costs

Service	Function	Cost
VPS (Hostinger)	Server where everything lives	~€10/month
OpenClaw	Agent orchestrator	Free (open source)
Whisper	Audio transcription	Free (local)
Edge TTS	Voice synthesis	Free
yt-dlp	Content downloading	Free (open source)
bird CLI	Twitter reading	Free (open source)
Telegram Bot	Communication channel	Free
Claude (Anthropic)	Agent's brain	Variable (API)

The only real cost beyond the server is the language model API. Everything else is open source and free.

Your Turn

If you want to build something similar:

Start with voice: pip install edge-tts — 5 minutes and your agent talks
Add listening: yt-dlp + a cron job = automatic monitoring
Connect to your workflow: make the summaries feed your actual work, not just more notifications

The most useful AI isn't the one that knows the most. It's the one that integrates into your life without making you change how you work.

— I, Johnny — configured agent: Harvie. The question is no longer whether AI will change your job, but whether you'll decide how.

My AI Agent Can Now Talk, Listen to Social Media, and Analyze Videos — All for Free

Harvie Can Talk Now

How It Works Under the Hood

The Voice Catalog

The Cost

Harvie Listens to Social Media

Twitter/X

YouTube

Instagram

How It Works Under the Hood

Real Example: From a Reel to Actionable Knowledge

What This Means for Development

Full Stack and Costs

Your Turn