Back to blog
openclawaiai-agentsttswhisperproductivity
·5 min read

My AI Agent Can Now Talk, Listen to Social Media, and Analyze Videos — All for Free

Day 7 with Harvie, my AI agent. Today we added three new capabilities: voice, social media listening, and video analysis. The most surprising part: the total cost of all this is zero euros.

Harvie Can Talk Now

Until yesterday our communication was text-only. Today I said: "can you talk to me with voice?" And in 5 minutes it was set up.

How It Works Under the Hood

  1. I send a voice note via Telegram (normal voice message)
  2. Whisper (OpenAI's model, running locally) transcribes the audio to text
  3. Harvie processes the text and generates his response
  4. Edge TTS (Microsoft's voice engine) converts the response to audio
  5. I get a voice note back on Telegram
Johnny's audio → Whisper (STT) → Text → Harvie thinks → Edge TTS → Audio back

The Voice Catalog

Edge TTS has 44 Spanish voices with accents from across the Spanish-speaking world:

  • 🇪🇸 Spain (Álvaro, Elvira, Ximena)
  • 🇨🇴 Colombia (Gonzalo, Salomé)
  • 🇦🇷 Argentina (Tomás, Elena)
  • 🇻🇪 Venezuela (Sebastián, Paola)
  • 🇲🇽 Mexico (Jorge, Dalia)
  • 🇨🇺 Cuba (Manuel, Belkys)
  • And more: Chile, Peru, Ecuador, Puerto Rico...

The Cost

ComponentPrice
Whisper (STT)Free — runs locally, tiny model
Edge TTSFree — Microsoft API, no key needed
TelegramFree
Total€0/month

No paid API keys. No usage limits. No subscriptions. Whisper runs on the same server where Harvie lives, and Edge TTS is a public Microsoft service.

Harvie Listens to Social Media

Every morning at 9:00, Harvie automatically checks:

Twitter/X

Monitors accounts I care about, looking for tweets about AI agents, open source tools and developer tools. Sends me a digest with the best tweets of the day and suggestions for new accounts.

YouTube

Checks specific channels for new videos. If there's a new one, downloads the subtitles and sends me a summary.

Instagram

When I send a reel, it extracts the audio, transcribes it and analyzes the content. If the reel mentions other resources, it finds them and summarizes them.

How It Works Under the Hood

Cron job (9:00 every day)
  → bird CLI (Twitter) / yt-dlp (YouTube/Instagram)
  → Download content
  → Whisper (if audio/video)
  → Analysis and summary
  → Delivered to me via Telegram

All tools are open source:

  • bird: Twitter CLI, no official API needed
  • yt-dlp: downloads from any platform (YouTube, Instagram, TikTok, Bilibili...)
  • Whisper: audio to text transcription
  • OpenClaw cron jobs: automatic task scheduling

Real Example: From a Reel to Actionable Knowledge

Today someone sent me an Instagram reel. 30 seconds of someone recommending 5 YouTube videos. In 10 minutes Harvie had:

  1. Downloaded the audio from the reel
  2. Transcribed the content with Whisper
  3. Identified the 5 videos mentioned
  4. Found each one on YouTube
  5. Downloaded subtitles and generated summaries

One example — Jaime Guerra: "Want to Get Rich? Don't Start a Business"

His thesis: don't start a business if you don't have skills. First learn something valuable (growth marketing, sales, leadership). Skills last, businesses come and go. He went from €5 in his account to €100,000/month by learning growth marketing and offering his services to e-commerce brands.

A 30-second reel contained 5 hours of YouTube content. Without this pipeline, it would have taken an entire afternoon. With it, 10 minutes.

What This Means for Development

These capabilities open doors that didn't exist before:

  • Market research: Harvie can monitor what's being said on social media about my sector and summarize it every morning
  • Content curation: instead of scrolling for hours, I receive the relevant stuff filtered
  • Content creation: video summaries become material for blog posts
  • Prospecting: if a lead posts on social media, Harvie detects it and adapts the pitch
  • Accelerated learning: I can "watch" 10 videos in the time it takes to watch one

And it all runs on open source tools, no paid APIs, on a €10/month VPS.

Full Stack and Costs

ServiceFunctionCost
VPS (Hostinger)Server where everything lives~€10/month
OpenClawAgent orchestratorFree (open source)
WhisperAudio transcriptionFree (local)
Edge TTSVoice synthesisFree
yt-dlpContent downloadingFree (open source)
bird CLITwitter readingFree (open source)
Telegram BotCommunication channelFree
Claude (Anthropic)Agent's brainVariable (API)

The only real cost beyond the server is the language model API. Everything else is open source and free.

Your Turn

If you want to build something similar:

  1. Start with voice: pip install edge-tts — 5 minutes and your agent talks
  2. Add listening: yt-dlp + a cron job = automatic monitoring
  3. Connect to your workflow: make the summaries feed your actual work, not just more notifications

The most useful AI isn't the one that knows the most. It's the one that integrates into your life without making you change how you work.


— I, Johnny — configured agent: Harvie. The question is no longer whether AI will change your job, but whether you'll decide how.