An AI lead generation agent that books meetings while you sleep
How we build AI lead generation agents that actually book meetings: the workflow, the prompts, the guardrails, and the bits that quietly break at 3am.
- ↳A lead gen agent is three small agents in a trench coat: qualifier, researcher, scheduler. Build them separately or they fight.
- ↳Guardrails matter more than prompts. Rate limits, allowlists, and a human-in-the-loop for the first 50 bookings save you from public embarrassment.
- ↳Cal.com plus a thin scheduling tool beats anything custom for round-robin and timezones. Don't rebuild it.
- ↳Log every tool call to Postgres with the lead id. When something goes wrong at 4am, you need the trace, not vibes.
A client called us last March about a lead form that was getting 400 submissions a month and converting eight into meetings. The SDR was answering on Monday morning. By then half the leads had gone cold or signed with someone faster. They wanted an agent to handle it overnight.
We shipped it in two weeks. It now books between 30 and 60 meetings a month on autopilot, and the SDR spends her time on the ones the agent flags as worth a human touch. That’s the kind of system this post is about. Not a chatbot that says hello. An actual AI lead generation agent that does the work end to end.
I’ll walk through the workflow we use, the prompts that hold up in production, and the guardrails that stop the thing from emailing 200 people the wrong calendar link at 2am. Because that happened once. We’ll get to it.
What the agent actually does
The job sounds simple. A lead comes in (form submit, chat widget, inbound email, LinkedIn DM forwarded to a shared inbox). The agent qualifies them, enriches what we know, replies in a way that sounds like a person, and books a meeting on the right calendar with the right account exec.
The job is not simple. Here’s where most builds fall apart:
- Qualification is fuzzy. “Is this a real lead” depends on company, role, intent, and budget signals that nobody wrote down.
- Enrichment APIs lie. Clearbit, Apollo, and the rest will confidently return wrong job titles.
- Scheduling has 14 edge cases. Timezones. Round-robin. Holidays. Buffer time. Already-booked slots that the calendar API hasn’t synced yet.
- The reply has to not sound like an AI. Operators can tell. Buyers can tell. Your mom can tell.
So we don’t build one giant agent. We build three small ones that talk to each other.
The three-agent split
[inbound webhook] -> qualifier -> researcher -> scheduler -> [calendar + email]
|
v
[human review queue]
Qualifier. Reads the inbound payload (form, email body, chat transcript) and decides: is this a buyer, a job seeker, a vendor pitch, or spam. Outputs a score 0-100 and a category. Runs on Claude Haiku because it’s cheap and good enough. We log every decision to Postgres with the raw input so we can grade it later.
Researcher. Only runs if the qualifier scored above 60. Hits Apollo for company size and the person’s role, pulls the company’s homepage with a fetch tool, and writes a four-sentence brief. We use Claude Sonnet here because the brief feeds the email and quality matters. Cost per lead: about $0.03.
Scheduler. Takes the brief, drafts the reply, and calls the Cal.com API to find slots on the right AE’s calendar based on territory rules. Uses GPT-4.1 mini for the draft because in our blind tests it sounds slightly less LLM-ish than Sonnet for short business emails. Reasonable people disagree on this.
Keeping them separate means each one has a small, testable job. When the scheduler picks the wrong AE you don’t have to debug the qualifier. When the qualifier starts approving spam you don’t have to retrain the researcher.
The prompts that actually hold up
I’m not going to paste a 2000-token system prompt. Most of what’s online is fan fiction. Here’s the qualifier, trimmed but real:
You classify inbound leads for a B2B agency.
Return JSON only: { category, score, reason }
category is one of: buyer, job_seeker, vendor_pitch, support, spam
score is 0-100. 100 means clear ICP buyer with intent.
ICP signals (boost score):
- email domain matches a company with 20+ employees
- message mentions a project, timeline, or budget
- role is founder, head of, director, VP, or C-level
Negative signals (cap score):
- gmail/yahoo/outlook personal address: cap at 55
- message is under 15 words and generic: cap at 40
- mentions "partnership" or "collaboration" with no specifics: cap at 30
Do not infer signals that aren't in the input. If unsure, score lower.
Three things make this work. Strict JSON output so we can parse it. Hard caps so it can’t get overexcited about a Gmail lead. And “do not infer” because the default LLM behaviour is to confabulate intent.
The scheduler prompt is longer but the only part that matters is the tone instructions:
Write like a senior person who is busy but helpful.
- 60-90 words max
- one specific reference to their company or message
- two time options in their timezone
- no "I hope this email finds you well"
- no "excited to", "happy to", "looking forward to"
- sign off with just the first name
The banned phrases list is doing 80% of the work. Without it every email opens with “I hope this email finds you well” and dies on arrival.
Guardrails, or how we stopped emailing the wrong people
The 2am incident: an edge case where a lead replied to the agent’s email and our inbound parser treated the reply as a new lead. The agent dutifully booked a second meeting with the same person on a different AE’s calendar. Then a third.
Guardrails we added after that, in order of how much they’ve saved us:
- Dedupe by email + 24h window. Before any tool call, check Postgres for a lead with the same email in the last day. If found, route to the human queue instead of replying.
- Rate limit per domain. Max 3 outbound emails per company domain per week. Stops the agent from carpet-bombing a single company.
- Allowlist for the first 50 sends. When we launch a new client, every outbound goes through a Slack approval step for the first 50. Catches tone problems before they go out 500 times.
- Hard token ceiling per lead. $0.25 max. If a lead burns through that, escalate. Almost always means the agent is in a loop.
- Calendar write lock. We hold a Redis lock on the AE’s calendar for 30 seconds during a booking. Prevents two concurrent bookings into the same slot.
None of these are clever. All of them came from a specific bug.
The stack, concretely
For a typical build we use Cloudflare Workers for the webhook ingest, Postgres on Hetzner for state, Cal.com for scheduling, Resend for outbound email, and the Anthropic and OpenAI APIs for the models. Turnstile in front of the public form to keep bot submissions down. The whole thing runs for about $40 a month in infra plus model costs that scale with volume (usually $0.05 to $0.15 per lead processed).
We deploy it from a single repo with Cursor open and Claude Code running the test suite. Nothing exotic. The boring stack is the point.
What it doesn’t replace
The agent books meetings. It does not close them. It also doesn’t handle the long tail of weird inbound: the person who wants to chat about a podcast appearance, the journalist on deadline, the existing customer with a support question that got routed to the wrong form. Those still need a human, or a separate AI agent for customer support sitting on a different inbox.
The split matters. A lead gen agent that tries to also do support gets confused, and a support agent that tries to book sales meetings annoys your existing customers. Two agents, two prompts, two evaluation sets.
If you’re building this yourself
Start with the qualifier. Run it in shadow mode for a week against your real inbound, log everything, then sit down with whoever currently handles leads and grade 100 decisions together. You’ll find your scoring rules are wrong in ways you didn’t expect. Fix those before you wire up any sending.
Then add the scheduler with a human approval step. Then remove the approval step once you trust it. Don’t skip the middle stage. The agents that ship without it are the ones that produce horror stories.
If you’d rather not build it from scratch, we do this for a living. Bring your inbound volume, your ICP, and your current conversion rate, and we’ll tell you honestly whether an agent moves the number.
Common questions
▸How is this different from a chatbot on my website?
A chatbot answers questions in a widget. A lead gen agent handles the full workflow: it reads inbound from any source (form, email, chat), qualifies the lead, drafts a response in your voice, and books the meeting on the right calendar. Chatbots are reactive and live in one place. Agents are proactive and stitch together email, calendar, CRM, and enrichment. You can run both, and they should share a database, but the prompts and guardrails are different.
▸How long does it take to build one?
Two to four weeks for a first version that handles 80% of inbound, assuming you have an existing form or inbox to point it at. The work isn't writing the code, it's the evaluation loop: grading the qualifier against real leads, tuning the email tone until it doesn't read as AI, and finding the edge cases in your scheduling rules. Plan for another month of iteration after launch.
▸What does it cost to run?
For most clients, $50-$200 a month in model and API costs at a volume of 500-2000 leads processed. Infra (Cloudflare Workers, a small Postgres instance, Resend) is another $30-$60. The big variable is enrichment. Apollo or Clearbit can add $0.10-$0.30 per lead. If you're processing tens of thousands of leads a month, the architecture changes and we'd push more work to cheaper models.
▸Will buyers know it's an AI?
Some will, especially in technical markets. We don't pretend otherwise, and we don't recommend trying to hide it. What we do recommend is making the email sound like a competent assistant, not a marketing template. The banned-phrases list in the prompt does most of that work. We also have the AE step in for the second reply, so the AI handles the opener and the human handles the conversation.
▸What happens when the agent gets something wrong?
Every lead has a status and a trace log in Postgres. When something goes wrong (wrong AE booked, weird reply, calendar conflict) you have the full chain of tool calls, prompts, and responses to look at. We also route anything the qualifier scores between 40 and 70 to a human review queue, because those are the cases most likely to be misjudged. The goal is fewer surprises, not zero surprises.
Related posts
Building an AI chatbot for small business that actually knows your products
The RAG stack we ship for small business chatbots: ingestion, embeddings, retrieval, guardrails, and the boring parts that decide whether it answers correctly or hallucinates a refund policy.
AI agent for customer support vs. the chatbot you already tried: what's actually different
Your old support bot routed people in circles. An AI agent reads tickets, calls your APIs, and closes loops. Here's the real gap, with examples.
What to actually scope in an MVP development service contract: a buyer's checklist
Most MVP contracts hide the parts that matter. Here is what to demand in writing before you sign with any MVP development service, with the line items I have watched startups get burned on.