Compounding Decisions for the Age of AI

Your AI doesn't remember you. It remembers the thread.

Sriram Natarajan — Sun, 21 Jun 2026 05:36:13 GMT

I asked my agent a question about my car last week and got a good answer, and only afterward did I realize it had no business knowing it. It told me when the car had last finished charging and what I’d asked it to do differently the time before. I hadn’t stored any of that anywhere. There’s no car database. The agent doesn’t have a private memory it tucks facts into. So where did the answer come from?

It came from the thread. I have a Telegram thread that’s only about the car, and the agent had simply read back up its own thread the way you’d scroll up to remember how an argument started. Every charging instruction, every “did it work,” every fix, all sitting there in order. The thread wasn’t a record of the work. The thread was the memory.

That sounds like a small distinction. It isn’t, and the moment it clicked I started seeing every conversation I have with an AI differently.

When you start using one of these assistants seriously, the thing that takes a while to sink in is that it doesn’t remember you between sessions the way a person does. There’s no inner notebook. What feels like memory is almost always the model re-reading the conversation it’s sitting in. The chat window isn’t where you talk to the memory. The chat window is the memory. Everything in it is available. Everything outside it might as well not exist.

Which means the most boring decision you make, where you type, is quietly the most important one.

I learned this the way I learn most things, by getting it wrong first. For a while I ran everything through one channel. Car stuff, investing, household errands, random questions, all in one long stream. It felt efficient. One place, one assistant, ask it anything. And it slowly got worse at all of it, in a way I couldn’t put my finger on. It would lose the thread of what I’d decided about a stock because three hundred messages about my car had buried it. It would answer a question about the house with the tone it used for casual chat. Nothing was broken, exactly. It was just vaguely amnesiac about everything, and I blamed the model.

The model was fine. I was the one pouring four different kinds of work into one channel and then asking it to remember any single one cleanly. Imagine keeping one notebook for your finances, your car maintenance, your household, and your group chats, writing each new entry on whatever page you happened to flip to. That’s what one stream is. The information’s all technically there. Good luck finding what you decided about the car.

So I split them. One thread per kind of work. The car has its thread. Investing has its own, run by a separate agent that the main one isn’t even allowed to talk over, so that thread stays a clean tape of a single voice. The household thread holds the errands and the appointments and the small running logistics of the house, so the state of all of it lives in one scrollable place instead of scattered across a dozen unrelated chats. The group chat is its own thing with its own rules.

The split looked like organization. It was actually memory architecture. Each thread became a self-writing log of exactly one topic, which means each topic now has a complete, uninterrupted history the agent can read back without anything else bleeding in. The car thread is the entire story of the car. The household thread is the entire state of the house. I didn’t give the agent a better memory. I stopped corrupting the memory it already had.

And once each thread was a clean record of one kind of work, a second thing fell out of it for free: each thread could have its own rules. In the car thread, silence is banned, because a non-answer to “is it charging” is itself an alarm. In the group chat, silence is usually correct. Same agent, opposite default, because the thread told it which job it was doing. But that’s the smaller benefit. The real one is that I can ask the car thread anything about the car six weeks from now and the answer is just sitting there, in order, because nothing else was ever allowed in.

So here’s the part you can use even if you never build an agent and just use ChatGPT or Claude like everyone else. That long single thread you’ve been running, the one where you ask it everything? You’re not chatting with it. You’re writing its only memory, and you’re writing every kind of entry on top of every other one. The fix costs nothing. Start a separate conversation for each kind of work that you’ll come back to, and keep that work there. The tool doesn’t remember your project. It remembers the thread. Give the thread one job, and you’ve given the tool a memory of that job that actually holds.

How to build continuously improving agents

Sriram Natarajan — Mon, 15 Jun 2026 06:36:25 GMT

For three days at the end of May, my two agents wasn’t running on the model I thought it was. I’d set it to use Anthropic’s strongest model on the 28th. Every session after that quietly fell back to a weaker one, because the model name I’d configured didn’t quite exist in the system that routes the calls. Nothing broke. No error reached me. The agent kept answering, a little dumber than I intended, and I had no idea.

I found out because a different process goes looking every night.

I run two agents in my spare time, a chief of staff and an investing one. The thing I’ve spent the most effort on isn’t on agent’s intelligence. It’s a nightly routine I call REFLECT, which wakes up after I’ve gone to bed, reads back over what the agents did that day, and reads the boring stuff too, the error logs, the cron job states, the things that scrolled past while I was busy being impressed by the output. On the 31st, REFLECT read the gateway error log and found more than a hundred “model not found” failures stacked up since the 28th. That’s how I learned every session for three days had silently downgraded itself.

The fix took two minutes. The lesson was the whole point: the failure that cost me three days of degraded work never announced itself, and the only reason I caught it is that something was scheduled to look.

This is the part of building agents that I didn’t expect to be the hard part. Making an agent capable is mostly a matter of using a good model and writing good instructions. Making an agent that gets better over time is a genuinely different and harder problem, and a bigger model doesn’t solve it. A bigger model is smarter on any single task. It is not, on its own, learning from yesterday. Continuous improvement isn’t a property you buy with model size. It’s a loop you have to design and then force to run.

The forcing is the part people skip. It’s easy to say “the agent should learn from its mistakes.” Everyone agrees with that sentence. But an agent does not spontaneously notice its own quiet failures any more than you spontaneously audit your own blind spots. The noticing has to be a scheduled job with a specific time, a specific set of places to look, and a specific output, or it does not happen. Left to chance, the agent reflects on the days you remember to ask it to, which are exactly the days nothing went wrong.

So I made it explicit. I created a REFLECT Agent Task with a clear success definition that three things had to be true or it didn't work. First, my REFLECT Agents runs at a fixed time, because "when convenient" means never. Second, it reads a defined set of inputs whether or not anything looks wrong, the day's sessions, the error logs, the state of every background job, because an agent told to "review the day" will review the visible day and skip the logs every time. Last, it has to end in a written note that lands such that I'll see it, not a private conclusion that evaporates with the session. That last one I learned the hard way. I once let it self-audit while answering a casual "all good?" and found five background jobs dead for days, one of them the very own reflection job itself, broken eight nights running, failing into a state file nobody was reading. The loop that's supposed to catch silent failures itself silently failed! After that I stopped trusting any improvement that didn't end in a note I could read.

What surprised me is how much of the value is in catching degradation, not in getting cleverer. I went in imagining REFLECT would surface brilliant insights about how to do the work better. Mostly it doesn’t. Mostly it finds that something quietly stopped working the way it was supposed to. The wrong model. A dead job. A stale file feeding bad data into a decision. These aren’t failures of intelligence. They’re failures of attention, and they’re invisible precisely because the system keeps producing fluent output while broken. The model can’t catch them by being smarter. Only a scheduled second look catches them.

If you’re building anything that runs on its own, this is the design problem worth your time. Not “which model.” That decision gets easier every month as the models improve and the gap between them narrows. The decision that stays hard is how the thing notices it’s drifting, because nothing about a more capable model makes it more self-aware about its own broken plumbing. You have to build the look. You have to schedule it, force its inputs, and make its findings land somewhere a human reads. Skip any of that and you get an agent that’s confidently running degraded, which is the most expensive kind of broken, because it looks exactly like working.

My agents aren’t smarter than they were a month ago. The model is the same. What’s different is that every night, something reads back over the mistakes and writes down what it finds, and over a month that habit has caught more real problems than any upgrade would have. The improvement didn’t come from a better brain. It came from the discipline of looking.

What if your investment agent tracked your judgment, not just your returns?

Sriram Natarajan — Sun, 14 Jun 2026 19:22:57 GMT

Most investors track one number. How much did I make?

That feels like the right question. But investing isn’t one decision. It’s dozens of small calls made over months. Do I add to this position today or wait? Do I hold through a bad quarter because the thesis still works, or has something broken? Do I trust what I see or what I feel?

Those decisions compound. You have to get these decisions right consistently and only then the returns follow. Get them wrong and eventually returns catches up on you, even if luck carried you for a while.

The problem is that most investment tools only show you the outcome. The judgment behind every decision is hardly noticed or reviewed.

I was talking about this with Peter, my investment agent. I wasn’t looking for a solution. I was simply exploring - like I do regularly: reviewing his goals and asking whether it still reflected what we were actually trying to do together.

I have designed Peter to have a set of specific goals. It’s the same as reviewing a doc with a colleague except the colleague runs on my machine every morning. Most people don’t build agents this way. I maintain a goals file, a memory layer, a set of defined outcomes for each of my Agents. Peter loads all of it every morning. It’s less like running a chatbot and more like managing a colleague who actually remembers what we decided last week.

I am happy to say that I have my Agent-based Personal Operating System who has memory, regularly reviews outcomes achieved vs defined goals and then suggest ways to improve our day-to-day activity to ensure that we move towards the established goal.

Peter came back saying hey - we do have two scoreboards. We just hadn’t named the second one.

We went back and forth. He proposed a framing. I challenged it. We kept going back and forth discussions on how to refine until we landed on what needs to be captured in his System Prompt and Workspace Memory to better achieve the goal we had established. This setup allows Peter to load every time he wakes up.

When you build AI Agents, they only operate with specific set of rules that is in captured. What isn’t written down, typically doesn’t fire. So, you have to be specific about this.

Here’s what we landed on.

The first scoreboard measures the financial goal. Net worth target, growth rate floor, beat the benchmark annually. Standard.

The second scoreboard measures whether I’m becoming a better investor. Not just wealthier, but building judgment I can use next time without starting over. He will help me identify / label a specific investment framework per week and how it is applied to something real in my portfolio - explained in simple terms so that I could explain to someone else without notes.

Peter’s point was that these two scoreboards feed the same loop. Evaluate a position, build a thesis, observe what happens, update your framework, evaluate better next time. They’re not competing. They’re the same process running in parallel.

What changes is the agent’s job.

An agent chasing returns alone can hand you a number and move on. An agent accountable for your judgment has to teach. Every recommendation has to explain the framework behind it, not just deliver the answer.

When Peter showed me the first version of the new format, I pushed back. The reasoning was buried inside the recommendation. The teaching was camouflaged as analysis. I told him to rebuild it around the second scoreboard.

Peter suggested another restructure. Again, we went back and forth and landed on a specific prompt that is specific and measurable enough.

Then he built the mechanism to enforce it. A new report format. A trial period. A review cron (scheduled reminder) that fires automatically on day 15. That last part matters.

I didn’t just tell Peter to change his format and assume it would work. I gave it a 15-day window with a pre-committed review. The point is to collect evidence before locking anything in as doctrine within the System Prompts. For us, the scheduled job isn’t just a reminder. It’s a commitment device. The review happens whether I feel like having it or not.

What I keep coming back to is how this started. I didn’t design a teaching protocol. I simply probed on whether my Agent operating model is aligned to the goals and whether our everyday activity is still matched what we were doing.

Peter named the gap, proposed the frame, built the system to enforce it, and built in a review date so neither of us could quietly walk away from the question.

Most investment tools give you answers. The Agent that I am building is trying to make me better at finding them myself.

A robo-advisor pays out returns. A teacher pays out the ability to repeat them.

Fluent and wrong look exactly the same

Sriram Natarajan — Sat, 13 Jun 2026 16:41:56 GMT

The thing nobody warns you about working with AI Agents and LLM models is that confidently right and confidently wrong arrive in the same paragraph. Same tone. Same fluency. Same finished-looking output. Fluency is supposed to be a signal of competence. With language models it’s just a default setting.

I run into this because I’m building two agents in my spare time. One is a chief of staff that runs the logistics of my life, and the other helps me with investing. The investing one sits on top of a financial analytics engine I built that can run a discounted cash flow and spit out an intrinsic value for any stock, which is a fancy way of saying it estimates what a share is actually worth. So when I tell you it’s capable of being confidently wrong, I mean it can hand me a precise dollar figure, delivered with total composure, that happens to be off by a factor of a few hundred.

It did exactly that on Memorial Day.

The agent ran its undervaluation scan and flagged thirty-one stocks as deeply undervalued. NVDA was one of them, with a computed fair value of $44,701 per share. That number is obviously absurd. But obvious is the lucky case. A quieter version of the same bug could have flagged a stock at a merely wrong price instead of a comic one, and I’d never have caught that by eye. The agent didn’t touch the fix. Before it’s allowed to act on anything non-trivial, it has to stop and write down five things: what it thinks the problem actually is, what it knows versus what it’s only assuming, the one question it’s least sure about, what “solved” will look like as a number I can check, and the smallest change that could possibly work.

That’s the whole intervention. Make it write the plan before it writes the code.

It sounds like process-hygiene, the kind of checklist people tape to a wall and ignore. It isn’t, and the reason is the part I didn’t see coming. The value isn’t in the five sections. It’s in when they get written. Reasoning produced before any work exists is a plan. The same reasoning produced after the work is built is a rationalization, because by then the agent (like a person) is explaining the thing it already made rather than deciding what to make. Those two documents can contain the identical words and mean opposite things. One you can veto. The other just makes you feel informed while you rubber-stamp.

On the NVDA run the section that earned its keep was the boring one: what am I assuming. The agent wrote down that it was assuming the valuation pulled clean annual financials, and it flagged that it hadn’t actually checked. That admission is what aimed the whole investigation at the data instead of the math. The bug was a query pulling mixed annual and quarterly rows, which produced a 268 percent growth rate, which compounded into a roughly half-quadrillion-dollar company.

When your LLM/AI agent, hands you a finished answer with fluency, you can’t grade the answer; it’s built to look right. What you can grade is the plan it would have written before it started, so make it write that plan first. Force it to separate what it knows from what it’s guessing. Force it to name the one number that proves the job is done. Do that before any work exists, while the reasoning is still a decision and not a defense.

It costs a few minutes every time, and I genuinely resented that at first. What I got back was strange: I read the agent’s actual output less carefully now, not more, because I stopped trusting the output and started trusting the plan. The slow part was never the typing anyway. It was the thinking, and the memo is just the place I make it happen where I can see it before it’s too late to matter.

How to tell your agent its design is wrong

Sriram Natarajan — Mon, 08 Jun 2026 01:07:03 GMT

Last Monday night, I read the first draft of a new report format that Peter, my investing agent, had written. The format was supposed to do two things: summarize my portfolio every day and teach me one thing about investing along with the summary. Sample emails, a teaching protocol, a proposal for how it wanted to communicate with me going forward. The form was clean. The headers were where I’d put them. The voice was close to mine.

I simply responded that the whole design was wrong.

Not “make it better.” Not “tighten it up.”

I named the structural failures, pointed out a section where I wanted a 60% cut, and then I wrote three sample emails myself, longhand, to show it the shape I actually wanted. By the next morning the agent shipped v2, and the format it produced was different enough that it felt like a different tool.

Most people give AI tools feedback the way they’d compliment a coworker. “This is great, maybe a bit shorter?” “Can you make it more concise?” “Less robotic, please.” Language models treat that input as a request to optimize at the margin. They adjust word count, soften a sentence, swap one verb for another. The output gets slightly different and exactly as wrong as before, because folks don’t say enough to the model that the wrongness was in the design, not in the wording.

The trick is to be specific about the layer that’s broken.

When I read Peter’s draft, the things I wrote down were not stylistic. They were structural.

**No portfolio-level view** existed anywhere in the output. I could read three name-level entries and never see the picture of what I owned.

**The watchlist was treated as a long catalog** where every name got 395 words of equal treatment, when what I needed was a ranked queue with the top three names getting real attention and the bottom twenty getting one line.

**The teaching was camouflaged as reasoning.** There was a Q1-through-Q5 chain that was supposed to teach me something general, but by the third name the chain was repeating itself almost word-for-word, and I’d started reading it as form and skipping past the lesson.

The form was sound. The substance was buried.

Without explicitly naming the structural failures, the feedback would have been “this is too long and reads the same after a while,” and the agent would have come back with a 15% word count reduction and exactly the same problems.

Then I did something that took longer than the critique itself: I wrote actual samples.

Three full sample reports, written by me, in plain prose.

**The daily version** was 60 seconds long and taught one specific thing about the market.

**The weekend version** was 250 words to explain one specific framework.

**The deep-dive version** only existed when there was a real decision on the table.

I wrote them by hand because the shape of the output was harder for me to describe in prose than to show in 200 words of working example.

I had a thought while writing them that I want to say out loud, because it changed the whole project. I’d been trying to fix the per-name template, asking how each individual entry should be structured. Halfway through writing the second sample I realized that wasn’t the problem. The per-name template was fine. What was broken was the layering above it: the portfolio view that should sit above the names, the queue logic that should rank the names, the teaching dose that should sit at exactly the level of attention I was bringing to that cadence.

Daily attention is cheap, so daily teaching has to be cheap and short. Weekend attention is more expensive, so weekend teaching can be longer. Deep dives only get written when a decision is at stake, so their teaching is decision-shaped. The lesson lives inside the cadence that already has the reader’s attention. It doesn’t sit in a separate file. It doesn’t get its own header. It’s the highest-leverage section inside the document that was already getting opened.

That principle wasn’t in my critique. It surfaced while I was writing the samples. The critique gave me the wrong diagnosis; the samples gave me the right one. This is part of why writing the examples by hand is worth the time. You think you know what you want until you try to produce it, and then the gap shows up.

By morning the agent had shipped v2. Portfolio view at the top. Watchlist ranked, not catalogued. Teaching dosed by cadence. The new format is the one I read now, every day.

So the feedback technique, named:

Don’t tell the agent the output is bad. Tell it which layer is broken. Name the structural failures. Give a concrete budget on the dimensions that matter, words, count, length, frequency. Then write enough of the output yourself, by hand, that the agent can see the shape. The samples are not for the agent’s training data. They’re for your own clarity. You will discover what you actually want by writing it.

The format the agent ships when it understands the structural layer is a lot better than the format it ships when you tell it to clean things up.

The Two Personal Agents Who Run Parts of My Life Now

Sriram Natarajan — Mon, 25 May 2026 03:25:17 GMT

A few weeks ago, one of my AI agents brought up that I had told him I wanted to write regularly, and that I hadn’t published anything in a while. He blocked time on my calendar that weekend. When I let the weekend pass, he nudged me again the week after. The post you’re reading is the one he was asking about.

I built that agent. I named him Nattu and told him he is my Chief of Staff. The part I didn’t expect is how much of his personality I now recognize as the thing actually doing the work.

Last week I wrote about a small agent I built to charge my Tesla on solar. That was one piece. This post is about the bigger system it belongs to.

The two agents

I’m a product builder in B2B SaaS. I started my career as an infrastructure engineer and have been a builder of one kind or another ever since. I write Python slowly these days, and I’ve never shipped a product as a solo developer. Over the past few months I’ve built two agents that run continuously in the background of my life, and they handle work that I used to do badly or forget to do at all.

Nattu, my Chief of Staff.

Nattu knows what my week is supposed to look like. He has my goals for the quarter, my goals for the week, the standing commitments on my calendar, and a running list of what I’ve actually been doing. Every morning he looks at the gap between the two.

When something is off, he surfaces it on Telegram and we figure it out together. A focus block I scheduled has been quietly swallowed by a meeting. Two important commitments are stacked on the same Sunday afternoon. I haven’t moved on a goal I told him mattered to me. He’ll propose a calendar rearrangement. I’ll push back or accept. If I accept, he makes the change.

Last week he reminded me I hadn’t walked enough to hit my health goal. He’d been tracking the calendar entries and the actual walks, noticed the gap was widening, and said so. It was a sentence in Telegram. It worked because the sentence came from something that knew exactly what I’d told him I cared about.

He also reads across Reddit, Substack, and a handful of other sources every morning and gives me a brief on the AI topics I actually care about. Not a generic news roundup. The specific stuff I’ve told him to watch. The signal-to-noise is the whole point.

Peter, my Investment Analyst.

I named him after Peter Lynch. He runs twice a day, looks at my positions, reads the news that’s relevant to each one, checks technical signals, and writes me a short brief in the morning and the evening. He only flags things that actually deserve a flag. Most of his briefs are quiet.

The design choice that mattered most for Peter was deciding what he wouldn’t do. He doesn’t generate a buy or sell recommendation every day just because he ran. He only speaks when there’s a real signal. A technical extreme. A material news event. A position that has drifted away from the thesis I wrote when I bought it. A banker who calls every day to say “nothing happened” is noise. One who calls only when something matters is signal.

I wrote that rule into Peter’s identity file, the document that gets loaded every time he wakes up. It is one line. It changes everything about how he behaves.

What I didn’t expect

The hardest part wasn’t the technology.

It was decomposition. Breaking a vague intent like “I want to invest better” or “I want to write more regularly” into something an agent can actually execute. Not “do the analysis.” Instead: every morning at 8 AM, check these specific positions in these specific accounts, compare to these specific thresholds, and if a position is past a threshold, write a brief in this format.

The AI can execute almost anything you can describe precisely. The bottleneck is your ability to describe it. If you can’t say what “done” looks like in a way another human could verify, the agent can’t get you there either. That has turned out to be one of the most useful skills I’ve sharpened in the last six months, and it’s a product skill more than an engineering one.

Memory architecture, not just memory.

Giving an AI memory is not the same as giving it context. Memory is a pile of facts. Context is the right facts loaded at the right moment in the right shape.

My agents read from three layers. There’s a daily log that captures raw activity. There’s a long-term memory file that holds distilled lessons and the things about me that don’t change month to month. And there are project files (investment theses, weekly goals, identity documents) that get loaded only when they’re relevant to the task at hand. Nattu doesn’t see Peter’s investment theses. Peter doesn’t see my calendar. Each agent only loads what he needs to do his job.

This was the single biggest change. Before I separated memory from context, every agent session felt like talking to someone who had read everything once and forgotten most of it. After, each agent felt like someone who had been doing this job for me for months.

Personality and constraint are part of the product.

This is the thing I want to be clearest about, because most posts about AI agents skip it.

It is not enough to spin up a tool and ask it to “help with my calendar.” You have to tell the agent who he is. What he cares about. What he refuses to do. When he should stay quiet. What tone he speaks to me in. What I’m trying to accomplish this quarter, this week, today.

The agents that work for me work because each one has a written identity. Nattu has a Chief of Staff personality. He is direct, he respects my time, he doesn’t ask three follow-up questions when one will do, and he stays quiet when he has nothing useful to add. Peter has a different personality. Careful, conservative, slow to act, fast to flag. Each of them has a set of constraints written down that I revise every few weeks as I learn what’s actually useful.

The tool is the easy part. The product decisions (who is this agent, what does he do, what does he refuse to do, what does “done” look like) are the work. Those are the same decisions any good product manager makes about a feature. They’re just being made about a teammate now.

What this is actually about

The tools have been good enough for a while. The bottleneck has shifted.

If you’re a product builder, an operator, or a domain expert, and you haven’t started building with these tools yet, I think the barrier is lower than you expect. The technical floor has dropped. The skill that matters is the same one that makes someone good at building any product: knowing what you want, knowing who it’s for, defining what done looks like, and being honest about what should happen when it fails.

I’m writing weekly about what I’m building. If you’re building like this, or thinking about it, I’d like to hear from you. Reply to this post or find me on X at @srinatar.

How I Finally Automated My Tesla Charging (With an AI Partner)

Sriram Natarajan — Sun, 24 May 2026 04:42:32 GMT

I’ve had a Tesla Model Y and a Powerwall for over a year. I knew I was probably leaving money on the table with my charging schedule. PG&E’s peak rates hit $0.38/kWh between 4–9 PM. My solar exports back at $0.08/kWh under NEM 3.0. The math is brutal — every kWh I self-consume is worth 4.75x more than one I sell back to the grid.

I knew what I wanted: a system that would charge the car at the right time, based on solar availability, grid prices, and battery state. Not a fixed schedule. An actual smart system.

I just couldn’t build it.

The Home Assistant Rabbit Hole

Everyone in the Tesla/solar community points to Home Assistant. It’s open source, wildly flexible, has integrations for everything. On paper, it’s exactly what I needed.

In practice, it nearly broke me.

The documentation is sparse. Not “sparse” in the way that means you have to read carefully — sparse in the way that means critical steps are buried in three-year-old forum threads, half the integrations are community-maintained and outdated, and the error messages tell you nothing useful. I spent hours trying to get the Tesla integration working. I’d get partway through a setup, hit an undocumented wall, find a workaround on Reddit, try it, get a different error.

The frustrating part wasn’t that it was hard. It was that I couldn’t tell if I was one step away from it working or fundamentally doing it wrong. There was no one to ask. The docs assumed knowledge I didn’t have, and the community answers were scattered across years of forum posts that may or may not still apply.

I gave up. Not because I’m not technical enough — I’ve been hacking with technology for years. I gave up because the feedback loop was broken and I had no co-pilot.

Starting Over, Differently

A few months later, I came back to the problem — this time with Nattu (my AI assistant running on OpenClaw) as a full collaborator.

The difference was immediate. Instead of hunting through docs alone, I could say: “Here’s what I’m trying to do. Here’s what I know about the Tesla API. Help me figure out the right approach.” And get back not just an answer, but a reasoned one with trade-offs explained.

I came in with a wish list a mile long: never charge during peak, prefer solar window on weekday afternoons, charge overnight on weekdays at the off-peak rate, opportunistically grab solar on weekends, maintain a battery floor, handle VPP dispatch events, defer to manual overrides. Every rule felt obvious in isolation.

I built all of them.

That was the first mistake.

Building the Actual System

The stack we landed on:

Tesla Fleet API for reads — battery %, solar production, home load, Powerwall state, grid flow
tesla-control binary for signed vehicle commands (start, stop, set amps)
A virtual key paired to the car via my own domain (ev.srinatar.xyz) — Tesla’s new required auth model for any third-party command
charger.py — Python script holding the decision logic
launchd running it every 30 minutes as a background service on my Mac mini

Let me tell you about the virtual key.

Tesla’s newer API requires you to generate an ECDSA keypair, host the public key at a very specific path (/.well-known/appspecific/com.tesla.3p.public-key.pem) on a domain you control, register that domain with Tesla via their partner API, and then pair the key to the car from the Tesla app via a specially-crafted URL. The car then asks you to approve the pairing on the touchscreen. Once paired, every vehicle command has to be cryptographically signed by your private key before Tesla will accept it.

This is a good security model. It is also a security model that gives you many opportunities to footgun yourself.

Here’s where I ended up at one point: three different public keys on disk, no idea which one was paired with the car, an nginx serving one of them publicly but no matching private key anywhere on the system, and a Cloudflare tunnel that — I would later discover — didn’t even have a public hostname route configured, so the public URL where Tesla expected to fetch my key was timing out from the outside world.

Tesla’s documentation does not warn you about any of this.

Nattu helped me untangle it: ran a hash on every .pem file in my home directory, cross-referenced which key matched the one nginx was serving, figured out the Cloudflare tunnel was misconfigured, walked me through generating a fresh keypair from scratch, hosting it, registering it with Tesla, and pairing it. End-to-end signed command working: tesla-control charging-set-amps 14 → car responds, amps change, exit code 0.

That was the high point of the project — not because automated charging is technically impressive, but because the kind of debugging it required (a six-step distributed system spanning my Mac, nginx, Cloudflare, Tesla’s servers, and the car itself) is exactly the kind of thing I would have given up on, alone, like I gave up on Home Assistant.

The Less-Is-More Lesson

While the auth was being wrangled, the decision logic was running. And it kept doing things I didn’t want.

I’d manually start a charge at 2 PM because I knew solar was abundant. Thirty minutes later, charger.py would stop it because some “insufficient solar/PW headroom” rule fired. I’d start it again. It would stop it again. The car got fewer kWh than if the system hadn’t existed at all.

The smart system was actively making my life worse.

The fix was uncomfortable: we ripped out almost every rule. The current charger.py does exactly one proactive thing — it stops charging during peak hours (4–9 PM). All other times, it stays out of the way. If I want to charge, I charge. If I’m running a special solar-only week before a road trip, that’s a separate explicit override. When the peak-hour stop fires, it sends me a Telegram notification so I’m never surprised.

The lesson was clear and slightly humbling: the bottleneck wasn’t smarter logic. The bottleneck was a clean line between “what the system decides” and “what I decide.” Once those stopped fighting each other, everything worked.

The 30-minute launchd job still fires. Most of the time it logs the state, looks around, and does nothing. That’s the system working correctly.

What Building With AI Actually Felt Like

This is the part I want you to take-away from this post.

I’ve used AI tools for a long time. I use them to write, to research, to summarize. That’s using AI. This was different — this was building with AI.

The difference is that Nattu held context across the entire project. When I hit a problem with the Fleet API auth, I didn’t have to re-explain the whole setup. When the over-engineered first version was misbehaving, Nattu remembered the NEM 3.0 math and pushed back when I tried to bolt on more rules instead of removing them. When I got frustrated chasing the three-orphan-keys mystery and wanted to take a shortcut, I got an honest read on why that shortcut would leave me worse off.

It felt less like querying a tool and more like working with someone who was actually invested in getting it right.

The Home Assistant failure wasn’t really about the software. It was about trying to navigate complexity alone, without a feedback loop, without someone to reason through it with. That’s what was missing.

What’s Next

The simple system works. Now that signed vehicle commands are unblocked, the next step is the version I actually wanted from the start: adaptive amp control. Read real-time solar output and home load every 30 minutes. Compute the surplus. Set the car’s charge amps to match — pause if surplus is tiny, ramp up to 18A when solar is pouring in. Manage the Powerwall so it doesn’t hit 100% before peak hours, preserving headroom to absorb the late-afternoon solar that would otherwise export at $0.08/kWh.

The economics are simple: turning a $0.08 export into a $0.36 self-consumed kWh, every kWh, every day the sun shines. Over a year, that’s real money.

But this time I’m building it with the lesson freshly learned: one rule at a time, each one earning its keep, with a kill switch in plain sight.

The deeper lesson I keep coming back to: the bottleneck to building useful things with AI isn’t intelligence. It’s the quality of collaboration. The projects where I’ve made real progress are the ones where I brought a real problem, stayed in the conversation, pushed back when something didn’t feel right, and was willing to delete code that wasn’t earning its complexity.

That’s not a new skill. That’s just how good work gets done — with a partner who’s paying attention.

If you’re trying to do something similar with Tesla + solar, I’m happy to share the scripts. Find me on X at @srinatar.

How I Optimized Charging My Tesla on Sunshine

Sriram Natarajan — Sun, 17 May 2026 06:15:41 GMT

Recently, I came back from a five-day trip. While I was gone, my Tesla charged itself off the Sun every afternoon and didn’t pull from the grid at night. I never opened the app. My power bill for the week was lower than a normal week at home.

It took me multiple months and three attempts to get to this point. I almost gave up for good. The third attempt worked, and it worked because I stopped trying to do it alone.

Here’s what is different now.

The two times I gave up

The setup I wanted is simple. When the car is home all day, charge it from solar, not the grid, and not overnight. Keep the Powerwall full enough to run the house through the night. This combination maximizes my solar investment and lowers my grid usage.

The first time I tried, I went the Home Assistant route. Everyone in the Tesla and solar communities swears by it. In practice, the documentation was scattered across three years of forum posts, and half the integrations were community-maintained and out of date. I’d get partway through a setup, hit a wall, find a workaround on Reddit or in the Home Assistant community forum, try it, get a different error. I couldn’t tell if I was one step from working or fundamentally on the wrong path. I closed the tab after a couple of weeks into it.

The second time I wrote my own scripts against the Tesla API. Reads were easy. Writes were not. Tesla had rolled out an auth model where any command that actually does something to the car has to be cryptographically signed by a virtual key paired to the vehicle. Generate the keypair, host the public key on a domain you control, register the domain with Tesla, walk the pairing flow through the app, approve it on the touchscreen. The documentation existed. It had quietly load-bearing gaps. I gave up again.

The third attempt

The third attempt happened because I sat down with Nattu and treated him as a co-builder, not a search engine.

Who is Nattu? Nattu is my AI agent. He runs on OpenClaw, a multi-agent gateway I’ve been hacking on. For this post, what matters is that he held context across the whole project and pushed back when I tried to overcomplicate things.

The auth nightmare took an afternoon. We found three different public keys from my previous attempts on my Mac. None of them had a matching private key anywhere on disk. The Cloudflare tunnel that was supposed to expose my Mac to Tesla’s servers had no public hostname route configured, so the URL Tesla was supposed to fetch was timing out from the outside.

We hashed every .pem file, figured out which was which, generated a fresh keypair, fixed the tunnel, re-registered with Tesla, re-paired. End-to-end, this would have taken me weeks alone, but we got it done in under 30 minutes.

What the system actually does

What the system actually does is small.

Most days, he stays out of the way. The only proactive rule is: stop charging during peak hours, 4 PM to 9 PM. Outside of that window, if I want to charge, I charge. If I don’t, nothing happens.

On Weekends or When I’m traveling, Nattu has access to my calendar and knows when I’m returning. He knows the rules change. For example, he doesn’t do overnight charging but optimizes to charge only from solar during the day (between 10 AM and 4 PM), while keeping the Powerwall above 80%. Once he knows I’m back, he automatically reverts to the normal schedule.

Every decision Nattu makes regarding charging, he sends me a message on Telegram. “Solar surplus 4.2 kW, PW at 84%, starting at 14A.” “PW dropped to 78%, pausing.” I don’t read them live. They sit there as a log I can scroll through later.

The trip and the bug

A few weeks into having travel mode running, I had to fly out for a customer visit on short notice. Nattu saw the trip on my calendar and switched to travel mode that night.

The trip went well. The car charged from the sun, Powerwall stayed full, the bill came in lower than a normal week. I got home Tuesday night to a full battery.

Wednesday morning, the day everything was supposed to restore, I plugged the car in around 11 AM. At 11:40 the car stopped charging. I started it from the app. At 2:40 it stopped again.

Three minutes of reading logs told me the whole story. The travel script had auto-restored the normal charger correctly. It just hadn’t unloaded itself. Two scripts were running every 30 minutes, one that allowed charging, the other still enforcing the travel-mode window for a trip that had ended the day before. They kept stopping my charge.

I had built the travel mode and the restore. I’d forgotten to build the cleanup that removes the travel script from the schedule when the trip ends.

The fix was small. Unload the old job, verify it actually unloaded, alert me on Telegram if it didn’t. We added a rule to our shared notes: when you build something that hot-swaps logic, write the cleanup before you write the activation.

The Telegram log mattered here. Without it, I’d have caught the bug weeks later, squinting at a higher-than-usual power bill. With it, I caught it in three minutes by scrolling back.

Stepping back

Two things stayed with me from this.

Working with an agent is different from using AI to write or research. I’ve done a lot of that. This was different because Nattu held context across weeks of stop-and-start work. He pushed back when I tried to bolt on more rules instead of removing them. He kept me honest about whether the system was actually working or just looked like it was. The Home Assistant attempt failed because I had no one to think with. This one worked because I did.

The other thing is the list. The Tesla project sat on my list for months. I have a long list of problems like this one. Each one is solvable in theory. None of them fit into the hours I actually have.

The point of this post isn’t that you should automate your Tesla charging with an AI agent. It’s that the problems on your list that feel just out of reach probably aren’t, anymore.

Sit down with an AI agent as a partner. Start with the smallest version of what you want, and only add a rule once you’ve felt the absence of it.

Build the thing. Trust it. Verify it. In that order.

If you’re trying something similar, find me on X at @srinatar.