Personal Philosophy

I recently read Harry Potter and the Methods of Rationality by Eliezer Yudkowsky.

This made me want to write down my personal philosophy. It's WIP and subject to change.

Core Framework

As best I can tell, I'm a consequentialist with modifications. Expected value reasoning feels right for major decisions, but I notice I operate within constraints and thresholds that pure utilitarianism doesn't require. I'm still figuring out why.

Some Tweaks

Epistemic integrity requirements. I find myself distrusting confidence that outpaces the underlying reasoning. When someone gives a narrow p(doom) estimate, my instinct isn't "you're wrong" but "I'm not sure the confidence your answer implies is justified by the work you did to arrive at it." I'd rather say "5-25%, depending on assumptions I can't pin down" than perform false precision. This feels like more than methodological preference to me, though I'm not certain.

Agency commitments. I think individuals should be able to specify how AI systems behave on their behalf, even if their specifications aren't globally optimal. Luthien isn't built on "we know what AI should do." It's built on "users should be able to define their own constraints and verify they're enforced." This seems like a preference for preserving human autonomy layered on consequentialism.

Threshold effects for moral demands. On animal welfare, for example: I'm not confident enough in the moral weights to restructure my life around them. A pure consequentialist might say if the EV calculation favors veganism, do it regardless of confidence. I seem to apply a threshold: moral demands need to clear an epistemic bar before they bind me. I'm not sure if this is principled or just convenient.

Animal welfare, real talk. I think eating less meat is probably the right call. The personal cost is manageable. I admire people who go all the way. I mostly haven't, mostly due to personal convenience 🙈😔. I hope others figure this one out, or at least make it easier to take a sensible middle path. I'm working on reducing the inhumanely produced animal products in my diet and reducing my cholesterol, but it's WIP.

Moral judgment. I try not to judge others. I don't know their actual options, constraints, or what they knew at the time. For example, the American founders embedded slavery in the Constitution, which seems bad. However, the alternative was separate nations, and an independent South probably keeps slavery longer. Idk if the compromise was worth it, but it's certainly not clear cut. I try to extend the same grace to myself: strive for improvement without being too self-critical.

The Wittgenstein Thread

I was drawn to Wittgenstein in college philosophy. Looking back, his later work seems to map onto my thinking more than I initially recognized, though I may be pattern-matching too eagerly.

Meaning as use. Wittgenstein argued that meaning isn't found in abstract definitions but in practical application, in how words function in actual "language games." This resonates with my instinct to dive deep but surface to ask "so what?" I'm not that interested in learning for its own sake; I want to know how concepts cash out in practice.

Dissolving false problems. Wittgenstein believed many philosophical problems aren't solved but dissolved. They arise from misusing language or asking malformed questions. I wonder if alignment is similar: "what should AI value?" may be philosophically unsolvable as posed, but "can users specify constraints and verify enforcement?" feels more tractable.

Family resemblance over rigid categories. Wittgenstein rejected the idea that categories have essential definitions. Instead, members share overlapping similarities, like family resemblances, without a single common thread. This matches my sense that the world is messier than clean categories suggest.

Against false precision. Wittgenstein critiqued philosophers who sought more precision than their subject matter allowed. This captures something about my discomfort with narrow p(doom) ranges.

Some Examples

Career choice: I left a 9-year Principal PM career at Amazon to co-found an AI safety startup. The decision logic was consequentialist (highest expected impact), but working directly on the problem matters to me, not just funding it. There's something about agency and direct contribution that I value.

P(doom) and timeline questions: I think the better question to ask is: is your p(doom) above the threshold that would cause you to work on AI safety full time? Humans are bad at small numbers. A p(doom) of 0.01% should still be motivating if you take it seriously. I'd rather focus on that than false precision about ranges.

AI agents "lying" and "cheating": In Luthien's taxonomy, we categorize AI failures using moral language, not just optimization language. An agent that says one thing and does another isn't just making errors; it feels like it's violating something. Trust, honesty norms, the duty to be what you present yourself as. I'm not sure if this framing is philosophically defensible, but it's how I think about it.

The Synthesis

If I had to summarize: I seem to be a consequentialist who takes expected value seriously for big decisions, but operates with:

Epistemic integrity requirements: try not to claim more confidence than you have
Agency commitments: individuals get to specify their own values
Threshold effects: uncertain moral demands don't fully bind
Libertarian paternalist design philosophy: good defaults, user override, transparency

This framework seems internally consistent and explains my actual choices, at least as it looks from the inside. It's not pure anything, but hopefully coherent.

So What?

I aspire for my work in AI safety and at Luthien to embody these ideas:

Epistemic humility: We're not claiming to know what AI should value. We're saying: here are predictable failure modes, here are defaults that catch them, here's how to customize.

Agency preservation: Users specify their own constraints. The system is choice architecture, not value imposition.

Honesty as something that matters: AI "lying" (inconsistency between stated intent and actual behavior) is a first-class failure mode. I'm not certain if this is because deception produces bad outcomes or because it's specifically wrong. Probably both.

Libertarian paternalism: Good defaults, easy override, transparency about the architecture. Nudge over ban. Enable over mandate.