Hawkins map of consciousness image

In this post, my goal is to distill and expand upon some of Steve Byrnes’s thinking on AGI safety. (A simple example of this general phenomenon in humans: Jack hates reading fiction, but Jack wants to be the kind of guy who likes reading fiction, so he forces himself to read fiction.) Introduction Second, I introduce the problem of self-referential misalignment, which is essentially the worry that initially-aligned ML systems with the capacity to model their own values could assign second-order values to these models that ultimately result in contradictory-and thus misaligned-behavioral policies. I build off of Steve’s framework in the second half of this post: first, I discuss why it would be worthwhile to understand the computations that underlie theory of mind + affective empathy. He argues that because we’re probably going to end up at some point with an AGI whose subparts at least superficially resemble those of the brain (a value function, a world model, etc.), it’s really important for alignment to proactively understand how the many ML-like algorithms in the brain actually do their thing. TL DR: Steve Byrnes has done really exciting work at the intersection of neuroscience and alignment theory. Many additional thanks to Steve Byrnes and Adam Shimi for their helpful feedback on earlier drafts of this post. This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.