The Geometry of Concepts: Sparse Autoencoder Feature Structure
Abstract
:1. Introduction
2. Related Work
3. “Atom”-Scale: Crystal Structure
4. “Brain”-Scale: Meso-Scale Modular Structure
- While we can cluster features based on whether they co-occur, we can also perform spectral clustering based on the cosine similarity between SAE feature decoder vectors. So instead of feature affinity values being, e.g., their co-occurrence phi coefficient, affinity matrix values are instead computed simply from feature geometry as . Given a clustering of SAE features using cosine similarity and a clustering using co-occurrence, we compute the mutual information between these two sets of labels. In some sense, this measures the amount of information about geometric structure that one obtains from knowing functional structure. We report the adjusted mutual information [52] as implemented by scikit-learn [50], which corrects for chance agreements between the clusters.
- Another conceptually simple approach is to train models to predict which functional lobe a feature is in from its geometry. To accomplish this, we take a given set of lobe labels from our co-occurrence-based clustering, and train a logistic regression model to predict these labels directly from the point positions, using an 80-20 train–test split and reporting the balanced test accuracy of this classifier.
5. “Galaxy”-Scale: Large-Scale Point Cloud Structure
5.1. Shape Analysis
- The eigenvalue spectrum of the point cloud decays as a power law rather than following the expected Wishart behavior.
- As shown in Figure 6, this power law decay is more pronounced in SAE features compared to raw activations.
5.2. Clustering Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Additional Information on Brain Lobes
Co-Occurrence Measures
Appendix B. Understanding Principal Components in Difference Space
Appendix C. Breaking Down SAE Vectors by PCA Component
References
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar]
- The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed on 24 March 2025).
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Slattery, P.; Saeri, A.K.; Grundy, E.A.; Graham, J.; Noetel, M.; Uuk, R.; Dao, J.; Pour, S.; Casper, S.; Thompson, N. The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence. arXiv 2024, arXiv:2408.12622. [Google Scholar]
- Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.; Askell, A.; Bowman, S.R.; Cheng, N.; Durmus, E.; Hatfield-Dodds, Z.; Johnston, S.R.; et al. Towards understanding sycophancy in language models. arXiv 2023, arXiv:2310.13548. [Google Scholar]
- Park, P.S.; Goldstein, S.; O’Gara, A.; Chen, M.; Hendrycks, D. AI deception: A survey of examples, risks, and potential solutions. arXiv. arXiv 2023, arXiv:2308.14752. [Google Scholar]
- Marks, S.; Treutlein, J.; Bricken, T.; Lindsey, J.; Marcus, J.; Mishra-Sharma, S.; Ziegler, D.; Ameisen, E.; Batson, J.; Belonax, T.; et al. Auditing Language Models for Hidden Objectives. arXiv 2024, arXiv:2503.10965. [Google Scholar]
- Ngo, R.; Chan, L.; Mindermann, S. The alignment problem from a deep learning perspective. arXiv 2022, arXiv:2209.00626. [Google Scholar]
- Bereska, L.; Gavves, E. Mechanistic Interpretability for AI Safety—A Review. arXiv 2024, arXiv:2404.14082. [Google Scholar]
- Sharkey, L.; Chughtai, B.; Batson, J.; Lindsey, J.; Wu, J.; Bushnaq, L.; Goldowsky-Dill, N.; Heimersheim, S.; Ortega, A.; Bloom, J.; et al. Open Problems in Mechanistic Interpretability. arXiv 2025, arXiv:2501.16496. [Google Scholar]
- Huben, R.; Cunningham, H.; Smith, L.R.; Ewart, A.; Sharkey, L. Sparse Autoencoders Find Highly Interpretable Features in Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; et al. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread 2023. Available online: https://transformer-circuits.pub/2023/monosemantic-features/index.html (accessed on 24 March 2025).
- Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread 2024. Available online: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html (accessed on 24 March 2025).
- Faruqui, M.; Tsvetkov, Y.; Yogatama, D.; Dyer, C.; Smith, N. Sparse overcomplete word vector representations. arXiv 2015, arXiv:1506.02004. [Google Scholar]
- Zhang, J.; Chen, Y.; Cheung, B.; Olshausen, B.A. Word embedding visualization via dictionary learning. arXiv 2019, arXiv:1910.03833. [Google Scholar]
- Yun, Z.; Chen, Y.; Olshausen, B.A.; LeCun, Y. Transformer visualization via dictionary learning: Contextualized embedding as a linear superposition of transformer factors. arXiv 2021, arXiv:2103.15949. [Google Scholar]
- Olshausen, B.A.; Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996, 381, 607–609. [Google Scholar] [CrossRef]
- Olshausen, B.A.; Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 1997, 37, 3311–3325. [Google Scholar] [CrossRef] [PubMed]
- Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; et al. Toy Models of Superposition. Transformer Circuits Thread 2022. Available online: https://transformer-circuits.pub/2022/toy_model/index.html (accessed on 24 March 2025).
- Park, K.; Choe, Y.J.; Veitch, V. The linear representation hypothesis and the geometry of large language models. arXiv 2023, arXiv:2311.03658. [Google Scholar]
- Olah, C. What is a Linear Representation? What is a Multidimensional Feature? Transformer Circuits Thread 2024. Available online: https://transformer-circuits.pub/2024/july-update/index.html#linear-representations (accessed on 24 March 2025).
- Engels, J.; Liao, I.; Michaud, E.J.; Gurnee, W.; Tegmark, M. Not All Language Model Features Are Linear. arXiv 2024, arXiv:2405.14860. [Google Scholar]
- Lieberum, T.; Rajamanoharan, S.; Conmy, A.; Smith, L.; Sonnerat, N.; Varma, V.; Kramár, J.; Dragan, A.; Shah, R.; Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv 2024, arXiv:2408.05147. [Google Scholar]
- Ansuini, A.; Laio, A.; Macke, J.H.; Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 1–15. [Google Scholar]
- Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]
- Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009; Volume 25. [Google Scholar]
- Rushing, C.; Nanda, N. Explorations of Self-Repair in Language Models. arXiv 2024, arXiv:2402.15390. [Google Scholar]
- Belrose, N.; Furman, Z.; Smith, L.; Halawi, D.; Ostrovsky, I.; McKinney, L.; Biderman, S.; Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2023, arXiv:2303.08112. [Google Scholar]
- Conmy, A.; Mavor-Parker, A.; Lynch, A.; Heimersheim, S.; Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. Adv. Neural Inf. Process. Syst. 2023, 36, 16318–16352. [Google Scholar]
- Park, K.; Choe, Y.J.; Jiang, Y.; Veitch, V. The geometry of categorical and hierarchical concepts in large language models. arXiv 2024, arXiv:2406.01506. [Google Scholar]
- Mendel, J. SAE Feature Geometry is Outside the Superposition Hypothesis. AI Alignment Forum 2024. Available online: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis (accessed on 24 March 2025).
- Smith, L. The ‘Strong’ Feature Hypothesis Could be Wrong. AI Alignment Forum 2024. Available online: https://www.lesswrong.com/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong (accessed on 24 March 2025).
- Bussmann, B.; Pearce, M.; Leask, P.; Bloom, J.I.; Sharkey, L.; Nanda, N. Showing SAE Latents Are Not Atomic Using Meta-SAEs.AI Alignment Forum 2024. Available online: https://www.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes (accessed on 24 March 2025).
- Drozd, A.; Gladkova, A.; Matsuoka, S. Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings of the Coling 2016, the 26th International Conference on Computational Linguistics: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 3519–3530. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
- Ma, L.; Zhang, Y. Using Word2Vec to process big text data. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; pp. 2895–2897. [Google Scholar]
- Nanda, N.; Lee, A.; Wattenberg, M. Emergent linear representations in world models of self-supervised sequence models. arXiv 2023, arXiv:2309.00941. [Google Scholar]
- Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv 2022, arXiv:2210.13382. [Google Scholar]
- Michaud, E.J.; Liao, I.; Lad, V.; Liu, Z.; Mudide, A.; Loughridge, C.; Guo, Z.C.; Kheirkhah, T.R.; Vukelić, M.; Tegmark, M. Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code. Entropy 2024, 26, 1046. [Google Scholar] [CrossRef]
- Marks, S.; Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv 2023, arXiv:2310.06824. [Google Scholar]
- Gurnee, W.; Tegmark, M. Language models represent space and time. arXiv 2023, arXiv:2310.02207. [Google Scholar]
- Heinzerling, B.; Inui, K. Monotonic representation of numeric properties in language models. arXiv 2024, arXiv:2403.10381. [Google Scholar]
- Todd, E.; Li, M.L.; Sharma, A.S.; Mueller, A.; Wallace, B.C.; Bau, D. Function vectors in large language models. arXiv 2023, arXiv:2310.15213. [Google Scholar]
- Hendel, R.; Geva, M.; Globerson, A. In-context learning creates task vectors. arXiv 2023, arXiv:2310.15916. [Google Scholar]
- Kharlapenko, D.; neverix; Nanda, N.; Conmy, A. Extracting SAE Task features for In-Context Learning. AI Alignment Forum 2024. Available online: https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning (accessed on 24 March 2024).
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. [Google Scholar]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
- Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B.; Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B. Linear discriminant analysis. In Robust Data Mining; Springer: New York, NY, USA, 2013; pp. 27–33. [Google Scholar]
- Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1073–1080. [Google Scholar]
- Mueller, A.; Brinkmann, J.; Li, M.; Marks, S.; Pal, K.; Prakash, N.; Rager, C.; Sankaranarayanan, A.; Sharma, A.S.; Sun, J.; et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability. arXiv 2024, arXiv:2408.01416. [Google Scholar]
- Olah, C. Transformer Circuits Thread: Interpretability Dreams; An Informal Note on Future Goals for Mechanistic Interpretability. Transformer Circuits Thread 2023. Available online: https://transformer-circuits.pub/2023/interpretability-dreams/index.html (accessed on 24 March 2025).
- Hoel, E.P.; Albantakis, L.; Tononi, G. Quantifying causal emergence shows that macro can beat micro. Proc. Natl. Acad. Sci. USA 2013, 110, 19790–19795. [Google Scholar]
- Kennicutt, R.C., Jr. Star formation in galaxies along the Hubble sequence. Annu. Rev. Astron. Astrophys. 1998, 36, 189–231. [Google Scholar]
- Hubble, E.P. Extragalactic Nebulae. Astrophys. J. 1926, 64, 321–369. [Google Scholar] [CrossRef]
- Kravtsov, A. Dark matter substructure and dwarf galactic satellites. Adv. Astron. 2010, 2010, 281913. [Google Scholar]
- Wishart, J. The generalised product moment distribution in samples from a normal multivariate population. Biometrika 1928, 20, 32–52. [Google Scholar]
- Marchenko, V.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 1967, 72, 4. [Google Scholar]
- Dasarathy, B.V. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques; IEEE Computer Society Tutorial: Los Alamitos, CA, USA, 1991. [Google Scholar]
- Kozachenko, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Informatsii 1987, 23, 9–16. [Google Scholar]
- Jaccard, P. Nouvelles Recherches Sur La Distribution Florale. Bull. De La Société Vaudoise Des Sci. Nat. 1908, 44, 223–270. [Google Scholar]
- Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar]
- Yule, G.U. On the Methods of Measuring Association Between Two Attributes. J. R. Stat. Soc. 1912, 75, 579–652. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Michaud, E.J.; Baek, D.D.; Engels, J.; Sun, X.; Tegmark, M. The Geometry of Concepts: Sparse Autoencoder Feature Structure. Entropy 2025, 27, 344. https://doi.org/10.3390/e27040344
Li Y, Michaud EJ, Baek DD, Engels J, Sun X, Tegmark M. The Geometry of Concepts: Sparse Autoencoder Feature Structure. Entropy. 2025; 27(4):344. https://doi.org/10.3390/e27040344
Chicago/Turabian StyleLi, Yuxiao, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. 2025. "The Geometry of Concepts: Sparse Autoencoder Feature Structure" Entropy 27, no. 4: 344. https://doi.org/10.3390/e27040344
APA StyleLi, Y., Michaud, E. J., Baek, D. D., Engels, J., Sun, X., & Tegmark, M. (2025). The Geometry of Concepts: Sparse Autoencoder Feature Structure. Entropy, 27(4), 344. https://doi.org/10.3390/e27040344