{"id":15271,"date":"2026-04-17T04:00:58","date_gmt":"2026-04-17T03:00:58","guid":{"rendered":"https:\/\/emilkirkegaard.dk\/en\/?p=15271"},"modified":"2026-04-17T04:21:19","modified_gmt":"2026-04-17T03:21:19","slug":"the-structure-of-substack","status":"publish","type":"post","link":"https:\/\/emilkirkegaard.dk\/en\/2026\/04\/the-structure-of-substack\/","title":{"rendered":"The structure of Substack"},"content":{"rendered":"<p>So Substack has nice open APIs where one can download a lot of interesting data from. <a href=\"https:\/\/uncorrelated.xyz\/i-webscraped-2-million-substack-articles-this-is-what-i-learnt\/\">Uncorrelated already did some cool stuff on how to make money on Substack using posting patterns, paywalls etc. to predict revenue<\/a>. The TL;DR is:<\/p>\n<ul>\n<li>Charge More: Price remains a powerful lever. However, it should be noted that by default estimated revenue is a function of price. So this correlation may exist regardless.<\/li>\n<li>Post Often: Reducing the average time between posts (posting more frequently) still shows a strong positive association with revenue.<\/li>\n<li>Paid Percentage: The positive correlation holds \u2013 more paid posts generally link to higher revenue in the model, though the visual plots (below) still suggest a potential curve peaking around 50-60%.<\/li>\n<li>Word Count: Longer posts (both free and paid, after log transformations) still show a statistically significant positive correlation with revenue, but the effect sizes are smaller than before. Frequency likely remains more impactful than length alone.<\/li>\n<li>Consistency? Maybe Not: The slight positive correlation for more variance in posting intervals persists. Frequency seems to matter more than rigid timing.<\/li>\n<\/ul>\n<p>Today I wanted to tackle a different question. Substacks relate to each other in various ways. They can link to each other in posts, say, because they are having a debate, or because they just feature <a href=\"https:\/\/www.astralcodexten.com\/p\/links-for-february-2026\">link collection posts<\/a>. One could collect of all these by scraping the HTML code and finding all the cross-links. Another option is using Substack&#8217;s recommendation list. A given Substack can recommend other Substacks. I scraped this data initially for all the top 100 science Substacks, but it turns out that many people don&#8217;t use the feature:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15272\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram.png\" alt=\"\" width=\"2000\" height=\"1200\" srcset=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram.png 2000w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram-300x180.png 300w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram-1024x614.png 1024w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram-768x461.png 768w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_outgoing_recs_histogram-1536x922.png 1536w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/a><\/p>\n<p>So 35% of these popular Substacks don&#8217;t recommend anyone, which is poor manners, but also gives the authors an easy way out in case someone asks them &#8220;Can you please add me to your recommendation list?&#8221;. Additionally, it protects the authors against <a href=\"https:\/\/en.wikipedia.org\/wiki\/Six_Degrees_of_Kevin_Bacon\">6 degrees of Kevin Bacon<\/a> guilt by <del>association<\/del> recommendation. Whatever the reason, this dataset is not great for analysis. Regarding the network data, Substack itself provides the crude overlap metric. Mine are:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15273\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap.png\" alt=\"\" width=\"2234\" height=\"1004\" srcset=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap.png 2234w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap-300x135.png 300w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap-1024x460.png 1024w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap-768x345.png 768w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap-1536x690.png 1536w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/EK-SS-overlap-2048x920.png 2048w\" sizes=\"auto, (max-width: 2234px) 100vw, 2234px\" \/><\/a><\/p>\n<p>I think this is some variant of the crude overlap from my Substack&#8217;s perspective, which is to say that 33% of subscribers also subscribe to Aporia, 26% to Cremieux and so on. The problem with this metric is that larger Substacks will always have more overlap in general, regardless of whether they have any particular shared topics or not. One solution to this is to use the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pointwise_mutual_information\">PMI metric<\/a>. This is based on the relative rate of overlap given a neutral prior. So if one Substack has 10% of all users and another has 5%, their expected overlap by chance is just the product, that is, 0.5%. If we find that the actual value is 2%, then the relative rate is 4x, and PMI is log2(RR). The related NPMI metric normalizes this value to -1 to 1 range. Anyway, the Substack-level recommendations are unsuitable given the 35% of the data is missing. But we can go further. I scraped all the posts for the top 100 science Substacks to find all the users (~31k posts). Then I scraped all the users&#8217; pages to see which Substacks they subscribed to (~62k users). Their distributions of subscriptions look like this:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15274\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram.png\" alt=\"\" width=\"2000\" height=\"1200\" srcset=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram.png 2000w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram-300x180.png 300w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram-1024x614.png 1024w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram-768x461.png 768w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_user_subs_histogram-1536x922.png 1536w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/a><\/p>\n<p>It&#8217;s a beautiful power law distribution. Using these subscriptions, we can build an overall network of Substacks using NPMI:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-scaled.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15275\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-scaled.png\" alt=\"\" width=\"2560\" height=\"1862\" srcset=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-scaled.png 2560w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-300x218.png 300w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-1024x745.png 1024w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-768x559.png 768w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-1536x1117.png 1536w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_cosub_network_npmi-2048x1489.png 2048w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/a><\/p>\n<p>Yes, it is very small and hard to read. The labels are given by Claude based on largest blogs in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Louvain_method\">each cluster<\/a>. For the paying subscribers, there is an interactive version available so you can zoom in on whatever you want. Here&#8217;s my immediate network according to NPMI:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.technotheoria.org\">Technotheoria [Seb Jensen]<\/a> (0.65)<\/li>\n<li><a href=\"https:\/\/www.leonvoss.com\">Leon Vo\u00df [Bronski]<\/a> (0.64)<\/li>\n<li><a href=\"https:\/\/www.aporiamagazine.com\">Aporia<\/a> (0.62)<\/li>\n<li><a href=\"https:\/\/werkat.substack.com\">Half-Baked Thoughts [werkat]<\/a> (0.60)<\/li>\n<li><a href=\"https:\/\/www.anthro1.net\">Peter Frost&#8217;s Newsletter<\/a> (0.59)<\/li>\n<li><a href=\"https:\/\/menghuscience.com\">Meng Hu on HBD and Austrian Economics<\/a> (0.59)<\/li>\n<li><a href=\"https:\/\/ncofnas.com\">Nathan Cofnas&#8217;s Newsletter<\/a> (0.57)<\/li>\n<li><a href=\"https:\/\/noahcarl.substack.com\">Noah&#8217;s Newsletter<\/a> [inactive since writing for Aporia] (0.56)<\/li>\n<li><a href=\"https:\/\/davidepiffer.com\">PifferPilfer<\/a> (0.55)<\/li>\n<li><a href=\"https:\/\/arctotherium.substack.com\/\">Not With a Bang [arctotherium]<\/a> (0.54)<\/li>\n<\/ul>\n<p>It&#8217;s a pretty reasonable list. These are either collaborators, coauthors, or people I interact with frequently, so it is not surprising that readers have about the same preferences.<\/p>\n<p>Regarding just science, it looks like this:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_science_network_npmi-scaled.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15276\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/substack_science_network_npmi-scaled.png\" alt=\"\" width=\"2560\" height=\"1849\" \/><\/a><\/p>\n<p>I think this also nicely illustrates that HBD-cluster is quite distant from the usual conspiracy\/quack networks (germ theory, COVID stuff), while being closely related to the popular science and rationalist clusters.<\/p>\n<p>If you want to explore further using interactive visualization, <a href=\"https:\/\/front.emilkirkegaard.dk\/static\/substack_network.html\">you can do so here<\/a>. You can zoom in and click on any node to see top connections. E.g. Razib Khan:<\/p>\n<p><a href=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15280\" src=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network.png\" alt=\"\" width=\"1536\" height=\"1590\" srcset=\"https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network.png 1536w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network-290x300.png 290w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network-989x1024.png 989w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network-768x795.png 768w, https:\/\/emilkirkegaard.dk\/en\/wp-content\/uploads\/razib-network-1484x1536.png 1484w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>So Substack has nice open APIs where one can download a lot of interesting data from. Uncorrelated already did some cool stuff on how to make money on Substack using posting patterns, paywalls etc. to predict revenue. The TL;DR is: Charge More: Price remains a powerful lever. However, it should be noted that by default [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":15275,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2589,2817],"tags":[3854,3857,3856,3855,2912],"class_list":["post-15271","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-meta-2","category-personal","tag-network","tag-overlap","tag-recommendation","tag-scraping","tag-substack","entry","has-media"],"_links":{"self":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/15271","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/comments?post=15271"}],"version-history":[{"count":4,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/15271\/revisions"}],"predecessor-version":[{"id":15282,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/posts\/15271\/revisions\/15282"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media\/15275"}],"wp:attachment":[{"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/media?parent=15271"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/categories?post=15271"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emilkirkegaard.dk\/en\/wp-json\/wp\/v2\/tags?post=15271"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}