User Generated Content Data Platforms

Why is this a conversation?

To most readers it makes little difference whether data in a figure is sourced from Dune Analytics or Coin Metrics. However, for an analyst the two platforms are nearly incomparable.

The data you see in figures from Dune doesn’t exactly exist—to be more precise, it isn’t decoded. For example, if you want daily Bitcoin transaction fee data from Coin Metrics, you type `fee` in the search bar and the dataset has been curated by paid employees.

By contrast, user-generated content (UGC) platforms like Dune host minimally cleaned raw copies of the blockchain, and unless someone has done the work and published the exact dataset that you’re looking for, you’re stuck writing SQL code.

To view the same data as above, we need to extract daily Bitcoin transaction fee data with a query like the one below.

The query above is trivial enough, but on the other end of the spectrum you can find analysis like Hildobby’s Pool Granular Stats query, which is 325 lines that you couldn’t pay me to write. This is the key strength of platforms like Dune: the ability to create arbitrarily complex queries and to dive into any data that you want, beyond what is curated for you by the company.

The non-curation of data is a strength, but it also creates complex dynamics around ownership and credit. Despite nuances around the evolving state of intellectual property rights on Dune, when someone fails to provide accurate attribution to the author of query, they’re primarily stealing from unpaid creators that have donated their time to provide you with access to cleaned data.

The most blatant bad actors are often called out on Twitter for actions like cropping out the handles of query authors, or other forms of improper attribution, but the most malicious actors are other content creators who mostly slide under the radar.

To be blunt, every interesting piece of data cleaning or analysis that I’ve posted on Dune has been duplicated character-for-character by others without any form of attribution—it’s an action we even see from companies in the community that are viewed extremely positively.

The situation is so bad, that work that was done by me, but credited to others, has been featured in essentially every prominent crypto publication that outsources any part of their data, and it’s not even the fault of these publications because the level of diligence required to avoid it is untenable. And by Dune standards I’m not even prolific!

The current solution suggested by Dune is private queries—thankfully the paywall was recently removed. However, private queries are a stopgap that in my opinion do more to tick boxes for companies, while ignoring the underlying issues that creators face.

The issue is not the visibility of one’s code, it’s the attribution of the code.

The reality is that private queries don’t work anyway.

In the data analysis world, it's rare that the biggest challenge is writing the actual code. Data analysts build their value by knowing what to look at and where to look for it.

As a first test of private queries, before I posted my modelling of the Shanghai upgrade, I hid my code for tracking Ethereum withdrawals credential types. The live breakdown of credential types had been a largely requested feature that no one had publicly figured out, with many data personalities in the space quoting unverified months old numbers.

Within 12 hours of posting my analysis, and people being aware that it was possible to decode the data on Dune, someone who self-proclaimed didn’t know SQL had used ChatGPT to recreate the dataset (with a disaster of query). I actually think it is fair-play to recreate people’s hidden queries (data is a public good), so I don’t mean to call them out, but it shows the weakness of the proposed solution and elevates a crucial discussion about provenance and attribution.

I’ve included my code below as reference.

As technology evolves and programming grows increasingly automated, occurrences like this will only grow more common. Hiding code is only valuable when there is a barrier to entry in generating it, but as this barrier is deconstructed the protection that analysts will need is at the social layer not at the technical layer.

The strength of UGC platforms comes from embracing open-source and collaboration. For example, live estimates for Ethereum staking APR and estimates around issuance have been long-requested features on Dune, but few people have both the data background and the desire+skills to read through Ethereum consensus documentation to put the whole picture together. Instead, we can leverage work done by hildobby and breakdowns of documentation prepared by Ben Edgington to turn countless hours of coding and research into four simple lines of code.

Within hours of me tweeting that the query was now available it had already been incorporated into some of the most popular dashboards monitoring Ethereum staking. This open collaboration is extremely beneficial to the whole community, but it only happens when there is a proper culture of respect and attribution.

We can also look at my work (alongside ilemi) decoding Bitcoin inscriptions. The Bitcoin community has been technically stagnant for so long, that very few people in the data analysis community understand how to work with low-level Bitcoin data—today’s popular Bitcoin analysts predominantly are not data engineers and instead work with data cleaned and prepared by others.

The optimal game theoretical choice would have been for us to keep the queries private and prevent other wizards (and world leading blockchain data platforms đź‘€ đź‘€) from easily duplicating the queries, but this is not the nature of open-source. The action would have made follow-up work by domo, niftytable/Kofi, and others much more difficult, and the data ecosystem would be significantly worse off.

The fostering of open data in the community came at the cost of allowing bad actors to simply duplicate my work and represent it as their own. Unsurprisingly it happened a lot. And unfortunately, the Bitcoin community’s response has left me unwilling to continue donating my time to analyze inscriptions.

The Future of UGC Data Platforms

The crux of the issue comes down to this description from 0xDataWolf.

UGC Data Platforms are the only platforms that allow for real-time analytics and publishing, while also enabling users to skip the data engineering process. […] UGC crypto data platforms serve as [both] very powerful data and content publishing houses.

UGC data platforms serve two audiences: the analysts that are responsible for content generation and the general public who are often ignorant about how the sausage is made.

It is not always clear who these platforms are primarily designed to serve. Popularity amongst the non-data community is important, but for these platforms to continue to be successful, they have to ensure that users continue generating content. Good analysts are the key customer that must be prioritized.

The class of user that does not generate content, but either distributes or rebrands it only provides short-term benefit to the platform, and if they become the focus the content is destined to stagnate—the platform along with it.

This is why subtle issues that may seem trivial to the wider public, but matter to analysts are so crucial: dashboard prioritization algorithms, forking workflow, query plagiarism, IP rights management, and figure exports.

As a strident believer in open-source, it makes me incredibly happy to see others use and build upon my work, but open-source does not necessarily mean unlicensed and granting analysts the option to exert control and ownership over their work will spur higher quality contributions. Just because you can see the source code doesn’t mean that it is proper or legal to pass off other people’s work as your own, and this is something that needs to be ingrained in our culture quickly.

Subscribe to Data Always
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
This entry has been permanently stored onchain and signed by its creator.