The creator of the dataset issued an apology to concerned users in a post on Bluesky.
A dataset of 1m Bluesky posts that was uploaded to machine learning platform Hugging Face earlier this week has been removed.
On 26 November, Daniel van Strien, a machine learning librarian at Hugging Face, uploaded a dataset of 1m public posts and accompanying metadata taken from Bluesky’s firehose API. The dataset card explained it was “intended for machine learning research and experimentation with social media data”.
However, after facing a backlash, van Strien removed the Bluesky data and apologised yesterday (27 November).
“I’ve removed the Bluesky data from the repo,” van Strien posted on Bluesky.
“While I wanted to support tool development for the platform, I recognise this approach violated principles of transparency and consent in data collection. I apologise for this mistake.”
He said that he has left the public repository (which the dataset was posted to) online so that users can continue to give feedback.
As noted by 404 Media, the data wasn’t anonymous, with each post listed alongside the user’s decentralised identifier.
While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use.
Discourse over data
There has been an increased amount of discourse about the use of people’s data for artificial intelligence (AI) training without users’ consent.
X, the social media site which rivals Bluesky, found itself in hot water earlier this year when a security expert said that Elon Musk’s X is “overstepping boundaries of digital ownership” by defaulting users into allowing their posts, interactions and even conversations to be shared with its AI chatbot Grok for the purpose of AI development.
A month later, the Irish Data Protection Commission (DPC) said that X had decided to suspend processing personal data of EU users to train Grok after the commission had taken legal action against it.
Meta, the parent company of WhatsApp, Facebook and Instagram, also faced complaints regarding its plans to use personal data for AI earlier in the year.
Instead of asking users for their consent, Meta argued that it had a “legitimate interest” to collect and process this data. The company used this same legal basis for its personalised advertising policies, but this basis was rejected by the European Court of Justice last year.
Earlier this month, Bluesky experienced a surge of new users, which occurred following a mass exodus of users from X, and even led to a brief outage for the site.
Open-source champion Kelsey Hightower, best known for his work with Kubernetes and Google, spoke to SiliconRepublic.com about the promise of Bluesky as a decentralised platform.
He said that we have been presented with a new opportunity to get social media right but added that we all have a responsibility to ensure that this happens.
Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.