AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

sabreW4K3@lazysoci.al · 12 hours ago

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

LandedGentry@lemmy.zip · edit-2 10 hours ago

Aaron Schwartz was harassed by the feds until he took his own life for releasing publicly-funded academic works wrongly locked behind paywalls, yet we let venture capital funded slop machines assault the cornerstones of our public knowledge, to the point where they are rendered inaccessible by the general populace.

When ChatGPT 3.5 dropped, there were lots of people calling for “guardrails” and “common sense legislation.” They were shouted down and called Luddites. These are the kinds of situations we should have been able to prevent.

404 has been on fire lately by the way, everyone should consider bookmarking and supporting them

FaceDeer@fedia.io · 9 hours ago

This seems contradictory. On the one hand you’re saying that these works are wrongly locked behind paywalls, but on the other you’re saying that scraping them is an “assault on the cornerstones of our public knowledge.” Is this information supposed to be freely viewable or not?

IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file. That lets anyone who wants the whole thing to have it without having to “hammer” the servers. Meanwhile the servers can be protected by standard load-balancing and DDOS prevention systems.

Zaleramancer@beehaw.org · 3 hours ago

There’s a difference between making information accessible to humans for the purposes of advancing our shared knowledge vs saying that public institutions should subsidize the needs of private for-profit organizations.

It’s like, you can say, “Oh yeah, people should have access to freshwater for free,” and also say, “Companies shouldn’t be allowed to pump infinite freshwater from those sources to bottle it for profit.”

Those aren’t contradictory if your actual goal is the benefit of humankind and not, like, pendantic genie logic.

FaceDeer@fedia.io · 3 hours ago

Unlike water, though, data can be duplicated easily.

Zaleramancer@beehaw.org · 3 hours ago

Bandwidth can’t, though.

Is it okay to hire a bunch of people to check out half a library’s books, then rent them to people for money? Is that fine, or an obvious abuse?

Rendering this service inaccessible to actual human people in order to feed your for-profit software is only different in medium from that.

FaceDeer@fedia.io · 2 hours ago

Bandwidth can’t, though.

Bandwidth is incredibly cheap. The problem these sites are having is not running into bandwidth limits, it’s that providing the pages requires processing to generate them. That’s why Wikipedia’s solution works - they offer all the “raw” data in a single big archive, which takes just as much bandwidth to download but way fewer server resources to process (because there’s literally no processing - it’s just a big blob of data).

Is it okay to hire a bunch of people to check out half a library’s books, then rent them to people for money?

This analogy fails because, as I said, data can be duplicated easily. Making a copy of the data doesn’t obstruct other people from also viewing the data provided you avoid the sorts of resource bottlenecks I described above.

Is your problem really about the accessibility of this data? Or is it that you just don’t want those awful for-profit companies you hate to have access to it? I really get the impression that that’s the real problem here - people hate AI companies, and so a solution that gives everyone what they want is unacceptable because the AI companies are included in “everyone.”

Zaleramancer@beehaw.org · 54 minutes ago

Dude, my problem is that capitalism is going to ruin everything. It is a rotting sickness that cuts through every layer of society and creates systemic, ugly problems.

Do you know how excited I was when LLM tech was announced? Do you know how much it sucked to realize, so soon, that companies were going to do their best to use it to optimize profits?

The free access of information problem is just a manifestation of this dark specter on society.

You are acting as if we can approach this problem in the abstract, where you have to abide by simplistic, binary philosophical rules and not that we live in a world of constant moral compromise and complexity.

It’s not as simple as, “Oh, you say that you believe in freedom of information, but curious how you don’t want private companies to use it to make money at your expense! Guess you’re a hypocrite.”

Tell me what you actually believe, or stop cycling back to this like it’s a damning rebuttal.

FaceDeer@fedia.io · 6 minutes ago

It’s ironic that you’re railing against capitalism while espousing exactly the sort of scarcity mindset that capitalism is rooted in, whereas I’m the one taking the “information wants to be free” attitude that would normally be associated with anti-capitalist mindsets.

Do you know how excited I was when LLM tech was announced? Do you know how much it sucked to realize, so soon, that companies were going to do their best to use it to optimize profits?

They do that with everything. Does that mean that everything must therefore become some kind of all-or-nothing battleground wherein companies must be thwarted?

It’s not as simple as, “Oh, you say that you believe in freedom of information, but curious how you don’t want private companies to use it to make money at your expense! Guess you’re a hypocrite.”

Emphasis added. That part is where you’re in error about my view, it’s not at my expense. It doesn’t harm me any.

Tell me what you actually believe, or stop cycling back to this like it’s a damning rebuttal.

I have been.

Zaleramancer@beehaw.org · 1 hour ago

Wow, you’re beginning to understand the actual arguments and debates going on. :3

Why are you taking their side buddy?

FaceDeer@fedia.io · 12 minutes ago

I’m not “taking their side.” I’m just not actively trying to harm them. The world is not a zero-sum game, it’s often possible for everyone to get what they want without harming each other in the process.

LandedGentry@lemmy.zip · edit-2 39 minutes ago

deleted by creator

LandedGentry@lemmy.zip · edit-2 8 hours ago

so every single repository should have to spend their time, energy, and resources on accommodating a bunch of venture funded companies that want to get all of this shit for free without contributing to these repositories at all themselves? You think that is a fair ask? That these (often underfunded) institutions should have to accommodate the American private sector’s free lunch because they’re entitled to break our sites without warning?

Honestly the more I write the more this sounds like capitulating to hackers.

FaceDeer@fedia.io · 7 hours ago

so every single repository should have to spend their time, energy, and resources on accommodating a bunch of venture funded companies that want to get all of this shit for free without contributing to these repositories at all themselves?

Was Aaron Schwartz wrong to scrape those repositories? He shouldn’t have been accessing all those publicly-funded academic works? Making it easier for him to access that stuff would have been “capitulating to hackers?”

I think the problem here is that you don’t actually believe that information should be free. You want to decide who and what gets to use that “publicly-funded academic work”, and you have decided that some particular uses are allowable and others are not. Who made you that gatekeeper, though?

I think it’s reasonable that information that’s freely posted for public viewing should be freely viewable. As in anyone can view it. If they want to view all of it and that puts a load on the servers providing it, but there’s an alternate way of providing it that doesn’t put that load on the servers, what’s wrong with doing that? It solves everyones’ problems.

Zaleramancer@beehaw.org · 1 hour ago

Really?

Okay, look, the reason people are disagreeing with you is that you’re responding to the following problem:

“Private companies are preventing access to public resources due to their rapacious, selfish greed.”

And your response has been:

“By changing how we structure things to make it easier for them to take things, we can both enjoy the benefits of the public resources.”

The companies are not the same as normal patrons. They’re motived by a desire for infinite growth and will consume anything that they can access for low prices to resell for high ones. They do not contribute to these public resources, because they only wish to abuse them for the potential capital they have.

Drawing an equivalence between these two things requires the willful disregard of this distinction so that you can act as if the underlying moral principle is being betrayed because your rhetorical opponent didn’t define it as rigorously as possible. They didn’t do that out of an expectation that you would engage with this in good faith.

Why are you doing this?

FaceDeer@fedia.io · 13 minutes ago

Yes, I know the companies are not the same as normal patrons. I don’t care that they’re not the same as normal patrons. All I’m concerned about is that the normal patrons get access to the data. The solution I proposed does that.

The problem, as I see it, is that’s not all that you are concerned about. Your goal also includes a second aspect; you want those companies to not have access to that data. So my proposal is not acceptable because it doesn’t thwart those companies.

I’m not drawing an equivalence between companies and individual patrons, I’m just saying my goals don’t include actively obstructing those companies. If they can get what they want without interfering with what the normal patrons want, why is that a bad thing?

LandedGentry@lemmy.zip · edit-2 6 hours ago

I don’t think you understand what is mechanically occurring. He was not putting strain on public servers, he downloaded on-site as one person at a reasonable rate, and then distributed it to the public. It was essentially ethical piracy. No site or entity was put under strain. No one was denied access.

The reason I drew the comparison is his treatment as one person downloading journals and releasing them, vs AI companies scraping countless website/public repositories, taking them down for the public in the process, and then monetizing it internally.

The reason they are being compared is their treatment for extracting publicly funded information. AI companies are being far more destructive. It’s not even close. They are actively harming public data and access with their unfettered sense of entitlement.

FaceDeer@fedia.io · 4 hours ago

If someone did an Aaron-Schwartz-style scrape, then published the data they scraped in a downloadable archive so that AI trainers could download it and use it, would you find that objectionable?

LandedGentry@lemmy.zip · edit-2 3 hours ago

Certainly far less objectionable than taking down public resources, though there’s more to it than that - again, it puts the onus on everyone else to protect themselves from companies that are essentially acting like malicious hackers, Companies that should be the ones responsible for not tearing down public resources. But I don’t really get what you’re trying to prove, because your proposal is not what they’re doing. They’re just doing whatever the fuck they want and don’t care who it impacts. They never do.

I don’t feel like this is very complicated. I’m not allowed to block public roads with my car. I’m not allowed to cut the power to a library and bar the doors. You can’t just deny people public resources like that as a private entity, unless of course you are an AI slop company, in which case states literally aren’t even allowed to make rules about you for the next decade due to our corrupt commander-in-chief. These AI companies are allowed to steamroll any private or public entity they want so long as they condense the right people they will make them a lot of money. It is wildly unethical and the fact that I have to spend so much time convincing you they deserve a little more scrutiny is kind of baffling.

Aaron Schwartz didn’t do anything like the above and your insistence that he is somehow critical to proving some perceived hypocrisy or inconsistency on my part is…well, i’m not sure what the word is, but it’s just not accurate at all.

FaceDeer@fedia.io · 3 hours ago

That suggestion is exactly the same as what I started with when I said “IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file.” It just cuts out the Aaron-Schwarts-style external middleman, so it’s easier and more efficient to create the downloadable data.

Lucy :3@feddit.org · 11 hours ago

Use Iocaine and Anubis!

Geodad@beehaw.org · 8 hours ago

I’ve been seeing more Anubis lately. It pops up for like 5 seconds.

Lucy :3@feddit.org · 8 hours ago

Action -> Reaction

Geodad@beehaw.org · 8 hours ago

I usr a VPN, so my traffic is automatically looked upon as suspicious.

Lucy :3@feddit.org · 8 hours ago

I doubt that there are (m)any anubis deployments that distinguish between suspicious or not. It’s just that as more companies get aggressive with scraping, we are getting more aggressive with said tools.

Geodad@beehaw.org · 8 hours ago

Yeah, I can see that. I like seeing the cute anime art pop up briefly.

sabreW4K3@lazysoci.al · 11 hours ago

Anubis is what slrpnk uses and it blocks the community icon for the electric vehicles community 😭