• 7 Posts
  • 41 Comments
Joined 3 years ago
cake
Cake day: November 2nd, 2022

help-circle
  • I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

    It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

    Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

    Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

    git.blob42.xyz {
        @bot <<CEL
            header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
        CEL
    
    
        abort @bot
        
    
        defender garbage {
    
            ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
          
        }
    
        rate_limit {
            zone dynamic_botstop {
                match {
                    method GET
                     # to use with defender
                     #header X-RateLimit-Apply true
                     #not header LetMeThrough 1
                }
                key {remote_ip}
                events 1500
                window 30s
                #events 10
                #window 1m
            }
        }
    
        reverse_proxy upstream.server:4242
    
        handle_errors 429 {
            respond "429: Rate limit exceeded."
        }
    
    }
    

    If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud






















  • One thing I can imagine is even something like unconscious “self censorship”, choosing more permissive license to attract more people and even corporations which will hire developers…

    This is the result of years of anti-copyleft propaganda which started to pay off. Now, all that corps need to do is wait for new projects and libraries to pop up and subtly (more than often openly) allocate resources to whichever project they need, or simply EEE. A much easier exercise than it was during the early years of copyleft where we could literally have a free alternate operating system to Microsoft, Apple and IBM while they were openly fighting it. Read on the Education and Government Incentives program for a reminder of what corporations are capable of.


  • I highly doubt these are sponsored by any big corp, just hobbyists/students that think it is interesting project to undertake that don’t care as much about the GPL as much as they care about doing something interesting to them.

    I wanted to test this theory, quickly looking at the commit history you can see that although the project might have started as a hobby/student weekend project, it is currently maintained by someone with an official affiliation of director at Mozilla corp.

    PS: I am not pointing the finger to any entity here, I picked this project as an example to have a discussion on this topic.