Introducing Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration

mgdigital@lemmy.world · 1 year ago

Introducing Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration

Shepy@feddit.uk · 1 year ago

This sounds amazing, definitely going to add this to my servarr setup next few days.

Dasnap@lemmy.world · edit-2 1 year ago

It’s only once you install something like this that you realize just how many torrents are porno.

I’ve always been curious about ‘Anal Police Stories 2’ but I’ve never found the time.

Salamendacious@lemmy.world · 1 year ago

scrubs YouTube clip

PipedLinkBot@feddit.rocks · 1 year ago

Here is an alternative Piped link(s):

scrubs YouTube clip

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.

thantik@lemmy.world · 1 year ago

Very nice. This gets rid of any questionably legal gray area of using sites like Nyaa, etc for Torrent links. Also provides a bit of robustness against censorship when those sites get taken down. Looks like I’m gonna have to set up proxmox on a machine this weekend, as Windows sucks dick for docker containers and that’s what I’ve got most of my *arr stuff hosted on currently.

It’ll be a good thing anyways, as most of those instances aren’t running through my VPN yet and I should just centralize them on proxmox and run all the torrents, etc through containerized instances for security.

kungen@feddit.nu · 1 year ago

This gets rid of any questionably legal gray area of using sites like Nyaa, etc for Torrent links

Except that now you’re asking the swarm for metadata behind a boatload of info_hashes? Unlikely anyone would care (though you’d be surprised how many DMCAs I get when just having a simple open tracker running, not even an indexet), but I don’t see it as being any less grey than using any existing sites.

thantik@lemmy.world · edit-2 1 year ago

In some jurisdictions hosting links to pirated content is considered illegal. In others it is not. You are now not hosting publicly available links. Many of those rulings were based on the publicly available nature and that you were providing OTHER people with the information. You are now simply obtaining the whole of the DHT yourself. You can’t be assumed to be doing anything illegal with it, because it’s everything. You could be doing research on swarms of computers, you could be looking for a linux torrent…the act of collecting ALL of the data yourself, doesn’t violate the laws in the way they were ruled on.

Additionally some sites have been MITMd so that they saw when people were browsing…say…“Barbie Movie”…and then they watched the DHT for a client connecting soon after, and could connect them to users with VPNs because people are browsing these sites not behind a VPN, but torrenting behind a VPN when they torrent.

Browsing something like Nyaa isn’t technically illegal - but people have been targeted over it. When you don’t have to browse Nyaa using a web browser, you bypass that whole shebang.

kungen@feddit.nu · 1 year ago

You are now not hosting publicly available links

That’s also the case with open trackers (without indexers), yet I’ve gotten shut down way too many times. But that made me wonder, does this project share metadata if someone else in the DHT swarm queries for an info_hash you have, or does it simply “leech”? Pretty cool project regardless.

droopy4096@lemmy.ca · 1 year ago

@mgdigital, first thing I’be noticed: reliance on “heavier” database stack (pg + redis), at least from the first glance at docker-compose. My suggestion would be to have an option for minimalist setup with sqlite and without redis if possible. That would work better for those of us flying with minimal hardware (rpi, old PC and such).

mgdigital@lemmy.world · 1 year ago

Hi, this is a great point and one that I’ve already given consideration to. I’ll address separately the issue of the primary datastore ,i.e. Postgres, and the Redis dependency:

Postgres as the only option for the data store

There are 2 reasons for this:

Performance: while SQLite could offer a simpler/embedded data store, it simply doesn’t have the performance and features of Postgres. Bitmagnet has a faceted search engine and is write-intensive (it will be discovering ~5k torrents per hour and writing these to the database along with associated metadata). As such, its database may not be suitable for running on older hardware. A SQLite adapter, if it was developed, may simply not be up to the job (although as I haven’t attempted this I can’t say what the performance would be like). That said, Bitmagnet itself is not especially resource intensive, you could probably run it on a Raspberry PI but point it to a Postgres instance on some more powerful hardware. At this stage I’ve only been running it on a M2 Mac Mini with Postgres located on its SSD and so would be interested to know people’s mileage on other hardware.
Development, support and maintenance overhead: I’m a lone developer and this project is already too big for one person. A SQLite adapter, if feasible performance-wise, I think could only happen if other contributors joined the project as my to-do list is already pretty long. It would have to achieve feature parity with the Postgres implementation which makes use of several Postgres-specific features and extensions. It would also mean a longer testing cycle and therefore probably a slower release cadence. That said, if there was enough demand and assistance then I’d be open to looking into the feasibility of this once the rest of the application is a little more mature and the current database schema more finalised.

Redis dependency

Redis is currently used only for the asynchronous task queue. I would like to have put this in Postgres, but there simply is not a good out-of-the-box solution that works well with Postgres and GoLang, and is actively maintained. I looked at quite a few queuing libraries and eventually settled on asynq (https://github.com/hibiken/asynq), which is a great library and does the job well - but could really do with support for non-Redis backends.

Using Redis here was a pragmatic decision that allowed me to make progress, rather than an optimal one. I guess I could have built a simple Postgres-based queue myself but that would have been a distraction and probably sub-optimal compared with a mature/separately developed library. It remains an option. Since I looked into this a new project has sprung up which I’m keeping an eye on - https://www.tork.run/ - it has a Postgres backend and looks like it might be up to the job, but is very new.

So yes, I’m very aware that the additional Redis dependency is not ideal and it may well disappear at some point.

mlunar@lemmy.world · 1 year ago

Hi, those points are certainly valid and I have nothing against these picks!

I just wanted to chime in that perf might not be as big of a problem as you might expect. 5k/hour is 1.4/sec, which sqlite should for sure be able to handle.

In fact, you can do hundreds to thousands of writes/sec, as long as you batch them in transactions (as by default each query is executed in its own transaction).

droopy4096@lemmy.ca · 1 year ago

thank you for such a detailed response. I would love to contribute however at the moment my capacities are rather limited but otherwise I’d be willing to add sqlite adapter. From your description it sounds like currently architecture is narrowly locked on PostgreSQL features. In my daily job I love PostgreSQL for big apps and stacks but I’m also aware how “hungry” PG can be, which is why I’m wondering whether it’s “too big of a hammer” for this particular problem. Also, setting up single service is easier to novices vs maintaining several. Docker compose is nice but it has it’s limitations.

Stephen304@lemmy.ml · 1 year ago

A dht crawler is inherently an intensive service to run, magnetico used sqlite and would take 10 minutes just to load the splash page that includes the total count of discovered torrents.

ryannathans@aussie.zone · 1 year ago

Does it infiniely crawl, storing all metadata about every torrent it finds forever?

mgdigital@lemmy.world · 1 year ago

726a67@lemmy.sdf.org · edit-2 1 year ago

Looks super interesting; starred!

Will report back once I’ve run through the installation.

spiritedpause@sh.itjust.works · 1 year ago

Dude this is amazing! Exactly the sort of thing I’ve been hoping would pop up to further “decentralize” the torrent search experience.

So I’m trying to run it on my machine through the docker-compose option, and I’m seeing something weird. It shows as successfully running, but when I go to the port it should be running on, I get “unable to connect” on my browser.

When I check my containers running, it shows the 3 bitmagnet containers, but the port doesn’t show.

https://i.imgur.com/D4R1Le5.png

mgdigital@lemmy.world · 1 year ago

Hi, the default port is 3333, which should be exposed if you’re using the example configuration here: https://bitmagnet.io/setup/installation.html - I’m not sure what the app is in your screenshot but the provided config definitely exposes that port and is tested on Docker for Mac.

spiritedpause@sh.itjust.works · 1 year ago

Just pulled the latest and tried again, and it works now! Thanks

prim3r@lemmy.ca · 1 year ago

This looks really cool! How resource intensive is this? What sort of storage requirements are there for this to be a reasonably reliable method of acquiring media? I’m probably just gonna find out myself. I’ve recently fully switched over to usenet, but this could make torrents pretty compelling again.

mgdigital@lemmy.world · 1 year ago

Hi, and thanks!

As a priority I’d like to gather some more rigorous performance benchmarks, but I can give you some hand-wavey stats now: Bitmagnet is currently fluctuating between 2-10% CPU usage on my M2 Mac Mini, and is using ~120MB of memory having currently been running for around 48 hours. Overall, the GoLang implementation seems pretty efficient to me considering how much I know is going on in the background.

Disk space usage of the database- this will be highly dependent on 2 configuration options, the first of which I’ve only just added in the just-released version. Copied from the configuration page of the website:

dht_crawler.save_files (default: true): If true, file metadata from the DHT crawler will be saved to the database. This provides more rich information about a torrent, but will use a lot more disk space. If disk space is at a premium you may want to consider disabling this.
dht_crawler.save_pieces (default: false): If true, the DHT crawler will save the pieces bytes from the torrent metadata. The pieces take up quite a lot of space, and aren’t currently very useful, but they may be used by future features.

For me, 24 hours of crawling uses ~2.5GB of database disk space for metadata on the ~120k torrents it has discovered. Yep, that sounds like a lot, however 90% of that is taken up with the files metadata, and could have been saved by setting dht_crawler.save_files to false. In fact I may set this to false by default and allow users to opt-in to the full-fat torrent info.

I’ve also imported the entire RARBG backup (the SQLite one, see tutorial on the Bitmagnet website). This, along with all the associated metadata from TMDB, took around 4GB of database space, which seems quite acceptable considering it’s basically every movie and TV show. Note that this does NOT include the metadata on individual files as I described above.

A priority feature for me (detailed on website) is smart deletion - this would allow you to automatically discard a lot of data that can be automatically determined of no interest and therefore greatly reduce disk space demands.

kautau@lemmy.world · edit-2 1 year ago

As someone interested in Usenet, what’s the best provider and client to start with in your opinion?

prim3r@lemmy.ca · 1 year ago

I’ve been using easynews/nzbgeek/nzbget with an arr stack on debian and it’s worked well for me. I’m fairly new to usenet, so take this with a giant grain of salt.

kautau@lemmy.world · 1 year ago

Cool, thanks for the reply!

Kushan@lemmy.world · 1 year ago

Sabnzbd is probably the best choice of download client, fyi.

CosmicApe@kbin.social · 1 year ago

Linux program names are fucking wild

deafboy@lemmy.world · 1 year ago

Running for 6 days, save_pieces: false

My database is currently 184 GB

pete_the_cat@lemmy.world · 1 year ago

Sounds interesting 😀 I’ll keep an eye on it, though I won’t be a primary user, I switched to usenet about a decade ago and only use torrents as a last resort.

Shdwdrgn@mander.xyz · 1 year ago

Looks like a fun project, but will you be providing any info on setting it up from scratch? I just don’t have an interest in docker containers.

cyberpunk007@lemmy.world · 1 year ago

Out of curiosity, why not? I’ve come around.

Shdwdrgn@mander.xyz · 1 year ago

I’ve just always used VMs for everything and set up each service to match my existing system. For example, my postfix servers have to all tie in to LDAP, mailman, and the host of services for authenticating email. It seems like the point of docker is to just have a completely preconfigured and self-contained setup. I guess I Just don’t see how that would work in my environment where I already have some services like databases or LDAP already running elsewhere, and I run multiple instances for redundancy. And if I have to reconfigure all that stuff in docker anyway, how is that any better than simply using my existing VMs?

cyberpunk007@lemmy.world · 1 year ago

Used to be like you, then I moved from truenas core to scale where it’s now Linux and docker instead of freebsd and iocage jails.

So docker has this concept of persistent volumes. You configure all your settings in the initial setup command (docker compose) and define persistent volumes. This way you don’t lose your data.

Here’s an example, Plex. I run Plex in docker now. So my config directory is defined as a persistent volume. If I need to update Plex, or rebuild it or whatever, the container just updates and has all the data I need via the persistent volume. If the install is messed up or whatever I just get a newer image and run the docker compose and it fires up and mounts the persistent volume and off I go.

Basically it takes away the burden of having to figure out the OS configuration. Makes backups easier - and smaller. And the things are spun up, installed, and usable in seconds.

Shdwdrgn@mander.xyz · 1 year ago

Not sure the OS configuration is really a burden :-) I have several servers I have to keep up to date anyway. And backups aren’t really an issue, I just run rdiff-backup on everything to provide a year’s worth of incremental backups, which doesn’t really take much extra space. Maybe one of these days when I catch up on other projects I’ll look into it though.

cyberpunk007@lemmy.world · 1 year ago

On truenas scale though it’s just tiles in a web browser, it’s super easy. And since it runs on ZFS backups are easier too. Just click your way through periodic volume snapshot tasks.

Definitely a bit of a learning curve but it’s a sleek setup once you understand.

Shdwdrgn@mander.xyz · 1 year ago

I’m not quite sure what “truenas” is? All of my stuff is individually installed, I decided a long time ago to split it up onto VMs that each perform an specific task. I have a main file server that runs zfs, then two servers to run the redundant VMs. There’s not really anything difficult about backups, I just add a cron job to run a script once a day and never touch it again, so I have backups of each VM but then the backups of the main servers includes the VM image files so each VM gets backed up twice. There’s a lot of info there but the backups of all the critical stuff only use about 6TB (I could actually cut that in half if I got rid of the backups from older machines).

So lets say I put in the time to learn how docker works, and then put in a lot more time converting all of my existing systems over to docker images… What exactly what I get out of all that effort? The thing that nobody’s been able to sell me on so far is that I don’t see how docker is going to make anything any easier, it just seems like it’s a “different” way to do things but nothing more.

cyberpunk007@lemmy.world · 1 year ago

Your data footprint would be less. Maintenance is a breeze. If you update your image and it breaks, just roll it back. Less consumption of resources. No need to divide your storage and ram for VMs. There are millions of docker images so you can start something new in seconds. And the learning curve isn’t too bad if you’re on truenas scale. Truenas core is a NAS operating system built on freebsd (Unix), and truenas scale is built on Linux. Both use ZFS for the underlying storage.

Dasnap@lemmy.world · 1 year ago

I personally love containers (probably because I use them for work) but I can understand someone not wanting another layer of abstraction if they’ve worked bare-metal for a long time.

mgdigital@lemmy.world · 1 year ago

Hi, yes this is mentioned on the installation page of the website, below the Docker instructions. The app can be installed Dockerless using go install; if you choose this option you’ll have to provide and configure Postgres and Redis instances for the app to connect to. That said, Docker is the recommended and easiest option.

Shdwdrgn@mander.xyz · 1 year ago

I saw that, but didn’t recognize the ‘go’ command as anything available on Debian. Just did some quick digging though and now I see it’s a new language and I believe I have an idea how to get it installed for compiling so I will give that a shot.

paris@lemmy.blahaj.zone · 1 year ago

Golang v1.0 was released in March of 2012. Not sure I would consider it a new language.

Shdwdrgn@mander.xyz · 1 year ago

Oh interesting… I thought I read something that said 2017. No worries, I’ll get it figured out now that I understand what it is.

Decronym@lemmy.decronym.xyz · edit-2 1 year ago

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
NAS	Network-Attached Storage
Plex	Brand of media server package
SSD	Solid State Drive mass storage
VPN	Virtual Private Network

4 acronyms in this thread; the most compressed thread commented on today has 9 acronyms.

[Thread #191 for this sub, first seen 5th Oct 2023, 14:25] [FAQ] [Full list] [Contact] [Source code]

Biscoot@thelemmy.club · 1 year ago

This sounds awesome, I’ll give it a try! Would this work in i2p?

mgdigital@lemmy.world · 1 year ago

I’ve never used I2P but I don’t see why not!

pedroapero@lemmy.ml · 1 year ago

Great project !

Naming conventions are missing some important information like bitrate, color depth, and most importantly language and subtitles.

Do you plan to scrape additional infos from known torrent sites (searching for torrent hashes for well named torrents) ?

mgdigital@lemmy.world · 1 year ago

Scraping torrent sites will be avoided is it’ll be prohibitively slow and break the self-sufficiency concept - we’ll infer as much as possible from the torrent meta info alone. You could have a guess at the bitrate from the file sizes. Sonarr/Radarr will already do this for you with quality profiles I think.

railsdev@programming.dev · 1 year ago

You had me GraphQL 🥰

BlueÆther@no.lastname.nz · 1 year ago

seems to work well

just one question, is it expected to have 10,000 out of 12,000 as unknown?

mgdigital@lemmy.world · edit-2 1 year ago

Hi, yep that’s expected. Torrents will only move out of “Unknown” once the classifier is able to categorise them. The classifier currently only supports movie and TV show content, and can recognise these with quite high accuracy assuming a well-named torrent (and a badly named torrent is unlikely to be a high quality release). The other content types (music, games etc) can currently only be populated via an import (see the tutorial on the website). A priority feature is classifiers for other content types - however we will likely always have a lot of torrents ending up in “Unknown” given the poor naming of many crawled items. Another roadmap feature, smart deletion, could help in future with getting rid of all the rubbish whose contents cannot be inferred from the torrent name.

Introducing Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration

Introducing Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration

Home

What is a DHT crawler?

Currently implemented features of Bitmagnet:

Interested?

Postgres as the only option for the data store

Redis dependency