How would you design parallel grep for huge JSONL files?

dhruv3006@lemmy.world · 1 day ago

How would you design parallel grep for huge JSONL files?

mvirts@lemmy.world · 12 hours ago

If you’re writing a program, definitely multiple threads or processes that each scan a chunk of the file, which basically means seek to the start of the chunk, read lines into the scan code until you hit the end of the chunk. For jsonl each chunk will need an alignment step to not break the jsonl.

For command line trickery, maybe the file could be chunked up by running multiple dd instances with an offset parameter piped into grep. This has many synchronization issues and all the outputs should be captured separately then combined afterwards. I can’t think of a good way to align this method to line edges but maybe you can put some fancy regular expression magic into the grep step to ignore malformed json at the beginning and end and overlap the chunks?

Grep is fast already, maybe test the simple approach and see how long it takes.