I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.

Would love to hear how you would design it.

  • mvirts@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    12 hours ago

    If you’re writing a program, definitely multiple threads or processes that each scan a chunk of the file, which basically means seek to the start of the chunk, read lines into the scan code until you hit the end of the chunk. For jsonl each chunk will need an alignment step to not break the jsonl.

    For command line trickery, maybe the file could be chunked up by running multiple dd instances with an offset parameter piped into grep. This has many synchronization issues and all the outputs should be captured separately then combined afterwards. I can’t think of a good way to align this method to line edges but maybe you can put some fancy regular expression magic into the grep step to ignore malformed json at the beginning and end and overlap the chunks?

    Grep is fast already, maybe test the simple approach and see how long it takes.