About grep -o. # stat -c %s file 6297285 # file file file: ASCII text, with very...

burntsushi · on Jan 1, 2024

OK, so first of all, let's get one thing cleared up. What the heck is ired? It isn't in the Archlinux package repos. I found this[1], but it looks like an incomplete and abandoned project. It doesn't even have proper docs:

    $ ired -h
    ired [-qhnv] [-c cmd] [-i script] [-|file ..]
    $ ired --help
    $

So like, I don't even know what `ired -n` is doing. From what I can tell from your commands, it's searching for `string`, but you first need to convert it to a hexadecimal representation.

But okay, let's also check the output between the commands and make sure they're the same. I used my own file:

    $ time grep -ob string 1-2048.txt
    333305:string
    333380:string
    920494:string
    5166701:string
    5210094:string
    6775219:string

    real    0.006
    user    0.006
    sys     0.000
    maxmem  15 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    13123:333305:string
    13124:333380:string
    33382:920494:string
    159885:5166701:string
    161059:5210094:string
    211466:6775219:string

    real    0.003
    user    0.000
    sys     0.003
    maxmem  15 MB
    faults  0

    $ time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n 1-2048.txt"

    0x515f9
    0x51644
    0xe0bae
    0x4ed66d
    0x4f7fee
    0x6761b3

    real    0.013
    user    0.010
    sys     0.004
    maxmem  15 MB
    faults  0

Indeed, the hexadecimal offsets printed by ired line up with the offsets printed by grep and ripgrep. Notice also the timing. ired is slower here for me.

OK, now let's do context:

    $ time grep -ob string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.006
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time grep -ob .string 1-2048.txt
    [..snip..]
    real    0.005
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time grep -ob ..string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.003
    sys     0.000
    maxmem  16 MB
    faults  0
    $ time rg -ob .string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.000
    sys     0.003
    maxmem  16 MB
    faults  0
    $ time rg -ob ..string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.004
    sys     0.000
    maxmem  16 MB
    faults  0

I don't see anything worth saying "yikes" about here.

One possible explanation for the timing differences is that your search has a lot of search results. The match count is a crucial part of benchmarking, and you've made the same mistake as the ugrep author by omitting them. But okay, let me try a search with more hits.

    $ time rg -ob the 1-2048.txt | wc -l
    60509

    real    0.011
    user    0.006
    sys     0.006
    maxmem  16 MB
    faults  0

    $ time rg -ob .the 1-2048.txt | wc -l
    60477

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time rg -ob ..the 1-2048.txt | wc -l
    60359

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0

A little slower, but that's what you'd expect with the higher match frequency. Now let's try your script for 1.sh:

    $ echo the | time sh 1.sh 1-2048.txt 6 | wc -l
    63304

    real    0.048
    user    0.072
    sys     0.052
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 7 1 | wc -l
    63336

    real    0.056
    user    0.096
    sys     0.042
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 8 2 | wc -l
    63419

    real    0.053
    user    0.079
    sys     0.049
    maxmem  16 MB
    faults  0

(The counts are a little different because `..the` matches fewer things than `the` when given to grep, but presumably `ired` doesn't care about that.)

But in any case, ired is quite a bit slower here.

OK, let's pop up a level. Your benchmark is somewhat flawed. For three reasons. First is because the timings are so short that the differences here are generally irrelevant to human perception. It reminds me of the time when ripgrep came out, and someone would respond with a "gotcha" that `ag` was faster because it ran a search on a tiny repository in 10ms where as ripgrep took 12ms. That's not quite exactly the same as what's happening here, but it's close. The second is that the haystack is so short that overhead is likely playing a role here. The timings are just too short to be reliable indicators of performance as the haystack size scales. See my commentary on ugrep's benchmarks[2].

Let's try a bigger file:

    $ stat -c %s eigth.txt
    1621035918

    $ file eigth.txt
    eigth.txt: ASCII text

    $ time rg -ob Sherlock eigth.txt | wc -l
    1068

    real    0.154
    user    0.103
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ time rg -ob .Sherlock eigth.txt | wc -l
    935

    real    0.156
    user    0.096
    sys     0.060
    maxmem  1551 MB
    faults  0

    $ time rg -ob ..Sherlock eigth.txt | wc -l
    932

    real    0.154
    user    0.107
    sys     0.047
    maxmem  1551 MB
    faults  0

And now ired:

    $ echo Sherlock | time sh 1.sh eigth.txt 6 | wc -l
    1068

    real    1.393
    user    0.671
    sys     0.729
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 7 1 | wc -l
    1201

    real    1.391
    user    0.604
    sys     0.793
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 8 2 | wc -l
    1204

    real    1.395
    user    0.578
    sys     0.823
    maxmem  16 MB
    faults  0

Yikes. Over an order of magnitude slower.

Now that the memory usage reported for ripgrep is high just because it's using file backed memory maps. It's not actual heap usage. You can check this by disabling memory maps:

    $ time rg -ob ..Sherlock eigth.txt --no-mmap | wc -l
    932

    real    0.179
    user    0.063
    sys     0.116
    maxmem  16 MB
    faults  0

And if we increase the match frequency on the same large haystack, the gap closes a little, but ired is still about 4x slower:

    $ time rg -ob ..the eigth.txt | wc -l
    13141187

    real    2.470
    user    2.418
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ echo the | time sh 1.sh eigth.txt 8 2 | wc -l
    13894916

    real    10.027
    user    16.293
    sys     8.122
    maxmem  402 MB
    faults  0

I'm not clear on why you're seeing the results you are. It could be because your haystack is so small that you're mostly just measuring noise. ripgrep 14 did introduce some optimizations in workloads like this by reducing match overhead, but I don't think it's anything huge in this case. (And I just tried ripgrep 13 on the same commands above and the timings are similar if a tiny bit slower.)

[1]: https://github.com/radare/ired

[2]: https://github.com/BurntSushi/ripgrep/discussions/2597

1vuio0pswjnm7 · on Jan 3, 2024

    curl -A "" https://raw.githubusercontent.com/json-iterator/test-data/0bce379832b475a6c21726ce37f971f8d849513b/large-file.json \
    |tr -d '\12' > test.json

How slow is grep -o for printing a match with surrounding characters

To observe, keep adding characters before the match

    _test1(){
    echo .{$1}'(https:)|(http:)'.{$2};
    n=0;while test $n -le 3;do
    busybox time grep -Eo ".{$1}(https:)|(http:).{$2}" test.json |sed d;
    echo;
    n=$((n+1));
    done;
    }

    _test1 5 10;
    _test1 15 10;
    _test1 25 10;
    _test1 35 10;
    _test1 45 10;
    _test1 55 10;
    _test1 105 10;

Question: Does piping rg output to wc -l affect time(1) output?

Answer: [ ] Yes [ ] No

    _test2(){
    echo .{$1}'(https:)|(http:)'.{$2};
    n=0;while test $n -le 3;do
    busybox time rg -o .{$1}'(https:)|(http:)'.{$2} test.json;
    sleep 2;
    echo;
    echo Now try this with a pipe to wc -l...;
    echo;
    sleep 2
    busybox time rg -o .{$1}'(https:)|(http:)'.{$2} test.json |wc -l >/dev/null;
    echo;
    sleep 5;
    echo;
    n=$((n+1));
    done;
    }

    _test2 150 10;

burntsushi · on Jan 3, 2024

> Does piping rg output to wc -l affect time(1) output?

Oh yes absolutely! If `rg` is printing to a tty, it will automatically enable showing line numbers and printing with colors. Both of those have costs (over and beyond just printing matches and their byte offsets) that appear irrelevant to your use case. Neither of those things are done by ired. It's not about `wc -l` specifically, but about piping into anything. And of course, with `wc -l`, you avoid the time needed to actually render the results. But I used `wc -l` with ired too, so I "normalized" the benchmarking model and simplified it.

But either way, my most recent comment before this one capitulated to your demands and avoided the use of piping results into anything. It was for this reason that I showed commands with `--color=never -N`.

And yes, `grep -Eo` gets slower with more `.`. ripgrep does too, but is a bit more robust than GNU grep. I already demonstrated this in my most recent previous comment and even wrote some words about it explicitly and that regex engines can't typically do as well with increasing window sizes like this when compared to a purpose built tool like ired. Nevertheless, ired is still slower than ripgrep in most of the tests I showed in my previous comment.

But optimally speaking, could something even faster than both ired and ripgrep be built? I believe so, yes. But only for some workloads, I suspect, with high match frequency. And ain't nobody going to build such a thing just to save a few milliseconds. Lol. The key is really to implement the windowing explicitly instead of relying on the regex engine to do it for you. Alternatively, one could add a special optimization pass to the regex engine that recognizes "windowing" patterns and does something clever. I have a ticket open for something similar[1].

[1]: https://github.com/rust-lang/regex/issues/802

burntsushi · on Jan 3, 2024

I tried your `_test2` and `rg` without `wc -l` takes 0.02s while with `wc -l` it takes 0.01s. The difference is meaningless. I don't believe you if you say that impacts your edit-compile-run cycle.

1vuio0pswjnm7 · on Jan 5, 2024

Below are the results I got from _test2.

    + echo .{150}(https:)|(http:).{10}
    + n=0
    + test 0 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.37s
    user 0m 1.25s
    sys  0m 2.97s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time+  rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.49s
    user 0m 0.45s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=1
    + test 1 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 34.23s
    user 0m 1.75s
    sys  0m 4.22s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.40s
    user 0m 0.37s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=2
    + test 2 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.59s
    user 0m 1.05s
    sys  0m 1.76s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10}+  test.json
    wc -l
    real 0m 0.45s
    user 0m 0.37s
    sys  0m 0.04s
    + echo
    + sleep 5
    + echo
    + n=3
    + test 3 -le 3
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    real 1m 33.99s
    user 0m 1.93s
    sys  0m 4.82s
    + sleep 2
    + echo
    + echo Now try this with a pipe to wc -l...
    + echo
    + sleep 2
    + busybox time rg -o .{150}(https:)|(http:).{10} test.json
    wc -l
    real 0m 0.40s
    user 0m 0.37s
    sys  0m 0.02s
    + echo
    + sleep 5
    + echo
    + n=4
    + test 4 -le 3
    + exit

No, I am not going to stare at the screen for a minute and half as thousands of matches are displayed. (In fact I am unlikely to even be examining a file of this size. It's more likely to be under 6M.) With a file this size what I would do is examine a sample of the matches, let's say for example the first 20.

Look at the speed of a shell script with 9 pipes, using ired 3x to examine the first 20 matches.

    busybox time sh -c "echo -n https:
    |od -tx1 -An \
    |tr -d '\40' \
    |sed 's>^>/>' \
    |ired -n test.json \
    |sed  '1,21s/.*/s&@s-150@b166@X/;21q' \
    |tr @ '\12' \
    |ired -n test.json \
    |sed 's/.*/w&0a/' \
    |ired -n /dev/stdout"

    real 0m 0.01s
    user 0m 0.00s
    sys 0m 0.00s

That speed is just right.

Now with the same ripgrep.

    busybox time rg -o .{150}https:.{10} test.json|sed 20q 

    real 0m 0.40s
    user 0m 0.37s
    sys 0m 0.02s

Being used to the speed of ired, this is slow.

Even more, ired is fraction of the size of ripgrep.

For me, the choice of what to use is easy, until I discover something better than ired.

burntsushi · on Jan 5, 2024

    $ busybox time rg -o '.{150}(https:)|(http:).{10}' test.json
    real    0m 0.02s
    user    0m 0.01s
    sys     0m 0.00s

I don't know how you're getting over a minute for that. Perhaps you aren't doing a good enough job at providing a reproduction.

> That speed is just right.

... unlike its results. "Wrong and fast, just the way I like it!" Lmao.

1vuio0pswjnm7 · on Jan 3, 2024

Not sure where "wc -l" came from. It was not in any of the tests I authored. That's because I am not interested in line counts. Nor am I interested in very large files either, or ASCII files of Shakespeare, etc. As I stated in the beginning, I am working with files that are like "a wall of text". Long lines, few if any linebreaks. Not the type of files that one can read or edit using less(1), ed(1) or vi(1).

What I am interested in is how fast the results display on the screen. For files of this type in the single digit MB range, piping the results to another program does not illustrate the speed of _displaying results to the screen_. In any event, that's not what I'm doing. I am not piping to another program. I am not looking for line counts. I am looking for patterns of characters and I need to see these on the screen. (If I wanted line counts I would just use grep -c -o.)

When working with these files interactively at the command line, performing many consecutive searches,^1 the differences in speed become readily observable. Without need for time(1). grep -o is ridiculously slow. Hence I am always looking for alternatives. Even a shell script with ired is faster than grep. Alas, ripgrep is not a solution either. It's not any faster than ired _at this task for files of this type and size_.

Part of the problem in comparing results is that we are ignoring the hardware. For example, I am using a small, low resource, underpowered computer; I imagine most software developers use more expensive computers that are much more powerful with vast amounts of resources.

Try some JSON as a sample but note this is not necessarily the best example of the "walls of text" I am working with; ones that do not necessarily conform to a standard.

   curl "https://api.crossref.org/works?query=unix&rows=1000" > test.json

   busybox time grep -Eo .{100}https:.{50}
   real 0m 7.93s
   user 0m 7.81s
   sys  0m 0.02s

This is still a tiny file in the world of software developers. Obviously, if this takes less than 1 second on some developer machine, then any comparison with me, an end user with an ordinary computer, are not going to make much sense.

1. Not an actual "loop", but an iterative, loop-like process of search file, edit program, compile program, run program, search file, edit program, ...

With this, speed becomes noticeable even if the search task is relatively short-lived.

burntsushi · on Jan 3, 2024

Well then can you share such a file? I wasn't measuring the time of wc. I just did that to confirm the outputs were the same. The fact is that I can't reproduce your timings, and ired is significantly slower in the tests I showed above.

I tried my best to stress the importance of match frequency, and even varied the tests on that point. Yet I am still in the dark as to the match frequency in your tests.

The timing differences even in your tests also seem insignificant to me, although they can be significant in a loop or something. Hence the reason I used a larger file. Otherwise the difference in wall time appears to be a matter of milliseconds. Why does that matter? Maybe I'm reading your timings wrong, but that would only deepen the mystery as to why our results are so different. Hence my request for an input that your care about so that we can get on the same page.

Not sure if it was clear or not, but I'm the author of ripgrep. The benefit to you from this exchange is that I should be able to explain why the perf difference has to exist (or is difficult to remedy) or file a TODO for making rg -o faster.

1vuio0pswjnm7 · on Jan 4, 2024

Another iterative loop-like procedure is search, adjust pattern and/or amount of context, search, adjust pattern and/or amount of context, search, ...

If a program is sluggish, I will notice.

The reason I am searching for a pattern is because there is something I consider meaningful that follows or precedes it. Repeating patterns would generally not be something I am interested in. For example, a repeating pattern such as "httphttphttp". The search I would do would more likely be "http". If for some reason it repeats, then I will see that in the context.

For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

https://news.ycombinator.com/item?id=38564604

And it will do all this faster than grep -o and nearly as fast as a big, fat grep clone in Rust that spits out coloured text by default, even when ired is in a shell script with multiple pipes and other programs. TBH, neither grep nor grep clones are as flexible; they are IMO not suitable for me, for this type of task. But who knows there may be some other program I do not know about yet.

Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different. Not every user is using the same hardware. Nor is every user trying to do the exact same things with their computer.

For example, I have tried all the popular UNIX shells. I would not touch zsh with a ten-foot pole. Because I can feel the sluggishness compared to working in dash or NetBSD sh. I want something smaller and lighter. I intentionally use the same shell for interactive and non-interactive use. Because I like the speed. But this is not for everyone. Some folks might like some other shell, like zsh. Because [whatever reasons]. That does not mean zsh is for everyone, either. Personally, I would never try to proclaim that the reasons these folks use zsh are "insignificant". To those users, those reasons are significant. But the size and speed differences still exist, whether any particular user deems them "significant" or not.

burntsushi · on Jan 4, 2024

> If a program is sluggish, I will notice.

Well yes of course... But you haven't demonstrated ripgrep to be sluggish for your use case.

> For me, neither grep nor grep clones are as useful as ired. ired will show me the context including the formatting, e.g., spaces, carriage returns. It will print the pattern plus context to the screen exactly as it appears in the file, also in hexdump or formatted hex, like xxd -p.

Then what are you whinging about? grep isn't a command line hex editor like ired is. You're the one who came in here asking for grep -o to be faster. I never said grep (or ripgrep) could or even should replace ired in your workflow. You came in here talking about it and making claims about performance. At least for ripgrep, I think I've pretty thoroughly debunked your claims about perf. But in terms of functionality, I have no doubt whatsoever that ired is better fitted for the kinds of problems you talk about. Because of course it is. They are two completely different tools.

ired will also helpfully not report all substring results. I love how you just completely ignore the fact that your useful tool is utterly broken. I don't mean "broken" lightly. It has had hidden false negatives for 14 years. Lmao. YIKES.

> they are IMO not suitable for this type of task

Given that ripgrep gives the same output as your ired shell script (with a lot less faffing about) and it does it faster than ired, I find this claim baseless and without evidence. Of course, ripgrep will not be as flexible as ired for other hex editor use cases. Because it's not a hex editor. But for the specific case you brought up on your own accord because you wanted to come complain on an Internet message board, ripgrep is pretty clearly faster.

> nearly as fast as a big, fat grep clone in Rust

At least it doesn't have a 14 year old bug that can't find ABAB in ABAABAB. Lmao.

But I see the goalposts are shifting. First it was speed. Now that that has been thoroughly debunked, you're whinging about binary size. I never made any claims about that or said that ripgrep was small. I know it's fatter than grep (although your grep is probably dynamically linked). If people like you want to be stingy with your MBs to the point that you won't use ripgrep, then I'm absolutely cool with that. You can keep your broken software.

> Significance can be subjective. What is important to me may not be important to someone else, and vice versa. Every user is different.

A trivial truism, and one that I've explicitly acknowledged throughout this discussion. I said that milliseconds in perf could matter for some use cases, but it isn't going to matter in a human paced iteration workflow. At least, I have seen no compelling argument to the contrary.

It's even subjective whether or not you care if your tool has given you false negatives because of a bug that has existed for 14 years. Different strokes for different folks, amiright?

burntsushi · on Jan 3, 2024

Okay I see you replied with more details elsewhere. I'll investigate tomorrow when I have hands on a keyboard. Thanks.