I was curious, so I tried to do this with GitHub search. The search “Golang” ret...

taeric · on Oct 3, 2022

It is easy to think that this is normal. And I can think of many non crappy reasons that it became so.

That said, I do think it hinges on dishonest. Even if well intended, there is absolutely no faith in thinking that there were actually that many meaningful hits on any of these searches. Nor is there really much gained by presenting such large numbers.

I can almost buy that it is intended directionally to help refine search terms. But, the numbers are so silly large that I don't. Especially as there is no way to see the last results to know exactly why they count as hits, but are not worth serving.

zeven7 · on Oct 3, 2022

If you look at the number of results rather than the results themselves, you clearly care about the number. The number is the information. It can answer a question like: How many books in the Library of Congress contain the word "God"?[1]

Similarly whenever I've looked at the number of results in Google it was to judge how common something is. For example, if I want to know the most commonly used spelling between two options, I Google both and check the number. That's what the number is helpful for, and really the only thing it could be helpful for. It is its own information.

[1] I know the question might need to be revised for the number to reflect the answer accurately; you get the idea.

taeric · on Oct 3, 2022

But that number has no way of inspecting it to know what it actually means. Is it including prefix searches? Suffix? Whole word only? Possible misspellings? Acronyms? Translations?

I get that it can be somewhat directional, but I question that anyone is getting meaningful data out of it. :(

amelius · on Oct 3, 2022

Also, some pages are mirrored all over the internet.

hunter2_ · on Oct 3, 2022

You've just opened a can of worms. It raises a very interesting question: when asking something like

> if I want to know the most commonly used spelling between two options

do those mirrors count as "use" of the word? Or does "use" refer only to the original writer?

If a neologism is written once and read by millions, it carries more weight than one read by only a few people. Perhaps the same goes for spelling variants: the more people see it, the more people agree it's the typical spelling. Lots of mirroring could be a decent proxy for lots of readership.

mewpmewp2 · on Oct 3, 2022

You can in some cases compare different keywords to each other.

amatecha · on Oct 4, 2022

The number is not _the_ information. It's just _some_ information. When I search the web I want the content, and very very rarely care about the actual number of results. So yeah, showing a search result count they can't actually serve is dishonest and gives a false sense of reality.

zeven7 · on Oct 4, 2022

No, in context of what I said (when the number is what you're looking at) the number is the information. If the content is what you care about, the number being 2 million or 5 million is meaningless because you'll never get through all the content.

SR2Z · on Oct 3, 2022

This reads more like a sign saying "over x customers served" on a fast food restaurant. It doesn't really matter, nobody will ever care about the accuracy of the information, but it's a pride thing and makes a point about the establishment.

Obviously there aren't millions of meaningful hits on most terms, but you don't need to go to page 10001 for that. You can get to page 3 and know it for sure.

taeric · on Oct 3, 2022

I'd argue that, to a layman, if it is obvious that there aren't that many meaningful hits, then there weren't that many hits. :(

This does call back to the odd false confidence that our industry bakes into the interview process. "Design a realtime chat program that can notify any number of followers that you posted and let them respond with sub millisecond latency."

pitched · on Oct 3, 2022

I can’t push a single key in less than 20ms latency. Those poor interviewees are building Twitter but for bots. Unfortunately, that is probably a relevant skill…

mrazomor · on Oct 3, 2022

The number of search results is plenty useful. Even for grammar! If spell checker isn't of help, use Google and see which option is more popular (I used to use it often when I wrote more in my native language, which has a few peculiarities). It can give valuable insight in popularity of different things.

I do hold the number of results trustworthy (not exact, but a good estimate). Having the index in the Inverted Index format, and efficient joins (solved problem), it's just about getting the size of the term list and summing over the shards. Something that Google anyway has to do for the regular retrieval.

P5fRxh5kUvp2th · on Oct 3, 2022

right, I would understand "found 10k results but can only show the first 1k, please refine your search".

That's reasonable. But as you said, what they're doing now is bordering on dishonest.

Retric · on Oct 3, 2022

Library of congress is quite explicit about this on their first search page, rather than showing an irrelevant number. So, having limits isn’t unusual, but not everyone is deceptive about it.

You Searched: ALL: God

Your search retrieved more records than can be displayed. Only the first 10,000 will be shown.

Titles List: 1-25 of 10000 https://catalog.loc.gov/vwebv/search?searchArg=God&searchCod...

sailfast · on Oct 3, 2022

Yeah - as someone that has run production search clusters before on technologies like Elastic / open search, deep pagination is rarely used and an extremely annoying edge case that takes your cluster memory to zero.

I found it best to optimize for whatever is a reasonable but useful for users while also preventing any really seriously resource intensive but low value queries (mostly bots / folks trying to mess with your site) to some number that will work with your server main node memory limits.

acoustics · on Oct 3, 2022

I guess they have multiple search interfaces.

I navigated to loc.gov and searched directly from the home page.

I was sent to this page: https://www.loc.gov/search/?in=&q=god&new=true&st=

At the top, it says “Results 1-25 of 324,782” and does not mention the limit of 10,000 anywhere.

mattkrause · on Oct 3, 2022

It does once you get to the 4,001st page of the results (which is actually 100k items): "Sorry! We can't process this request. This request exceeds the maximum search results depth." The error page even includes a link to "LC for Robots" (https://labs.loc.gov/lc-for-robots/), a list of APIs, since few humans are going get that far on their own.

The number here also seems less misleading because all 324,782 results do seem to exist. It doesn't want to generate a pagination for the entire set, but you could get to them by choosing different formats, date ranges, collections etc. The number Google repors, as far as I can tell, needs to be taken on faith.

Retric · on Oct 3, 2022

Ahh, looks like the limit on results is 100,000 items on that page. https://www.loc.gov/search/?q=god&sp=4000

But you can subdivide the search results by date and get every item on your original list: https://www.loc.gov/search/?dates=1890/1899&q=god&sp=1808

So, they have and can show you every single one of those 324,782.

levkk · on Oct 3, 2022

My guess is the estimate comes from term frequency index which is pretty easy to build. Estimates of that can come from HyperLogLog or similar.

Asking for page 10,000 is asking the search engine to search and _rank_ 10,000 * 10 results and give you the last one. That's very expensive and ultimately useless - search is about finding what you're looking for on page 1, not on page 10,000 :)

So it is true that there are 225k+ repositories using Golang (you can compute that with an index scan once a week), but searching them is an entirely different problem.

ipaddr · on Oct 3, 2022

If they cannot give you the page results how can you trust page 1 is better than page 34557

Version467 · on Oct 3, 2022

Huh? It's a ranked retrieval model. Each result has scored a little bit worse on their relevancy function than the one above it.

To not trust that the results on page 1 are better than those on page 34557 is the same as saying that their ranking function does not work at all, which would mean that it's at best as good as random chance. That's clearly not the case, therefore I can trust that page 1 indeed has more relevant results than page 34557.

With that said, Page 34557 doesn't exist. And that's fine. The result count estimate is not based on the actual ranking that has taken place (at least not directly). It would be an absolute waste of resources to rank that many results. If you cannot find what you're looking for on the first page, then it's much easier to reformulate your query. Easier for you because it gives you more control over what you want your search results to be and easier for google because it only needs to rank a couple hundred results instead of a bajillion.

varispeed · on Oct 4, 2022

Well, one could claim that indeed their ranking function does not work well - at least recently. Stuff that is relevant rarely is showing up on the first page as it is losing to various spam sites having articles written by AI so that it can match the typed search query and get a good ranking, but articles themselves being misleading and incorrect. I remember the days when you could really dig deep into results pages. Maybe not 34557, but above 100 you could get niche human written content on the interesting topic.

amatecha · on Oct 4, 2022

Yeah, I used to pretty frequently click through, and find relevant/useful content, some 20+ pages in on paginated web search results. Well, unfortunately the major search engines have changed their functionality such that there's usually no point in even looking beyond a couple pages (or, arguably, even looking at the first page for that matter).

phpthrowaway99 · on Oct 3, 2022

If you search a famous name like Joe Biden or Donald Trump, it might be useful to read the 10,000th thing written about them, wouldn't it?

CrendKing · on Oct 3, 2022

Like the others have said, Google wants you to refine your search (e.g. "Joe Biden foreign policy" instead of "Joe Biden"), rather than dumping low quality results on you. You don't have to agree with Google's rule by switching to other search engines, but if you do use Google, abide to that.

amatecha · on Oct 4, 2022

Ehhh, I write very specific search terms, with quoted elements, and I still get shockingly irrelevant results. I'd literally rather get a "no useful results" response rather than hundreds of pages of results that literally DO NOT CONTAIN the terms I typed into the search box.

jjk166 · on Oct 4, 2022

If you put the search terms in quotation marks it will only show results where that term appears verbatim on the page. Also searching -keyword will exclude search results where that word is present.

reaperducer · on Oct 3, 2022

But it really seems that everyone does this

If everyone lies, that doesn't make it right to lie, no matter the technological barrier.

We should't accept and normalize this behavior. If a search claims there are six million results, I should be able to see any or all of those claimed.

If Google (and the others) can't let me see all the results, be honest and tell me. "Google found 6,553,500 results. Showing the 5,000 most relevant."

Google advertises that it has x results, but there is no way to know if that is true, or a lie.

Is it really so hard to not lie to your users?

rhdunn · on Oct 3, 2022

The issue is that the number of results is often an estimate for performance reasons. So the results should say something like "around X results" or "approximately X results" to make that clearer.

If Google or some other search-based service had to check every document for a match to verify that the results were accurate (to avoid false positives and negatives) then 1) search would be slow; 2) it would be expensive; and 3) the search couldn't handle many requests (as that would kill the database server).

What search engines/databases tend to do is make use of fast and efficient lookup tables or indices, then perform operations on those results (such as joining between different tables) depending on the particular search terms/options (e.g. if you are searching for a specific content type or timeframe).

There is a lot of complexity in making the numbers and results both accurate and efficient.

winrid · on Oct 4, 2022

Because as you skip pages with most pagination implimentations the DB has to scan more index keys. If they allowed paginating so far, without some kind of expensive preaggregation, you could easily create a DDOS attack.

?afterId=x will perform much better than page based pagination and doesn't have this problem as much, but it can still cause undesired io and cache activity in the DB when you start paginating a lot to unpopular content.

throwawaylinux · on Oct 3, 2022

> But personally I’m ok knowing “there are X hits and we will serve you some of them”.

That's not what it is, or at least that's not what they're trying to convince people it is. Otherwise they wouldn't the numbers on the last page.

    Page 21 of about 16,890,000,000 results (0.82 seconds)
    Page 22 of about 214 results (0.95 seconds)

I don't think other companies doing the same thing make it okay if it's not reasonably known puffery. That said I suppose it's not false advertising because you're not buying a search.

jaredsohn · on Oct 3, 2022

Not sure what GitHub uses internally, but Elasticsearch has a default limit of 10,000 records unless you update the index with a parameter. I imagine a lot of apps have a limit for this reason.

tracker1 · on Oct 4, 2022

Considering loading that many results, to drop the first N is a lot of overhead... 10,000 is arbitrary, but pretty common at a level as no reasonable person really wants to see this far into weighted search results.

schroeding · on Oct 3, 2022

> I queried page 1000 to see the 10001st-10010th results and got a 404.

Yeah, the GitHub API also only serves 100 pages with 100 results each max. If you want more, you need to use dirty hacks like slicing the search into blocks by sorting and filtering based on the creation date or the number of stars of a repo.

Be careful though, in my experience the results are not consistent anymore as soon as you enter low star or old repository territory, probably depending on the actual API server you hit, forcing you to query multiple times to really get (hopefully) all results. :)

achillean · on Oct 3, 2022

If you're using Elastic then it's inefficient and discouraged to allow accessing random pages (ex. requesting page 100 before having visited pages 1-99). I.e. this could be a technical limitation and there aren't use-cases that require accessing those pages arbitrarily. For Shodan, we allow random access for the first 10 pages but if you want to go beyond that then you need to go in order so we can use a search cursor on the backend to more efficiently go through the results.

pclmulqdq · on Oct 3, 2022

I think if you jumped forward by tens, you might see some results on some of those sites. Skipping forward in a search index is exactly the same on the server side as moving incrementally, and they likely didn't want to do 10000 pages for one request.

gofreddygo · on Oct 4, 2022

Reminds me of a rookie mistake I made - caching.

I had a key-value store populated from map-reduce jobs. Lookup a key and return cached result. Refresh cache every X days. With increasing workloads, each job took longer pushing X higher and higher. So some results you'd see days old. not noticable to most.

Is it possible here that, the "count" here is stale ?

belter · on Oct 4, 2022

I noticed you get the thousands of results when you search logged out, incognito moode, and just a few hundred results when searching logged in with a google account.

Maybe they are saving storage on the amount of results as they serve and track you on the web :-)

alas44 · on Oct 3, 2022

Mentioned previous related discussion is here https://news.ycombinator.com/item?id=32777737

1vuio0pswjnm7 · on Oct 3, 2022

From LOC website:

"Note about deep paging limitations

Due to the technical limitations of search engine technologies, it is not recommended that users page through a large number of result pages. If the number of result pages is excessive, it will be better to use faceting or more specific search terms to reduce the result set. Paging past the 100,000th item in a search result is not supported at this time. In some searches, responses may fail before 100,000 items."

https://www.loc.gov/apis/json-and-yaml/

This is what the OP states:

"A misconception regarding Google's search results is that all of the results are being served to the user conducting that particular search. Those 2 billion search results can't be gotten through Google's pagination, and it seems that this number is somewhat arbitrary to the search, or commonality of the keyword."

That seems accurate. One cannot retrieve the full number of results. The parent demonstrated this with some examples.^1

The pertinent question IMO is how many results can one retrieve. As someone who started using www search engines in 1993, that number keeps shrinking. IMO, this is a reflection of companies like Google seeking to commercialise the web for their own benefit. Google wants the web and paid advertisement to be synonymous.

With Wikipedia or LOC, one can retrieve more results that one can using Google. More importantly, the sorting order (ranking) is different. Not sure about Wikipedia but no one uses "SEO" for LOC. It is a curated collection, unlike the uwashed web. In the pre-Google era, people would choose the name "Acme" for their businesses so they could be listed first in the Yellow Pages. Google does not allow alphabetical listing. The reader should be able to figure out why. It's because the web is not curated. It is not a library. Google is only interested in what sells advertising.

Using Google from the command line (no cookies, no Javascript)

   "golang" 457 results max
   "god" 576 results max
   "america" 448 results max

Those numbers are laughable if one sees Google as some kind of "oracle" for open-ended questions, a gateway to the world's information, or even to the contents of the www, as many seem to do. It is not even close. It is a filter. The filter has a purpose. The purpose is commercial.

Of course Google is useful but one is kidding themselves if they believe Google is anything like a library. Libraries (public, academic) generally do not subsist on selling advertising services. Library websites and Google may use computers and similar software that have limitations but that does not mean they share the same principles. There is no reason to believe LOC would not serve all www users with more than 10000 results if the software allowed it. LOC is not promoting some results over others based on commercial objectives.

The parent may be referring to this recent HN comment allegedly from a former Google employee:

https://news.ycombinator.com/item?id=32785079

Google fans are happy if the reader conflates (a) technical limitations with (b) commercial objectives, e.g., "secret algorithms" for ranking results. Anyone using computers and software will be subject to (a) but not every entity using computers and software to assist patrons with searching its catalog (database) must engage in Google-like behaviour.

1.

YMMV, but I found Github would only return 90 results max for "golang" when searching from the command line (no cookies, no Javascript).

Wikipedia caps results at 10000. This is stated on the website.

1vuio0pswjnm7 · on Oct 3, 2022

To see the Wikipedia statement

https://en.wikipedia.org/w/index.php?limit=500&offset=10000&...

acdha · on Oct 4, 2022

> EDIT: I just searched the Library of Congress for “god” and was told I was viewing results 1-25 of 324,782. Sure enough when I asked for page 10,000 I was rebuffed. I really don’t think this is a Google thing

This is a common problem called deep pagination, and it basically stems down to how most search systems work. When you make that search request on loc.gov, it eventually turns into a complex request for Apache Solr. Answering that request requires Solr to find all of the matching documents which match the query and any other active filters (e.g. document type, date, etc.), and calculate their rank in the search result set so it can determine which specific results are numbers 10,000-10,050. Solr/Lucene are really mature and well-tuned but when you think about how many millions of documents there are in that index, this is still calculating a pretty large temporary set only to serve 50 records, throw away the other 324k, and go on to the next query. Your next request might arrive a second later but get routed to a different replica so it’s a cold cache there, too. Click one facet on the filter list, and the search engine has to recalculate the matching documents again.

Don’t forget that you’re often seeing the results of more than just the metadata. You often get the best search results by treating entire books, newspapers, etc. as a single group for the purposes of ranking and then displaying it as a single entry on the top-level search. I wrote about this a while back: https://blogs.loc.gov/thesignal/2014/08/making-scanned-conte...

There’s an alternative which is considerably more efficient: cursor-based pagination, where instead of saying count=50 offset=10000 you say after=<unique ID/time stamp on the last result>. That allows the search engine to efficiently ignore all of the records which don’t match that, avoiding the need to calculate their rank, but comes at the cost of breaking the pagination UI: sites like loc.gov give you numbered lists of pages which humans like but you don’t have a way to calculate those after= values so your faster search isn’t as popular with advanced users.

That last part gets around to the other big factor in decisions: relatively few people do this. Most of the time, people are going to adjust their search if they don’t find what they’re looking for in the first few pages. What does do this a lot are robots which crawl every URL on a page, generating millions of permutations for every filtering option you offer. Some of those robots honor robots.txt (perhaps even correctly implemented), some have accurate user-agents, etc. but many do not and basically everyone offering a search service with a large corpus ends up having to make decisions about rate-limiting, limiting pagination depth, capping the number of filter options, etc.

rkagerer · on Oct 3, 2022

So clearly as a society we've forgotten how to count ¯\_(ツ)_/¯

iostream25 · on Oct 3, 2022

That's a general thing I've been noticing.

Maybe it's long COVID or quiet-quitting or was it ghosting? I know it's not gaslighting. I'm trying to keep up, bear with me.