Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Is there an equivalent of the Dewey Decimal System for data collections?
27 points by Mithriil on June 16, 2023 | hide | past | favorite | 25 comments
The Dewey Decimal System is famous as a cataloguing system for books and other writings and medias that have to be physically retrievable.

Is there an equivalent or something similar for data collections? I'm particularly interested in such a system for Machine Learning Operations (MLOps).



This is recent and (I think?) related:

Johnny Decimal - https://news.ycombinator.com/item?id=36308366 - June 2023 (192 comments)


Well, it could be related, but it is more a system designed for personal files and notes.


You might find something useful by poking around the literature around "digital libraries".

https://en.wikipedia.org/wiki/Digital_library

There's also some stuff that could be adapted to the purpose you're talking about, from the Semantic Web space. See the following example of using VoID with DBPedia categories, for one illustration of this idea.

https://www.w3.org/TR/void/#subject

And with that in mind, since somebody brought up the LOC, you might also find this of interest:

http://www.loc.gov/standards/mods/modsrdf-primer.html

http://www.loc.gov/standards/mods/

https://www.loc.gov/librarians/standards

And from the "see also" category:

https://www.dublincore.org/



That's a very good example of what I've been searching for! :) Thank you.

It's quite logical to believe that digital collections are catalogued not through traditional codes, but through universally shared metadata.


Dublin Core is a metadata specification standard. That includes classification system(s), but isn't one itself.


What are you trying to do / get done?


I'm thinking ahead for the company I work for. I was wondering how we could organize and store datasets for the long run. Because, yeah, we're a bit late on data management.

So I work in a Machine Learning department and getting our hands on data from clients have been a terrible mess (both architecturally and legally). How are we suppose to develop prediction machines for contexts we have no data on? I wish to push the discussions a bit in the right direction.


Having worked both in libraries and with large data sets I don't think a classification system like Dewey Decimal would work for data sets. For example, if you had a large data set of locations for people on a college campus, that data set could be used for sociological research (who meets up with whom?), or epidemiology (to see who potentially spread a virus), or building maintenance (which buildings have the most people using them?) So what do you classify the data set under? Maybe you could file it under location data (though it is also student+professor+manager data), but what if the basic data is WiFi AP associations? Then the data might also be of interest to wireless network managers and could be classified as wireless management data. So any classification system probably needs a tag system that can apply multiple tags to one data set. The main point of classification is being able to find what you need, so pigeon holing a data set into one category is not a good idea because data is much more versatile than a single book (although there are problems with Dewey Decimal and books in this regard also). There is also the problem of future categories, catagories that don't even exist now but might retroactively apply to current data sets. So an old data set that might have been tagged with "location" might in the future also need to be tagged with "privacy". Being able to add tags in the future is important. So a data set is not like a book on a shelf, it's more like a web page, and there should be multiple ways to link to it. Data sets are not physical things, and are not related to a single topic (though they may have been collected for only one purpose initially), so don't use a cataloging system for physical objects to catalog data sets. That said, the range of data sets is infinite. So understanding your own needs my help limit the number of tags/categories and understanding how your needs may change in the future would help. Machine Learning data sets are often repurposed. Business aims may expand. At the extreme you may want to do some research into the problem of ontologies. Specifically, how to build an ontology of your current and future intentions for these data sets. Look for something flexible because your needs are likely to change, so you may need to transition the "catalog" to a new form in the future. That seems very complicated and difficult, so reduce the problem as much as you can by imposing limits. What is the minimum range of categories that defines your interests?

You might take a completely different approach also, instead of categorizing data sets, just feed them (and all their metadata) to a machine learning algorithm that can search and summarize them for you. That puts the categorization into the search query+process, not associating it with the data. By its nature raw data is rather category-less, it can later be used for anything. Notes on butterfly migrations from the 1900's later becomes data on global warming.


I appreciate your insights. Indeed, I can't see how a single template classification can work for every possible datasets.

Best approaches I have in mind is either determining a set of mandatory and optional metadata shared by every dataset and, as you mentioned, defining an ontology of use cases, sources or characteristics to label datasets.


There are numerous classification and cataloguing systems. Dewey Decimal is only one of a large set. I'd strongly suggest looking at the US Library of Congress's two sets of classifiers: the Library of Congress Classification, a set of 20 alphabetically-denoted categories (A-Z, excluding I, O, W, X, and Y), originally based on a classification devised by Thomas Jefferson, whose donated collection seeded the Library of Congress's holdings.

The LoC also has a set of subject headings, which is a controlled vocabulary used to describe works.

The chief difference between the two is that any given work is assigned one Classification, which is used for shelving and retrieval, but can have multiple subject headings, which are used for general cataloguing.

Paul Otlet, an early 20th-century archivist, created the universal decimal classification, based on Dewey, for a project similar to what you seem to be after: a collection of information rather than works, in a project called the Mundaneum, in many ways a precursor to Google, though based on index cards. Much of the original was destroyed by Nazi Germany during WWII.

The UDC shares with the Dewey and Library of Congress classifications the benefit of being in widespread extant use, which is to say that these classifications reflect current informational needs and have been revised from historical standards which may no longer be especially suitable, and have established bodies and procedures for further updates and revisions.

The Library of Congress classfication & subject headings, as works of the US government, are also in the public domain, though useful & usable electronic formats are not readily available to the best of my knowledge. I believe the Library of Congress sells various products, however.

There are other classifications as well, several of which I've heard of though I've not used them: the colon classification, Bliss Bibliographic Classification, and several national / language-specific classifications (e.g., German, Nippon, Chinese, Korean, Russian).

Among the more interesting classifications is SuDocs, the Superintendent of Documents Classification, developed and maintained by the U.S. Government Publishing Office. This is not a universal or subject-based system, but is principally organised by Federal department or agency, additional classes for subordinate offices, category classes (e.g., annual reports, bulletins, law, ...), book numbers. Whilst likely not directly useful to you, it's an example of a classification designed for the specifics of a particular organisational context.

There are bibliographic standards such as the Dublin Core, which defines 15 metadata elements (though without clearly defining their specifications, meanings, or encodings): Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type. You'll find many content management systems incorporate Dublin Core to some extent.

Another biblographic standard is MARC: Machine-readable cataloging, a standard which emerged from the Library of Congress beginning in the 1960s. It is arcane and has a heavy influence from mainframe-computer (and punch-card data storage) practices of the era.

It's quite useful to think of how any system you specify will be used, by whom, and how it will be applied and maintained:

- Who will be using the classification?

- What purposes will it serve? Especially any processes related to rights management, documentation, provenance, and/or regulatory frameworks.

- What processes will it serve?

- Who will maintain the classification itself?

- Who will maintain the catalogue, e.g., adding new records, updating and correcting existing records, de-aquisitioning records.

- What if any extant standards are there in my specific field? (I don't know, for example, what if any machine-learning standards exist.)

Keep in mind that classifications tend to be married to the notion of physical records stored in a specific location, which is generally not the case for electronic data. For the latter, useful descriptions of the work or information, information on its provenance, unique item identifier(s), and cross references between related items (e.g., source or derived data) might be more relevant information to capture.

There are various schools of thought on cataloguing and classification generally. Over the past few decades, "self-describing" works, often based on full-text search and some relevance measure, has become popular --- essentially Google and other online General Web Search indices. Hashes (e.g., SHA-256 checksums) are another self-descriptive tool, which are useful for identifying a specific file but tell you nothing about related works (e.g., the plain text, PDF, and ePub versions of a document, or JPEG, PNG, and SVG versions of a graphic). The advantage to the self-descriptive approach is that human inputs are relatively minimal, and documents self-describe through their contents and relations to other materials and factors. The disadvantage is that this approach lacks any central coordination, uniform classification, quality controls, or validation of self-described contents. The traditional practice (owing much to Melville Dewey himself) of having an independent cataloger role affords greater control and consistency, but tends to be helplessly out of date with new incoming information --- something of an age-old problem in the library field.

In practice hybrid systems are probably most feasible, with assigned bibliographic characteristics being added to works as time permits and need arises. My own thoughts are that the notion of a cataloguing workflow be explicitly notated in the bibliographic metadata, and that levels of automated and manual review and assignment be coded to works as well.

There are several schools of library & information science, and you might want to poke around their course offerings, syllabi, etc., for information. The School of Information at UC Berkeley (previously SIMS), and the Information programme at Pittsburg are two of which I'm specifically aware. Wikipedia of course has a more extensive listing: <https://en.wikipedia.org/wiki/List_of_information_schools>

There are also organisations working with electronic data collections at scale, including the Internet Archive and the Wikimedia Foundation, most particularly Wikidata, which might be of interest or relevance to you.

Wikipedia also has articles on most of the classifications and topics I've mentioned here: Library of Congress Classification & Subject Headings, Universal Decimal Classification, Colon Classification, Paul Otlet, and more.


> Keep in mind that classifications tend to be married to the notion of physical records stored in a specific location, which is generally not the case for electronic data. For the latter, useful descriptions of the work or information, information on its provenance, unique item identifier(s), and cross references between related items (e.g., source or derived data) might be more relevant information to capture.

The many replies here on HN have made me come to this conclusion. As long as one can find the datasets they need for a use case. Solid metadata standards for good description and indexing is a must.

> It's quite useful to think of how any system you specify will be used, by whom, and how it will be applied and maintained [...]

These sentences are very good starting points to design said metadata standards internally. I'll keep these questions in mind.

> There are also organisations working with electronic data collections at scale, including the Internet Archive and the Wikimedia Foundation, most particularly Wikidata, which might be of interest or relevance to you.

These are very good suggestions, thank you!


These sentences are very good starting points to design said metadata standards internally. I'll keep these questions in mind.

There's a formulation of the cataloguer's / librarian's goals in creating a catalogue which I tried to find and reference above which I'd failed to do.

The "Six Functions of Bibliographic Control" are one version of that. Abbreviating slightly:

1. Identifying the existence of all types of information resources as they are made available.

2. Identifying the works contained within those information resources or as parts of them.

3. Systematically pulling together these information resources into collections in libraries, archives, museums, and Internet communication files, and other such depositories.

4. Producing lists of these information resources prepared according to standard rules for citation.

5. Providing name, title, subject, and other useful access to these information resources.

6. Providing the means of locating each information resource or a copy of it.

<https://en.wikipedia.org/wiki/Cataloging_(library_science)#S...>

Citing:

- Hagler, Ronald (1997). The Bibliographic Record and Information Technology, 3rd ed. Chicago: American Library Association.

- Taylor, Arlene G., & Daniel N. Joudrey (2009). The organization of information. 3rd ed. Englewood: Libraries Unlimited, pp. 5-7

The Wikipedia Cataloguing article raises numerous other points, practices, and history.

Keep in mind that your own needs might overlap with these only partially, including some, excluding others, and including other considerations (e.g., provenance and derivative datasets or sources, generating and/or reporting software).

On the physical storage/retrieval aspect: idiosyncracies of the US Library of Congress classification become far more understandable when one considers that, say, topics of history and geography, local and proximate topics are not only of greater concern, but historically comprised a far larger portion of the collection. (This is of course changing.) Having gone though the classification in some detail, I'd also been struck by how within the law classification, the degree of detail provided for California and New York State is vastly greater than nearly all other states (a handful of other industrial states also have relatively detailed classifications). The classification, in other words, maps onto the archived works.

The history and evolution of the classification is itself a story, though one you may not want to dive into. It's pretty substantively documented in the annual Librarian's Report to Congress, which begin in the 1860s and continue through the present. Recent years (to about 2000) are available through the loc.gov website, older reports, to the first, at the Hathi Trust. Much of the first 50 years or so details the challenges of dealing with a swelling collection in a much-too-insufficient space (either the Old Senate Chambers or the Old Supreme Court Chambers, if memory serves). The current freestanding main structure, the Thomas Jefferson building adjacent to the Congress and Supreme Court both of which it serves though an underground messaging-and-delivery system was opened in 1897.

The book delivery system is described on page 7 of the 1897 report, with the five requested books reaching the Capitol in eight to twelve minutes: <https://babel.hathitrust.org/cgi/pt?id=mdp.39015036735044&vi...>

Photo of the old Library (still within the US Capitol) in the 1890s: <https://en.wikipedia.org/wiki/Library_of_Congress#/media/Fil...>

Recent LoC annual reports: <https://www.loc.gov/about/reports-and-budgets/annual-reports...>

The Hathi Trust archive of reports 1866--2007: <https://catalog.hathitrust.org/Record/000072049>


Nobody uses the Dewey Decimal system for cataloguing resources. Unless you're an elementary school librarian, possibly, and even then... no.

You really want the Library of Congress's classification outline.


This is the appeal to authority fallacy that we see far too much in tech, here applied to library science.

The Dewey Decimal system is widely used and is completely appropriate for the vast majority of libraries. The LCC is useful for the Library of Congress. It's a custom classification solution designed by them, for them, for their own practical needs.

AWS uses a custom network protocol and custom hardware for their datacenters. It's designed by them, for them. It doesn't mean we should drop TCP/IP. Similarly, Netflix, Twitter, Google, etc all have quite complex and robust architectures. That's not an invitation to do what they're doing - it's actually the opposite. They're operating at a very a unique scale, with their own unique needs on top of that. Your fresh startup with zero customers has no business adopting their special architectures.

For the vast majority, LCC is just overengineered. Dewey Decimal is a better pick.


AWS uses TCP/IP internally. And DNS. And DHCP. And NTP. And ssh. And all those other standard TCP/IP or UDP/IP algorithms you may think of. Now, if you want to claim they have their own bespoke build system, then I'd agree with you there.

When I was studying for my BSCS, I worked at a University library, shelving books. I had two floors -- Deck 4 (Dewey Decimal) in the old stacks, and 4 West in the new building (LCA). I also occasionally shelved books anywhere else there was in the library that needed shelving. So, I know both of those systems pretty well.

Dewey Decimal is like Latin or Ancient Greek -- a dead language. Not really designed to handle modern inventions and newer technologies. It was a huge improvement for its day and it served us well in the past. But it didn't age well, and it didn't scale well.

The Library of Congress system is alive, like modern Italian or English. It addressed the major weaknesses of Dewey Decimal and although it has its own problems, it has continued to adjust and evolve as necessary. It also has scaling challenges, but at the really big sizes operated by the likes of the library of Congress or AWS, I find that everything has its scaling challenges.

For me, if I was running my own personal small library (which I probably could, given all the technical books I've bought over the years), I might be inclined to use the Dewey Decimal system. Or maybe the Library of Congress system. I'd have to test them both out and see which served my needs the best.

And that's what I'd suggest you do before you make any Herculean or Sisyphean statements. Try them both out. See what actually works for you. Then speak from your own personal experience as to what works for you and why.


I heard that the Dewey Decimal System is something you have to pay for a license for in order to legally use. Is that not the case?


Lol what, its literally just how you order and label your books - doesn't seem like anything someone would charge for even if they could


Yes, they license it.

https://www.oclc.org/en/dewey/resources.html

(Blue box about a third down the page)


I'm a former professional librarian. Left the field in 1985, but have never heard of anyone anywhere paying for use of the dewey decimal system. OCLC services is a different issue. You're paying for their provision of cataloging for your collection.


Here's one instance where they sued over this in the early 2000's.

https://scholarship.richmond.edu/cgi/viewcontent.cgi?article...

The issue was with a hotel that numbered their rooms according to the Dewey Decimal System.


I used to work in a library school. I'm not just making a trivial appeal to authority.


It usually takes time for a system to be changed by another better one across all libraries.


All the public libraries I've been to in the US use a decimal classification system that I assume is Dewey Decimal. For example, programming books are under 005.* AUTHORLASTNAME.


Well, I guess it depends where you are.

Unless all public libraries I went to in Québece Canada have a variant of Dewey Decimal...

And no, I'm not seeking a surrogate to DDS, I'm searching for dataset cataloguing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: